TL;DR

Deriving Design Needs for AI Cooking Assistants

Cooking involves constant context switching between understanding instructions and performing physical tasks in real-time. This research explored how to design multimodal cooking assistants that understand user progress, environmental context, and support natural interaction flows.

Research Demo Video

The Problem

Cooking assistance requires understanding context, not just controlling media

Following cooking videos requires users to constantly alternate attention between understanding video instructions and performing those steps in their physical environment. Current video navigation tools treat this as a simple media control problem, ignoring the rich contextual information available in the cooking environment.

Context switching creates cognitive overhead

Users must mentally track their progress, map video instructions to their specific setup, and troubleshoot differences between what they see on screen vs. their kitchen reality.

How to design effective systems

Many existing cooking assistants and research efforts focus on agents that only process video. However, with recent advances in intelligent multimodal agent systems, we can now envision agents that offer contextually relevant assistance. The challenge lies in how to design such systems effectively.

What I Did

🔬

Wizard-of-Oz Study Design

Designed and conducted a controlled study simulating AI-powered contextual assistance through human operators

📊

Interaction Analysis

Analyzed 30+ hours of cooking sessions to understand query patterns, workflow alignment, and assistant needs

🎨

Design Framework

Derived design principles for context-aware assistants that extend beyond the cooking domain

User Study

Wizard-of-oz study setup showing researcher and participant perspectives

Wizard-of-oz study setup: researcher could see participant's cooking environment while controlling shared video interface

Research Goal

How can we design context-aware assistants that support users as they switch focus between following procedural video instructions and performing cooking tasks in real-time?

What types of contextual information enable more effective cooking assistance?
How do users naturally interact with environment-aware assistants during hands-on tasks?
What design patterns from cooking assistance can generalize to other procedural domains?

Study Protocol

Video conference setup: Participants joined from home kitchens, researchers controlled shared video screen

Environmental awareness simulation: Researchers could see participant's cooking space and context in real-time

Natural interaction: Participants encouraged to ask any questions they would want from an ideal cooking assistant

Data Capture

Multi-stream recording: User video, audio, all verbal interactions, and environmental sounds

Recipe selection: Participants chose unfamiliar but interesting dishes from 5 options, selected own YouTube videos

Session structure: 1-hour cooking sessions with pre/post surveys and demographic data

Participants (average-intermediate cooks)

Standardized recipe categories

100%

Home kitchen environments

⚠️ Technical Decision

Used wizard-of-oz methodology rather than full AI implementation to focus on interaction patterns and user needs without being constrained by current AI limitations. This allowed us to simulate ideal contextual awareness and study user expectations.

UX Analysis Framework

Open Coding & Thematic Analysis

Two researchers independently performed line-by-line open coding on session transcripts and survey responses using established thematic analysis methods. Initial themes included "asking to replay video near end of cooking step" and "interactions requiring multimodal sensing."

Interaction Categorization Framework

Systematically grouped all user interactions into distinct categories based on intent, context, and required assistant capabilities. Categories emerged from data rather than predetermined frameworks.

Framework showing categorization of user interactions

Interaction categorization framework derived from thematic analysis

User Persona Development

Used human-centered design methods to collaboratively build user personas based on cooking expertise, interaction patterns, and assistant usage preferences. Personas reached through researcher consensus and validation against participant data.

User personas derived from cooking interaction patterns

Data-driven user personas showing distinct interaction and assistance needs

Task Flow & Interaction Mapping

Derived typical cooking task flows and mapped where context switches and assistant interactions naturally occurred. Identified predictable patterns across different cooking styles and expertise levels.

Example cooking task flow showing context switch points

Example cooking task flow highlighting natural context switch points and assistant intervention opportunities

Persona-Based Interaction Analysis

Created comprehensive interaction profiles by mapping each participant's query patterns to derived personas, revealing distinct usage patterns and assistant needs across user types.

Bar chart showing interaction patterns grouped by persona

Bar chart analysis: participant interactions categorized and grouped by derived personas

Key Findings

Users form distinct query patterns based on environmental context

Analysis revealed three primary query categories that emerged naturally during context-aware cooking assistance:

Progress queries: "Did I add enough salt?" "Is this the right consistency?"
Adaptation queries: "I don't have a stand mixer, what should I do?" "My pan is smaller than in the video"
Timing queries: "How much longer should this simmer?" "When should I start the next step?"

These query types rarely occur with traditional video controls, suggesting environmental awareness unlocks new interaction possibilities.

Workflow alignment varies dramatically between users despite similar goals

While users had different interaction preferences and query patterns, their overall workflow structure showed surprising consistency:

All users followed similar preparation → execution → validation cycles
Context switches occurred at predictable points (ingredient prep, technique changes, timing decisions)
Environmental queries clustered around decision points and skill-adaptation moments

This suggests context-aware systems can predict when assistance will be needed, even across diverse user preferences.

Multimodal sensing enables proactive rather than reactive assistance

Environmental awareness allowed the system to anticipate user needs before explicit requests:

Computer vision detected when ingredients were missing or substituted
Audio cues indicated technique struggles (e.g., inadequate whisking sounds)
Sensor data revealed timing issues (oven preheating, pan temperature)
Combined signals enabled contextually relevant suggestions at optimal moments

Users reported feeling "understood" by the system rather than simply "controlled," leading to higher task confidence and learning outcomes.

Design Implications & Future Work

Design Goal	Technical Implementation	Broader Applications
Support task completion flow recognition	Develop computer vision models for cooking state recognition Implement temporal reasoning for multi-step procedure tracking Create adaptive timing models based on user skill and environmental factors	Procedural assistance systems for manufacturing, healthcare protocols, educational labs, and repair/maintenance tasks.
Accommodate diverse user queries and characteristics	Build multimodal interaction frameworks (voice, gesture, visual) Develop user modeling for skill adaptation and preference learning Create context-aware natural language understanding for domain-specific queries	Adaptive AI tutoring systems that accommodate different learning styles, physical abilities, and expertise levels across domains.
Leverage multimodal sensing of environment	Integrate IoT sensors with computer vision and audio processing Develop sensor fusion algorithms for environmental state estimation Create privacy-preserving edge computing for real-time analysis	Smart environments that understand human activity and provide contextual assistance in homes, workplaces, and public spaces.

💡

Reflection

This work demonstrates that effective AI assistance requires understanding not just user intent, but environmental context and task progression. The multimodal sensing approach and interaction patterns identified extend beyond cooking to any domain involving procedural guidance in physical environments.

Technical contribution: Multimodal fusion framework for real-time environmental understanding

Interaction design: Context-aware query patterns that inform natural language interfaces

Broader impact: Design principles for AI assistants in physical, procedural domains

Following along how-to videos requires alternating focus between understanding procedural video instructions and performing them. Examining how to support these continuous context switches for the user has been largely unexplored. In this paper, we describe a user study with thirty participants who performed an hour-long cooking task while interacting with a wizard-of-oz hands-free interactive system that is aware of both their cooking progress and environment contexts. Through analysis of the session scripts, we identify a dichotomy between participant query differences and workflow alignment similarities, under-studied interactions that require AI functionality beyond video navigation alone, and queries that call for multimodal sensing of a user’s environment. By understanding the assistant experience through the participants’ interactions, we identify design implications for a smart assistant that can discern a user’s task completion flow and personal characteristics, accommodate requests within and external to the task domain, andsupport nonvoice-based queries.

Context Aware Cooking Assistant