Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

Multimodal Human Video. We capture human motion by recording video data, muscle activity through EMG sensors, and interaction sounds through microphones. These signals reveal critical details about manipulation.

Chain-of-Modality. Our approach enables Vision Language Models to analyze each modality sequentially, extracting force timing, hand movements, and object recognition to generate robot-executable code.

Abstract

Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data — videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with generalization to new task setups and objects in real-world robot experiments.

Method: Chain-of-Modality

Chain-of-Modality (CoM) is a prompting strategy that enables Vision Language Models to analyze multimodal human demonstration data sequentially. By examining each modality step-by-step, CoM extracts key information and progressively refines its understanding to produce accurate task plans and control parameters.

The image above illustrates the Chain-of-Modality approach. On the left, the baseline "Merged" method combines all modalities into a single input for the VLM. In contrast, CoM (right) analyzes each modality sequentially: first force data to determine when force is applied, then hand pose to infer grasping and twisting actions, and finally images to identify the specific objects and actions (e.g., twisting a bottle cap). This sequential analysis enables CoM to generate more accurate robot-executable Python programs.

Example: Bottle Opening Task

The video above demonstrates Chain-of-Modality in action for the bottle opening task. Notice how the process begins by analyzing the force signal peaks (shown in red) to identify key moments when force is applied. Next, the hand pose analysis reveals grasping and twisting motions with specific rotation directions. Finally, the visual analysis identifies the bottle and cap objects. By the end, CoM has assembled a complete task plan with detailed parameters that can be translated into robot code.

Experiment Tasks

We evaluate our approach on four manipulation tasks that require precise control parameters:

Opening Bottle: A bi-manual task requiring precision in grasping and twisting. In real-world evaluations, we tested with 7 different bottles (6 unseen) and deployed on two robot platforms (bi-manual ViperX and KUKA).
Inserting Plug: Requires varying force application during different phases of manipulation. For testing generalization, we randomly placed the plug, power strip, and box in different configurations.
Playing Drum: Demands precise control of force and timing. We tested with different drumming beats to evaluate adaptability to various rhythmic patterns.
Wiping Board: Tests force-controlled surface interaction. We evaluated with marker drawings of different shapes and varying positions on the board.

Qualitative Results

The figure above shows qualitative results of how Chain-of-Modality generates detailed task plans for various manipulation tasks. CoM successfully segments videos into subtasks, specifying the skills, force levels, and target objects at each stage.

Opening Bottle: CoM correctly identifies the sequence of grasping, twisting, and releasing actions with proper rotation directions. It captures multiple twist cycles and generates accurate control parameters for each action.
Inserting Plug: CoM recognizes the need for different force levels during the manipulation process - light force when adjusting the plug orientation and higher force during the insertion phase.
Playing Drum: CoM captures rhythmic patterns and force variations, detecting different intensities for each drum hit and properly segmenting the sequence of motions.
Wiping Board: CoM accurately extracts the wiping trajectory and pressure needed for effective cleaning, including the changes in direction and force applied during different phases of wiping.

Robot Demonstrations

Below we showcase robotic implementations of tasks learned through our Chain-of-Modality approach. These demonstrations highlight the robot's ability to replicate human manipulation skills with appropriate force control parameters.

Task 1: Opening a Bottle

Chain-of-Modality Video Analysis

Real Robot Execution Results

Task 2: Inserting a Plug

Chain-of-Modality Video Analysis

Real Robot Execution Results

Task 3: Playing a Drum

Chain-of-Modality Video Analysis

Real Robot Execution Results

Task 4: Wiping a Board

Chain-of-Modality Video Analysis

Real Robot Execution Results

Quantitative Results

The figure compares Chain-of-Modality with various baseline methods across three manipulation tasks using both Gemini 1.5 pro and GPT-4o models.

We observe that processing and analyzing each modality separately consistently outperforms baselines that either merge modality inputs or generate single merged answers. CoM further outperforms the Sep-Sep approach (which separates inputs and outputs but lacks progressive refinement) by more than 19% with Gemini 1.5 pro and 17% with GPT-4o.

Our results confirm that force information greatly enhances understanding of human task plans. Methods with force inputs significantly outperform those without, helping VLMs better segment the video into different stages. This leads to an average 42% improvement in similarity score between extracted task plans and ground truth.

In tasks requiring fine-grained manipulation like Opening Bottle, methods with all modalities achieve the highest success rates. Hand pose plays a crucial role in these tasks, providing significant assistance for detecting subtle finger movements and rotations.

BibTeX

@article{wang2025chainofmodality,
      title={Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models}, 
      author={Chen Wang and Fei Xia and Wenhao Yu and Tingnan Zhang and Ruohan Zhang and C. Karen Liu and Li Fei-Fei and Jie Tan and Jacky Liang},
      year={2025},
      journal={arXiv preprint arXiv:2504.13351}
}