Learning Robot Manipulation from In-the-Wild Audio-Visual Data

Zeyi Liu1    Cheng Chi1,2    Eric Cousineau3    Naveen Kuppuswamy3   
Benjamin Burchfiel3    Shuran Song1,2

1Stanford University      2Columbia University      3Toyota Research Institute

Paper Video (YouTube) Code Dataset
Human Demonstration with Audio-Visual Feedback
Robot Policy Rollout

Audio signals provide rich information for the robot interaction and object properties through contact. These information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments, by learning from diverse in-the-wild human demonstrations.

Technical Summary Video (4 min)

Capability Experiments

(a) Wiping Whiteboard 🪧

The robot is tasked to wipe a shape (e.g. heart, square) drawn on a whiteboard. The robot can start in any initial configuration above the whiteboard and grasp an eraser in parallel to the board. The main challenge of the task is that the robot needs to exert a reasonable amount of contact force on the whiteboard while moving the eraser along the shape.

ManiWAV (unmute to hear the contact mic recording):

In distribution
Unseen shape (e.g. star)
Unseen table height
Unseen eraser


Vision only: Eraser fails to get into contact with the whiteboard and floats.
Vision only: Eraser does not get into contact with the whiteboard and floats.
MLP fusion: Policy terminates early before shape is completely wiped off.
No noise augmentation: Robot presses too hard on the whiteboard, causing gripper to bend.

Key Findings:

(b) Flipping Bagel 🥯

The robot is tasked to flip a bagel in a pan from facing down to facing upward using a spatula. To perform this task successfully, the robot needs to sense and switch between different contact modes -- precisely insert the spatula between the bagel and the pan, maintain the contact while sliding, and start to tilt up the spatula when the bagel is in contact with the edge of the pan.

ManiWAV (unmute to hear the contact mic recording):

In distribution
In distribution
Unseen table height
Noise perturbation


Vision only: Spatula pokes on the side of the bagel.
Vision only: Robot loses contact with the bagel before it's flipped.
ResNet: Policy trained with a ResNet18 andio encoder fails due to spatula displacement.
MLP policy: Using MLP instead of Diffusion Policy also sometimes loses contact with the bagel before it's flipped.

In-the-Wild Generalization:

Different Bagels
Different Pans
Different Pans
Different Environments

Key Findings:

(c) Pouring 🎲

The robot is tasked to pick up the white cup and pour dice out to the pink cup if the white cup is not empty. When finish pouring, the robot needs to place the empty cup down to a designated location. The challenge of the task is that the robot cannot observe whether there are dice in the cup or not given the camera view point both before and after the pouring action, therefore it needs to leverage feedback from vibrations of objects inside the cup. Watch the below video for details of the task and ablations.

Key Findings:

(d) Taping Wires with Velcro Tape ➰

The robot is tasked to choose the 'hook' tape from several tapes (either 'hook' or 'loop') and strap wires by attaching the 'hook' tape to a 'loop' tape underneath the wires. The challenge of the task is that the difference between 'loop' and 'hook' tape are not observable with vision, but the subtle difference in surface material can generate different sounds when 'sliding' the gripper finger against the tape. Watch the below video for details of the task and ablations.

Key Findings:

More Results

Attention Map Visualization

Description of Image

Interestingly, we find that a policy co-trained with audio attends more on the task-relevant regions (shape of drawing or free space inside the pan). In contrast, the vision only policy often overfits to background structures as an shortcut to estimate contact (e.g., the edge of the whiteboard, table, and room structures).


    title={ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data},
    author={Liu, Zeyi and Chi, Cheng and Cousineau, Eric and Kuppuswamy, Naveen and Burchfiel, Benjamin and Song, Shuran},
    journal={arXiv preprint arXiv:2406.19464},


If you have any questions, please feel free to contact Zeyi Liu.


The authors would like to thank Yifan Hou and Zhenjia Xu for their help with discussions and setup of real-world experiments, Karan Singh for assistance with audio visualization and data collection. In addition, we would like to thank all REALab members: Huy Ha, Mandi Zhao, Mengda Xu, Xiaomeng Xu, Chuer Pan, Austin Patel, Yihuai Gao, Haochen Shi, Dominik Bauer, Samir Gadre, et al. and additional collaborators at TRI: Siyuan Feng, Russ Tedrake for fruitful technical discussions and emotional support. The authors would also like to specially thank Xiaoran 'Van' Fan for his help with task brainstorm and audio expertise. This work was supported in part by the Toyota Research Institute, NSF Award #2143601, #2132519 and Sloan Fellowship. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.