LaVA-Man: Learning Visual Action Representations for Robot Manipulation

Queen Mary University of London, UK1
University College London, UK2

Conference on Robot Learning (CoRL), 2025

Abstract

Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visual-textual associations through a self-supervised pretext task: reconstructing a masked goal image conditioned on an input image and textual instructions. This formulation allows the model to learn visual-action representations without robot action supervision. The learned representations can then be fine-tuned for manipulation tasks with only a few demonstrations. We also introduce the Omni-Object Pick-and-Place dataset, which consists of annotated robot tabletop manipulation episodes, including 180 object classes and 3,200 instances with corresponding textual instructions. This dataset enables the model to acquire diverse object priors and allows for a more comprehensive evaluation of its generalisation capability across object instances. Experimental results on the five benchmarks, including both simulated and real-robot validations, demonstrate that our method outperforms prior art.

Overview

We present LaVA-Man, a self-supervised framework for learning visual-action representations for robot manipulation via goal image prediction. We also introduce the Omni-Object Pick-and-Place dataset to ensure the model learns a diverse, open-vocabulary-based object prior. The learned representations can be adapted to various downstream robotic perception and manipulation tasks

Visualized affodance

New dataset

We also introduce the OOPP dataset, a tabletop simulation benchmark consisting of 3,200 unique real-scanned objects across 180 distinct categories.

Real robot experiments

Simulation experiments

Predictions in pretext task

The shown samples of the goal image prediction illustrate that the learned representation can capture the underlying causality of visual state transitions in language-guided manipulation.

Failure cases