Improving Generalization of Language-Conditioned Robot Manipulation

Queen Mary University of London¹
Idiap Research Institute², École Polytechnique Fédérale de Lausanne³

Abstract

The control of a robot for manipulation tasks generally relies on visual inputs. Recent advances in vision-language models (VLMs) enable using natural language instructions to condition this input and operate robots in a wider range of environments. However, existing methods require a large amount of data to fine-tune the VLM to operate in unseen environments. In this paper, we present a framework that learns object-arrangement tasks with just a few-shot demonstrations. We present a two-stage framework that divides object-arrangement tasks into a target localization stage, for picking an object, and a region determination stage, for placing the object. We present an instance-level semantic fusion module that maps the instance-level image crops with the input text embedding, enabling the model to identify the target objects defined by the natural language instructions. We validate our method on both simulation and real-robot environments. Our method, fine-tuned with a few-shot demonstration, improves generalization capability and shows zero-shot ability in real-robot manipulation scenarios.

Pipeline

We first input the observation into the CLIP visual encoder and extract the [CLS] token as a global embedding. Meanwhile, we use SAM2 to segment object masks in observation and obtain visual crops of individual objects. These cropped images are then fed into the CLIP visual encoder to generate instance embeddings. To integrate these features, we employ an instance-level semantic module that fuses each instance embedding with the global embedding. Next, we compute a confidence score for each object to assess its likelihood of alignment with the given instruction which is presented as text embeddings. In the Target Localization stage, a lightweight fully connected networks (FCN) is used to refine the specific manipulation position, ensuring accurate picking. In the Region Determination stage, the crop of the picked object is used to determine the optimal placement position and angle through cross-correlation.

Real robot experiments

Simulation experiments

Affordance Prediction

We visualize the affordance maps of ``Pick/Push from ...'' and the ``Place/Push to ...'', which highlight the locations of the relevant target objects (in yellow circle). These affordance maps represent a probability distribution over potential manipulation places, indicating not only the target objects but also the most suitable areas for robotic interaction. The corresponding rotation angles are not depicted in the images.