DiffPort: Adapting Pre-trained Diffusion Models for Generalizable Robot Manipulation

Queen Mary University of London, UK¹
Korea Advanced Institute of Science & Technology, Korea²
Idiap Research Institute, Switzerland³
^*Indicates Equal Contribution

Abstract

We introduce a new framework that leverages a pre-trained text-to-image diffusion model for language-guided robot manipulation. We use a learnable captioner that transforms the textual commands for robot action into text embeddings aligned with the pre-trained diffusion model. We then utilize the diffusion-aligned text embeddings and the visual observations as input to extract features from the diffusion model. These semantic features are then integrated with an affordance prediction network that guides robot actions for pick and place tasks. We validate our framework on diverse language-guided tabletop robot manipulation tasks in both simulation and real-world environments. The results demonstrate the advantages of our approach in manipulating previously seen and unseen objects.

Pipeline

Our model processes an RGB-D observation and corresponding robot commands during manipulation. In the decoder, features from the diffusion model and the transporter are fused through concatenation, while features from the text encoder are integrated via element-wise multiplication.

Real robot experiments

Seen colors and objects

Put the yellow block in the red bowl

Put the blue block in the red bowl

Put the screwdriver in the brown box

Put the banana in the brown box

Unseen colors and objects

Put the orange block in the red bowl

Put the green block in the red bowl

Put the strawberry in the brown box

Put the golf ball in the brown box

Simulation experiments

Seen colors and objects

Unseen colors and objects

Affordance Prediction

Affordances predictions for both pick and place. DiffPort is able to generate affordance predictions and localize objects without using any explicit object representations (e.g., object detections and segmentation)