DiffPort: Adapting Pre-trained Diffusion Models for Generalizable Robot Manipulation

Queen Mary University of London, UK1
Korea Advanced Institute of Science & Technology, Korea2
Idiap Research Institute, Switzerland3

*Indicates Equal Contribution

We present DiffPort, a visual-languae model which leverage pre-trained diffsuion models for robot maniplation tasks. Our model applies to a wide range of robot tasks in both simulation and real world.

Abstract

We introduce a new framework that leverages a pre-trained text-to-image diffusion model for language-guided robot manipulation. We use a learnable captioner that transforms the textual commands for robot action into text embeddings aligned with the pre-trained diffusion model. We then utilize the diffusion-aligned text embeddings and the visual observations as input to extract features from the diffusion model. These semantic features are then integrated with an affordance prediction network that guides robot actions for pick and place tasks. We validate our framework on diverse language-guided tabletop robot manipulation tasks in both simulation and real-world environments. The results demonstrate the advantages of our approach in manipulating previously seen and unseen objects.

Pipeline

Our model processes an RGB-D observation and corresponding robot commands during manipulation. In the decoder, features from the diffusion model and the transporter are fused through concatenation, while features from the text encoder are integrated via element-wise multiplication.

Robot setup

Robot Setup

We conduct experiments using a Franka Emika Panda robot equipped with an RGB-D end-effector camera. We also show our seen and unseen split of objects and colored blocks in the manipulation tasks.

Reconstruction pipeline

Real robot experiments

Seen colors and objects
Put the yellow block in the red bowl
Put the blue block in the red bowl
Put the screwdriver in the brown box
Put the banana in the brown box

Unseen colors and objects
Put the orange block in the red bowl
Put the green block in the red bowl
Put the strawberry in the brown box
Put the golf ball in the brown box

Simulation experiments

Seen colors and objects

Unseen colors and objects

Affordance Prediction

Affordances predictions for both pick and place. DiffPort is able to generate affordance predictions and localize objects without using any explicit object representations (e.g., object detections and segmentation)

Reconstruction pipeline