Pipeline
Our model processes an RGB-D observation and corresponding robot commands during manipulation. In the decoder, features from the diffusion model and the transporter are fused through concatenation, while features from the text encoder are integrated via element-wise multiplication.