Learning human-to-robot handovers through 3D scene reconstruction

Yuekun Wu1, Yik Lung Pang1, Andrea Cavallaro2,3, Changjae Oh1,
Queen Mary University of London1
Idiap Research Institute2

École Polytechnique Fédérale de Lausanne3

We propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or collecting real-robot data.

Abstract

Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although training using simulations offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. Gaussian Splatting visual reconstruction methods have recently provided new directions for robot manipulation by generating realistic environments. In this paper, we propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or collecting real-robot data. The proposed policy learner, Human-to-Robot Handover using Sparse-View Gaussian Splat- ting (H2RHO-SGS), leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. We train a robot policy on demonstrations collected with 16 household objects and directly deploy this policy in the real environ- ment. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that H2RHO-SGS serves as a new and effective representation for the human-to-robot handover task.

Pipeline

Overview of our method. (a) Given sparse-view RGB-D handover images, we reconstruct a 3D scene using Gaussian Splatting (GS) and then estimate grasp poses from the object and hand point clouds extracted from the GS scene. (b) We then use the GS scene and grasp pose to generate the gripper’s trajectory toward the pre-grasp pose and the hand-eye image at each sampled pose. (c) Each trajectory becomes a handover demonstration dataset that includes hand-eye images, object and hand masks, transformations of the gripper pose, and pre-grasp pose labels. (d) The dataset is used to train a handover policy. For inference, only the hand-eye RGB image and masks are required.

Robot setup

Robot Setup

We use 16 household objects for training the grasping policy, as shown in the top part of the figure. For real-robot experiments (bottom-left), we use a UR5 robotic arm equipped with a Robotiq 2F-85 two-finger gripper and a hand-eye Intel RealSense D435i camera; a human demonstrator holds the object to simulate the handover scenario. To evaluate the model’s performance, we use 6 test objects in total (bottom-right), including 4 seen during training and 2 unseen, to assess both effectiveness and generalization in handover tasks.

Reconstruction pipeline

Real-robot experiments (2× speed)