Stereo Hand-Object Reconstruction for Human-to-Robot Handover

Queen Mary, University of London1
Idiap Research Institute2, École Polytechnique Fédérale de Lausanne3

We propose StereoHO, a hand-object reconstruction method for wide baseline stereo cameras, which enables a robot to receive general household objects from humans.

Abstract

Jointly estimating hand and object shape ensures the success of the robot grasp in human-to-robot handovers. However, relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a stereo-based method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset to ensure that our method is generalisable, and use RGB inputs instead of depth as RGB can better capture transparent objects. We show that our method achieves a lower object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human.

Human-to-robot handover setup

We reconstruct the hand-object pointcloud from stereo RGB input for human-to-robot handover. (1) A safe grasp is selected for the handover and the robot moves in to grasp the object. (2) The object is delivered to a target location on the table. (3) The robot returns to its starting position.

Robot setup

Stereo hand object reconstruction

Our proposed hand-object reconstruction method with two cropped images from a wide-baseline stereo camera. StereoHO first performs shape estimation from each view independently to obtain the predicted probability distributions over the shape codebooks. The stereo-based probability distribution is computed by element-wise multiplication. The trained SDF decoder transforms the stereo prediction into the hand-object TSDF. Surface points, sampled as a pointcloud from the TSDF, are projected into each view using the predicted camera projection parameters. We use the segmentation masks to remove the outliers to obtain the final pointcloud.

Reconstruction pipeline

Robot control

Our proposed robot control pipeline for human-to-robot handover (modules from other works are in grey). We first perform hand-object detection to obtain bounding boxes. Hand-object segmentation masks and wrist poses are estimated on the image cropped around the hand. We combine the outputs from the preprocessing steps for stereo hand-object reconstruction to obtain pointcloud. Grasp estimation is performed on the reconstructed shape and transformed at each timestep using the wrist poses. The wrist pose in robot coordinate space is calculated using the world to robot base transform obtained using the hand-eye calibration process.

Robot pipeline

Hand-object reconstructions

Comparison of single view and stereo hand object reconstructions on DexYCB. For the single-view setting, each reconstruction corresponds to the image on the same row. For the stereo setting, the same reconstruction is shown from two viewpoints. Our method yields less noisy reconstruction by introducing segmentation masks as input. In the stereo setting, our method improves both the hand and object reconstructions by combining the predictions from individual views.

Comparison with previous work

Our method allows the robot to perform grasping with 6-DoF, resulting in more natural handovers. In contrast, previous stereo RGB based work perform grasping with fixed rotations, forcing the human to adapt to the gripper pose by holding the object upright.

Ours with 6-DoF grasping
Previous work grasping with fixed rotation