Pipeline
We first input the observation into the CLIP visual encoder and extract the [CLS] token as a global embedding. Meanwhile, we use SAM2 to segment object masks in observation and obtain visual crops of individual objects. These cropped images are then fed into the CLIP visual encoder to generate instance embeddings. To integrate these features, we employ an instance-level semantic module that fuses each instance embedding with the global embedding. Next, we compute a confidence score for each object to assess its likelihood of alignment with the given instruction which is presented as text embeddings. In the Target Localization stage, a lightweight fully connected networks (FCN) is used to refine the specific manipulation position, ensuring accurate picking. In the Region Determination stage, the crop of the picked object is used to determine the optimal placement position and angle through cross-correlation.