ABM: Attention before Manipulation

ABM: Attention before Manipulation

Fan Zhuo, Ying He, Fei Yu, Pengteng Li, Zheyi Zhao, Xilong Sun

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1816-1824. https://doi.org/10.24963/ijcai.2024/201

Vision-language models (VLMs) show promising generalization and zero-shot capabilities, offering a potential solution to the impracticality and cost of enabling robots to comprehend diverse human instructions and scene semantics in the real world. Existing approaches most directly integrate the semantic representations from pre-trained VLMs with policy learning. However, these methods are limited to the labeled data learned, resulting in poor generalization ability to unseen instructions and objects. To address the above limitation, we propose a simple method called "Attention before Manipulation" (ABM), which fully leverages the object knowledge encoded in CLIP to extract information about the target object in the image. It constructs an Object Mask Field, serving as a better representation of the target object for the model to separate visual grounding from action prediction and acquire specific manipulation skills effectively. We train ABM for 8 RLBench tasks and 2 real-world tasks via behavior cloning. Extensive experiments show that our method significantly outperforms the baselines in the zero-shot and compositional generalization experiment settings.
Keywords:
Computer Vision: CV: Embodied vision: Active agents, simulation
Robotics: ROB: Applications
Robotics: ROB: Manipulation
Robotics: ROB: Robotics and vision