MuEP: A Multimodal Benchmark for Embodied Planning with Foundation Models

MuEP: A Multimodal Benchmark for Embodied Planning with Foundation Models

Kanxue Li, Baosheng Yu, Qi Zheng, Yibing Zhan, Yuhui Zhang, Tianle Zhang, Yijun Yang, Yue Chen, Lei Sun, Qiong Cao, Li Shen, Lusong Li, Dapeng Tao, Xiaodong He

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 129-138. https://doi.org/10.24963/ijcai.2024/15

Foundation models have demonstrated significant emergent abilities, holding great promise for enhancing embodied agents' reasoning and planning capacities. However, the absence of a comprehensive benchmark for evaluating embodied agents with multimodal observations in complex environments remains a notable gap. In this paper, we present MuEP, a comprehensive Multimodal benchmark for Embodied Planning. MuEP facilitates the evaluation of multimodal and multi-turn interactions of embodied agents in complex scenes, incorporating fine-grained evaluation metrics that provide insights into the performance of embodied agents throughout each task. Furthermore, we evaluate embodied agents with recent state-of-the-art foundation models, including large language models (LLMs) and large multimodal models (LMMs), on the proposed benchmark. Experimental results show that foundation models based on textual representations of environments usually outperform their visual counterparts, suggesting a gap in embodied planning abilities with multimodal observations. We also find that control language generation is an indispensable ability beyond common-sense knowledge for accurate embodied task completion. We hope the proposed MuEP benchmark can contribute to the advancement of embodied AI with foundation models.
Keywords:
Agent-based and Multi-agent Systems: MAS: Agent-based simulation and emergence
Computer Vision: CV: Embodied vision: Active agents, simulation
Multidisciplinary Topics and Applications: MTA: Databases