Map-based Modular Approach for Zero-shot Embodied Question Answering


TL;DR: Our paper introduces a map-based modular approach to Embodied Question Answering (EQA), enabling real-world robots to explore while answering a wide range of natural language questions, with demonstrated effectiveness in both virtual and real-world settings.

teaser

Abstract

Embodied Question Answering (EQA) serves as a benchmark task to evaluate the capability of robots to navigate within novel environments and identify objects in response to human queries. However, existing EQA methods often rely on simulated environments and operate with limited vocabularies. This paper presents a map-based modular approach to EQA, enabling real-world robots to explore and map unknown environments. By leveraging foundation models, our method facilitates answering a diverse range of questions using natural language. We conducted extensive experiments in both virtual and real-world settings, demonstrating the robustness of our approach in navigating and comprehending queries within unknown environments.


Data Preprocess

prompts for data preprocessing

Dataset pre-processing using gpt-35-turbo-0613. It extracts a target object category from a given question for ObjNav and converts a question into a declarative text for image-text matching.


Method

our proposed method

We propose map-based modular approach that combines effective off-the-shelf models to perform EQA. The proposed method comprises the Navigation module (outlined in blue) and the VQA module (outlined in red). The Navigation module consists of the Perception module and a set of Policies. The Perception module incrementally builds a 2D map, storing images along with their image-text matching scores. The Global Policy selects a long-term goal based on the 2D map and its frontiers. The Deterministic Local Policy outputs actions, and finally, the VQA module provides an answer based on the memorized images and the given question.


Image-text Matching in MP3D-EQA

figure of itm

The EQA agent has to distinguish target objects from others based on a declarative text converted from a question. To tackle this problem, we use vision-language foundation models BLIP2 and CLIP as an image-text matching module. The result in MP3D-EQA shows that the pair of BLIP2 and declarative text is the best among the others.


VQA Results in MP3D-EQA

figure of vqa

We investigate which VQA models are effective for MP3D-EQA. The result indicates that LLaVA is the most effective, so we adopt LLaVA as a VQA module.


EQA Results in Simulation

eqa results

The results of EQA in MP3D-EQA highlight our method consistently outperforms the VQA-only baseline. It indicates that the navigation module of our method can work efficiently to answer questions.


EQA Results in Simulation

BibTeX


        @inproceedings{sakamoto2024mapeqa,
          author={Koya Sakamoto, Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, Motoaki Kawanabe},
          title={Map-based Modular Approach for Zero-shot Embodied Question Answering},
          booktitle={Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
          year={2024},
        }