TY - GEN
T1 - Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents
AU - Chen, Yuanwen
AU - Zhang, Xinyao
AU - Chen, Yaran
AU - Zhao, Dongbin
AU - Zhao, Yunzhen
AU - Zhao, Zhe
AU - Hu, Pengfei
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Embodied Instruction Following (EIF) involves the task of locating and manipulating objects according to language instructions. Existing methods face challenges in small object navigation due to ineffective exploration and imperfect perception, which ultimately affects their performance. This study focuses on small object navigation in the EIF domain. We propose Common Sense Language-guided exploration (CSL), a novel approach that leverages common-sense knowledge from seen scenes and information from language instructions to infer the location of objects. The proposed CSL significantly improves exploration efficiency. Additionally, we propose Hierarchical Dense Perception (HDP), which uses hierarchical features to perform semantic segmentation and depth estimation. The use of HDP significantly improves the agent's perceptual capabilities. Experiments on the ALFRED benchmark demonstrate the effectiveness of CSL-HDP. The proposed CSL-HDP achieves an absolute improvement of 9.29% (18.45% relative) on unseen test scenes compared to the previous state-of-the-art, securing the top position on the leaderboard. Code will be available at https://github.com/Cyuanwen/CSL-HDP.
AB - Embodied Instruction Following (EIF) involves the task of locating and manipulating objects according to language instructions. Existing methods face challenges in small object navigation due to ineffective exploration and imperfect perception, which ultimately affects their performance. This study focuses on small object navigation in the EIF domain. We propose Common Sense Language-guided exploration (CSL), a novel approach that leverages common-sense knowledge from seen scenes and information from language instructions to infer the location of objects. The proposed CSL significantly improves exploration efficiency. Additionally, we propose Hierarchical Dense Perception (HDP), which uses hierarchical features to perform semantic segmentation and depth estimation. The use of HDP significantly improves the agent's perceptual capabilities. Experiments on the ALFRED benchmark demonstrate the effectiveness of CSL-HDP. The proposed CSL-HDP achieves an absolute improvement of 9.29% (18.45% relative) on unseen test scenes compared to the previous state-of-the-art, securing the top position on the leaderboard. Code will be available at https://github.com/Cyuanwen/CSL-HDP.
KW - Computer Vision
KW - Embodied Instruction Following
KW - Object Navigation
UR - http://www.scopus.com/inward/record.url?scp=85206591441&partnerID=8YFLogxK
U2 - 10.1109/ICME57554.2024.10687514
DO - 10.1109/ICME57554.2024.10687514
M3 - Conference Proceeding
AN - SCOPUS:85206591441
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PB - IEEE Computer Society
T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Y2 - 15 July 2024 through 19 July 2024
ER -