Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents

Yuanwen Chen; Xinyao Zhang; Yaran Chen; Dongbin Zhao; Yunzhen Zhao; Zhe Zhao; Pengfei Hu

doi:10.1109/ICME57554.2024.10687514

Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents

Yuanwen Chen, Xinyao Zhang, Yaran Chen^*, Dongbin Zhao, Yunzhen Zhao, Zhe Zhao, Pengfei Hu

^*Corresponding author for this work

Department of Intelligent Science

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

2 Citations (Scopus)

Abstract

Embodied Instruction Following (EIF) involves the task of locating and manipulating objects according to language instructions. Existing methods face challenges in small object navigation due to ineffective exploration and imperfect perception, which ultimately affects their performance. This study focuses on small object navigation in the EIF domain. We propose Common Sense Language-guided exploration (CSL), a novel approach that leverages common-sense knowledge from seen scenes and information from language instructions to infer the location of objects. The proposed CSL significantly improves exploration efficiency. Additionally, we propose Hierarchical Dense Perception (HDP), which uses hierarchical features to perform semantic segmentation and depth estimation. The use of HDP significantly improves the agent's perceptual capabilities. Experiments on the ALFRED benchmark demonstrate the effectiveness of CSL-HDP. The proposed CSL-HDP achieves an absolute improvement of 9.29% (18.45% relative) on unseen test scenes compared to the previous state-of-the-art, securing the top position on the leaderboard. Code will be available at https://github.com/Cyuanwen/CSL-HDP.

Original language	English
Title of host publication	2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Publisher	IEEE Computer Society
ISBN (Electronic)	9798350390155
DOIs	https://doi.org/10.1109/ICME57554.2024.10687514
Publication status	Published - 2024
Event	2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, Canada Duration: 15 Jul 2024 → 19 Jul 2024

Publication series

Name	Proceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)	1945-7871
ISSN (Electronic)	1945-788X

Conference

Conference	2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Country/Territory	Canada
City	Niagra Falls
Period	15/07/24 → 19/07/24

Keywords

Computer Vision
Embodied Instruction Following
Object Navigation

Access to Document

10.1109/ICME57554.2024.10687514

Cite this

Chen, Y., Zhang, X., Chen, Y., Zhao, D., Zhao, Y., Zhao, Z., & Hu, P. (2024). Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents. In 2024 IEEE International Conference on Multimedia and Expo, ICME 2024 (Proceedings - IEEE International Conference on Multimedia and Expo). IEEE Computer Society. https://doi.org/10.1109/ICME57554.2024.10687514

@inproceedings{d68d8d5027fa4d32a6c2d71f5d73f4d7,

title = "Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents",

abstract = "Embodied Instruction Following (EIF) involves the task of locating and manipulating objects according to language instructions. Existing methods face challenges in small object navigation due to ineffective exploration and imperfect perception, which ultimately affects their performance. This study focuses on small object navigation in the EIF domain. We propose Common Sense Language-guided exploration (CSL), a novel approach that leverages common-sense knowledge from seen scenes and information from language instructions to infer the location of objects. The proposed CSL significantly improves exploration efficiency. Additionally, we propose Hierarchical Dense Perception (HDP), which uses hierarchical features to perform semantic segmentation and depth estimation. The use of HDP significantly improves the agent's perceptual capabilities. Experiments on the ALFRED benchmark demonstrate the effectiveness of CSL-HDP. The proposed CSL-HDP achieves an absolute improvement of 9.29% (18.45% relative) on unseen test scenes compared to the previous state-of-the-art, securing the top position on the leaderboard. Code will be available at https://github.com/Cyuanwen/CSL-HDP.",

keywords = "Computer Vision, Embodied Instruction Following, Object Navigation",

author = "Yuanwen Chen and Xinyao Zhang and Yaran Chen and Dongbin Zhao and Yunzhen Zhao and Zhe Zhao and Pengfei Hu",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE International Conference on Multimedia and Expo, ICME 2024 ; Conference date: 15-07-2024 Through 19-07-2024",

year = "2024",

doi = "10.1109/ICME57554.2024.10687514",

language = "English",

series = "Proceedings - IEEE International Conference on Multimedia and Expo",

publisher = "IEEE Computer Society",

booktitle = "2024 IEEE International Conference on Multimedia and Expo, ICME 2024",

}

Chen, Y, Zhang, X, Chen, Y, Zhao, D, Zhao, Y, Zhao, Z & Hu, P 2024, Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents. in 2024 IEEE International Conference on Multimedia and Expo, ICME 2024. Proceedings - IEEE International Conference on Multimedia and Expo, IEEE Computer Society, 2024 IEEE International Conference on Multimedia and Expo, ICME 2024, Niagra Falls, Canada, 15/07/24. https://doi.org/10.1109/ICME57554.2024.10687514

Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents. / Chen, Yuanwen; Zhang, Xinyao; Chen, Yaran et al.
2024 IEEE International Conference on Multimedia and Expo, ICME 2024. IEEE Computer Society, 2024. (Proceedings - IEEE International Conference on Multimedia and Expo).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents

AU - Chen, Yuanwen

AU - Zhang, Xinyao

AU - Chen, Yaran

AU - Zhao, Dongbin

AU - Zhao, Yunzhen

AU - Zhao, Zhe

AU - Hu, Pengfei

PY - 2024

Y1 - 2024

N2 - Embodied Instruction Following (EIF) involves the task of locating and manipulating objects according to language instructions. Existing methods face challenges in small object navigation due to ineffective exploration and imperfect perception, which ultimately affects their performance. This study focuses on small object navigation in the EIF domain. We propose Common Sense Language-guided exploration (CSL), a novel approach that leverages common-sense knowledge from seen scenes and information from language instructions to infer the location of objects. The proposed CSL significantly improves exploration efficiency. Additionally, we propose Hierarchical Dense Perception (HDP), which uses hierarchical features to perform semantic segmentation and depth estimation. The use of HDP significantly improves the agent's perceptual capabilities. Experiments on the ALFRED benchmark demonstrate the effectiveness of CSL-HDP. The proposed CSL-HDP achieves an absolute improvement of 9.29% (18.45% relative) on unseen test scenes compared to the previous state-of-the-art, securing the top position on the leaderboard. Code will be available at https://github.com/Cyuanwen/CSL-HDP.

AB - Embodied Instruction Following (EIF) involves the task of locating and manipulating objects according to language instructions. Existing methods face challenges in small object navigation due to ineffective exploration and imperfect perception, which ultimately affects their performance. This study focuses on small object navigation in the EIF domain. We propose Common Sense Language-guided exploration (CSL), a novel approach that leverages common-sense knowledge from seen scenes and information from language instructions to infer the location of objects. The proposed CSL significantly improves exploration efficiency. Additionally, we propose Hierarchical Dense Perception (HDP), which uses hierarchical features to perform semantic segmentation and depth estimation. The use of HDP significantly improves the agent's perceptual capabilities. Experiments on the ALFRED benchmark demonstrate the effectiveness of CSL-HDP. The proposed CSL-HDP achieves an absolute improvement of 9.29% (18.45% relative) on unseen test scenes compared to the previous state-of-the-art, securing the top position on the leaderboard. Code will be available at https://github.com/Cyuanwen/CSL-HDP.

KW - Computer Vision

KW - Embodied Instruction Following

KW - Object Navigation

UR - http://www.scopus.com/inward/record.url?scp=85206591441&partnerID=8YFLogxK

U2 - 10.1109/ICME57554.2024.10687514

DO - 10.1109/ICME57554.2024.10687514

M3 - Conference Proceeding

AN - SCOPUS:85206591441

T3 - Proceedings - IEEE International Conference on Multimedia and Expo

BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024

PB - IEEE Computer Society

T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024

Y2 - 15 July 2024 through 19 July 2024

ER -

Chen Y, Zhang X, Chen Y, Zhao D, Zhao Y, Zhao Z et al. Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents. In 2024 IEEE International Conference on Multimedia and Expo, ICME 2024. IEEE Computer Society. 2024. (Proceedings - IEEE International Conference on Multimedia and Expo). doi: 10.1109/ICME57554.2024.10687514

Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this