SIRI: Spatial relation induced network for spatial description resolution

Peiyao Wang; Weixin Luo; Yanyu Xu; Haojie Li; Shugong Xu; Jianyu Yang; Shenghua Gao

SIRI: Spatial relation induced network for spatial description resolution

Peiyao Wang, Weixin Luo, Yanyu Xu, Haojie Li, Shugong Xu, Jianyu Yang, Shenghua Gao

Research output: Contribution to journal › Conference article › peer-review

Abstract

Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced (SIRI) network. Specifically, visual features are firstly correlated at an implicit object-level in a projected latent space; then they are distilled by each spatial relationship word, resulting in each differently activated feature representing each spatial relationship. Further, we introduce global position priors to fix the absence of positional information, which may result in global positional reasoning ambiguities. Both the linguistic and visual features are concatenated to finalize the target localization. Experimental results on the Touchdown show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown. The code for this project is publicly available at https://github.com/wong-puiyiu/siri-sdr.

Original language	English
Journal	Advances in Neural Information Processing Systems
Volume	2020-December
Publication status	Published - 2020
Externally published	Yes
Event	34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online Duration: 6 Dec 2020 → 12 Dec 2020

Cite this

@article{e4cc19a0416c4eac97860ee42e5872ac,

title = "SIRI: Spatial relation induced network for spatial description resolution",

abstract = "Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced (SIRI) network. Specifically, visual features are firstly correlated at an implicit object-level in a projected latent space; then they are distilled by each spatial relationship word, resulting in each differently activated feature representing each spatial relationship. Further, we introduce global position priors to fix the absence of positional information, which may result in global positional reasoning ambiguities. Both the linguistic and visual features are concatenated to finalize the target localization. Experimental results on the Touchdown show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown. The code for this project is publicly available at https://github.com/wong-puiyiu/siri-sdr.",

author = "Peiyao Wang and Weixin Luo and Yanyu Xu and Haojie Li and Shugong Xu and Jianyu Yang and Shenghua Gao",

note = "Publisher Copyright: {\textcopyright} 2020 Neural information processing systems foundation. All rights reserved.; 34th Conference on Neural Information Processing Systems, NeurIPS 2020 ; Conference date: 06-12-2020 Through 12-12-2020",

year = "2020",

language = "English",

volume = "2020-December",

journal = "Advances in Neural Information Processing Systems",

issn = "1049-5258",

}

TY - JOUR

T1 - SIRI

T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020

AU - Wang, Peiyao

AU - Luo, Weixin

AU - Xu, Yanyu

AU - Li, Haojie

AU - Xu, Shugong

AU - Yang, Jianyu

AU - Gao, Shenghua

PY - 2020

Y1 - 2020

N2 - Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced (SIRI) network. Specifically, visual features are firstly correlated at an implicit object-level in a projected latent space; then they are distilled by each spatial relationship word, resulting in each differently activated feature representing each spatial relationship. Further, we introduce global position priors to fix the absence of positional information, which may result in global positional reasoning ambiguities. Both the linguistic and visual features are concatenated to finalize the target localization. Experimental results on the Touchdown show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown. The code for this project is publicly available at https://github.com/wong-puiyiu/siri-sdr.

AB - Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced (SIRI) network. Specifically, visual features are firstly correlated at an implicit object-level in a projected latent space; then they are distilled by each spatial relationship word, resulting in each differently activated feature representing each spatial relationship. Further, we introduce global position priors to fix the absence of positional information, which may result in global positional reasoning ambiguities. Both the linguistic and visual features are concatenated to finalize the target localization. Experimental results on the Touchdown show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown. The code for this project is publicly available at https://github.com/wong-puiyiu/siri-sdr.

UR - http://www.scopus.com/inward/record.url?scp=85108413500&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85108413500

SN - 1049-5258

VL - 2020-December

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

Y2 - 6 December 2020 through 12 December 2020

ER -

SIRI: Spatial relation induced network for spatial description resolution

Abstract

Other files and links

Fingerprint

Cite this