TY - JOUR
T1 - WaterVG
T2 - Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar
AU - Guan, Runwei
AU - Jia, Liye
AU - Yao, Shanliang
AU - Yang, Fengyufan
AU - Xu, Sheng
AU - Purwanto, Erick
AU - Zhu, Xiaohui
AU - Man, Ka Lok
AU - Lim, Eng Gee
AU - Smith, Jeremy
AU - Hu, Xuming
AU - Yue, Yutao
N1 - Publisher Copyright:
© 2000-2011 IEEE.
PY - 2025
Y1 - 2025
N2 - Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.
AB - Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.
KW - interactive perception
KW - multi-modal learning
KW - perception of unmanned surface vessels
KW - Visual grounding
UR - http://www.scopus.com/inward/record.url?scp=85217017429&partnerID=8YFLogxK
U2 - 10.1109/TITS.2025.3527011
DO - 10.1109/TITS.2025.3527011
M3 - Article
AN - SCOPUS:85217017429
SN - 1524-9050
JO - IEEE Transactions on Intelligent Transportation Systems
JF - IEEE Transactions on Intelligent Transportation Systems
ER -