WaterVG: Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar

Runwei Guan, Liye Jia, Shanliang Yao, Fengyufan Yang, Sheng Xu, Erick Purwanto, Xiaohui Zhu, Ka Lok Man, Eng Gee Lim, Jeremy Smith, Xuming Hu, Yutao Yue*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.

Original languageEnglish
JournalIEEE Transactions on Intelligent Transportation Systems
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • interactive perception
  • multi-modal learning
  • perception of unmanned surface vessels
  • Visual grounding

Fingerprint

Dive into the research topics of 'WaterVG: Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar'. Together they form a unique fingerprint.

Cite this