WaterVG: Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar

Runwei Guan; Liye Jia; Shanliang Yao; Fengyufan Yang; Sheng Xu; Erick Purwanto; Xiaohui Zhu; Ka Lok Man; Eng Gee Lim; Jeremy Smith; Xuming Hu; Yutao Yue

doi:10.1109/TITS.2025.3527011

WaterVG: Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar

Runwei Guan, Liye Jia, Shanliang Yao, Fengyufan Yang, Sheng Xu, Erick Purwanto, Xiaohui Zhu, Ka Lok Man, Eng Gee Lim, Jeremy Smith, Xuming Hu, Yutao Yue^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.

Original language	English
Pages (from-to)	7275-7291
Number of pages	17
Journal	IEEE Transactions on Intelligent Transportation Systems
Volume	26
Issue number	5
DOIs	https://doi.org/10.1109/TITS.2025.3527011
Publication status	Accepted/In press - 2025

Keywords

interactive perception
multi-modal learning
perception of unmanned surface vessels
Visual grounding

Access to Document

10.1109/TITS.2025.3527011

Cite this

@article{ad6c0a4504764e62b50720af1d9381c9,

title = "WaterVG: Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar",

abstract = "Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.",

keywords = "interactive perception, multi-modal learning, perception of unmanned surface vessels, Visual grounding",

author = "Runwei Guan and Liye Jia and Shanliang Yao and Fengyufan Yang and Sheng Xu and Erick Purwanto and Xiaohui Zhu and Man, {Ka Lok} and Lim, {Eng Gee} and Jeremy Smith and Xuming Hu and Yutao Yue",

note = "Publisher Copyright: {\textcopyright} 2000-2011 IEEE.",

year = "2025",

doi = "10.1109/TITS.2025.3527011",

language = "English",

volume = "26",

pages = "7275--7291",

journal = "IEEE Transactions on Intelligent Transportation Systems",

issn = "1524-9050",

number = "5",

}

TY - JOUR

T1 - WaterVG

T2 - Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar

AU - Guan, Runwei

AU - Jia, Liye

AU - Yao, Shanliang

AU - Yang, Fengyufan

AU - Xu, Sheng

AU - Purwanto, Erick

AU - Zhu, Xiaohui

AU - Man, Ka Lok

AU - Lim, Eng Gee

AU - Smith, Jeremy

AU - Hu, Xuming

AU - Yue, Yutao

PY - 2025

Y1 - 2025

N2 - Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.

AB - Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.

KW - interactive perception

KW - multi-modal learning

KW - perception of unmanned surface vessels

KW - Visual grounding

UR - http://www.scopus.com/inward/record.url?scp=85217017429&partnerID=8YFLogxK

U2 - 10.1109/TITS.2025.3527011

DO - 10.1109/TITS.2025.3527011

M3 - Article

AN - SCOPUS:85217017429

SN - 1524-9050

VL - 26

SP - 7275

EP - 7291

JO - IEEE Transactions on Intelligent Transportation Systems

JF - IEEE Transactions on Intelligent Transportation Systems

IS - 5

ER -

WaterVG: Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this