The perception of waterways based on human intent holds significant importance for autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water environments. Inspired by visual grounding, in this paper, we introduce WaterVG, the first visual grounding dataset designed for USV-based waterway perception based on human intention prompts. WaterVG encompasses prompts describing multiple targets, with annotations at the instance level including bounding boxes and masks. Notably, WaterVG includes 11,568 samples with 34,950 referred targets, which integrates both visual and radar characteristics captured by monocular camera and millimeter-wave (mmWave) radar, enabling a finer granularity of text prompts. Furthermore, we propose a novel multi-modal visual grounding model, Potamoi, which is a multi-modal and multi-task model based on the one-stage paradigm with a designed Phased Heterogeneous Modality Fusion (PHMF) structure, including Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). In specific, MHSCA is a low-cost and efficient fusion module with a remarkably small parameter count and FLOPs, elegantly aligning and fusing scenario context information captured by two sensors with linguistic features, which can effectively address tasks of referring expression comprehension and segmentation based on fine-grained prompts. Comprehensive experiments and evaluations have been conducted on WaterVG, where our Potamoi archives state-of-the-art performances compared with counterparts.
Original languageEnglish
Title of host publicationUnder double-blind review
Publication statusPublished - 19 Mar 2024


Dive into the research topics of 'WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar'. Together they form a unique fingerprint.

Cite this