Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning

Mingjie Sun, Jimin Xiao*, Eng Gee Lim, Cairong Zhao, Yao Zhao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

The main task we aim to tackle is the multi-modality video object segmentation (VOS), which can be divided into two sub-tasks: mask-referred and language-referred VOS, where the first-frame mask-level or language-level label is utilized to provide the target information, respectively. Due to the huge gap between different modalities, existing works never come up with a unified framework for these two sub-tasks. In this work, such a unified framework is designed, where the visual and linguistic inputs are first spilt into a number of image patches and words, and then mapped into same-size tokens, which are equally processed by a self-attention based segmentation model. Furthermore, to highlight the significant information and discard the non-target or ambiguous one, unified multi-modality filter networks are further designed, and reinforcement learning is adopted to optimize such networks. Experiments show that new state-of-the-art performances are achieved by the proposed method: 52.8% of J&F on Ref-YoutubeVOS dataset and 83.2% of JS on YoutubeVOS dataset, respectively. The code will be released.

Original languageEnglish
Pages (from-to)6722-6734
Number of pages13
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number8
DOIs
Publication statusPublished - 2024

Keywords

  • Video object segmentation
  • multiple modalities
  • reinforcement learning

Fingerprint

Dive into the research topics of 'Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning'. Together they form a unique fingerprint.

Cite this