TY - JOUR
T1 - Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning
AU - Sun, Mingjie
AU - Xiao, Jimin
AU - Lim, Eng Gee
AU - Zhao, Cairong
AU - Zhao, Yao
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - The main task we aim to tackle is the multi-modality video object segmentation (VOS), which can be divided into two sub-tasks: mask-referred and language-referred VOS, where the first-frame mask-level or language-level label is utilized to provide the target information, respectively. Due to the huge gap between different modalities, existing works never come up with a unified framework for these two sub-tasks. In this work, such a unified framework is designed, where the visual and linguistic inputs are first spilt into a number of image patches and words, and then mapped into same-size tokens, which are equally processed by a self-attention based segmentation model. Furthermore, to highlight the significant information and discard the non-target or ambiguous one, unified multi-modality filter networks are further designed, and reinforcement learning is adopted to optimize such networks. Experiments show that new state-of-the-art performances are achieved by the proposed method: 52.8% of J&F on Ref-YoutubeVOS dataset and 83.2% of JS on YoutubeVOS dataset, respectively. The code will be released.
AB - The main task we aim to tackle is the multi-modality video object segmentation (VOS), which can be divided into two sub-tasks: mask-referred and language-referred VOS, where the first-frame mask-level or language-level label is utilized to provide the target information, respectively. Due to the huge gap between different modalities, existing works never come up with a unified framework for these two sub-tasks. In this work, such a unified framework is designed, where the visual and linguistic inputs are first spilt into a number of image patches and words, and then mapped into same-size tokens, which are equally processed by a self-attention based segmentation model. Furthermore, to highlight the significant information and discard the non-target or ambiguous one, unified multi-modality filter networks are further designed, and reinforcement learning is adopted to optimize such networks. Experiments show that new state-of-the-art performances are achieved by the proposed method: 52.8% of J&F on Ref-YoutubeVOS dataset and 83.2% of JS on YoutubeVOS dataset, respectively. The code will be released.
KW - Video object segmentation
KW - multiple modalities
KW - reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85162699873&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3284165
DO - 10.1109/TCSVT.2023.3284165
M3 - Article
AN - SCOPUS:85162699873
SN - 1051-8215
VL - 34
SP - 6722
EP - 6734
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 8
ER -