Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation

Mingjie Sun, Jimin Xiao*, Eng Gee Lim, Yao Zhao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

In this study, we investigate language-level video object segmentation, where first-frame language annotation is used to describe the target object. Because a language label is typically compatible with all frames in a video, the proposed method can choose the most suitable starting frame to mitigate initialization failure. Apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is also proposed to describe moving objects more accurately. Scores of multiple standards are then aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including the DAVIS 2017, DAVIS 2016, SegTrack V2 and YouTubeObject datasets, and a novel accuracy measured as mean region similarity is obtained on both the DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. The code will be published.

Original languageEnglish
Pages (from-to)3354-3363
Number of pages10
JournalIEEE Transactions on Multimedia
Volume25
DOIs
Publication statusPublished - 2023

Keywords

  • Starting point
  • language annotation
  • matching strategy
  • video object segmentation

Fingerprint

Dive into the research topics of 'Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation'. Together they form a unique fingerprint.

Cite this