Abstract
In this study, we investigate language-level video object segmentation, where first-frame language annotation is used to describe the target object. Because a language label is typically compatible with all frames in a video, the proposed method can choose the most suitable starting frame to mitigate initialization failure. Apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is also proposed to describe moving objects more accurately. Scores of multiple standards are then aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including the DAVIS 2017, DAVIS 2016, SegTrack V2 and YouTubeObject datasets, and a novel accuracy measured as mean region similarity is obtained on both the DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. The code will be published.
| Original language | English |
|---|---|
| Pages (from-to) | 3354-3363 |
| Number of pages | 10 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 25 |
| DOIs | |
| Publication status | Published - 2023 |
Keywords
- Starting point
- language annotation
- matching strategy
- video object segmentation
Fingerprint
Dive into the research topics of 'Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver