Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation

Mingjie Sun; Jimin Xiao; Eng Gee Lim; Yao Zhao

doi:10.1109/TMM.2022.3159403

Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation

Mingjie Sun, Jimin Xiao^*, Eng Gee Lim, Yao Zhao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

In this study, we investigate language-level video object segmentation, where first-frame language annotation is used to describe the target object. Because a language label is typically compatible with all frames in a video, the proposed method can choose the most suitable starting frame to mitigate initialization failure. Apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is also proposed to describe moving objects more accurately. Scores of multiple standards are then aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including the DAVIS 2017, DAVIS 2016, SegTrack V2 and YouTubeObject datasets, and a novel accuracy measured as mean region similarity is obtained on both the DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. The code will be published.

Original language	English
Pages (from-to)	3354-3363
Number of pages	10
Journal	IEEE Transactions on Multimedia
Volume	25
DOIs	https://doi.org/10.1109/TMM.2022.3159403
Publication status	Published - 2023

Keywords

Starting point
language annotation
matching strategy
video object segmentation

Access to Document

10.1109/TMM.2022.3159403

Cite this

@article{1ac8f4ec547d46ceb5d06f25e7d8864c,

title = "Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation",

abstract = "In this study, we investigate language-level video object segmentation, where first-frame language annotation is used to describe the target object. Because a language label is typically compatible with all frames in a video, the proposed method can choose the most suitable starting frame to mitigate initialization failure. Apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is also proposed to describe moving objects more accurately. Scores of multiple standards are then aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including the DAVIS 2017, DAVIS 2016, SegTrack V2 and YouTubeObject datasets, and a novel accuracy measured as mean region similarity is obtained on both the DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. The code will be published.",

keywords = "Starting point, language annotation, matching strategy, video object segmentation",

author = "Mingjie Sun and Jimin Xiao and Lim, {Eng Gee} and Yao Zhao",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2023",

doi = "10.1109/TMM.2022.3159403",

language = "English",

volume = "25",

pages = "3354--3363",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

}

TY - JOUR

T1 - Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation

AU - Sun, Mingjie

AU - Xiao, Jimin

AU - Lim, Eng Gee

AU - Zhao, Yao

PY - 2023

Y1 - 2023

N2 - In this study, we investigate language-level video object segmentation, where first-frame language annotation is used to describe the target object. Because a language label is typically compatible with all frames in a video, the proposed method can choose the most suitable starting frame to mitigate initialization failure. Apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is also proposed to describe moving objects more accurately. Scores of multiple standards are then aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including the DAVIS 2017, DAVIS 2016, SegTrack V2 and YouTubeObject datasets, and a novel accuracy measured as mean region similarity is obtained on both the DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. The code will be published.

AB - In this study, we investigate language-level video object segmentation, where first-frame language annotation is used to describe the target object. Because a language label is typically compatible with all frames in a video, the proposed method can choose the most suitable starting frame to mitigate initialization failure. Apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is also proposed to describe moving objects more accurately. Scores of multiple standards are then aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including the DAVIS 2017, DAVIS 2016, SegTrack V2 and YouTubeObject datasets, and a novel accuracy measured as mean region similarity is obtained on both the DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. The code will be published.

KW - Starting point

KW - language annotation

KW - matching strategy

KW - video object segmentation

UR - http://www.scopus.com/inward/record.url?scp=85126514942&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3159403

DO - 10.1109/TMM.2022.3159403

M3 - Article

AN - SCOPUS:85126514942

SN - 1520-9210

VL - 25

SP - 3354

EP - 3363

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Starting Point Selection and Multiple-Standard Matching for Video Object Segmentation With Language Annotation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this