TY - GEN
T1 - Flexible Video Matting With Temporally Coherent Trimaps Generation
AU - Xue, Chenhui
AU - Xu, Shugong
AU - Mu, Shiyi
AU - Gao, Yilin
N1 - Publisher Copyright:
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Traditional video matting networks depend on user-annotated trimaps to estimate alpha mattes for the foreground in videos. However, creating trimaps is labor-intensive and rigid. Recent advancements in video matting aim to eliminate the need for trimaps, but these methods struggle to estimate alpha mattes for specific individuals in scenes featuring multiple instances. In this study, we propose the Flexible Video Matting (FVM) model, a novel video matting network capable of generating alpha mattes for any specified instance in a video using simple prompts such as text, bounding boxes, and points, without relying on user-annotated trimaps. FVM combines the Segment Anything Model (SAM) and a video object segmentation network to obtain semantic masks for the target instance. Additionally, we have designed a Mask-to-Trimap (MTT) module for FVM, based on a recurrent architecture. This module utilizes semantic masks and temporal information in the video to predict temporally consistent trimaps, which are subsequently fed into the matting module to generate temporally consistent alpha mattes. Experimental results on the video matting benchmark demonstrate that our model achieves state-of-the-art matting quality and exhibits superior temporal coherence compared with methods that directly apply image matting techniques to video matting tasks.
AB - Traditional video matting networks depend on user-annotated trimaps to estimate alpha mattes for the foreground in videos. However, creating trimaps is labor-intensive and rigid. Recent advancements in video matting aim to eliminate the need for trimaps, but these methods struggle to estimate alpha mattes for specific individuals in scenes featuring multiple instances. In this study, we propose the Flexible Video Matting (FVM) model, a novel video matting network capable of generating alpha mattes for any specified instance in a video using simple prompts such as text, bounding boxes, and points, without relying on user-annotated trimaps. FVM combines the Segment Anything Model (SAM) and a video object segmentation network to obtain semantic masks for the target instance. Additionally, we have designed a Mask-to-Trimap (MTT) module for FVM, based on a recurrent architecture. This module utilizes semantic masks and temporal information in the video to predict temporally consistent trimaps, which are subsequently fed into the matting module to generate temporally consistent alpha mattes. Experimental results on the video matting benchmark demonstrate that our model achieves state-of-the-art matting quality and exhibits superior temporal coherence compared with methods that directly apply image matting techniques to video matting tasks.
KW - Mask-to-Trimap
KW - Segment Anything
KW - Temporal Coherence
KW - Video Matting
UR - http://www.scopus.com/inward/record.url?scp=85219212457&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-8702-9_12
DO - 10.1007/978-981-97-8702-9_12
M3 - Conference Proceeding
AN - SCOPUS:85219212457
SN - 9789819787012
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 172
EP - 185
BT - Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings
A2 - Wallraven, Christian
A2 - Liu, Cheng-Lin
A2 - Ross, Arun
PB - Springer Science and Business Media Deutschland GmbH
T2 - 4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024
Y2 - 3 July 2024 through 6 July 2024
ER -