Flexible Video Matting With Temporally Coherent Trimaps Generation

Chenhui Xue; Shugong Xu; Shiyi Mu; Yilin Gao

doi:10.1007/978-981-97-8702-9_12

Flexible Video Matting With Temporally Coherent Trimaps Generation

Chenhui Xue, Shugong Xu^*, Shiyi Mu, Yilin Gao

^*Corresponding author for this work

Shanghai University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

Traditional video matting networks depend on user-annotated trimaps to estimate alpha mattes for the foreground in videos. However, creating trimaps is labor-intensive and rigid. Recent advancements in video matting aim to eliminate the need for trimaps, but these methods struggle to estimate alpha mattes for specific individuals in scenes featuring multiple instances. In this study, we propose the Flexible Video Matting (FVM) model, a novel video matting network capable of generating alpha mattes for any specified instance in a video using simple prompts such as text, bounding boxes, and points, without relying on user-annotated trimaps. FVM combines the Segment Anything Model (SAM) and a video object segmentation network to obtain semantic masks for the target instance. Additionally, we have designed a Mask-to-Trimap (MTT) module for FVM, based on a recurrent architecture. This module utilizes semantic masks and temporal information in the video to predict temporally consistent trimaps, which are subsequently fed into the matting module to generate temporally consistent alpha mattes. Experimental results on the video matting benchmark demonstrate that our model achieves state-of-the-art matting quality and exhibits superior temporal coherence compared with methods that directly apply image matting techniques to video matting tasks.

Original language	English
Title of host publication	Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings
Editors	Christian Wallraven, Cheng-Lin Liu, Arun Ross
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	172-185
Number of pages	14
ISBN (Print)	9789819787012
DOIs	https://doi.org/10.1007/978-981-97-8702-9_12
Publication status	Published - 2025
Externally published	Yes
Event	4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024 - Jeju Island, Korea, Republic of Duration: 3 Jul 2024 → 6 Jul 2024

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	14892 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024
Country/Territory	Korea, Republic of
City	Jeju Island
Period	3/07/24 → 6/07/24

Keywords

Mask-to-Trimap
Segment Anything
Temporal Coherence
Video Matting

Access to Document

10.1007/978-981-97-8702-9_12

Cite this

Xue, C., Xu, S., Mu, S., & Gao, Y. (2025). Flexible Video Matting With Temporally Coherent Trimaps Generation. In C. Wallraven, C.-L. Liu, & A. Ross (Eds.), Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings (pp. 172-185). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14892 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-8702-9_12

Xue, Chenhui ; Xu, Shugong ; Mu, Shiyi et al. / Flexible Video Matting With Temporally Coherent Trimaps Generation. Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings. editor / Christian Wallraven ; Cheng-Lin Liu ; Arun Ross. Springer Science and Business Media Deutschland GmbH, 2025. pp. 172-185 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{2fe2d74f05a34f60a6d4b2117d6ef5a5,

title = "Flexible Video Matting With Temporally Coherent Trimaps Generation",

abstract = "Traditional video matting networks depend on user-annotated trimaps to estimate alpha mattes for the foreground in videos. However, creating trimaps is labor-intensive and rigid. Recent advancements in video matting aim to eliminate the need for trimaps, but these methods struggle to estimate alpha mattes for specific individuals in scenes featuring multiple instances. In this study, we propose the Flexible Video Matting (FVM) model, a novel video matting network capable of generating alpha mattes for any specified instance in a video using simple prompts such as text, bounding boxes, and points, without relying on user-annotated trimaps. FVM combines the Segment Anything Model (SAM) and a video object segmentation network to obtain semantic masks for the target instance. Additionally, we have designed a Mask-to-Trimap (MTT) module for FVM, based on a recurrent architecture. This module utilizes semantic masks and temporal information in the video to predict temporally consistent trimaps, which are subsequently fed into the matting module to generate temporally consistent alpha mattes. Experimental results on the video matting benchmark demonstrate that our model achieves state-of-the-art matting quality and exhibits superior temporal coherence compared with methods that directly apply image matting techniques to video matting tasks.",

keywords = "Mask-to-Trimap, Segment Anything, Temporal Coherence, Video Matting",

author = "Chenhui Xue and Shugong Xu and Shiyi Mu and Yilin Gao",

note = "Publisher Copyright: {\textcopyright} The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.; 4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024 ; Conference date: 03-07-2024 Through 06-07-2024",

year = "2025",

doi = "10.1007/978-981-97-8702-9_12",

language = "English",

isbn = "9789819787012",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "172--185",

editor = "Christian Wallraven and Cheng-Lin Liu and Arun Ross",

booktitle = "Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings",

}

Xue, C, Xu, S, Mu, S & Gao, Y 2025, Flexible Video Matting With Temporally Coherent Trimaps Generation. in C Wallraven, C-L Liu & A Ross (eds), Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14892 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 172-185, 4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024, Jeju Island, Korea, Republic of, 3/07/24. https://doi.org/10.1007/978-981-97-8702-9_12

Flexible Video Matting With Temporally Coherent Trimaps Generation. / Xue, Chenhui; Xu, Shugong; Mu, Shiyi et al.
Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings. ed. / Christian Wallraven; Cheng-Lin Liu; Arun Ross. Springer Science and Business Media Deutschland GmbH, 2025. p. 172-185 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14892 LNCS).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Flexible Video Matting With Temporally Coherent Trimaps Generation

AU - Xue, Chenhui

AU - Xu, Shugong

AU - Mu, Shiyi

AU - Gao, Yilin

N1 - Publisher Copyright: © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

PY - 2025

Y1 - 2025

N2 - Traditional video matting networks depend on user-annotated trimaps to estimate alpha mattes for the foreground in videos. However, creating trimaps is labor-intensive and rigid. Recent advancements in video matting aim to eliminate the need for trimaps, but these methods struggle to estimate alpha mattes for specific individuals in scenes featuring multiple instances. In this study, we propose the Flexible Video Matting (FVM) model, a novel video matting network capable of generating alpha mattes for any specified instance in a video using simple prompts such as text, bounding boxes, and points, without relying on user-annotated trimaps. FVM combines the Segment Anything Model (SAM) and a video object segmentation network to obtain semantic masks for the target instance. Additionally, we have designed a Mask-to-Trimap (MTT) module for FVM, based on a recurrent architecture. This module utilizes semantic masks and temporal information in the video to predict temporally consistent trimaps, which are subsequently fed into the matting module to generate temporally consistent alpha mattes. Experimental results on the video matting benchmark demonstrate that our model achieves state-of-the-art matting quality and exhibits superior temporal coherence compared with methods that directly apply image matting techniques to video matting tasks.

AB - Traditional video matting networks depend on user-annotated trimaps to estimate alpha mattes for the foreground in videos. However, creating trimaps is labor-intensive and rigid. Recent advancements in video matting aim to eliminate the need for trimaps, but these methods struggle to estimate alpha mattes for specific individuals in scenes featuring multiple instances. In this study, we propose the Flexible Video Matting (FVM) model, a novel video matting network capable of generating alpha mattes for any specified instance in a video using simple prompts such as text, bounding boxes, and points, without relying on user-annotated trimaps. FVM combines the Segment Anything Model (SAM) and a video object segmentation network to obtain semantic masks for the target instance. Additionally, we have designed a Mask-to-Trimap (MTT) module for FVM, based on a recurrent architecture. This module utilizes semantic masks and temporal information in the video to predict temporally consistent trimaps, which are subsequently fed into the matting module to generate temporally consistent alpha mattes. Experimental results on the video matting benchmark demonstrate that our model achieves state-of-the-art matting quality and exhibits superior temporal coherence compared with methods that directly apply image matting techniques to video matting tasks.

KW - Mask-to-Trimap

KW - Segment Anything

KW - Temporal Coherence

KW - Video Matting

UR - http://www.scopus.com/inward/record.url?scp=85219212457&partnerID=8YFLogxK

U2 - 10.1007/978-981-97-8702-9_12

DO - 10.1007/978-981-97-8702-9_12

M3 - Conference Proceeding

AN - SCOPUS:85219212457

SN - 9789819787012

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 172

EP - 185

BT - Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings

A2 - Wallraven, Christian

A2 - Liu, Cheng-Lin

A2 - Ross, Arun

PB - Springer Science and Business Media Deutschland GmbH

T2 - 4th International Conference on Pattern Recognition and Artificial Intelligence, ICPRAI 2024

Y2 - 3 July 2024 through 6 July 2024

ER -

Xue C, Xu S, Mu S, Gao Y. Flexible Video Matting With Temporally Coherent Trimaps Generation. In Wallraven C, Liu CL, Ross A, editors, Pattern Recognition and Artificial Intelligence - 4th International Conference, ICPRAI 2024, Proceedings. Springer Science and Business Media Deutschland GmbH. 2025. p. 172-185. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-981-97-8702-9_12

Flexible Video Matting With Temporally Coherent Trimaps Generation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this