Global and Compact Video Context Embedding for Video Semantic Segmentation

Lei Sun; Yun Liu; Guolei Sun; Min Wu; Zhijie Xu; Kaiwei Wang; Luc Van Gool

doi:10.1109/ACCESS.2024.3409150

Global and Compact Video Context Embedding for Video Semantic Segmentation

Lei Sun, Yun Liu^*, Guolei Sun, Min Wu, Zhijie Xu, Kaiwei Wang^*, Luc Van Gool

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Intuitively, global video context could benefit video semantic segmentation (VSS) if it is designed to simultaneously model global temporal and spatial dependencies for a holistic understanding of the semantic scenes in a video clip. However, we found that the existing VSS approaches focus only on modeling local video context. This paper attempts to bridge this gap by learning global video context for VSS. Apart from the global nature, the video context should also be compact when considering the large number of video feature tokens and the redundancy among nearby video frames. Then, we embed the learned global and compact video context into the features of the target video frame to improve the distinguishability. The proposed VSS method is dubbed Global and Compact Video Context Embedding (GCVCE). With the compact nature, the number of global context tokens is very limited so that GCVCE is flexible and efficient for VSS. Since it may be too challenging to directly abstract a large number of video feature tokens into a small number of global context tokens, we further design a Cascaded Convolutional Downsampling (CCD) module before GCVCE to help it work better. 1.6% improvement in mIoU on the popular VSPW dataset compared to previous state-of-the-art methods demonstrate the effectiveness and efficiency of GCVCE and CCD for VSS. Code and models will be made publicly available.

Original language	English
Pages (from-to)	135589-135600
Number of pages	12
Journal	IEEE Access
Volume	12
DOIs	https://doi.org/10.1109/ACCESS.2024.3409150
Publication status	Published - 2024
Externally published	Yes

Keywords

compact video context
global video context
video context embedding
Video semantic segmentation

Access to Document

10.1109/ACCESS.2024.3409150

Cite this

@article{3ed7bef319cf4f5186dce1c3c26680d6,

title = "Global and Compact Video Context Embedding for Video Semantic Segmentation",

abstract = "Intuitively, global video context could benefit video semantic segmentation (VSS) if it is designed to simultaneously model global temporal and spatial dependencies for a holistic understanding of the semantic scenes in a video clip. However, we found that the existing VSS approaches focus only on modeling local video context. This paper attempts to bridge this gap by learning global video context for VSS. Apart from the global nature, the video context should also be compact when considering the large number of video feature tokens and the redundancy among nearby video frames. Then, we embed the learned global and compact video context into the features of the target video frame to improve the distinguishability. The proposed VSS method is dubbed Global and Compact Video Context Embedding (GCVCE). With the compact nature, the number of global context tokens is very limited so that GCVCE is flexible and efficient for VSS. Since it may be too challenging to directly abstract a large number of video feature tokens into a small number of global context tokens, we further design a Cascaded Convolutional Downsampling (CCD) module before GCVCE to help it work better. 1.6% improvement in mIoU on the popular VSPW dataset compared to previous state-of-the-art methods demonstrate the effectiveness and efficiency of GCVCE and CCD for VSS. Code and models will be made publicly available.",

keywords = "compact video context, global video context, video context embedding, Video semantic segmentation",

author = "Lei Sun and Yun Liu and Guolei Sun and Min Wu and Zhijie Xu and Kaiwei Wang and Gool, {Luc Van}",

note = "Publisher Copyright: {\textcopyright} 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.",

year = "2024",

doi = "10.1109/ACCESS.2024.3409150",

language = "English",

volume = "12",

pages = "135589--135600",

journal = "IEEE Access",

issn = "2169-3536",

}

TY - JOUR

T1 - Global and Compact Video Context Embedding for Video Semantic Segmentation

AU - Sun, Lei

AU - Liu, Yun

AU - Sun, Guolei

AU - Wu, Min

AU - Xu, Zhijie

AU - Wang, Kaiwei

AU - Gool, Luc Van

PY - 2024

Y1 - 2024

N2 - Intuitively, global video context could benefit video semantic segmentation (VSS) if it is designed to simultaneously model global temporal and spatial dependencies for a holistic understanding of the semantic scenes in a video clip. However, we found that the existing VSS approaches focus only on modeling local video context. This paper attempts to bridge this gap by learning global video context for VSS. Apart from the global nature, the video context should also be compact when considering the large number of video feature tokens and the redundancy among nearby video frames. Then, we embed the learned global and compact video context into the features of the target video frame to improve the distinguishability. The proposed VSS method is dubbed Global and Compact Video Context Embedding (GCVCE). With the compact nature, the number of global context tokens is very limited so that GCVCE is flexible and efficient for VSS. Since it may be too challenging to directly abstract a large number of video feature tokens into a small number of global context tokens, we further design a Cascaded Convolutional Downsampling (CCD) module before GCVCE to help it work better. 1.6% improvement in mIoU on the popular VSPW dataset compared to previous state-of-the-art methods demonstrate the effectiveness and efficiency of GCVCE and CCD for VSS. Code and models will be made publicly available.

AB - Intuitively, global video context could benefit video semantic segmentation (VSS) if it is designed to simultaneously model global temporal and spatial dependencies for a holistic understanding of the semantic scenes in a video clip. However, we found that the existing VSS approaches focus only on modeling local video context. This paper attempts to bridge this gap by learning global video context for VSS. Apart from the global nature, the video context should also be compact when considering the large number of video feature tokens and the redundancy among nearby video frames. Then, we embed the learned global and compact video context into the features of the target video frame to improve the distinguishability. The proposed VSS method is dubbed Global and Compact Video Context Embedding (GCVCE). With the compact nature, the number of global context tokens is very limited so that GCVCE is flexible and efficient for VSS. Since it may be too challenging to directly abstract a large number of video feature tokens into a small number of global context tokens, we further design a Cascaded Convolutional Downsampling (CCD) module before GCVCE to help it work better. 1.6% improvement in mIoU on the popular VSPW dataset compared to previous state-of-the-art methods demonstrate the effectiveness and efficiency of GCVCE and CCD for VSS. Code and models will be made publicly available.

KW - compact video context

KW - global video context

KW - video context embedding

KW - Video semantic segmentation

UR - http://www.scopus.com/inward/record.url?scp=85195389393&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2024.3409150

DO - 10.1109/ACCESS.2024.3409150

M3 - Article

AN - SCOPUS:85195389393

SN - 2169-3536

VL - 12

SP - 135589

EP - 135600

JO - IEEE Access

JF - IEEE Access

ER -

Global and Compact Video Context Embedding for Video Semantic Segmentation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this