Space or time for video classification transformers

Xing Wu; Chenjie Tao; Jian Zhang; Qun Sun; Jianjia Wang; Weimin Li; Yue Liu; Yike Guo

doi:10.1007/s10489-023-04756-5

Space or time for video classification transformers

Xing Wu^*, Chenjie Tao, Jian Zhang, Qun Sun, Jianjia Wang, Weimin Li, Yue Liu, Yike Guo

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.

Original language	English
Pages (from-to)	23039-23048
Number of pages	10
Journal	Applied Intelligence
Volume	53
Issue number	20
DOIs	https://doi.org/10.1007/s10489-023-04756-5
Publication status	Published - Oct 2023
Externally published	Yes

Keywords

Spatial attention
Temporal attention
Transformer
Video classification

Access to Document

10.1007/s10489-023-04756-5

Cite this

@article{a05d4db4a57a45e38906243a2cd9c302,

title = "Space or time for video classification transformers",

abstract = "Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.",

keywords = "Spatial attention, Temporal attention, Transformer, Video classification",

author = "Xing Wu and Chenjie Tao and Jian Zhang and Qun Sun and Jianjia Wang and Weimin Li and Yue Liu and Yike Guo",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = oct,

doi = "10.1007/s10489-023-04756-5",

language = "English",

volume = "53",

pages = "23039--23048",

journal = "Applied Intelligence",

issn = "0924-669X",

number = "20",

}

TY - JOUR

T1 - Space or time for video classification transformers

AU - Wu, Xing

AU - Tao, Chenjie

AU - Zhang, Jian

AU - Sun, Qun

AU - Wang, Jianjia

AU - Li, Weimin

AU - Liu, Yue

AU - Guo, Yike

PY - 2023/10

Y1 - 2023/10

N2 - Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.

AB - Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.

KW - Spatial attention

KW - Temporal attention

KW - Transformer

KW - Video classification

UR - http://www.scopus.com/inward/record.url?scp=85164011725&partnerID=8YFLogxK

U2 - 10.1007/s10489-023-04756-5

DO - 10.1007/s10489-023-04756-5

M3 - Article

AN - SCOPUS:85164011725

SN - 0924-669X

VL - 53

SP - 23039

EP - 23048

JO - Applied Intelligence

JF - Applied Intelligence

IS - 20

ER -

Space or time for video classification transformers

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this