TY - JOUR
T1 - Space or time for video classification transformers
AU - Wu, Xing
AU - Tao, Chenjie
AU - Zhang, Jian
AU - Sun, Qun
AU - Wang, Jianjia
AU - Li, Weimin
AU - Liu, Yue
AU - Guo, Yike
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2023/10
Y1 - 2023/10
N2 - Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.
AB - Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.
KW - Spatial attention
KW - Temporal attention
KW - Transformer
KW - Video classification
UR - http://www.scopus.com/inward/record.url?scp=85164011725&partnerID=8YFLogxK
U2 - 10.1007/s10489-023-04756-5
DO - 10.1007/s10489-023-04756-5
M3 - Article
AN - SCOPUS:85164011725
SN - 0924-669X
VL - 53
SP - 23039
EP - 23048
JO - Applied Intelligence
JF - Applied Intelligence
IS - 20
ER -