Space or time for video classification transformers

Xing Wu*, Chenjie Tao, Jian Zhang, Qun Sun, Jianjia Wang, Weimin Li, Yue Liu, Yike Guo

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.

Original languageEnglish
Pages (from-to)23039-23048
Number of pages10
JournalApplied Intelligence
Issue number20
Publication statusPublished - Oct 2023
Externally publishedYes


  • Spatial attention
  • Temporal attention
  • Transformer
  • Video classification


Dive into the research topics of 'Space or time for video classification transformers'. Together they form a unique fingerprint.

Cite this