Efficient 3D human pose estimation via spatio-temporal graph transformer with token pruning

  • Zuhe Li
  • , Hongyang Chen
  • , Fengqin Wang
  • , Gang Xu
  • , Qidong Liu
  • , Yushan Pan*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

3D human pose estimation aims to determine joint positions from video data to construct a 3D representation of human pose. Recent Transformer-based approaches have demonstrated remarkable results in 3D single-view human pose estimation. However, these methods struggle to capture local joint relationships and are computationally intensive, limiting their deployment on edge devices. To address these issues, we propose the STGFormer, a model that comprehensively and efficiently learns spatial correlations of human joints as well as temporal correlations between poses. The STGFormer integrates spatio-temporal self-attention mechanisms with graph convolutional networks (GCNs) to capture both global and local information. Additionally, a Token Pruning Module is introduced to select representative pose tokens, reducing redundant information and improving computational efficiency. Extensive experiments on two challenging benchmarks, Human3.6M and MPI-INF-3DHP, demonstrate that our method achieves highly competitive performance in terms of the complexity-accuracy trade-off compared to other Transformer-based methods. Here, we show that our approach achieves state-of-the-art or comparable results while significantly reducing computational costs, underscoring its potential for real-world applications.

Original languageEnglish
Article number398
JournalMultimedia Systems
Volume31
Issue number5
DOIs
Publication statusPublished - Oct 2025

Keywords

  • 3D human pose estimation
  • Efficient transformer
  • Spatio-temporal correlation learning
  • Token pruning

Fingerprint

Dive into the research topics of 'Efficient 3D human pose estimation via spatio-temporal graph transformer with token pruning'. Together they form a unique fingerprint.

Cite this