Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

Haotian Xu; Xiaobo Jin; Qiufeng Wang; Amir Hussain; Kaizhu Huang

doi:10.1145/3538749

Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

Haotian Xu, Xiaobo Jin^*, Qiufeng Wang, Amir Hussain, Kaizhu Huang^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

14 Citations (Scopus)

Abstract

Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.

Original language	English
Article number	119
Journal	ACM Transactions on Multimedia Computing, Communications and Applications
Volume	18
Issue number	2 S
DOIs	https://doi.org/10.1145/3538749
Publication status	Published - 6 Oct 2022

Keywords

Action recognition
attention consistency
multi-level attention
two-stream structure

Access to Document

10.1145/3538749

Cite this

@article{e0330a09b6934a319ff888662476aaee,

title = "Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition",

abstract = "Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.",

keywords = "Action recognition, attention consistency, multi-level attention, two-stream structure",

author = "Haotian Xu and Xiaobo Jin and Qiufeng Wang and Amir Hussain and Kaizhu Huang",

note = "Publisher Copyright: {\textcopyright} 2022 Association for Computing Machinery.",

year = "2022",

month = oct,

day = "6",

doi = "10.1145/3538749",

language = "English",

volume = "18",

journal = "ACM Transactions on Multimedia Computing, Communications and Applications",

issn = "1551-6857",

publisher = "Association for Computing Machinery",

number = "2 S",

}

TY - JOUR

T1 - Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

AU - Xu, Haotian

AU - Jin, Xiaobo

AU - Wang, Qiufeng

AU - Hussain, Amir

AU - Huang, Kaizhu

PY - 2022/10/6

Y1 - 2022/10/6

N2 - Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.

AB - Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.

KW - Action recognition

KW - attention consistency

KW - multi-level attention

KW - two-stream structure

UR - http://www.scopus.com/inward/record.url?scp=85146425056&partnerID=8YFLogxK

U2 - 10.1145/3538749

DO - 10.1145/3538749

M3 - Article

AN - SCOPUS:85146425056

SN - 1551-6857

VL - 18

JO - ACM Transactions on Multimedia Computing, Communications and Applications

JF - ACM Transactions on Multimedia Computing, Communications and Applications

IS - 2 S

M1 - 119

ER -

Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this