TY - GEN
T1 - Action recognition in videos with temporal segments fusions
AU - Fang, Yuanye
AU - Zhang, Rui
AU - Wang, Qiu Feng
AU - Huang, Kaizhu
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - Deep Convolutional Neural Networks (CNNs) have achieved great success in object recognition. However, they are difficult to capture the long-range temporal information, which plays an important role for action recognition in videos. To overcome this issue, a two-stream architecture including spatial and temporal segments based CNNs is widely used recently. However, the relationship among the segments is not sufficiently investigated. In this paper, we proposed to combine multiple segments by a fully connected layer in a deep CNN model for the whole action video. Moreover, the four streams (i.e., RGB, RGB differences, optical flow, and warped optical flow) are carefully integrated with a linear combination, and the weights are optimized on the validation datasets. We evaluate the recognition accuracy of the proposed method on two benchmark datasets of UCF101 and HMDB51. The extensive experimental results demonstrate encouraging results of our proposed method. Specifically, the proposed method improves the accuracy of action recognition in videos obviously (e.g., compared with the baseline, the accuracy is improved from 94.20% to 97.30% and from 69.40% to 77.99% on the dataset UCF101 and HMDB51, respectively). Furthermore, the proposed method can obtain the competitive accuracy to the state-of-the-art method of the 3D convolutional operation, but with much fewer parameters.
AB - Deep Convolutional Neural Networks (CNNs) have achieved great success in object recognition. However, they are difficult to capture the long-range temporal information, which plays an important role for action recognition in videos. To overcome this issue, a two-stream architecture including spatial and temporal segments based CNNs is widely used recently. However, the relationship among the segments is not sufficiently investigated. In this paper, we proposed to combine multiple segments by a fully connected layer in a deep CNN model for the whole action video. Moreover, the four streams (i.e., RGB, RGB differences, optical flow, and warped optical flow) are carefully integrated with a linear combination, and the weights are optimized on the validation datasets. We evaluate the recognition accuracy of the proposed method on two benchmark datasets of UCF101 and HMDB51. The extensive experimental results demonstrate encouraging results of our proposed method. Specifically, the proposed method improves the accuracy of action recognition in videos obviously (e.g., compared with the baseline, the accuracy is improved from 94.20% to 97.30% and from 69.40% to 77.99% on the dataset UCF101 and HMDB51, respectively). Furthermore, the proposed method can obtain the competitive accuracy to the state-of-the-art method of the 3D convolutional operation, but with much fewer parameters.
KW - Action recognition
KW - Convolutional Neural Networks
KW - Segments fusion
KW - Temporal segments models
UR - http://www.scopus.com/inward/record.url?scp=85080907811&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-39431-8_23
DO - 10.1007/978-3-030-39431-8_23
M3 - Conference Proceeding
AN - SCOPUS:85080907811
SN - 9783030394301
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 244
EP - 253
BT - Advances in Brain Inspired Cognitive Systems - 10th International Conference, BICS 2019, Proceedings
A2 - Ren, Jinchang
A2 - Hussain, Amir
A2 - Zhao, Huimin
A2 - Cai, Jun
A2 - Chen, Rongjun
A2 - Xiao, Yinyin
A2 - Huang, Kaizhu
A2 - Zheng, Jiangbin
PB - Springer
T2 - 10th International Conference on Brain Inspired Cognitive Systems, BICS 2019
Y2 - 13 July 2019 through 14 July 2019
ER -