Action Recognition from Still Images Based on Deep VLAD Spatial Pyramids

Shiyang Yan*, Jeremy S. Smith, Bailing Zhang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

30 Citations (Scopus)

Abstract

The recognition of human actions in images is a challenging task in computer vision. In many applications, actions can be exploited as mid-level semantic features for high level tasks. Actions often appear in fine-grained categorization, where the differences between two categories are small. Recently, deep learning approaches have achieved great success in many vision tasks, e.g., image classification, object detection, and attribute and action recognition. Also, the Bag-of-Visual-Words (BoVW) and its extensions, e.g., Vector of Locally Aggregated Descriptors (VLAD) encoding, have proved to be powerful in identifying global contextual information. In this paper, we propose a new action recognition scheme by combining the powerful feature representational capabilities of Convolutional Neural Networks (CNNs) with the VLAD encoding scheme. Specifically, we encode the CNN features of image patches generated by a region proposal algorithm with VLAD and subsequently represent an image by the compact code, which not only captures the more fine-grained properties of the images but also contains global contextual information. To identify the spatial information, we exploit the spatial pyramid representation and encode CNN features inside each pyramid. Experiments have verified that the proposed schemes are not only suitable for action recognition but also applicable to more general recognition tasks such as attribute classification. The proposed scheme is validated with four benchmark datasets with competitive mAP results of 88.5% on the Stanford 40 Action dataset, 81.3% on the People Playing Musical Instrument dataset, 90.4% on the Berkeley Attributes of People dataset and 74.2% on the 27 Human Attributes dataset.

Original languageEnglish
Pages (from-to)118-129
Number of pages12
JournalSignal Processing: Image Communication
Volume54
DOIs
Publication statusPublished - 1 May 2017

Keywords

  • Actions
  • Convolutional Neural Networks
  • Spatial pyramids
  • VLAD encoding

Fingerprint

Dive into the research topics of 'Action Recognition from Still Images Based on Deep VLAD Spatial Pyramids'. Together they form a unique fingerprint.

Cite this