Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding

Shiyang Yan; Jeremy S. Smith; Bailing Zhang

doi:10.1007/978-3-319-54526-4_37

Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding

Shiyang Yan^*, Jeremy S. Smith, Bailing Zhang

^*Corresponding author for this work

School of Advanced Technology

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

1 Citation (Scopus)

Abstract

Determination of human attributes and recognition of actions in still images are two related and challenging tasks in computer vision, which often appear in fine-grained domains where the distinctions between the different categories are very small. Deep Convolutional Neural Network (CNN) models have demonstrated their remarkable representational learning capability through various examples. However, the successes are very limited for attributes and action recognition as the potential of CNNs to acquire both of the global and local information of an image remains largely unexplored. This paper proposes to tackle the problem with an encoding of a spatial pyramid Vector of Locally Aggregated Descriptors (VLAD) on top of CNN features. With region proposals generated by Edgeboxes, a compact and efficient representation of an image is thus produced for subsequent prediction of attributes and classification of actions. The proposed scheme is validated with competitive results on two benchmark datasets: 90.4% mean Average Precision (mAP) on the Berkeley Attributes of People dataset and 88.5% mAP on the Stanford 40 action dataset.

Original language	English
Title of host publication	Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers
Editors	Chu-Song Chen, Kai-Kuang Ma, Jiwen Lu
Publisher	Springer Verlag
Pages	500-514
Number of pages	15
ISBN (Print)	9783319545257
DOIs	https://doi.org/10.1007/978-3-319-54526-4_37
Publication status	Published - 2017
Event	13th Asian Conference on Computer Vision, ACCV 2016 - Taipei, Taiwan, Province of China Duration: 20 Nov 2016 → 24 Nov 2016

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	10118 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	13th Asian Conference on Computer Vision, ACCV 2016
Country/Territory	Taiwan, Province of China
City	Taipei
Period	20/11/16 → 24/11/16

Access to Document

10.1007/978-3-319-54526-4_37

Cite this

Yan, S., Smith, J. S., & Zhang, B. (2017). Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding. In C.-S. Chen, K.-K. Ma, & J. Lu (Eds.), Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers (pp. 500-514). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10118 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-54526-4_37

Yan, Shiyang ; Smith, Jeremy S. ; Zhang, Bailing. / Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding. Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers. editor / Chu-Song Chen ; Kai-Kuang Ma ; Jiwen Lu. Springer Verlag, 2017. pp. 500-514 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{0ac36c7c2a31404d8bb29974d5f0e3a4,

title = "Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding",

abstract = "Determination of human attributes and recognition of actions in still images are two related and challenging tasks in computer vision, which often appear in fine-grained domains where the distinctions between the different categories are very small. Deep Convolutional Neural Network (CNN) models have demonstrated their remarkable representational learning capability through various examples. However, the successes are very limited for attributes and action recognition as the potential of CNNs to acquire both of the global and local information of an image remains largely unexplored. This paper proposes to tackle the problem with an encoding of a spatial pyramid Vector of Locally Aggregated Descriptors (VLAD) on top of CNN features. With region proposals generated by Edgeboxes, a compact and efficient representation of an image is thus produced for subsequent prediction of attributes and classification of actions. The proposed scheme is validated with competitive results on two benchmark datasets: 90.4% mean Average Precision (mAP) on the Berkeley Attributes of People dataset and 88.5% mAP on the Stanford 40 action dataset.",

author = "Shiyang Yan and Smith, {Jeremy S.} and Bailing Zhang",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.; 13th Asian Conference on Computer Vision, ACCV 2016 ; Conference date: 20-11-2016 Through 24-11-2016",

year = "2017",

doi = "10.1007/978-3-319-54526-4_37",

language = "English",

isbn = "9783319545257",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "500--514",

editor = "Chu-Song Chen and Kai-Kuang Ma and Jiwen Lu",

booktitle = "Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers",

}

Yan, S, Smith, JS & Zhang, B 2017, Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding. in C-S Chen, K-K Ma & J Lu (eds), Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10118 LNCS, Springer Verlag, pp. 500-514, 13th Asian Conference on Computer Vision, ACCV 2016, Taipei, Taiwan, Province of China, 20/11/16. https://doi.org/10.1007/978-3-319-54526-4_37

Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding. / Yan, Shiyang; Smith, Jeremy S.; Zhang, Bailing.
Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers. ed. / Chu-Song Chen; Kai-Kuang Ma; Jiwen Lu. Springer Verlag, 2017. p. 500-514 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10118 LNCS).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding

AU - Yan, Shiyang

AU - Smith, Jeremy S.

AU - Zhang, Bailing

N1 - Publisher Copyright: © Springer International Publishing AG 2017.

PY - 2017

Y1 - 2017

N2 - Determination of human attributes and recognition of actions in still images are two related and challenging tasks in computer vision, which often appear in fine-grained domains where the distinctions between the different categories are very small. Deep Convolutional Neural Network (CNN) models have demonstrated their remarkable representational learning capability through various examples. However, the successes are very limited for attributes and action recognition as the potential of CNNs to acquire both of the global and local information of an image remains largely unexplored. This paper proposes to tackle the problem with an encoding of a spatial pyramid Vector of Locally Aggregated Descriptors (VLAD) on top of CNN features. With region proposals generated by Edgeboxes, a compact and efficient representation of an image is thus produced for subsequent prediction of attributes and classification of actions. The proposed scheme is validated with competitive results on two benchmark datasets: 90.4% mean Average Precision (mAP) on the Berkeley Attributes of People dataset and 88.5% mAP on the Stanford 40 action dataset.

AB - Determination of human attributes and recognition of actions in still images are two related and challenging tasks in computer vision, which often appear in fine-grained domains where the distinctions between the different categories are very small. Deep Convolutional Neural Network (CNN) models have demonstrated their remarkable representational learning capability through various examples. However, the successes are very limited for attributes and action recognition as the potential of CNNs to acquire both of the global and local information of an image remains largely unexplored. This paper proposes to tackle the problem with an encoding of a spatial pyramid Vector of Locally Aggregated Descriptors (VLAD) on top of CNN features. With region proposals generated by Edgeboxes, a compact and efficient representation of an image is thus produced for subsequent prediction of attributes and classification of actions. The proposed scheme is validated with competitive results on two benchmark datasets: 90.4% mean Average Precision (mAP) on the Berkeley Attributes of People dataset and 88.5% mAP on the Stanford 40 action dataset.

UR - http://www.scopus.com/inward/record.url?scp=85016121116&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-54526-4_37

DO - 10.1007/978-3-319-54526-4_37

M3 - Conference Proceeding

AN - SCOPUS:85016121116

SN - 9783319545257

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 500

EP - 514

BT - Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers

A2 - Chen, Chu-Song

A2 - Ma, Kai-Kuang

A2 - Lu, Jiwen

PB - Springer Verlag

T2 - 13th Asian Conference on Computer Vision, ACCV 2016

Y2 - 20 November 2016 through 24 November 2016

ER -

Yan S, Smith JS, Zhang B. Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding. In Chen CS, Ma KK, Lu J, editors, Computer Vision - ACCV 2016 Workshops, ACCV 2016 International Workshops, Revised Selected Papers. Springer Verlag. 2017. p. 500-514. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-54526-4_37

Attributes and action recognition based on convolutional neural networks and spatial pyramid VLAD encoding

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this