CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization

Yuxi Li; Weiyao Lin; John See; Ning Xu; Shugong Xu; Ke Yan; Cong Yang

doi:10.1007/978-3-030-58517-4_30

CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization

Yuxi Li, Weiyao Lin^*, John See, Ning Xu, Shugong Xu, Ke Yan, Cong Yang

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

15 Citations (Scopus)

Abstract

Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization. In this paper, we propose Coarse-to-Fine Action Detector (CFAD), an original end-to-end trainable framework for efficient spatio-temporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatio-temporal action tubes from video streams, and then refines the tubes’ location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our framework. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, the proposed CFAD achieves competitive results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3 faster than the nearest competitor.

Original language	English
Title of host publication	Computer Vision – ECCV 2020 - 16th European Conference, Proceedings
Editors	Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	510-527
Number of pages	18
ISBN (Print)	9783030585167
DOIs	https://doi.org/10.1007/978-3-030-58517-4_30
Publication status	Published - 2020
Externally published	Yes
Event	16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom Duration: 23 Aug 2020 → 28 Aug 2020

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12361 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th European Conference on Computer Vision, ECCV 2020
Country/Territory	United Kingdom
City	Glasgow
Period	23/08/20 → 28/08/20

Keywords

Coarse-to-fine paradigm
Parameterized modeling
Spatiotemporal action detection

Access to Document

10.1007/978-3-030-58517-4_30

Cite this

Li, Y., Lin, W., See, J., Xu, N., Xu, S., Yan, K., & Yang, C. (2020). CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 - 16th European Conference, Proceedings (pp. 510-527). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12361 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58517-4_30

Li, Yuxi ; Lin, Weiyao ; See, John et al. / CFAD : Coarse-to-Fine Action Detector for Spatiotemporal Action Localization. Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. editor / Andrea Vedaldi ; Horst Bischof ; Thomas Brox ; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. pp. 510-527 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{1a9a499eb5a844da9e3e97e2112a7576,

title = "CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization",

abstract = "Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization. In this paper, we propose Coarse-to-Fine Action Detector (CFAD), an original end-to-end trainable framework for efficient spatio-temporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatio-temporal action tubes from video streams, and then refines the tubes{\textquoteright} location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our framework. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, the proposed CFAD achieves competitive results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3 faster than the nearest competitor.",

keywords = "Coarse-to-fine paradigm, Parameterized modeling, Spatiotemporal action detection",

author = "Yuxi Li and Weiyao Lin and John See and Ning Xu and Shugong Xu and Ke Yan and Cong Yang",

note = "Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 16th European Conference on Computer Vision, ECCV 2020 ; Conference date: 23-08-2020 Through 28-08-2020",

year = "2020",

doi = "10.1007/978-3-030-58517-4_30",

language = "English",

isbn = "9783030585167",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "510--527",

editor = "Andrea Vedaldi and Horst Bischof and Thomas Brox and Jan-Michael Frahm",

booktitle = "Computer Vision – ECCV 2020 - 16th European Conference, Proceedings",

}

Li, Y, Lin, W, See, J, Xu, N, Xu, S, Yan, K & Yang, C 2020, CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization. in A Vedaldi, H Bischof, T Brox & J-M Frahm (eds), Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12361 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 510-527, 16th European Conference on Computer Vision, ECCV 2020, Glasgow, United Kingdom, 23/08/20. https://doi.org/10.1007/978-3-030-58517-4_30

CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization. / Li, Yuxi; Lin, Weiyao; See, John et al.
Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. ed. / Andrea Vedaldi; Horst Bischof; Thomas Brox; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. p. 510-527 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12361 LNCS).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - CFAD

T2 - 16th European Conference on Computer Vision, ECCV 2020

AU - Li, Yuxi

AU - Lin, Weiyao

AU - See, John

AU - Xu, Ning

AU - Xu, Shugong

AU - Yan, Ke

AU - Yang, Cong

PY - 2020

Y1 - 2020

N2 - Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization. In this paper, we propose Coarse-to-Fine Action Detector (CFAD), an original end-to-end trainable framework for efficient spatio-temporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatio-temporal action tubes from video streams, and then refines the tubes’ location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our framework. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, the proposed CFAD achieves competitive results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3 faster than the nearest competitor.

AB - Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization. In this paper, we propose Coarse-to-Fine Action Detector (CFAD), an original end-to-end trainable framework for efficient spatio-temporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatio-temporal action tubes from video streams, and then refines the tubes’ location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our framework. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, the proposed CFAD achieves competitive results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3 faster than the nearest competitor.

KW - Coarse-to-fine paradigm

KW - Parameterized modeling

KW - Spatiotemporal action detection

UR - http://www.scopus.com/inward/record.url?scp=85092911883&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-58517-4_30

DO - 10.1007/978-3-030-58517-4_30

M3 - Conference Proceeding

AN - SCOPUS:85092911883

SN - 9783030585167

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 510

EP - 527

BT - Computer Vision – ECCV 2020 - 16th European Conference, Proceedings

A2 - Vedaldi, Andrea

A2 - Bischof, Horst

A2 - Brox, Thomas

A2 - Frahm, Jan-Michael

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 23 August 2020 through 28 August 2020

ER -

Li Y, Lin W, See J, Xu N, Xu S, Yan K et al. CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization. In Vedaldi A, Bischof H, Brox T, Frahm JM, editors, Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 510-527. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-58517-4_30

CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this