Audio Captioning Based on Transformer and Pre-Trained CNN

Kun Chen; Yusong Wu; Ziyue Wang; Xuan Zhang; Fudong Nian; Shengchen Li; Xi Shao

Audio Captioning Based on Transformer and Pre-Trained CNN

Kun Chen, Yusong Wu, Ziyue Wang, Xuan Zhang, Fudong Nian, Shengchen Li, Xi Shao

Department of Intelligent Science

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

Original language	English
Title of host publication	Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020)
Pages	21-25
Publication status	Published - 11 Feb 2020

Access to Document

https://dcase.community/documents/workshop2020/proceedings/DCASE2020Workshop_Chen_16.pdf

Cite this

@inproceedings{500e7ad242334617979da5f55275e7ef,

title = "Audio Captioning Based on Transformer and Pre-Trained CNN",

abstract = "Automated audio captioning is the task that generates text description of a piece of audio. This paper proposes a solution of automated audio captioning based on a combination of pre-trained CNN layers and a sequence-to-sequence architecture based on Transformer. The pre-trained CNN layers are adopted from a CNN based neural network for acoustic event tagging, which makes the latent variable resulted more efficient on generating captions. Transformer decoder is used in the sequence-to-sequence architecture as a consequence of comparing the performance of the more classical LSTM layers. The proposed system achieves a SPIDEr score of 0.227 for the DCASE challenge 2020 Task 6 with data augmentation and label smoothing applied.",

author = "Kun Chen and Yusong Wu and Ziyue Wang and Xuan Zhang and Fudong Nian and Shengchen Li and Xi Shao",

year = "2020",

month = feb,

day = "11",

language = "English",

pages = "21--25",

booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020)",

}

TY - GEN

T1 - Audio Captioning Based on Transformer and Pre-Trained CNN

AU - Chen, Kun

AU - Wu, Yusong

AU - Wang, Ziyue

AU - Zhang, Xuan

AU - Nian, Fudong

AU - Li, Shengchen

AU - Shao, Xi

PY - 2020/2/11

Y1 - 2020/2/11

N2 - Automated audio captioning is the task that generates text description of a piece of audio. This paper proposes a solution of automated audio captioning based on a combination of pre-trained CNN layers and a sequence-to-sequence architecture based on Transformer. The pre-trained CNN layers are adopted from a CNN based neural network for acoustic event tagging, which makes the latent variable resulted more efficient on generating captions. Transformer decoder is used in the sequence-to-sequence architecture as a consequence of comparing the performance of the more classical LSTM layers. The proposed system achieves a SPIDEr score of 0.227 for the DCASE challenge 2020 Task 6 with data augmentation and label smoothing applied.

AB - Automated audio captioning is the task that generates text description of a piece of audio. This paper proposes a solution of automated audio captioning based on a combination of pre-trained CNN layers and a sequence-to-sequence architecture based on Transformer. The pre-trained CNN layers are adopted from a CNN based neural network for acoustic event tagging, which makes the latent variable resulted more efficient on generating captions. Transformer decoder is used in the sequence-to-sequence architecture as a consequence of comparing the performance of the more classical LSTM layers. The proposed system achieves a SPIDEr score of 0.227 for the DCASE challenge 2020 Task 6 with data augmentation and label smoothing applied.

M3 - Conference Proceeding

SP - 21

EP - 25

BT - Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020)

ER -