TY - GEN
T1 - Acoustic Scene Generation with Conditional Samplernn
AU - Kong, Qiuqiang
AU - Xu, Yong
AU - Iqbal, Turab
AU - Cao, Yin
AU - Wang, Wenwu
AU - Plumbley, Mark D.
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Acoustic scene generation (ASG) is a task to generate waveforms for acoustic scenes. ASG can be used to generate audio scenes for movies and computer games. Recently, neural networks such as SampleRNN have been used for speech and music generation. However, ASG is more challenging due to its wide variety. In addition, evaluating a generative model is also difficult. In this paper, we propose to use a conditional SampleRNN model to generate acoustic scenes conditioned on the input classes. We also propose objective criteria to evaluate the quality and diversity of the generated samples based on classification accuracy. The experiments on the DCASE 2016 Task 1 acoustic scene data show that with the generated audio samples, a classification accuracy of 65.5% can be achieved compared to samples generated by a random model of 6.7% and samples from real recording of 83.1%. The performance of a classifier trained only on generated samples achieves an accuracy of 51.3%, as opposed to an accuracy of 6.7% with samples generated by a random model.
AB - Acoustic scene generation (ASG) is a task to generate waveforms for acoustic scenes. ASG can be used to generate audio scenes for movies and computer games. Recently, neural networks such as SampleRNN have been used for speech and music generation. However, ASG is more challenging due to its wide variety. In addition, evaluating a generative model is also difficult. In this paper, we propose to use a conditional SampleRNN model to generate acoustic scenes conditioned on the input classes. We also propose objective criteria to evaluate the quality and diversity of the generated samples based on classification accuracy. The experiments on the DCASE 2016 Task 1 acoustic scene data show that with the generated audio samples, a classification accuracy of 65.5% can be achieved compared to samples generated by a random model of 6.7% and samples from real recording of 83.1%. The performance of a classifier trained only on generated samples achieves an accuracy of 51.3%, as opposed to an accuracy of 6.7% with samples generated by a random model.
KW - SampleRNN
KW - acoustic scene generation
KW - generative model
KW - recurrent neural network
UR - http://www.scopus.com/inward/record.url?scp=85069497442&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8683727
DO - 10.1109/ICASSP.2019.8683727
M3 - Conference Proceeding
AN - SCOPUS:85069497442
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 925
EP - 929
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -