TY - GEN
T1 - Speaker Recognition-Assisted Robust Audio Deepfake Detection
AU - Pan, Jiahui
AU - Nie, Shuai
AU - Zhang, Hui
AU - He, Shulin
AU - Zhang, Kanghao
AU - Liang, Shan
AU - Zhang, Xueliang
AU - Tao, Jianhua
N1 - Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - Audio deepfake detection is usually formulated as a binary classification between genuine and fake speech for an entire utterance. Environmental clues such as background and device noise can be used as the classification features, but they are easy to be attacked, e.g. by simply adding real noise to the fake speech. While speech spectral discrimination are more robust features, which have been used in speaker recognition models to authenticate the speaker identity. In the study, we propose a speaker recognition-assisted audio deepfake detector. Feature representation extracted by a speaker recognition model is introduced into multiple layers of deepfake detector to fully exploit the inherent spectral discrimination of speech. Speaker recognition and audio deepfake detection models are jointly optimized by a multi-objective learning method. Systematic experiments on the ASVspoof 2019 logical access corpus demonstrate the proposed approach outperforms existing single systems and significantly improves the robustness to noise.
AB - Audio deepfake detection is usually formulated as a binary classification between genuine and fake speech for an entire utterance. Environmental clues such as background and device noise can be used as the classification features, but they are easy to be attacked, e.g. by simply adding real noise to the fake speech. While speech spectral discrimination are more robust features, which have been used in speaker recognition models to authenticate the speaker identity. In the study, we propose a speaker recognition-assisted audio deepfake detector. Feature representation extracted by a speaker recognition model is introduced into multiple layers of deepfake detector to fully exploit the inherent spectral discrimination of speech. Speaker recognition and audio deepfake detection models are jointly optimized by a multi-objective learning method. Systematic experiments on the ASVspoof 2019 logical access corpus demonstrate the proposed approach outperforms existing single systems and significantly improves the robustness to noise.
KW - ASVspoof 2019
KW - audio deepfake detection
KW - speaker recognition-assisted
KW - spectral discrimination
UR - http://www.scopus.com/inward/record.url?scp=85140060422&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-72
DO - 10.21437/Interspeech.2022-72
M3 - Conference Proceeding
AN - SCOPUS:85140060422
VL - 2022-September
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4202
EP - 4206
BT - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -