Speaker Recognition-Assisted Robust Audio Deepfake Detection

Jiahui Pan; Shuai Nie; Hui Zhang; Shulin He; Kanghao Zhang; Shan Liang; Xueliang Zhang; Jianhua Tao

doi:10.21437/Interspeech.2022-72

Speaker Recognition-Assisted Robust Audio Deepfake Detection

Jiahui Pan, Shuai Nie, Hui Zhang, Shulin He, Kanghao Zhang, Shan Liang, Xueliang Zhang, Jianhua Tao

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

5 Citations (Scopus)

Abstract

Audio deepfake detection is usually formulated as a binary classification between genuine and fake speech for an entire utterance. Environmental clues such as background and device noise can be used as the classification features, but they are easy to be attacked, e.g. by simply adding real noise to the fake speech. While speech spectral discrimination are more robust features, which have been used in speaker recognition models to authenticate the speaker identity. In the study, we propose a speaker recognition-assisted audio deepfake detector. Feature representation extracted by a speaker recognition model is introduced into multiple layers of deepfake detector to fully exploit the inherent spectral discrimination of speech. Speaker recognition and audio deepfake detection models are jointly optimized by a multi-objective learning method. Systematic experiments on the ASVspoof 2019 logical access corpus demonstrate the proposed approach outperforms existing single systems and significantly improves the robustness to noise.

Original language	English
Title of host publication	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Pages	4202-4206
Number of pages	5
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-72
Publication status	Published - 2022
Externally published	Yes
Event	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: 18 Sept 2022 → 22 Sept 2022

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)	2308-457X

Conference

Conference	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Country/Territory	Korea, Republic of
City	Incheon
Period	18/09/22 → 22/09/22

Keywords

ASVspoof 2019
audio deepfake detection
speaker recognition-assisted
spectral discrimination

Access to Document

10.21437/Interspeech.2022-72

Cite this

Pan, J., Nie, S., Zhang, H., He, S., Zhang, K., Liang, S., Zhang, X., & Tao, J. (2022). Speaker Recognition-Assisted Robust Audio Deepfake Detection. In 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 (Vol. 2022-September, pp. 4202-4206). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). https://doi.org/10.21437/Interspeech.2022-72

@inproceedings{eb2c122bd3304bc8b3ef452ce1fac1c1,

title = "Speaker Recognition-Assisted Robust Audio Deepfake Detection",

abstract = "Audio deepfake detection is usually formulated as a binary classification between genuine and fake speech for an entire utterance. Environmental clues such as background and device noise can be used as the classification features, but they are easy to be attacked, e.g. by simply adding real noise to the fake speech. While speech spectral discrimination are more robust features, which have been used in speaker recognition models to authenticate the speaker identity. In the study, we propose a speaker recognition-assisted audio deepfake detector. Feature representation extracted by a speaker recognition model is introduced into multiple layers of deepfake detector to fully exploit the inherent spectral discrimination of speech. Speaker recognition and audio deepfake detection models are jointly optimized by a multi-objective learning method. Systematic experiments on the ASVspoof 2019 logical access corpus demonstrate the proposed approach outperforms existing single systems and significantly improves the robustness to noise.",

keywords = "ASVspoof 2019, audio deepfake detection, speaker recognition-assisted, spectral discrimination",

author = "Jiahui Pan and Shuai Nie and Hui Zhang and Shulin He and Kanghao Zhang and Shan Liang and Xueliang Zhang and Jianhua Tao",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-72",

language = "English",

volume = "2022-September",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

pages = "4202--4206",

booktitle = "23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022",

}

Pan, J, Nie, S, Zhang, H, He, S, Zhang, K, Liang, S, Zhang, X & Tao, J 2022, Speaker Recognition-Assisted Robust Audio Deepfake Detection. in 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022. vol. 2022-September, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 4202-4206, 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, Incheon, Korea, Republic of, 18/09/22. https://doi.org/10.21437/Interspeech.2022-72

Speaker Recognition-Assisted Robust Audio Deepfake Detection. / Pan, Jiahui; Nie, Shuai; Zhang, Hui et al.
23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022. Vol. 2022-September 2022. p. 4202-4206 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Speaker Recognition-Assisted Robust Audio Deepfake Detection

AU - Pan, Jiahui

AU - Nie, Shuai

AU - Zhang, Hui

AU - He, Shulin

AU - Zhang, Kanghao

AU - Liang, Shan

AU - Zhang, Xueliang

AU - Tao, Jianhua

PY - 2022

Y1 - 2022

N2 - Audio deepfake detection is usually formulated as a binary classification between genuine and fake speech for an entire utterance. Environmental clues such as background and device noise can be used as the classification features, but they are easy to be attacked, e.g. by simply adding real noise to the fake speech. While speech spectral discrimination are more robust features, which have been used in speaker recognition models to authenticate the speaker identity. In the study, we propose a speaker recognition-assisted audio deepfake detector. Feature representation extracted by a speaker recognition model is introduced into multiple layers of deepfake detector to fully exploit the inherent spectral discrimination of speech. Speaker recognition and audio deepfake detection models are jointly optimized by a multi-objective learning method. Systematic experiments on the ASVspoof 2019 logical access corpus demonstrate the proposed approach outperforms existing single systems and significantly improves the robustness to noise.

AB - Audio deepfake detection is usually formulated as a binary classification between genuine and fake speech for an entire utterance. Environmental clues such as background and device noise can be used as the classification features, but they are easy to be attacked, e.g. by simply adding real noise to the fake speech. While speech spectral discrimination are more robust features, which have been used in speaker recognition models to authenticate the speaker identity. In the study, we propose a speaker recognition-assisted audio deepfake detector. Feature representation extracted by a speaker recognition model is introduced into multiple layers of deepfake detector to fully exploit the inherent spectral discrimination of speech. Speaker recognition and audio deepfake detection models are jointly optimized by a multi-objective learning method. Systematic experiments on the ASVspoof 2019 logical access corpus demonstrate the proposed approach outperforms existing single systems and significantly improves the robustness to noise.

KW - ASVspoof 2019

KW - audio deepfake detection

KW - speaker recognition-assisted

KW - spectral discrimination

UR - http://www.scopus.com/inward/record.url?scp=85140060422&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-72

DO - 10.21437/Interspeech.2022-72

M3 - Conference Proceeding

AN - SCOPUS:85140060422

VL - 2022-September

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 4202

EP - 4206

BT - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

Y2 - 18 September 2022 through 22 September 2022

ER -

Pan J, Nie S, Zhang H, He S, Zhang K, Liang S et al. Speaker Recognition-Assisted Robust Audio Deepfake Detection. In 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022. Vol. 2022-September. 2022. p. 4202-4206. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2022-72

Speaker Recognition-Assisted Robust Audio Deepfake Detection

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this