Deep Normalization for Speaker Vectors

Yunqi Cai; Lantian Li; Andrew Abel; Xiaoyan Zhu; Dong Wang

doi:10.1109/TASLP.2020.3039573

Deep Normalization for Speaker Vectors

Yunqi Cai, Lantian Li, Andrew Abel, Xiaoyan Zhu, Dong Wang^*

^*Corresponding author for this work

Department of Computing

Tsinghua University

Research output: Contribution to journal › Article › peer-review

19 Citations (Scopus)

Abstract

Deep speaker embedding has demonstrated state-of-the-art performance in speaker recognition tasks. However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact speaker recognition performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this article, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.

Original language	English
Article number	9296778
Pages (from-to)	733-744
Number of pages	12
Journal	IEEE/ACM Transactions on Audio Speech and Language Processing
Volume	29
DOIs	https://doi.org/10.1109/TASLP.2020.3039573
Publication status	Published - 2021

Keywords

Normalization flow
speaker embedding
speaker recognition

Access to Document

10.1109/TASLP.2020.3039573

Cite this

@article{807c9cdad4234fd0b0ce1ab5db96e012,

title = "Deep Normalization for Speaker Vectors",

abstract = "Deep speaker embedding has demonstrated state-of-the-art performance in speaker recognition tasks. However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact speaker recognition performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this article, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.",

keywords = "Normalization flow, speaker embedding, speaker recognition",

author = "Yunqi Cai and Lantian Li and Andrew Abel and Xiaoyan Zhu and Dong Wang",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.",

year = "2021",

doi = "10.1109/TASLP.2020.3039573",

language = "English",

volume = "29",

pages = "733--744",

journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",

issn = "2329-9290",

}

TY - JOUR

T1 - Deep Normalization for Speaker Vectors

AU - Cai, Yunqi

AU - Li, Lantian

AU - Abel, Andrew

AU - Zhu, Xiaoyan

AU - Wang, Dong

PY - 2021

Y1 - 2021

N2 - Deep speaker embedding has demonstrated state-of-the-art performance in speaker recognition tasks. However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact speaker recognition performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this article, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.

AB - Deep speaker embedding has demonstrated state-of-the-art performance in speaker recognition tasks. However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact speaker recognition performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this article, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.

KW - Normalization flow

KW - speaker embedding

KW - speaker recognition

UR - http://www.scopus.com/inward/record.url?scp=85098762552&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2020.3039573

DO - 10.1109/TASLP.2020.3039573

M3 - Article

AN - SCOPUS:85098762552

SN - 2329-9290

VL - 29

SP - 733

EP - 744

JO - IEEE/ACM Transactions on Audio Speech and Language Processing

JF - IEEE/ACM Transactions on Audio Speech and Language Processing

M1 - 9296778

ER -

Deep Normalization for Speaker Vectors

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this