Triplet based embedding distance and similarity learning for text-independent speaker verification

Zongze Ren; Zhiyong Chen; Shugong Xu

doi:10.1109/APSIPAASC47483.2019.9023253

Triplet based embedding distance and similarity learning for text-independent speaker verification

Zongze Ren, Zhiyong Chen, Shugong Xu

Shanghai University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

4 Citations (Scopus)

Abstract

Speaker embeddings become growing popular in the text-independent speaker verification task. In this paper, we propose two improvements during the training stage. The improvements are both based on triplet because the training stage and the evaluation stage of the baseline x-vector system focus on different aims. Firstly, we introduce triplet loss for optimizing the Euclidean distances between embeddings while minimizing the multi-class cross entropy loss. Secondly, we design an embedding similarity measurement network for controlling the similarity between the two selected embeddings. We further jointly train the two new methods with the original network and achieve state-of-the-art. The multi-task training synergies are shown with a 9% reduction equal error rate (EER) and detected cost function (DCF) on the 2016 NIST Speaker Recognition Evaluation (SRE) Test Set.

Original language	English
Title of host publication	2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	558-562
Number of pages	5
ISBN (Electronic)	9781728132488
DOIs	https://doi.org/10.1109/APSIPAASC47483.2019.9023253
Publication status	Published - Nov 2019
Externally published	Yes
Event	2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 - Lanzhou, China Duration: 18 Nov 2019 → 21 Nov 2019

Publication series

Name	2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019

Conference

Conference	2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019
Country/Territory	China
City	Lanzhou
Period	18/11/19 → 21/11/19

Keywords

Deep neural network
Similarity learning
Speaker verification

Access to Document

10.1109/APSIPAASC47483.2019.9023253

Cite this

Ren, Z., Chen, Z., & Xu, S. (2019). Triplet based embedding distance and similarity learning for text-independent speaker verification. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 (pp. 558-562). Article 9023253 (2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPAASC47483.2019.9023253

Ren, Zongze ; Chen, Zhiyong ; Xu, Shugong. / Triplet based embedding distance and similarity learning for text-independent speaker verification. 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 558-562 (2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019).

@inproceedings{0d31183b7a184165853b2a364da95bf5,

title = "Triplet based embedding distance and similarity learning for text-independent speaker verification",

abstract = "Speaker embeddings become growing popular in the text-independent speaker verification task. In this paper, we propose two improvements during the training stage. The improvements are both based on triplet because the training stage and the evaluation stage of the baseline x-vector system focus on different aims. Firstly, we introduce triplet loss for optimizing the Euclidean distances between embeddings while minimizing the multi-class cross entropy loss. Secondly, we design an embedding similarity measurement network for controlling the similarity between the two selected embeddings. We further jointly train the two new methods with the original network and achieve state-of-the-art. The multi-task training synergies are shown with a 9% reduction equal error rate (EER) and detected cost function (DCF) on the 2016 NIST Speaker Recognition Evaluation (SRE) Test Set.",

keywords = "Deep neural network, Similarity learning, Speaker verification",

author = "Zongze Ren and Zhiyong Chen and Shugong Xu",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019 ; Conference date: 18-11-2019 Through 21-11-2019",

year = "2019",

month = nov,

doi = "10.1109/APSIPAASC47483.2019.9023253",

language = "English",

series = "2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "558--562",

booktitle = "2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019",

}

Ren, Z, Chen, Z & Xu, S 2019, Triplet based embedding distance and similarity learning for text-independent speaker verification. in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019., 9023253, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019, Institute of Electrical and Electronics Engineers Inc., pp. 558-562, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019, Lanzhou, China, 18/11/19. https://doi.org/10.1109/APSIPAASC47483.2019.9023253

Triplet based embedding distance and similarity learning for text-independent speaker verification. / Ren, Zongze; Chen, Zhiyong; Xu, Shugong.
2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 558-562 9023253 (2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Triplet based embedding distance and similarity learning for text-independent speaker verification

AU - Ren, Zongze

AU - Chen, Zhiyong

AU - Xu, Shugong

PY - 2019/11

Y1 - 2019/11

N2 - Speaker embeddings become growing popular in the text-independent speaker verification task. In this paper, we propose two improvements during the training stage. The improvements are both based on triplet because the training stage and the evaluation stage of the baseline x-vector system focus on different aims. Firstly, we introduce triplet loss for optimizing the Euclidean distances between embeddings while minimizing the multi-class cross entropy loss. Secondly, we design an embedding similarity measurement network for controlling the similarity between the two selected embeddings. We further jointly train the two new methods with the original network and achieve state-of-the-art. The multi-task training synergies are shown with a 9% reduction equal error rate (EER) and detected cost function (DCF) on the 2016 NIST Speaker Recognition Evaluation (SRE) Test Set.

AB - Speaker embeddings become growing popular in the text-independent speaker verification task. In this paper, we propose two improvements during the training stage. The improvements are both based on triplet because the training stage and the evaluation stage of the baseline x-vector system focus on different aims. Firstly, we introduce triplet loss for optimizing the Euclidean distances between embeddings while minimizing the multi-class cross entropy loss. Secondly, we design an embedding similarity measurement network for controlling the similarity between the two selected embeddings. We further jointly train the two new methods with the original network and achieve state-of-the-art. The multi-task training synergies are shown with a 9% reduction equal error rate (EER) and detected cost function (DCF) on the 2016 NIST Speaker Recognition Evaluation (SRE) Test Set.

KW - Deep neural network

KW - Similarity learning

KW - Speaker verification

UR - http://www.scopus.com/inward/record.url?scp=85082395310&partnerID=8YFLogxK

U2 - 10.1109/APSIPAASC47483.2019.9023253

DO - 10.1109/APSIPAASC47483.2019.9023253

M3 - Conference Proceeding

AN - SCOPUS:85082395310

T3 - 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019

SP - 558

EP - 562

BT - 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019

Y2 - 18 November 2019 through 21 November 2019

ER -

Ren Z, Chen Z, Xu S. Triplet based embedding distance and similarity learning for text-independent speaker verification. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 558-562. 9023253. (2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019). doi: 10.1109/APSIPAASC47483.2019.9023253

Triplet based embedding distance and similarity learning for text-independent speaker verification

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this