Abstract
Multi-Taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs). Even if past work has reported promising automatic speaker verification (ASV) results with Gaussian mixture model-based classifiers, the performance of multi-Taper MFCCs with deep ASV systems remains an open question. Instead of a static-Taper design, we propose to optimize the multi-Taper estimator jointly with a deep neural network trained for ASV tasks. With a maximum improvement on the SITW corpus of 25.8% in terms of equal error rate over the static-Taper, our method helps preserve a balanced level of leakage and variance, providing more robustness.
| Original language | English |
|---|---|
| Pages (from-to) | 2187-2191 |
| Number of pages | 5 |
| Journal | IEEE Signal Processing Letters |
| Volume | 28 |
| DOIs | |
| Publication status | Published - 2021 |
| Externally published | Yes |
Keywords
- Multi-Taper spectrum
- speaker verification
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver