MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Zhiqi Ai; Zhiyong Chen; Shugong Xu

doi:10.21437/Interspeech.2024-10

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Zhiqi Ai, Zhiyong Chen, Shugong Xu

Shanghai University

Research output: Contribution to journal › Conference article › peer-review

1 Citation (Scopus)

Abstract

In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.

Original language	English
Pages (from-to)	2415-2419
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs	https://doi.org/10.21437/Interspeech.2024-10
Publication status	Published - 2024
Externally published	Yes
Event	25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 1 Sept 2024 → 5 Sept 2024

Keywords

hard case mining
multi-modal
multilingual
user-defined keyword spotting
zero-shot learning

Access to Document

10.21437/Interspeech.2024-10

Cite this

@article{73f5d83b79e64bb4911496a698049080,

title = "MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting",

abstract = "In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.",

keywords = "hard case mining, multi-modal, multilingual, user-defined keyword spotting, zero-shot learning",

author = "Zhiqi Ai and Zhiyong Chen and Shugong Xu",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-10",

language = "English",

pages = "2415--2419",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - MM-KWS

T2 - 25th Interspeech Conferece 2024

AU - Ai, Zhiqi

AU - Chen, Zhiyong

AU - Xu, Shugong

PY - 2024

Y1 - 2024

N2 - In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.

AB - In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.

KW - hard case mining

KW - multi-modal

KW - multilingual

KW - user-defined keyword spotting

KW - zero-shot learning

UR - http://www.scopus.com/inward/record.url?scp=85214846665&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-10

DO - 10.21437/Interspeech.2024-10

M3 - Conference article

AN - SCOPUS:85214846665

SN - 2308-457X

SP - 2415

EP - 2419

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 1 September 2024 through 5 September 2024

ER -

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this