Cantonese natural language processing in the transformers era: a survey and current challenges

Rong Xiang; Emmanuele Chersoni; Yixia Li; Jing Li; Chu Ren Huang; Yushan Pan; Yushi Li

doi:10.1007/s10579-024-09744-w

Cantonese natural language processing in the transformers era: a survey and current challenges

Rong Xiang, Emmanuele Chersoni^*, Yixia Li, Jing Li, Chu Ren Huang, Yushan Pan, Yushi Li

^*Corresponding author for this work

Hong Kong Polytechnic University

Research output: Contribution to journal › Article › peer-review

Abstract

Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.

Original language	English
Pages (from-to)	1747-1773
Number of pages	27
Journal	Language Resources and Evaluation
Volume	59
Issue number	2
DOIs	https://doi.org/10.1007/s10579-024-09744-w
Publication status	Published - 8 Jun 2024

Keywords

Cantonese
Code-switching
Evaluation resources
Multilingualism
NLP for social media

Access to Document

10.1007/s10579-024-09744-w

Cite this

@article{620f1eba707d4bf5a86165a6242898b4,

title = "Cantonese natural language processing in the transformers era: a survey and current challenges",

abstract = "Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.",

keywords = "Cantonese, Code-switching, Evaluation resources, Multilingualism, NLP for social media",

author = "Rong Xiang and Emmanuele Chersoni and Yixia Li and Jing Li and Huang, {Chu Ren} and Yushan Pan and Yushi Li",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2024.",

year = "2024",

month = jun,

day = "8",

doi = "10.1007/s10579-024-09744-w",

language = "English",

volume = "59",

pages = "1747--1773",

journal = "Language Resources and Evaluation",

issn = "1574-020X",

publisher = "Springer",

number = "2",

}

TY - JOUR

T1 - Cantonese natural language processing in the transformers era

T2 - a survey and current challenges

AU - Xiang, Rong

AU - Chersoni, Emmanuele

AU - Li, Yixia

AU - Li, Jing

AU - Huang, Chu Ren

AU - Pan, Yushan

AU - Li, Yushi

N1 - Publisher Copyright: © The Author(s) 2024.

PY - 2024/6/8

Y1 - 2024/6/8

N2 - Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.

AB - Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.

KW - Cantonese

KW - Code-switching

KW - Evaluation resources

KW - Multilingualism

KW - NLP for social media

UR - http://www.scopus.com/inward/record.url?scp=85195564852&partnerID=8YFLogxK

U2 - 10.1007/s10579-024-09744-w

DO - 10.1007/s10579-024-09744-w

M3 - Article

AN - SCOPUS:85195564852

SN - 1574-020X

VL - 59

SP - 1747

EP - 1773

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

IS - 2

ER -

Cantonese natural language processing in the transformers era: a survey and current challenges

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this