TY - JOUR
T1 - Cantonese natural language processing in the transformers era
T2 - a survey and current challenges
AU - Xiang, Rong
AU - Chersoni, Emmanuele
AU - Li, Yixia
AU - Li, Jing
AU - Huang, Chu Ren
AU - Pan, Yushan
AU - Li, Yushi
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/6/8
Y1 - 2024/6/8
N2 - Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.
AB - Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.
KW - Cantonese
KW - Code-switching
KW - Evaluation resources
KW - Multilingualism
KW - NLP for social media
UR - http://www.scopus.com/inward/record.url?scp=85195564852&partnerID=8YFLogxK
U2 - 10.1007/s10579-024-09744-w
DO - 10.1007/s10579-024-09744-w
M3 - Article
AN - SCOPUS:85195564852
SN - 1574-020X
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
ER -