TY - GEN
T1 - Advancing Low-Resource Machine Translation
T2 - 21st International Conference on Intelligent Computing, ICIC 2025
AU - Lu, Zhixiang
AU - Ji, Peichen
AU - Li, Yulong
AU - Sun, Ding
AU - Xue, Chenyu
AU - Xue, Haochen
AU - Zhou, Mian
AU - Stefanidis, Angelos
AU - Su, Jionglong
AU - Jiang, Zhengyong
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese-Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to offering a cost-effective and efficient alternative to using large-scale LLM inference. Our framework sets a new performance benchmark for Chinese-Dutch translation and provides a generalizable solution for improving LLM-based translation in low-resource scenarios.
AB - Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese-Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to offering a cost-effective and efficient alternative to using large-scale LLM inference. Our framework sets a new performance benchmark for Chinese-Dutch translation and provides a generalizable solution for improving LLM-based translation in low-resource scenarios.
KW - Data Selection
KW - Large Language Models
KW - Low-resource Scenarios
KW - Machine Translation
KW - Translation Quality Estimation
UR - https://www.scopus.com/pages/publications/105012426280
U2 - 10.1007/978-981-95-0020-8_41
DO - 10.1007/978-981-95-0020-8_41
M3 - Conference Proceeding
AN - SCOPUS:105012426280
SN - 9789819500192
T3 - Lecture Notes in Computer Science
SP - 482
EP - 493
BT - Advanced Intelligent Computing Technology and Applications - 21st International Conference, ICIC 2025, Proceedings
A2 - Huang, De-Shuang
A2 - Zhang, Qinhu
A2 - Zhang, Chuanlei
A2 - Chen, Wei
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 26 July 2025 through 29 July 2025
ER -