TY - GEN
T1 - Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework
AU - Lu, Zhixiang
AU - Li, Yulong
AU - Sun, Ding
AU - Xue, Chenyue
AU - Wang, Zichun
AU - Pang, Chengren
AU - Xue, Haochen
AU - Zhou, Mian
AU - Su, Jionglong
AU - Jiang, Zhengyong
PY - 2025
Y1 - 2025
N2 - Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.
AB - Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.
M3 - Conference Proceeding
BT - 2025 21st International Conference on Intelligent Computing
ER -