Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework

Zhixiang Lu, Yulong Li, Ding Sun, Chenyue Xue, Zichun Wang, Chengren Pang, Haochen Xue, Mian Zhou, Jionglong Su, Zhengyong Jiang*

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

Abstract

Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.
Original languageEnglish
Title of host publication2025 21st International Conference on Intelligent Computing
Publication statusAccepted/In press - 2025

Fingerprint

Dive into the research topics of 'Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework'. Together they form a unique fingerprint.

Cite this