Abstract
Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.
| Original language | English |
|---|---|
| Title of host publication | 2025 21st International Conference on Intelligent Computing |
| Publication status | Accepted/In press - 2025 |
Fingerprint
Dive into the research topics of 'Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver