Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework

Zhixiang Lu; Yulong Li; Ding Sun; Chenyue Xue; Zichun Wang; Chengren Pang; Haochen Xue; Mian Zhou; Jionglong Su; Zhengyong Jiang

Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework

Zhixiang Lu, Yulong Li, Ding Sun, Chenyue Xue, Zichun Wang, Chengren Pang, Haochen Xue, Mian Zhou, Jionglong Su, Zhengyong Jiang^*

^*Corresponding author for this work

School of AI and Advanced Computing

School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.

Original language	English
Title of host publication	2025 21st International Conference on Intelligent Computing
Publication status	Accepted/In press - 2025

Cite this

@inproceedings{1f5769ef2ffc4740805bae0a82ee0a24,

title = "Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework",

abstract = "Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.",

author = "Zhixiang Lu and Yulong Li and Ding Sun and Chenyue Xue and Zichun Wang and Chengren Pang and Haochen Xue and Mian Zhou and Jionglong Su and Zhengyong Jiang",

year = "2025",

language = "English",

booktitle = "2025 21st International Conference on Intelligent Computing",

}

TY - GEN

T1 - Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework

AU - Lu, Zhixiang

AU - Li, Yulong

AU - Sun, Ding

AU - Xue, Chenyue

AU - Wang, Zichun

AU - Pang, Chengren

AU - Xue, Haochen

AU - Zhou, Mian

AU - Su, Jionglong

AU - Jiang, Zhengyong

PY - 2025

Y1 - 2025

N2 - Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.

AB - Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese–Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to large-scale inference. Our framework sets a new performance benchmark for Chinese- Dutch translation and provides a generalizable solution for improving LLM- based translation in low-resource scenarios.

M3 - Conference Proceeding

BT - 2025 21st International Conference on Intelligent Computing

ER -

Enhancing Low-Resource Translation with Large Language Models: A Unified Data Selection and Scoring Optimization Framework

Abstract

Fingerprint

Cite this