Advancing Low-Resource Machine Translation: A Unified Data Selection and Scoring Optimization Framework

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

Abstract

Large language models (LLMs) have achieved remarkable success in machine translation, yet their performance on low-resource language pairs remains limited due to data scarcity and poor generalization. In this work, we propose the Unified Data Selection and Scoring Optimization (UDSSO) framework, a novel system that leverages LLMs for high-quality data augmentation and filtering, specifically tailored for low-resource translation. UDSSO integrates scalable data scoring and selection mechanisms to construct improved training corpora, which we apply to fine-tune a compact multilingual model, mBART. We focus on the challenging Chinese-Dutch translation task, a previously underexplored low-resource setting. Our experiments demonstrate that mBART trained with UDSSO-processed data significantly outperforms state-of-the-art (SOTA) LLMs such as GPT-4o and Deepseek-v3, both in translation accuracy and linguistic consistency. This finding highlights the power of strategically enhanced datasets in maximizing the performance of smaller models, offering a cost-effective and efficient alternative to offering a cost-effective and efficient alternative to using large-scale LLM inference. Our framework sets a new performance benchmark for Chinese-Dutch translation and provides a generalizable solution for improving LLM-based translation in low-resource scenarios.

Original languageEnglish
Title of host publicationAdvanced Intelligent Computing Technology and Applications - 21st International Conference, ICIC 2025, Proceedings
EditorsDe-Shuang Huang, Qinhu Zhang, Chuanlei Zhang, Wei Chen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages482-493
Number of pages12
ISBN (Print)9789819500192
DOIs
Publication statusPublished - 2025
Event21st International Conference on Intelligent Computing, ICIC 2025 - Ningbo, China
Duration: 26 Jul 202529 Jul 2025

Publication series

NameLecture Notes in Computer Science
Volume15865 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Intelligent Computing, ICIC 2025
Country/TerritoryChina
CityNingbo
Period26/07/2529/07/25

Keywords

  • Data Selection
  • Large Language Models
  • Low-resource Scenarios
  • Machine Translation
  • Translation Quality Estimation

Fingerprint

Dive into the research topics of 'Advancing Low-Resource Machine Translation: A Unified Data Selection and Scoring Optimization Framework'. Together they form a unique fingerprint.

Cite this