DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS

Yudong Li; Yuhao Feng; Wen Zhou; Zhe Zhao; Linlin Shen; Cheng Hou; Xianxu Hou

doi:10.1109/ICASSP48485.2024.10446640

DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS

Yudong Li, Yuhao Feng, Wen Zhou, Zhe Zhao, Linlin Shen^*, Cheng Hou, Xianxu Hou

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

1 Citation (Scopus)

Abstract

Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

Original language	English
Title of host publication	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	11291-11295
Number of pages	5
ISBN (Electronic)	9798350344851
DOIs	https://doi.org/10.1109/ICASSP48485.2024.10446640
Publication status	Published - 2024
Externally published	Yes
Event	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)	1520-6149

Conference

Conference	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/Territory	Korea, Republic of
City	Seoul
Period	14/04/24 → 19/04/24

Keywords

Large language model
cross-language
knowledge transfer

Access to Document

10.1109/ICASSP48485.2024.10446640

Cite this

Li, Y., Feng, Y., Zhou, W., Zhao, Z., Shen, L., Hou, C., & Hou, X. (2024). DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings (pp. 11291-11295). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP48485.2024.10446640

Li, Yudong ; Feng, Yuhao ; Zhou, Wen et al. / DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. pp. 11291-11295 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{63935117fe724575a3cc1262806a4006,

title = "DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS",

abstract = "Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.",

keywords = "Large language model, cross-language, knowledge transfer",

author = "Yudong Li and Yuhao Feng and Wen Zhou and Zhe Zhao and Linlin Shen and Cheng Hou and Xianxu Hou",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSP48485.2024.10446640",

language = "English",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "11291--11295",

booktitle = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings",

}

Li, Y, Feng, Y, Zhou, W, Zhao, Z, Shen, L, Hou, C & Hou, X 2024, DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS. in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 11291-11295, 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Republic of, 14/04/24. https://doi.org/10.1109/ICASSP48485.2024.10446640

DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS. / Li, Yudong; Feng, Yuhao; Zhou, Wen et al.
2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. p. 11291-11295 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS

AU - Li, Yudong

AU - Feng, Yuhao

AU - Zhou, Wen

AU - Zhao, Zhe

AU - Shen, Linlin

AU - Hou, Cheng

AU - Hou, Xianxu

PY - 2024

Y1 - 2024

N2 - Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

AB - Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

KW - Large language model

KW - cross-language

KW - knowledge transfer

UR - http://www.scopus.com/inward/record.url?scp=85195416893&partnerID=8YFLogxK

U2 - 10.1109/ICASSP48485.2024.10446640

DO - 10.1109/ICASSP48485.2024.10446640

M3 - Conference Proceeding

AN - SCOPUS:85195416893

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 11291

EP - 11295

BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

Y2 - 14 April 2024 through 19 April 2024

ER -

Li Y, Feng Y, Zhou W, Zhao Z, Shen L, Hou C et al. DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. p. 11291-11295. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP48485.2024.10446640

DYNAMIC DATA SAMPLER FOR CROSS-LANGUAGE TRANSFER LEARNING IN LARGE LANGUAGE MODELS

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this