Corpora for document-level neural machine translation

Siyou Liu; Xiaojun Zhang

Corpora for document-level neural machine translation

Siyou Liu, Xiaojun Zhang

Department of Literary and Translation Studies

Macao Polytechnic University

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

9 Citations (Scopus)

Abstract

Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aim to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.

Original language	English
Title of host publication	LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
Editors	Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Publisher	European Language Resources Association (ELRA)
Pages	3775-3781
Number of pages	7
ISBN (Electronic)	9791095546344
Publication status	Published - 2020
Event	12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France Duration: 11 May 2020 → 16 May 2020

Publication series

Name	LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference

Conference	12th International Conference on Language Resources and Evaluation, LREC 2020
Country/Territory	France
City	Marseille
Period	11/05/20 → 16/05/20

Keywords

Corpus
Discourse
Document-Level Translation
Neural Machine Translation

Cite this

Liu, S., & Zhang, X. (2020). Corpora for document-level neural machine translation. In N. Calzolari, F. Bechet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings (pp. 3775-3781). (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings). European Language Resources Association (ELRA).

Liu, Siyou ; Zhang, Xiaojun. / Corpora for document-level neural machine translation. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. editor / Nicoletta Calzolari ; Frederic Bechet ; Philippe Blache ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Helene Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis. European Language Resources Association (ELRA), 2020. pp. 3775-3781 (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).

@inproceedings{7de27e8ca1b14be9b8141380cce79f06,

title = "Corpora for document-level neural machine translation",

abstract = "Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aim to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.",

keywords = "Corpus, Discourse, Document-Level Translation, Neural Machine Translation",

author = "Siyou Liu and Xiaojun Zhang",

note = "Publisher Copyright: {\textcopyright} European Language Resources Association (ELRA), licensed under CC-BY-NC; 12th International Conference on Language Resources and Evaluation, LREC 2020 ; Conference date: 11-05-2020 Through 16-05-2020",

year = "2020",

language = "English",

series = "LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings",

publisher = "European Language Resources Association (ELRA)",

pages = "3775--3781",

editor = "Nicoletta Calzolari and Frederic Bechet and Philippe Blache and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis",

booktitle = "LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings",

}

Liu, S & Zhang, X 2020, Corpora for document-level neural machine translation. in N Calzolari, F Bechet, P Blache, K Choukri, C Cieri, T Declerck, S Goggi, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk & S Piperidis (eds), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, European Language Resources Association (ELRA), pp. 3775-3781, 12th International Conference on Language Resources and Evaluation, LREC 2020, Marseille, France, 11/05/20.

Corpora for document-level neural machine translation. / Liu, Siyou; Zhang, Xiaojun.
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. ed. / Nicoletta Calzolari; Frederic Bechet; Philippe Blache; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Helene Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis. European Language Resources Association (ELRA), 2020. p. 3775-3781 (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Corpora for document-level neural machine translation

AU - Liu, Siyou

AU - Zhang, Xiaojun

N1 - Publisher Copyright: © European Language Resources Association (ELRA), licensed under CC-BY-NC

PY - 2020

Y1 - 2020

N2 - Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aim to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.

AB - Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aim to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.

KW - Corpus

KW - Discourse

KW - Document-Level Translation

KW - Neural Machine Translation

UR - http://www.scopus.com/inward/record.url?scp=85096602820&partnerID=8YFLogxK

M3 - Conference Proceeding

AN - SCOPUS:85096602820

T3 - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

SP - 3775

EP - 3781

BT - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

A2 - Calzolari, Nicoletta

A2 - Bechet, Frederic

A2 - Blache, Philippe

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Helene

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

PB - European Language Resources Association (ELRA)

T2 - 12th International Conference on Language Resources and Evaluation, LREC 2020

Y2 - 11 May 2020 through 16 May 2020

ER -

Liu S, Zhang X. Corpora for document-level neural machine translation. In Calzolari N, Bechet F, Blache P, Choukri K, Cieri C, Declerck T, Goggi S, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, editors, LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. European Language Resources Association (ELRA). 2020. p. 3775-3781. (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).

Corpora for document-level neural machine translation

Abstract

Publication series

Conference

Keywords

Other files and links

Cite this