Automatic construction of discourse corpora for dialogue translation

Longyue Wang; Xiaojun Zhang; Zhaopeng Tu; Andy Way; Qun Liu

Automatic construction of discourse corpora for dialogue translation

Longyue Wang, Xiaojun Zhang, Zhaopeng Tu, Andy Way, Qun Liu

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

16 Citations (Scopus)

Abstract

In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.

Original language	English
Title of host publication	Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
Editors	Nicoletta Calzolari, Khalid Choukri, Helene Mazo, Asuncion Moreno, Thierry Declerck, Sara Goggi, Marko Grobelnik, Jan Odijk, Stelios Piperidis, Bente Maegaard, Joseph Mariani
Publisher	European Language Resources Association (ELRA)
Pages	2748-2754
Number of pages	7
ISBN (Electronic)	9782951740891
Publication status	Published - 2016
Externally published	Yes
Event	10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia Duration: 23 May 2016 → 28 May 2016

Publication series

Name	Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

Conference

Conference	10th International Conference on Language Resources and Evaluation, LREC 2016
Country/Territory	Slovenia
City	Portoroz
Period	23/05/16 → 28/05/16

Keywords

Dialogue
Discourse corpus
Information retrieval
Machine translation
Movie script
Movie subtitle

Cite this

Wang, L., Zhang, X., Tu, Z., Way, A., & Liu, Q. (2016). Automatic construction of discourse corpora for dialogue translation. In N. Calzolari, K. Choukri, H. Mazo, A. Moreno, T. Declerck, S. Goggi, M. Grobelnik, J. Odijk, S. Piperidis, B. Maegaard, & J. Mariani (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 2748-2754). (Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016). European Language Resources Association (ELRA).

Wang, Longyue ; Zhang, Xiaojun ; Tu, Zhaopeng et al. / Automatic construction of discourse corpora for dialogue translation. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. editor / Nicoletta Calzolari ; Khalid Choukri ; Helene Mazo ; Asuncion Moreno ; Thierry Declerck ; Sara Goggi ; Marko Grobelnik ; Jan Odijk ; Stelios Piperidis ; Bente Maegaard ; Joseph Mariani. European Language Resources Association (ELRA), 2016. pp. 2748-2754 (Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016).

@inproceedings{bcafd8fb98dd43728b62143042eae94a,

title = "Automatic construction of discourse corpora for dialogue translation",

abstract = "In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.",

keywords = "Dialogue, Discourse corpus, Information retrieval, Machine translation, Movie script, Movie subtitle",

author = "Longyue Wang and Xiaojun Zhang and Zhaopeng Tu and Andy Way and Qun Liu",

note = "Funding Information: This work is supported by the Science Foundation of Ireland (SFI) ADAPT project (Grant No.:13/RC/2106), and partly supported by the DCU-Huawei Joint Project (Grant No.:201504032-A, YB2015090061). It is partly supported by the Open Projects Program of National Laboratory of Pattern Recognition (Grant 201407353) and the Open Projects Program of Centre of Translation of GDUFS (Grant CTS201501).; 10th International Conference on Language Resources and Evaluation, LREC 2016 ; Conference date: 23-05-2016 Through 28-05-2016",

year = "2016",

language = "English",

series = "Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016",

publisher = "European Language Resources Association (ELRA)",

pages = "2748--2754",

editor = "Nicoletta Calzolari and Khalid Choukri and Helene Mazo and Asuncion Moreno and Thierry Declerck and Sara Goggi and Marko Grobelnik and Jan Odijk and Stelios Piperidis and Bente Maegaard and Joseph Mariani",

booktitle = "Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016",

}

Wang, L, Zhang, X, Tu, Z, Way, A & Liu, Q 2016, Automatic construction of discourse corpora for dialogue translation. in N Calzolari, K Choukri, H Mazo, A Moreno, T Declerck, S Goggi, M Grobelnik, J Odijk, S Piperidis, B Maegaard & J Mariani (eds), Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, European Language Resources Association (ELRA), pp. 2748-2754, 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoroz, Slovenia, 23/05/16.

Automatic construction of discourse corpora for dialogue translation. / Wang, Longyue; Zhang, Xiaojun; Tu, Zhaopeng et al.
Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. ed. / Nicoletta Calzolari; Khalid Choukri; Helene Mazo; Asuncion Moreno; Thierry Declerck; Sara Goggi; Marko Grobelnik; Jan Odijk; Stelios Piperidis; Bente Maegaard; Joseph Mariani. European Language Resources Association (ELRA), 2016. p. 2748-2754 (Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Automatic construction of discourse corpora for dialogue translation

AU - Wang, Longyue

AU - Zhang, Xiaojun

AU - Tu, Zhaopeng

AU - Way, Andy

AU - Liu, Qun

N1 - Funding Information: This work is supported by the Science Foundation of Ireland (SFI) ADAPT project (Grant No.:13/RC/2106), and partly supported by the DCU-Huawei Joint Project (Grant No.:201504032-A, YB2015090061). It is partly supported by the Open Projects Program of National Laboratory of Pattern Recognition (Grant 201407353) and the Open Projects Program of Centre of Translation of GDUFS (Grant CTS201501).

PY - 2016

Y1 - 2016

N2 - In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.

AB - In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.

KW - Dialogue

KW - Discourse corpus

KW - Information retrieval

KW - Machine translation

KW - Movie script

KW - Movie subtitle

UR - http://www.scopus.com/inward/record.url?scp=85024100897&partnerID=8YFLogxK

M3 - Conference Proceeding

AN - SCOPUS:85024100897

T3 - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

SP - 2748

EP - 2754

BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Mazo, Helene

A2 - Moreno, Asuncion

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Grobelnik, Marko

A2 - Odijk, Jan

A2 - Piperidis, Stelios

A2 - Maegaard, Bente

A2 - Mariani, Joseph

PB - European Language Resources Association (ELRA)

T2 - 10th International Conference on Language Resources and Evaluation, LREC 2016

Y2 - 23 May 2016 through 28 May 2016

ER -

Wang L, Zhang X, Tu Z, Way A, Liu Q. Automatic construction of discourse corpora for dialogue translation. In Calzolari N, Choukri K, Mazo H, Moreno A, Declerck T, Goggi S, Grobelnik M, Odijk J, Piperidis S, Maegaard B, Mariani J, editors, Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA). 2016. p. 2748-2754. (Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016).

Automatic construction of discourse corpora for dialogue translation

Abstract

Publication series

Conference

Keywords

Other files and links

Cite this