Research on the Construction Method of Chinese - Vietnamese Parallel Corpus

Shiying Tu, Haojin Hu, Ronglyu Sun, Yanmei Jing, Wenxue He

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

1 Citation (Scopus)

Abstract

The Chinese-Vietnameseparallel corpus is the basic research problem in the fields of natural language processing. The traditional methods use the DOM tree or element anchors in HTML extract parallel sentences with low accuracy and slow alignment speed. Therefore, this paper proposes a new Web-based Chinese-Vietnamese parallel corpus construction scheme. The scheme will determine the parallel web page through the LDA (Latent Dirichlet Allocation) and Gibbs Sampling. And the BeautifulSoup and regular expression will be used to crawl the webpage text and clean the corpus. The DOM tree and the element anchors in HTML are used to optimize the extraction of parallel sentence pairs. Combined with the sentence length and Champollion algorithm, the dynamic programming algorithm is adopted to improve the correct rate and recall rate of sentence alignment. The program successfully established a million-level Chinese-Vietnamese parallel corpus.

Original languageEnglish
Title of host publicationProceedings of 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2019
EditorsBing Xu, Kefen Mou
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2006-2011
Number of pages6
ISBN (Electronic)9781728119076
DOIs
Publication statusPublished - Dec 2019
Externally publishedYes
Event4th IEEE Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2019 - Chengdu, China
Duration: 20 Dec 201922 Dec 2019

Publication series

NameProceedings of 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2019

Conference

Conference4th IEEE Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2019
Country/TerritoryChina
CityChengdu
Period20/12/1922/12/19

Keywords

  • Chinese-Vietnamese parallel corpus
  • corpus cleaning
  • parallel web crawling
  • sentence alignment

Fingerprint

Dive into the research topics of 'Research on the Construction Method of Chinese - Vietnamese Parallel Corpus'. Together they form a unique fingerprint.

Cite this