TY - GEN
T1 - Automatic Proofreading in Chinese
T2 - 8th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2019
AU - Wang, Qiufeng
AU - Liu, Minghuan
AU - Zhang, Weijia
AU - Guo, Yuhang
AU - Li, Tianrui
N1 - Publisher Copyright:
© 2019, Springer Nature Switzerland AG.
PY - 2019
Y1 - 2019
N2 - Rapid increase of the scale of text carries huge costs for manual proofreading. In comparison, automatic proofreading shows great advantages on time and human resource, drawing more researchers into it. In this paper, we propose two attention based deep neural network models combined with confusion sets to detect and correct possible Chinese spelling errors in character-level. Our proposed approaches first model the context of Chinese character embedding using Long Short-Term Memory (LSTM) networks, then score the probabilities of candidates from its confusion set through attention mechanism, choosing the highest one as the prediction answer. Also, we define a new methodology for obtaining (preceding text, following text, candidates, target) quads and provides a supervised dataset for training and testing (Our data has been released to the public in https://github.com/ccit-proofread.). Performance evaluation indicates that our models achieve the state-of-the-art performance and outperform a set of baselines.
AB - Rapid increase of the scale of text carries huge costs for manual proofreading. In comparison, automatic proofreading shows great advantages on time and human resource, drawing more researchers into it. In this paper, we propose two attention based deep neural network models combined with confusion sets to detect and correct possible Chinese spelling errors in character-level. Our proposed approaches first model the context of Chinese character embedding using Long Short-Term Memory (LSTM) networks, then score the probabilities of candidates from its confusion set through attention mechanism, choosing the highest one as the prediction answer. Also, we define a new methodology for obtaining (preceding text, following text, candidates, target) quads and provides a supervised dataset for training and testing (Our data has been released to the public in https://github.com/ccit-proofread.). Performance evaluation indicates that our models achieve the state-of-the-art performance and outperform a set of baselines.
KW - Attention mechanism
KW - Error correction of Chinese text
KW - Error detection of Chinese text
KW - LSTM model
UR - http://www.scopus.com/inward/record.url?scp=85075822914&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-32236-6_31
DO - 10.1007/978-3-030-32236-6_31
M3 - Conference Proceeding
AN - SCOPUS:85075822914
SN - 9783030322359
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 349
EP - 359
BT - Natural Language Processing and Chinese Computing - 8th CCF International Conference, NLPCC 2019, Proceedings
A2 - Tang, Jie
A2 - Kan, Min-Yen
A2 - Zhao, Dongyan
A2 - Li, Sujian
A2 - Zan, Hongying
PB - Springer
Y2 - 9 October 2019 through 14 October 2019
ER -