TY - GEN
T1 - Keywords-oriented Data Augmentation for Chinese
AU - Yuan, Fang
AU - Hong, Xianbin
AU - Yuan, Cheng
AU - Fei, Xiang
AU - Guan, Sheng Uei
AU - Liu, Dawei
AU - Wang, Wei
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/12/11
Y1 - 2020/12/11
N2 - In natural language processing tasks, data is very important, but data collection is not cheap. Large volume data can well serve a series of tasks, especially for deep learning tasks. Data augmentation methods are solutions to data problems, which can work well on rising data quality and quantity, such as generating text without meaning changing and expanding the diversity of data distribution. A user-friendly method of the data augmentation is to sample words in a text then augmenting them. The sampling method is often implemented by a random probability. Although the performance of this solution has been proved over the past few years, random sampling is not the best choice for the data augmentation as it has a chance of randomly introducing some noise into initial data, like stop words. The generated data could interfere with the subsequent tasks and drop the accuracy of the tasks' solutions. Hence, this paper aims to introduce a novel data augmentation method that could avoid involving such noisy data. The strategy is keywords-oriented data augmentation for Chinese (KDA). The KDA proposed in this paper indicates a method of extracting keywords based on category labels, and an augmenting method based on the keywords. In contrast to randomness, the proposed technique firstly selects the key information data, then expands the selected data. The experimental section is compared with another two typical data augmentation techniques on three Chinese data sets for text classification tasks. The result shows that the KDA technique has a better performance in the data augmentation task than the compared two.
AB - In natural language processing tasks, data is very important, but data collection is not cheap. Large volume data can well serve a series of tasks, especially for deep learning tasks. Data augmentation methods are solutions to data problems, which can work well on rising data quality and quantity, such as generating text without meaning changing and expanding the diversity of data distribution. A user-friendly method of the data augmentation is to sample words in a text then augmenting them. The sampling method is often implemented by a random probability. Although the performance of this solution has been proved over the past few years, random sampling is not the best choice for the data augmentation as it has a chance of randomly introducing some noise into initial data, like stop words. The generated data could interfere with the subsequent tasks and drop the accuracy of the tasks' solutions. Hence, this paper aims to introduce a novel data augmentation method that could avoid involving such noisy data. The strategy is keywords-oriented data augmentation for Chinese (KDA). The KDA proposed in this paper indicates a method of extracting keywords based on category labels, and an augmenting method based on the keywords. In contrast to randomness, the proposed technique firstly selects the key information data, then expands the selected data. The experimental section is compared with another two typical data augmentation techniques on three Chinese data sets for text classification tasks. The result shows that the KDA technique has a better performance in the data augmentation task than the compared two.
KW - Chinese
KW - Classification
KW - Data Augmentation
UR - http://www.scopus.com/inward/record.url?scp=85101702617&partnerID=8YFLogxK
U2 - 10.1109/ICCC51575.2020.9345133
DO - 10.1109/ICCC51575.2020.9345133
M3 - Conference Proceeding
AN - SCOPUS:85101702617
T3 - 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020
SP - 2006
EP - 2012
BT - 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th IEEE International Conference on Computer and Communications, ICCC 2020
Y2 - 11 December 2020 through 14 December 2020
ER -