Keywords-oriented Data Augmentation for Chinese

Fang Yuan; Xianbin Hong; Cheng Yuan; Xiang Fei; Sheng Uei Guan; Dawei Liu; Wei Wang

doi:10.1109/ICCC51575.2020.9345133

Keywords-oriented Data Augmentation for Chinese

Fang Yuan, Xianbin Hong, Cheng Yuan, Xiang Fei, Sheng Uei Guan, Dawei Liu, Wei Wang

Department of Computing

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

2 Citations (Scopus)

Abstract

In natural language processing tasks, data is very important, but data collection is not cheap. Large volume data can well serve a series of tasks, especially for deep learning tasks. Data augmentation methods are solutions to data problems, which can work well on rising data quality and quantity, such as generating text without meaning changing and expanding the diversity of data distribution. A user-friendly method of the data augmentation is to sample words in a text then augmenting them. The sampling method is often implemented by a random probability. Although the performance of this solution has been proved over the past few years, random sampling is not the best choice for the data augmentation as it has a chance of randomly introducing some noise into initial data, like stop words. The generated data could interfere with the subsequent tasks and drop the accuracy of the tasks' solutions. Hence, this paper aims to introduce a novel data augmentation method that could avoid involving such noisy data. The strategy is keywords-oriented data augmentation for Chinese (KDA). The KDA proposed in this paper indicates a method of extracting keywords based on category labels, and an augmenting method based on the keywords. In contrast to randomness, the proposed technique firstly selects the key information data, then expands the selected data. The experimental section is compared with another two typical data augmentation techniques on three Chinese data sets for text classification tasks. The result shows that the KDA technique has a better performance in the data augmentation task than the compared two.

Original language	English
Title of host publication	2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	2006-2012
Number of pages	7
ISBN (Electronic)	9781728186351
DOIs	https://doi.org/10.1109/ICCC51575.2020.9345133
Publication status	Published - 11 Dec 2020
Event	6th IEEE International Conference on Computer and Communications, ICCC 2020 - Chengdu, China Duration: 11 Dec 2020 → 14 Dec 2020

Publication series

Name	2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020

Conference

Conference	6th IEEE International Conference on Computer and Communications, ICCC 2020
Country/Territory	China
City	Chengdu
Period	11/12/20 → 14/12/20

Keywords

Chinese
Classification
Data Augmentation

Access to Document

10.1109/ICCC51575.2020.9345133

Cite this

Yuan, F., Hong, X., Yuan, C., Fei, X., Guan, S. U., Liu, D., & Wang, W. (2020). Keywords-oriented Data Augmentation for Chinese. In 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020 (pp. 2006-2012). Article 9345133 (2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCC51575.2020.9345133

@inproceedings{bdee61da3bd148c19bf7687400ba5f58,

title = "Keywords-oriented Data Augmentation for Chinese",

abstract = "In natural language processing tasks, data is very important, but data collection is not cheap. Large volume data can well serve a series of tasks, especially for deep learning tasks. Data augmentation methods are solutions to data problems, which can work well on rising data quality and quantity, such as generating text without meaning changing and expanding the diversity of data distribution. A user-friendly method of the data augmentation is to sample words in a text then augmenting them. The sampling method is often implemented by a random probability. Although the performance of this solution has been proved over the past few years, random sampling is not the best choice for the data augmentation as it has a chance of randomly introducing some noise into initial data, like stop words. The generated data could interfere with the subsequent tasks and drop the accuracy of the tasks' solutions. Hence, this paper aims to introduce a novel data augmentation method that could avoid involving such noisy data. The strategy is keywords-oriented data augmentation for Chinese (KDA). The KDA proposed in this paper indicates a method of extracting keywords based on category labels, and an augmenting method based on the keywords. In contrast to randomness, the proposed technique firstly selects the key information data, then expands the selected data. The experimental section is compared with another two typical data augmentation techniques on three Chinese data sets for text classification tasks. The result shows that the KDA technique has a better performance in the data augmentation task than the compared two.",

keywords = "Chinese, Classification, Data Augmentation",

author = "Fang Yuan and Xianbin Hong and Cheng Yuan and Xiang Fei and Guan, {Sheng Uei} and Dawei Liu and Wei Wang",

note = "Publisher Copyright: {\textcopyright} 2020 IEEE.; 6th IEEE International Conference on Computer and Communications, ICCC 2020 ; Conference date: 11-12-2020 Through 14-12-2020",

year = "2020",

month = dec,

day = "11",

doi = "10.1109/ICCC51575.2020.9345133",

language = "English",

series = "2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "2006--2012",

booktitle = "2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020",

}

Yuan, F, Hong, X, Yuan, C, Fei, X, Guan, SU , Liu, D & Wang, W 2020, Keywords-oriented Data Augmentation for Chinese. in 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020., 9345133, 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020, Institute of Electrical and Electronics Engineers Inc., pp. 2006-2012, 6th IEEE International Conference on Computer and Communications, ICCC 2020, Chengdu, China, 11/12/20. https://doi.org/10.1109/ICCC51575.2020.9345133

Keywords-oriented Data Augmentation for Chinese. / Yuan, Fang; Hong, Xianbin; Yuan, Cheng et al.
2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020. Institute of Electrical and Electronics Engineers Inc., 2020. p. 2006-2012 9345133 (2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Keywords-oriented Data Augmentation for Chinese

AU - Yuan, Fang

AU - Hong, Xianbin

AU - Yuan, Cheng

AU - Fei, Xiang

AU - Guan, Sheng Uei

AU - Liu, Dawei

AU - Wang, Wei

PY - 2020/12/11

Y1 - 2020/12/11

N2 - In natural language processing tasks, data is very important, but data collection is not cheap. Large volume data can well serve a series of tasks, especially for deep learning tasks. Data augmentation methods are solutions to data problems, which can work well on rising data quality and quantity, such as generating text without meaning changing and expanding the diversity of data distribution. A user-friendly method of the data augmentation is to sample words in a text then augmenting them. The sampling method is often implemented by a random probability. Although the performance of this solution has been proved over the past few years, random sampling is not the best choice for the data augmentation as it has a chance of randomly introducing some noise into initial data, like stop words. The generated data could interfere with the subsequent tasks and drop the accuracy of the tasks' solutions. Hence, this paper aims to introduce a novel data augmentation method that could avoid involving such noisy data. The strategy is keywords-oriented data augmentation for Chinese (KDA). The KDA proposed in this paper indicates a method of extracting keywords based on category labels, and an augmenting method based on the keywords. In contrast to randomness, the proposed technique firstly selects the key information data, then expands the selected data. The experimental section is compared with another two typical data augmentation techniques on three Chinese data sets for text classification tasks. The result shows that the KDA technique has a better performance in the data augmentation task than the compared two.

AB - In natural language processing tasks, data is very important, but data collection is not cheap. Large volume data can well serve a series of tasks, especially for deep learning tasks. Data augmentation methods are solutions to data problems, which can work well on rising data quality and quantity, such as generating text without meaning changing and expanding the diversity of data distribution. A user-friendly method of the data augmentation is to sample words in a text then augmenting them. The sampling method is often implemented by a random probability. Although the performance of this solution has been proved over the past few years, random sampling is not the best choice for the data augmentation as it has a chance of randomly introducing some noise into initial data, like stop words. The generated data could interfere with the subsequent tasks and drop the accuracy of the tasks' solutions. Hence, this paper aims to introduce a novel data augmentation method that could avoid involving such noisy data. The strategy is keywords-oriented data augmentation for Chinese (KDA). The KDA proposed in this paper indicates a method of extracting keywords based on category labels, and an augmenting method based on the keywords. In contrast to randomness, the proposed technique firstly selects the key information data, then expands the selected data. The experimental section is compared with another two typical data augmentation techniques on three Chinese data sets for text classification tasks. The result shows that the KDA technique has a better performance in the data augmentation task than the compared two.

KW - Chinese

KW - Classification

KW - Data Augmentation

UR - http://www.scopus.com/inward/record.url?scp=85101702617&partnerID=8YFLogxK

U2 - 10.1109/ICCC51575.2020.9345133

DO - 10.1109/ICCC51575.2020.9345133

M3 - Conference Proceeding

AN - SCOPUS:85101702617

T3 - 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020

SP - 2006

EP - 2012

BT - 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 6th IEEE International Conference on Computer and Communications, ICCC 2020

Y2 - 11 December 2020 through 14 December 2020

ER -

Yuan F, Hong X, Yuan C, Fei X, Guan SU , Liu D et al. Keywords-oriented Data Augmentation for Chinese. In 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020. Institute of Electrical and Electronics Engineers Inc. 2020. p. 2006-2012. 9345133. (2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020). doi: 10.1109/ICCC51575.2020.9345133

Keywords-oriented Data Augmentation for Chinese

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this