TY - JOUR
T1 - Research on Intelligent Construction Algorithm of Subject Knowledge Thesaurus Based on Literature Resources
AU - Wang, Xiaoxia
AU - Xu, Xiaozhong
AU - Zhang, Jiarui
AU - Zhu, Yue
AU - Fan, Yuhang
AU - Xu, Pengjing
N1 - Publisher Copyright:
© Published under licence by IOP Publishing Ltd.
PY - 2021/6/29
Y1 - 2021/6/29
N2 - The implementation of National Science and Technology Innovation Strategy demands exponential growing in knowledge services on literature information institutions. It is the most important knowledge organization tool for Information Retrieval, which can be widely used for semantic citation, organization and retrieval of literature resources. This study aims to develop an innovative algorithm for constructing subject thesaurus based on massive literature resource data and mining academic neologisms, also the semantic relationship between academic neologisms and subject system. We firstly collect a dataset of literature corpus, corresponding work for data pre-processing carried out. Then using the FastText model to complete academic neologisms mining, we construct an automatic categorization model of academic neologisms based on the Bert and TextCNN algorithm. The algorithm proposed in this study is validated by 8.1 million multi-source and heterogeneous literature data in the field of marine disciplines. The result shows that the algorithm can effectively replace 90% of the manual annotation volume, mine a large number of high-quality marine neologisms and successfully build the marine science knowledge base with a pass rate of 82.6% reviewed by expert, which present high accuracy and certain engineering application prospects.
AB - The implementation of National Science and Technology Innovation Strategy demands exponential growing in knowledge services on literature information institutions. It is the most important knowledge organization tool for Information Retrieval, which can be widely used for semantic citation, organization and retrieval of literature resources. This study aims to develop an innovative algorithm for constructing subject thesaurus based on massive literature resource data and mining academic neologisms, also the semantic relationship between academic neologisms and subject system. We firstly collect a dataset of literature corpus, corresponding work for data pre-processing carried out. Then using the FastText model to complete academic neologisms mining, we construct an automatic categorization model of academic neologisms based on the Bert and TextCNN algorithm. The algorithm proposed in this study is validated by 8.1 million multi-source and heterogeneous literature data in the field of marine disciplines. The result shows that the algorithm can effectively replace 90% of the manual annotation volume, mine a large number of high-quality marine neologisms and successfully build the marine science knowledge base with a pass rate of 82.6% reviewed by expert, which present high accuracy and certain engineering application prospects.
UR - http://www.scopus.com/inward/record.url?scp=85109414059&partnerID=8YFLogxK
U2 - 10.1088/1742-6596/1955/1/012038
DO - 10.1088/1742-6596/1955/1/012038
M3 - Conference article
AN - SCOPUS:85109414059
SN - 1742-6588
VL - 1955
JO - Journal of Physics: Conference Series
JF - Journal of Physics: Conference Series
IS - 1
M1 - 012038
T2 - 2021 4th International Symposium on Big Data and Applied Statistics, ISBDAS 2021
Y2 - 21 May 2021 through 23 May 2021
ER -