TY - GEN
T1 - Effective document labeling with very few seed words
T2 - 25th ACM International Conference on Information and Knowledge Management, CIKM 2016
AU - Li, Chenliang
AU - Xing, Jian
AU - Sun, Aixin
AU - Ma, Zongyang
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/10/24
Y1 - 2016/10/24
N2 - Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unla-beled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.
AB - Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unla-beled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.
KW - Dataless text classification
KW - Text analysis
KW - Topic modeling
UR - http://www.scopus.com/inward/record.url?scp=84996598610&partnerID=8YFLogxK
U2 - 10.1145/2983323.2983721
DO - 10.1145/2983323.2983721
M3 - Conference Proceeding
AN - SCOPUS:84996598610
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 85
EP - 94
BT - CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
PB - Association for Computing Machinery
Y2 - 24 October 2016 through 28 October 2016
ER -