Effective document labeling with very few seed words: A topic modeling approach

Chenliang Li; Jian Xing; Aixin Sun; Zongyang Ma

doi:10.1145/2983323.2983721

Effective document labeling with very few seed words: A topic modeling approach

Chenliang Li^*, Jian Xing, Aixin Sun, Zongyang Ma

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

53 Citations (Scopus)

Abstract

Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unla-beled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.

Original language	English
Title of host publication	CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
Publisher	Association for Computing Machinery
Pages	85-94
Number of pages	10
ISBN (Electronic)	9781450340731
DOIs	https://doi.org/10.1145/2983323.2983721
Publication status	Published - 24 Oct 2016
Externally published	Yes
Event	25th ACM International Conference on Information and Knowledge Management, CIKM 2016 - Indianapolis, United States Duration: 24 Oct 2016 → 28 Oct 2016

Publication series

Name	International Conference on Information and Knowledge Management, Proceedings
Volume	24-28-October-2016

Conference

Conference	25th ACM International Conference on Information and Knowledge Management, CIKM 2016
Country/Territory	United States
City	Indianapolis
Period	24/10/16 → 28/10/16

Keywords

Dataless text classification
Text analysis
Topic modeling

Access to Document

10.1145/2983323.2983721

Cite this

Li, C., Xing, J., Sun, A., & Ma, Z. (2016). Effective document labeling with very few seed words: A topic modeling approach. In CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management (pp. 85-94). (International Conference on Information and Knowledge Management, Proceedings; Vol. 24-28-October-2016). Association for Computing Machinery. https://doi.org/10.1145/2983323.2983721

@inproceedings{a0e1801053ae4116a30f4783c07ae652,

title = "Effective document labeling with very few seed words: A topic modeling approach",

abstract = "Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unla-beled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.",

keywords = "Dataless text classification, Text analysis, Topic modeling",

author = "Chenliang Li and Jian Xing and Aixin Sun and Zongyang Ma",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 25th ACM International Conference on Information and Knowledge Management, CIKM 2016 ; Conference date: 24-10-2016 Through 28-10-2016",

year = "2016",

month = oct,

day = "24",

doi = "10.1145/2983323.2983721",

language = "English",

series = "International Conference on Information and Knowledge Management, Proceedings",

publisher = "Association for Computing Machinery",

pages = "85--94",

booktitle = "CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management",

}

Li, C, Xing, J, Sun, A & Ma, Z 2016, Effective document labeling with very few seed words: A topic modeling approach. in CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, vol. 24-28-October-2016, Association for Computing Machinery, pp. 85-94, 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, United States, 24/10/16. https://doi.org/10.1145/2983323.2983721

Effective document labeling with very few seed words: A topic modeling approach. / Li, Chenliang; Xing, Jian; Sun, Aixin et al.
CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. Association for Computing Machinery, 2016. p. 85-94 (International Conference on Information and Knowledge Management, Proceedings; Vol. 24-28-October-2016).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Effective document labeling with very few seed words

T2 - 25th ACM International Conference on Information and Knowledge Management, CIKM 2016

AU - Li, Chenliang

AU - Xing, Jian

AU - Sun, Aixin

AU - Ma, Zongyang

PY - 2016/10/24

Y1 - 2016/10/24

N2 - Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unla-beled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.

AB - Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unla-beled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.

KW - Dataless text classification

KW - Text analysis

KW - Topic modeling

UR - http://www.scopus.com/inward/record.url?scp=84996598610&partnerID=8YFLogxK

U2 - 10.1145/2983323.2983721

DO - 10.1145/2983323.2983721

M3 - Conference Proceeding

AN - SCOPUS:84996598610

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 85

EP - 94

BT - CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management

PB - Association for Computing Machinery

Y2 - 24 October 2016 through 28 October 2016

ER -

Effective document labeling with very few seed words: A topic modeling approach

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this