Topic modeling for short texts with auxiliary word embeddings

Chenliang Li; Haoran Wang; Zhiqian Zhang; Aixin Sun; Zongyang Ma

doi:10.1145/2911451.2911499

Topic modeling for short texts with auxiliary word embeddings

Chenliang Li^*, Haoran Wang, Zhiqian Zhang, Aixin Sun, Zongyang Ma

^*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

299 Citations (Scopus)

Abstract

For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embed-dings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Pólya urn (GPU) model. In this sense, the background knowledge about word semantic relat-edness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

Original language	English
Title of host publication	SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
Publisher	Association for Computing Machinery, Inc
Pages	165-174
Number of pages	10
ISBN (Electronic)	9781450342902
DOIs	https://doi.org/10.1145/2911451.2911499
Publication status	Published - 7 Jul 2016
Externally published	Yes
Event	39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy Duration: 17 Jul 2016 → 21 Jul 2016

Publication series

Name	SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference	39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
Country/Territory	Italy
City	Pisa
Period	17/07/16 → 21/07/16

Keywords

Short texts
Topic model
Word embeddings

Access to Document

10.1145/2911451.2911499

Cite this

Li, C., Wang, H., Zhang, Z., Sun, A., & Ma, Z. (2016). Topic modeling for short texts with auxiliary word embeddings. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 165-174). (SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval). Association for Computing Machinery, Inc. https://doi.org/10.1145/2911451.2911499

Li, Chenliang ; Wang, Haoran ; Zhang, Zhiqian et al. / Topic modeling for short texts with auxiliary word embeddings. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. pp. 165-174 (SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval).

@inproceedings{52cc307346444583a1ae6e201a340da4,

title = "Topic modeling for short texts with auxiliary word embeddings",

abstract = "For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embed-dings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized P{\'o}lya urn (GPU) model. In this sense, the background knowledge about word semantic relat-edness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.",

keywords = "Short texts, Topic model, Word embeddings",

author = "Chenliang Li and Haoran Wang and Zhiqian Zhang and Aixin Sun and Zongyang Ma",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 ; Conference date: 17-07-2016 Through 21-07-2016",

year = "2016",

month = jul,

day = "7",

doi = "10.1145/2911451.2911499",

language = "English",

series = "SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval",

publisher = "Association for Computing Machinery, Inc",

pages = "165--174",

booktitle = "SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

Li, C, Wang, H, Zhang, Z, Sun, A & Ma, Z 2016, Topic modeling for short texts with auxiliary word embeddings. in SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc, pp. 165-174, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 17/07/16. https://doi.org/10.1145/2911451.2911499

Topic modeling for short texts with auxiliary word embeddings. / Li, Chenliang; Wang, Haoran; Zhang, Zhiqian et al.
SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. p. 165-174 (SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Topic modeling for short texts with auxiliary word embeddings

AU - Li, Chenliang

AU - Wang, Haoran

AU - Zhang, Zhiqian

AU - Sun, Aixin

AU - Ma, Zongyang

PY - 2016/7/7

Y1 - 2016/7/7

N2 - For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embed-dings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Pólya urn (GPU) model. In this sense, the background knowledge about word semantic relat-edness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

AB - For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embed-dings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Pólya urn (GPU) model. In this sense, the background knowledge about word semantic relat-edness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

KW - Short texts

KW - Topic model

KW - Word embeddings

UR - http://www.scopus.com/inward/record.url?scp=84980351621&partnerID=8YFLogxK

U2 - 10.1145/2911451.2911499

DO - 10.1145/2911451.2911499

M3 - Conference Proceeding

AN - SCOPUS:84980351621

T3 - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

SP - 165

EP - 174

BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - Association for Computing Machinery, Inc

T2 - 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016

Y2 - 17 July 2016 through 21 July 2016

ER -

Li C, Wang H, Zhang Z, Sun A, Ma Z. Topic modeling for short texts with auxiliary word embeddings. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2016. p. 165-174. (SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval). doi: 10.1145/2911451.2911499

Topic modeling for short texts with auxiliary word embeddings

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this