Topic modeling for short texts with auxiliary word embeddings

Chenliang Li*, Haoran Wang, Zhiqian Zhang, Aixin Sun, Zongyang Ma

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

290 Citations (Scopus)

Abstract

For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embed-dings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Pólya urn (GPU) model. In this sense, the background knowledge about word semantic relat-edness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

Original languageEnglish
Title of host publicationSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages165-174
Number of pages10
ISBN (Electronic)9781450342902
DOIs
Publication statusPublished - 7 Jul 2016
Externally publishedYes
Event39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy
Duration: 17 Jul 201621 Jul 2016

Publication series

NameSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
Country/TerritoryItaly
CityPisa
Period17/07/1621/07/16

Keywords

  • Short texts
  • Topic model
  • Word embeddings

Fingerprint

Dive into the research topics of 'Topic modeling for short texts with auxiliary word embeddings'. Together they form a unique fingerprint.

Cite this