Visualizing Topics with Multi-Word Expressions

Resource type
Preprint
Authors/contributors
Title
Visualizing Topics with Multi-Word Expressions
Abstract
We describe a new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models. Our method finds significant $n$-grams related to a topic, which are then used to help understand and interpret the underlying distribution. Compared with the usual visualization, which simply lists the most probable topical terms, the multi-word expressions provide a better intuitive impression for what a topic is "about." Our approach is based on a language model of arbitrary length expressions, for which we develop a new methodology based on nested permutation tests to find significant phrases. We show that this method outperforms the more standard use of $\chi^2$ and likelihood ratio tests. We illustrate the topic presentations on corpora of scientific abstracts and news articles.
Repository
arXiv
Archive ID
arXiv:0907.1013
Date
2009-07-06
Accessed
02/11/2023, 21:43
Library Catalogue
Extra
arXiv:0907.1013 [stat]
Citation
Blei, D. M., & Lafferty, J. D. (2009). Visualizing Topics with Multi-Word Expressions (arXiv:0907.1013). arXiv. http://arxiv.org/abs/0907.1013