55 - Topic Modeling

Dividing a set of vectorized documents into N unsupervised topics by determining how similar vectors for a specific topic should be, and how many topics should be distinguished.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

26 Oct 2020• 1 min read

To divide a set of documents into N unsupervised topics, the documents should be represented by compact vectors. Term Frequency * Inverse Document Frequency (TF-IDF), Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) are the best-known Vector Space Model algorithms for transforming a document into a vector.

The critical step in Topic Modeling is to determine how similar these vectors should be, which documents belong to a specific topic and how many topics should be distinguished. In most libraries you have to define how many topics (clusters) the algorithm should generate.

However, the library Top2vec automatically reduces the number of dimensions and finds dense areas in this decreased space. So, it determines the number of topics for you. You can prune this number of topics by iteratively merging each smallest topic to the most similar topic until you reach your target number. It also combines document vectors and word vectors to determining topic(vector)s and their most important words.

Another informative tool that should be mentioned is pyLDAvis, which is a library for interactive topic model visualization (demo).

^{The 5 steps of the Top2Vec algorithm (source)}

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

55 - Topic Modeling

Rob van Zoest

Rob van Zoest

57 - Outlier Detection

56 - Trend Detection

54 - Extractive Summarization

54 - Extractive Summarization

56 - Trend Detection