60 - Document Similarity

Estimating the degree of similarity between the semantic representation of two documents.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

31 Oct 2020• 1 min read

The task of estimating the degree of similarity between the semantic representation of two documents can be done by different techniques for feature extraction. Some examples:

The statistical techniques BM25 (Best Matching 25) and TF-IDF (Term Frequency * Inverse Document Frequency), which are the default and former-default similarity algorithm in Elasticsearch and Lucene.
Latent Semantic Analysis (LSA/LSI) for vectorization of documents. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on the vector space and only keeps the directions in our vector space that contain the most variance.
Latent Dirichlet allocation (LDA) which is a probabilistic method.
Doc2Vec (aka paragraph2vec, aka sentence embeddings) a neural network method that modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text.
USE (Universal Sentence Encoder) encodes text into high dimensional vectors. It has pretrained models for English, but also a multilingual model.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

60 - Document Similarity

Rob van Zoest

Rob van Zoest

62 - Contextualized Word Representations

61 - Distributed Word Representations

59 - Distance Measures

59 - Distance Measures

61 - Distributed Word Representations