15 - Vocabulary Building
The goal of tokenization is a vocabulary. Word-based tokenizers have larger vocabularies with more Out-Of-Vocabulary (OOV) words than sub-word vocabularies.
The goal of tokenization is a vocabulary. Some extra properties might be interesting to include in the vocabulary. Count the word types per document and for the entire corpus for probability measures (like TF-IDF) to reflect how important the word type is.
This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.