04 - Generating a Corpus

A Corpus is a language resource consisting of a structured set of documents and additional information. It incorporates pre-processed documents and their meta-data, which might be the output of other NLP tasks.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

05 Sep 2020• 1 min read

A Corpus is a language resource consisting of a structured set of documents and additional information. A corpus can consist of pre-processed documents. So it might be the output of other NLP tasks. It’s important to know the Corpus Balance. What does the corpus represent; which languages, genres, domains and for which NLP-tasks can it be used. You can load, for example, the corpora that are built-in in the NLTK library.

^{Wikipedia as source for corpus (source)}

Wikipedia is one of the most popular open data sources for building a corpus. Select and download a Wikipedia dump. Clean the articles with wikiextractor and you have a great corpus for starters. Also Commoncrawl is often used, but you have to be able to handle all those gigabytes.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

04 - Generating a Corpus

Rob van Zoest

Rob van Zoest

07 - Text Extraction and OCR

06 - Text and File Scraping

05 - Loading from API

03 - Loading Structured Datafile

05 - Loading from API