Source Data Loading

04 - Generating a Corpus

A Corpus is a language resource consisting of a structured set of documents and additional information. It incorporates pre-processed documents and their meta-data, which might be the output of other NLP tasks. 

A Corpus is a language resource consisting of a structured set of documents and additional information. A corpus can consist of pre-processed documents. So it might be the output of other NLP tasks. It’s important to know the Corpus Balance. What does the corpus represent; which languages, genres, domains and for which NLP-tasks can it be used. You can load, for example, the corpora that are built-in in the NLTK library.

Wikipedia as source for corpus (source)

Wikipedia is one of the most popular open data sources for building a corpus. Select and download a Wikipedia dump. Clean the articles with wikiextractor and you have a great corpus for starters. Also Commoncrawl is often used, but you have to be able to handle all those gigabytes.



This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.