Source Data Loading07 - Text Extraction and OCRExtracting text and transforming it to qualitative data is challenging when you have the output format (PDF or image) and need to reduce it to the source format of textual data.
Source Data Loading06 - Text and File ScrapingThe lack of a Corpus or API requires you to scrape your textual data or files from the web. Overcome the challenges of IP-blocking, cookie walls, request headers and js-websites.
Source Data Loading05 - Loading from APIAn API serves as the interface between different applications. The requestor automatically gets access to data, with the benefit that the source doesn’t have to know how the other system exactly works.
Source Data Loading04 - Generating a CorpusA Corpus is a language resource consisting of a structured set of documents and additional information. It incorporates pre-processed documents and their meta-data, which might be the output of other NLP tasks.
Source Data Loading03 - Loading Structured DatafileStructured data implies ready-to-use data, but you have to interpret it’s scheme.
Source Data Loading02 - Manual TypewritingTyping your own text gives you more confidence when testing your code or a demo.
Source Data Loading01 - Bits to Character EncodingText is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). Fix or load your data by choosing the right encoding.