Innerdoc

Word Processing

23 - Negation Recognizer

Ignoring the meaning of a negation will flip the polarity of your text.

Word Processing

22 - Spell Checker

Spell Checkers can recommend corrections on the word level.

Word Processing

21 - Normalization

Besides Stemming or Lemmatizing, there still might be a need to edit words to move to more default words.

Word Processing

20 - Lemmatization

Lemmatization usually refers to rewriting a word to its base form (lemma) properly.

Word Processing

19 - Stemming

Stemming refers to a crude heuristic process that chops off the ends of words in the hope that words with the same meaning become words with the same syntax.

Word Parsing

18 - Dependency Parser

A Dependency Parser extracts a dependency graph from a sentence. In the graph the grammatical structure, like subject and object, and relationships between words represented by Dependency tags.

Word Parsing

17 - Part-of-Speech Tagger

The syntactic function of a word, like Noun or Verb, is defined by the Part-of-Speech (POS tags) and is based on the context.

Word Parsing

16 - Morphological Tagger

Assigning additional morphological information clarifies the grammatical meaning of a word, additionally to the syntax.

Word Parsing

15 - Vocabulary Building

The goal of tokenization is a vocabulary. Word-based tokenizers have larger vocabularies with more Out-Of-Vocabulary (OOV) words than sub-word vocabularies.

Word Parsing

14 - Tokenization

Many NLP tasks make use of intensive matrix calculations, for which word id’s are used, rather than words. For this, raw text is split up into tokens that represent (sub)words.

Training Data Generation

13 - Rulebased Training Data

Programmatically build training datasets by defining heuristic rules which are used in functions for labeling training data.

Training Data Generation

12 - Textual Data Augmentation

Boost your performance by creating data out of data, instead of new data.

Training Data Generation

11 - Crowdsourcing Marketplace

Creating training data is a labor-intensive task. Fine-tune the training data definition yourself and then scale-up by outsourcing to remote workers.

Training Data Generation

10 - Training Data Provider

Gold data contains the ground truth. Re-use available resources, but be careful that the dataset matches your purpose.

Training Data Generation

09 - Annotation with Active Learning

Use an annotation tool that benefits from active learning to enforce a robust annotion process and balanced annotations.

Training Data Generation

08 - Manual Annotation

Nobody wants to do the manual labor of tagging. Everybody wants to build language models with annotated training data.

Source Data Loading

07 - Text Extraction and OCR

Extracting text and transforming it to qualitative data is challenging when you have the output format (PDF or image) and need to reduce it to the source format of textual data.

Source Data Loading

06 - Text and File Scraping

The lack of a Corpus or API requires you to scrape your textual data or files from the web. Overcome the challenges of IP-blocking, cookie walls, request headers and js-websites.

Source Data Loading

05 - Loading from API

An API serves as the interface between different applications. The requestor automatically gets access to data, with the benefit that the source doesn’t have to know how the other system exactly works.

Source Data Loading

04 - Generating a Corpus

A Corpus is a language resource consisting of a structured set of documents and additional information. It incorporates pre-processed documents and their meta-data, which might be the output of other NLP tasks.

Source Data Loading

03 - Loading Structured Datafile

Structured data implies ready-to-use data, but you have to interpret it’s scheme.

Source Data Loading

02 - Manual Typewriting

Typing your own text gives you more confidence when testing your code or a demo.

Deep Text Search. Made Intelligent.