Word Processing23 - Negation RecognizerIgnoring the meaning of a negation will flip the polarity of your text.
Word Processing21 - NormalizationBesides Stemming or Lemmatizing, there still might be a need to edit words to move to more default words.
Word Processing20 - LemmatizationLemmatization usually refers to rewriting a word to its base form (lemma) properly.
Word Processing19 - StemmingStemming refers to a crude heuristic process that chops off the ends of words in the hope that words with the same meaning become words with the same syntax.
Word Parsing18 - Dependency ParserA Dependency Parser extracts a dependency graph from a sentence. In the graph the grammatical structure, like subject and object, and relationships between words represented by Dependency tags.
Word Parsing17 - Part-of-Speech TaggerThe syntactic function of a word, like Noun or Verb, is defined by the Part-of-Speech (POS tags) and is based on the context.
Word Parsing16 - Morphological TaggerAssigning additional morphological information clarifies the grammatical meaning of a word, additionally to the syntax.
Word Parsing15 - Vocabulary BuildingThe goal of tokenization is a vocabulary. Word-based tokenizers have larger vocabularies with more Out-Of-Vocabulary (OOV) words than sub-word vocabularies.
Word Parsing14 - TokenizationMany NLP tasks make use of intensive matrix calculations, for which word id’s are used, rather than words. For this, raw text is split up into tokens that represent (sub)words.
Training Data Generation13 - Rulebased Training DataProgrammatically build training datasets by defining heuristic rules which are used in functions for labeling training data.
Training Data Generation12 - Textual Data AugmentationBoost your performance by creating data out of data, instead of new data.
Training Data Generation11 - Crowdsourcing MarketplaceCreating training data is a labor-intensive task. Fine-tune the training data definition yourself and then scale-up by outsourcing to remote workers.
Training Data Generation10 - Training Data ProviderGold data contains the ground truth. Re-use available resources, but be careful that the dataset matches your purpose.
Training Data Generation09 - Annotation with Active LearningUse an annotation tool that benefits from active learning to enforce a robust annotion process and balanced annotations.
Training Data Generation08 - Manual AnnotationNobody wants to do the manual labor of tagging. Everybody wants to build language models with annotated training data.
Source Data Loading07 - Text Extraction and OCRExtracting text and transforming it to qualitative data is challenging when you have the output format (PDF or image) and need to reduce it to the source format of textual data.
Source Data Loading06 - Text and File ScrapingThe lack of a Corpus or API requires you to scrape your textual data or files from the web. Overcome the challenges of IP-blocking, cookie walls, request headers and js-websites.
Source Data Loading05 - Loading from APIAn API serves as the interface between different applications. The requestor automatically gets access to data, with the benefit that the source doesn’t have to know how the other system exactly works.
Source Data Loading04 - Generating a CorpusA Corpus is a language resource consisting of a structured set of documents and additional information. It incorporates pre-processed documents and their meta-data, which might be the output of other NLP tasks.
Source Data Loading03 - Loading Structured DatafileStructured data implies ready-to-use data, but you have to interpret it’s scheme.
Source Data Loading02 - Manual TypewritingTyping your own text gives you more confidence when testing your code or a demo.