Documents42 - Language IdentificationIdentifying the language of a text is often done before you select the right language model.
Documents41 - Meta-Info ExtractorExtracting text from a file should be accompanied with the extraction of meta-information.
Documents40 - Raw Text CleaningPre-processing text with the goal to increase the quality of subsequent NLP tasks.
Documents39 - DeduplicationFinding texts that are exactly the same or show a high similarity. Similarity can be measured on lexicality or semantic meaning from embeddings.