Innerdoc

Custom Solutions for Businesses

Wed, 10 Mar 2021 10:00:00 GMT

Some of our core-tasks:

Setting up a Project 🦉

Getting an Understanding of the Product Vision and Context of the larger application or business process where the model will be used
Defining what Accuracy is needed to achieve the usage goals
Defining how the Annotation Scheme and Corpus construction should look like to facilitate consistent annotations and easy decisions from the models
Initiating the Annotation process for human annotators, semi-automated annotations and a good quality control process
Build the Model Architecture, train and evaluate it
Iterate above steps and improve code and data with smart choices, parameter tuning, tricks and sweat

Gathering and Ingesting Textual Data 💬

Extracting textual data from your CMS, ERP and other information systems (Database queries and API calls)
Automated Scraping of content from websites
Handling data files like csv, json, xml, xlsx and many more
Interpreting document types like pdf, docx, rtx, txt, epub, ppt and more
Interpreting special document-types like emails, contracts, invoices, resumes, patient files and other Templated formats
Creating a domain specific Corpus

Processing Textual Data ✨

Parsing different languages like Dutch and English
Finding sentences, words and punctuation.
Recognizing word types (noun, verb, etc), syntactic types (object, subject, etc.), base forms of words (the lemma of ‘was’ is ‘be’)
Labeling Named Entities (NER) like Persons, Organizations (proper and common!), Locations, Places, Nationalities, etc.
Labeling Named Value notations like Money, Email Addresses, URLs, Time, Phone Numbers, Quantity, etc.
Finding Similarity between words, sentences, paragraphs, documents for recommendation and automated learning.
Labeling documents with your domain-specific tags for Text Classification

Advanced Analytics Projects 🚀

We train custom language Parsers that are optimized for your domain-specific content (Retail, Healthcare, Legal, etc)
We create advanced analyses for information extraction with classification, clustering and visualization of the results
We find relations among concepts by statistical and rule-based search based on linguistic features
We provide insight into you textual data so you can make the right decisions

We are ready to help you! Contact us at hey@innerdoc.com

Datashare

Wed, 10 Mar 2021 10:00:00 GMT

Datashare

OSINT

Timeline Demo for a History of NLP

Thu, 04 Feb 2021 10:00:00 GMT

This demo is created to show the integration of Streamlit with TimelineJS. It became a timeline about the history of Natural Language Processing!

Visit demo!

Detecting Emotions in Christmas Lyrics

Mon, 21 Dec 2020 10:00:00 GMT

It started with the idea for a fun pre-christmas project and with the help of Plutchik’s Wheel of Emotions and Zero-shot emotion classification with Transformers, it became this demo.

Visit demo!

Periodic Table of Natural Language Processing Tasks

Tue, 01 Dec 2020 10:00:00 GMT

Russian chemist Dmitri Mendeleev published the first Periodic Table in 1869. Now it’s time for the NLP tasks to be organized in the Periodic Table style!

The variation and structure of NLP tasks is endless. Still, you can think about building NLP Pipelines out of standard NLP tasks and dividing them into groups. But what do these tasks entail?

TASK GROUPS

👆 _{click links for tasks}

Source Data Loading 💾 Tasks from this group care about the textual data to do NLP analysis on.
Training Data Generation 💎 Generating gold data which is needed for training language models.
Word Parsing 🔬 Splitting text to tokens and creating the first structured metadata per token.
Word Processing 🛠 Improving the token format on a lexical level.
Phrases and Entities 🗣 Recognizing multi-token phrases.
Entity Enriching 🏛 Enriching entities with structured meta-data.
Sentences and Paragraphs 📃 Working with the sense and coherence of words on a sentence- and paragraph-level.
Documents 📚 Handling the unity of textual data on the level of documents.
Natural Language Models 💃 Steps for building a respectable language model.
Supervised Classification 🚩 Classifying textual data on all kinds of levels.
Unsupervised Signaling 🙈 Unsupervised discovery of important signals.
Similarity 👯 Calculating the closeness of different chunks of text.
Natural Language Generation 🤖 Producing content as if it’s written by humans.
Systems 🚀 NLP systems as a basis for interactive applications.
Information Visualization ✨ Visualizing textual information to better understand complex textual data.

ABOUT THIS PROJECT

I have tried to make the Periodic Table of NLP tasks as complete as possible. It’s therefore more a long-read than some self-contained blog articles.

The set-up and composition of the Periodic Table is subjective. The division of tasks and categories could have been done in multiple other ways. I appreciate your feedback and new ideas in the form below. I tried to make a clear and short description for each task. I omitted the deeper details, but provided links to extra information where possible. If you have improvements, you can send them in the form below or you can contact me on LinkedIn.

CREATE YOUR OWN PERIODIC TABLE

After continuously editing the project table, I made a ‘Periodic Table Creator’ to dynamically rebuild the figure over and over again. The code is available on github.com/innerdoc. I build it with the help of Streamlit and inspired by Bokeh-examples it became a dynamic creator that can be customized to your Periodic Table!

Feel free to use the Periodic Table Creator!

^{Click image to enlarge or download.}

Demo Time!

Sun, 29 Nov 2020 10:00:00 GMT

TIMELINE DEMO FOR A HISTORY OF NLP

This demo is created to show the integration of Streamlit with TimelineJS. It became a timeline about the history of Natural Language Processing!

CHRISTMAS LYRICS EMOTION DETECTOR

It started with the idea for a fun pre-christmas project and with the help of Plutchik’s Wheel of Emotions and Zero-shot emotion classification with Transformers, it became this demo.

CREATE YOUR OWN PERIODIC TABLE

After continuously editing the project table, I made a ‘Periodic Table Creator’ to dynamically rebuild the figure over and over again. I build it with the help of Streamlit and inspired by Bokeh-examples it became a dynamic creator that can be customized to your Periodic Table!

SCATTERTEXT DEMO

How you doin'? Typology for Affective Meaning

Sat, 28 Nov 2020 10:00:00 GMT

Why compute Affective Meaning?

Applications can be Detecting:

sentiment towards politicians, products, countries, ideas
frustration of callers to a help line
stress in drivers or pilots
depression and other medical conditions
confusion in students talking to e-tutors
emotions in novels (e.g., for studying groups that are feared over time) Could we generate:
emotions or moods for literacy tutors in the children’s storybook domain
emotions or moods for computer games
personalities for dialogue systems to match the user

Relying on language as an indicator of psychological well-being and a conduit for emotions more generally also has a long tradition in clinical psychology.

Two families of theories of Emotion

Dimensions of Emotion
Affect can vary in Valence and Arousal. Valence can be expressed in positive/pleasant or negative/unpleasant. Arousal can be expressed in strong/activated or weak/deactivated.

Plutchick’s wheel of Emotion
A list of 8 basic emotions in four opposing pairs:

Joy – Sadness
Anger – Fear
Trust – Disgust
Anticipation – Surprise

The Big Five Dimensions of Personality

Extraversion vs. Introversion
sociable, assertive, playful vs. aloof, reserved, shy
Emotional stability vs. Neuroticism
calm, unemotional vs. insecure, anxious
Agreeableness vs. Disagreeable
friendly, cooperative vs. antagonistic, faultfinding
Conscientiousness vs. Unconscientious
self-disciplined, organised vs. inefficient, careless
Openness to experience
intellectual, insightful vs. shallow, unimaginative

Scherer’s typology of affective states

Emotion: relatively brief episode of synchronized response of all or most organismic subsystems in response to the evaluation of an event as being of major significance
angry, sad, joyful, fearful, ashamed, proud, desperate

Mood: diffuse affect state …change in subjective feeling, of low intensity but relatively long duration, often without apparent cause
cheerful, gloomy, irritable, listless, depressed, buoyant

Interpersonal stance: affective stance taken toward another person in a specific interaction, coloring the interpersonal exchange
distant, cold, warm, supportive, contemptuous

Attitudes: relatively enduring, affectively colored beliefs, preferences predispositions towards objects or persons
liking, loving, hating, valuing, desiring

Personality traits: emotionally laden, stable personality dispositions and behavior tendencies, typical for a person
nervous, anxious, reckless, morose, hostile, envious, jealous

How can we model the Lexical Semantics?

sentiment
emotion
personality
mood
attitudes

Sentiment vs. Affective Meaning 🗣

The expansion of the area of Sentiment Analysis has resulted in a new interest in the quantification of opinion, sentiment, affect, feeling, emotion, personality, mood and attitude. These terms are often used interchangeably.

We differentiate sentiment from affective meaning based on their duration. Sentiment lives longer, while an affective state has a short term duration. An effective sentiment analysis system is one which captures the sentiment of the opinion about an entity. The recognition of Affective Meaning is for example the automatic discovery of an emotional reaction, often of a single person. Unlike opinions, emotions are short-term.

Understanding Sentiment Analysis

Thu, 26 Nov 2020 10:00:00 GMT

Sentiment Analysis can be considered a subfield of information extraction. It is used in a wide range of areas and is sometimes referred to as Opinion Mining. Sentiment Analysis attempts to determine the overall attitude (positive or negative) expressed within a text. Its purpose is to represent emotional or affective meaning.

The technique’s success lies in an imperative need to standardize the measurement of human emotion in social media in order to efficiently monetize it. Signalling a users’ preference is a vital component of the Like economy that underpins the business models of all major social media companies. Peer recommendations are highly trusted by other peers. This makes Sentiment Analysis a valuable source of information.

Scoring 🎯

The simplest scale for sentiment scoring is a binary (positive–negative) or modal (positive–neutral–negative) categorization. Alternatively a score between 1 and –1 provides a more detailed range from very positive to very negative.

The level on which the sentiment is scored, depends on the needs. Establishing a sentiment classification on document-level is often seen. Although applying sentiment analysis on the level of sentences, phrases or named entities is also common.

The simplest approach for scoring a sentiment is a calculation based on a lexicon. This is a precompiled wordlist of terms that indicate positive or negative expressions of sentiment. Summing the positive and negative hits within a document will give the score. In some cases, a simple majority decides the final labeling of the post. This can be enhanced by recognizing negation, applying rulebased predicate structures or by preprocessing steps like lemmatization.

A more elaborate strategy would be a classification algorithm that interpretes on the level of sentences or phrases. The selection of linguistic features for clasification also requires choices. For example, term frequency has been found to be a poorer predictor than term presence. The usage of highly emotionally charged terms is more significant than their exact frequency. Word class, multi-word phrases, syntax, and negation have all been used as features, as have been the use of text length, exclamation marks, all caps, and character repetition.

Challenges ⛰️

Often it is unclear what is measured, the polarity of a described entity or the emotional state of the writer.
Issues arise when genre or domain-specific sentiment algorithmes and dictionaries are suddenly applied to another field and context-dependent word meanings no longer fit with the original context.
The absence of detection of fake news and fraudulent reviews. Fake opinions try to deliberately mislead readers and algorithms by giving undeserving positive opinions to some target objects in order to promote the objects.
People who are skeptical whether computers are able to understand the complexities of natural language assume scoring and interpretation are the same step. While the system only delivers a score, the result should still be interpreted.
Language is easy to misinterpret. For example, misclassification can occur because sarcasm and humor are difficult to recognize.
Spoken opinions complicate sentiment analysis. Because proper language structure is often ignored and signs of body language and tone of voice are not recorded.
Sentiment analysis cannot determine why someone is unhappy. On the other hand, it is not easily solved by humans either.

Applications 💶

Market Research: Understanding the voice of Customers when they express their desires, thoughts, preferences and frustrations
Public Relations: Identify opinions towards public persons or organisations
Customer Service and Support: Identify the needs to solve a customer request
Human Resources: Understand the voice of Employees
Healthcare: Understanding the feelings of Patients and measuring the effect of a chosen therapy
Stock Market: Predicting Stock prices, as they are often driven by positive or negative information
Recommender Systems: Help users to decide on products
Business Analytics: Using Sentiment Analysis for decision support systems and business process improvement.

Sentiment vs. Affective Meaning 🗣

Scattertext Project

Tue, 24 Nov 2020 10:00:00 GMT

Sometimes the problems encountered when trying to understand a text, or better, a corpus of documents becomes so complex that you need to visualize it first. Here’s an interactive visualization for understanding texts: scattertext, a product of the genius of Jason Kessler.

You can see our demo of scattertext here.

81 - Knowledge Graph Visualization

Sat, 21 Nov 2020 10:00:00 GMT

A Knowledge Graph is a knowledge base with interlinked descriptions of entities. This can be used to put data into context and enhance search engines. Keywords and Named Entity Recognition in combination with relation extraction is a good source when feeding Knowledge Graphs.

Technically, a Knowledge Graph is a network that represents multiple types of entities (nodes) and relations (edges) in the same graph. Each link of the network represents an (entity, relation, value) triplet. For example: Eiffel Tower (entity) is located in (relation) in Paris (value). When you know A relates to B and B relates to C, then you automatically profit from the advantage that there is an indirect connection between A and C.

^{Knowledge Graph visualization (source)}

You can build your own network in python with Networkx and draw it with pyplot from Matplotlib. If data gets bigger, you need to scale up to a Graph Database like Grakn, ArangoDB or Neo4j.

Fraud Detection and Exploratory Data Analysis are important use-cases. For example, graph databases were intensively used to explore complex networked data during the Panama Papers investigation.

80 - Locations on Geomap

Fri, 20 Nov 2020 10:00:00 GMT

Geocoded Named Entities can easily be mapped on a geographical map. There are several services and libraries to do the job:

With a Mapbox account (50k web map loads/month free tier) you can plot your coordinates from a Pandas dataframe to a Plotly scatter-mapbox.
GeoPandas makes working with geospatial data in python easier. It extends the datatypes used by Pandas to allow spatial operations on geometric types.
Folium creates beautiful and interactive maps by using Python and Leaflet, a JavaScript library for interactive maps. Folium has a lot of Jupyter demo notebooks.

^{Folium chart with Python and Leaflet (source)}

79 - Events on Timeline

Thu, 19 Nov 2020 10:00:00 GMT

Plotting events chronological on a timeline increases the insight. Some example setups for plots are:

Document Timeline: document publishing date vs document title
Sentence Timeline: date, timestamp or period vs its sentence (demo)
Dispersion Plot: the location (word offset) of a keyword in a text

^{Dispersion plot for Game of Thrones keywords (source)}

Some libraries for inspiration: Seaborn stripplot, Yellowbrick dispersion plot, NLTK dispersion plot, Calmap heatmap per day from pandas timeseries, Knightlab javascript timeline for data in google sheets.

78 - Word Embedding Visualization

Wed, 18 Nov 2020 10:00:00 GMT

Visualizing Word Embeddings is often done to inspect the embedding and experience the cohesiveness of a subset of the embedding. It is all about dimension reduction; how to get a 2-D chart from e.g. a 300 dimensional embedding. Three often seen dimension reduction techniques:

T-SNE (t-Distributed Stochastic Neighbor Embedding) maps the multi-dimensional data to a lower dimensional space. This is computationally expensive. After this process, the input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE. Hence it is mainly a data exploration and visualization technique. T-SNE is good at preserving local context (neighbors).
PCA (Principal Component Analysis) is a linear feature extraction technique. It combines your input features in a specific way that you can drop the least important feature while still retaining the most valuable parts of all of the features. As an added benefit, each of the new features or components created after PCA are all independent of one another.
UMAP (Uniform Manifold Approximation and Projection) has some advantages over t-SNE, most important is the increased speed and better preservation of the data’s local (neighbors) and global (clusters) structure.

Scattertext is a famous package for finding distinguishing terms in corpora, and presenting them in an interactive, HTML scatter plot. This is done by visualizing the difference and overlap of two categories of documents. You can try a demo about republican vs democratic speeches.

^{Scattertext visualization (source)}

Googles TensorBoard Embedding Projector graphically represents high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers. A similar but simpler library is RASA’s Whatlies that also helps to inspect your word embedding.

^{Visualization in the Tensorflow projector for the most similar words to ‘school’ (source)}

77 - Wordcloud

Tue, 17 Nov 2020 10:00:00 GMT

The Wordcloud has been around for a long time. Visualizing information is a profession in itself. So there are some best practices, but Wordclouds seem to ignore these. Here are some remarks about the (missing) elements in a Wordcloud:

Stopwords are excluded, while the word don’t has an important meaning in front of another word. Including stopwords will mess up the Wordcloud because of their high frequencies.
Multi-word expressions are not calculated. The separate words from a multi-word expression (e.g. New York Times ) will be interpreted totally different.
Different colors have no different meaning.
Vertical or horizontal words have no different meaning. The same applies to words at the top/bottom/left/right.
No context is given to clarify the sense of that word.

Although there is a lot of resistance against Wordclouds, they are still around. You can generate your Wordclouds with the stylecloud python library.

^{Compare two Wordclouds about the State of the Union 2002 vs 2011 (source)}

76 - Annotated Text Visualization

Mon, 16 Nov 2020 10:00:00 GMT

Printing text, but prettier. Often you want to show text with emphasis on specific words and their metadata. A simple plugin for Streamlit apps is this word annotations plugin. spaCy also has its Named Entity Visualizer.

^{Annotations made with a Streamlit plugin (source)}

75 - Interactive App Creation

Sun, 15 Nov 2020 10:00:00 GMT

Presenting your NLP task results should be transparent, interactive and fancy. Several solutions are available to share your data and code. Notebooks are open-source web applications that allows you to create and share documents that contain live code, equations, visualizations and narrative text. These Notebooks and web apps are flexible and you can arrange the user interface with building-blocks or plugins for your specific use-case.

With Streamlit you can code the UI in the same script as you analysis is. By saving the script, the browser will automatically refresh. It’s a great way of building interactive demo’s. You can also deploy your scripts to the Streamlit cloud platform. The computing power in the free cloud platform is not suited for heavy apps.

^{Streamlit self-driving car demo with code on the left and browser UI with a sidebar on the right (source)}

^{Streamlit demo: Controllable face GAN generator (source)}

Jupyter Notebooks (f.k.a. IPython Notebook) facilitate in-browser interactive computing with direct results. There is also JupyterLab which is a web-based interactive development environment for Jupyter notebooks.

^{Jupyter Notebook with text-, code- and output blocks (source)}

^{JupyterLab environment (source)}

Google’s Colab (from Colaboratory) Notebooks are like Jupyter Notebooks, it’s free, runs in the cloud and there is no setup. In Colab you can choose to run on a (light but free) GPU runtime, instead of CPU.

^{Colab demo notebook (source)}

74 - E-Discovery and Media Monitoring

Sat, 14 Nov 2020 10:00:00 GMT

Electronic Discovery and (Social) Media Monitoring are tasks for doing large scale content analysis.

Electronic Discovery is the task of identifying, collecting and producing electronically stored information (ESI) in (legal) investigations. Important aspects are the performance of the system regarding the volume, combining textual data with metadata, preserving and linking the original document and keeping your analysis up-to-date with the latest documents.

^{Organizations cannot do E-discovery without NLP (source)}

(Social) Media Monitoring is the task of analyzing social media, news media or any other content like posts, blogs, articles, whitepapers, comments and conversations. It can be used to improve (social) marketing, listening and engagement.

The goal is to understand the voice of the customer, which can be in any kind of setting like the customer of your brand, or the user of your forum, or your subscriber, etc. This is done by iterating through the cycle of listen — understand — engage. Listening is the part where metrics like tone, emotions, topics, brand attitude are interpreted. Parsing text to insightful metrics might be more interesting than just counting the number of followers, likes, shares, visitors and recommendations.

In practice, you often see sentiment analysis on twitter data. While brand and conversation audits and interpreting topics and patterns might be more interesting, they are also more complex.

^{A typical social media monitoring layout with volume, sentiment, wordcloud and mentions (source)}

73 - Knowledge Base Population

Fri, 13 Nov 2020 10:00:00 GMT

Knowledge Bases (also known as knowledge graphs or ontologies) are valuable resources for developing intelligence applications, including search, question answering, and recommendation systems. The goal of Knowledge Base Population is discovering facts about entities (NER, NEL) and building a knowledge base with it.

^{From Text to Knowledge Base (source)}

There is often an Inference Engine to complement the Knowledge Base. Together they can be seen as an Expert System. The Knowledge Base represents facts and rules. The Inference Engine applies the rules or AI model to the known facts to deduce new facts.

72 - Semantic Search Indexing

Thu, 12 Nov 2020 10:00:00 GMT

Search Engines became famous for their keyword-based information retrieval. Adding semantic information about a piece of text can increase search accuracy. Adding not only the text, but also it’s vector will allow to search for the intent and semantic meaning of the search terms, in addition to keyword search.

NMSLIB (Non-Metric Space Library) is a fast similarity search library that can find objects with a minimal (cosine) distance to other objects. When handling a question, you calculate its vector and then find closest embedding vector from the NMSLIB-index. Calculating vectors can, for example, be done with the Universal Sentence Encoder. NMSLIB has become a part of Amazon Elasticsearch Service.

71 - Chatbot Dialogue

Wed, 11 Nov 2020 10:00:00 GMT

2016 was the year of the chatbot-hype. Talking to your brand through virtual assistents was (is) the future. The challenge is to program a natural and convincing chatbot dialogue for the personas of your customers. You have to meet the customers’ needs and respond to their informal language and emojis.

Some systems to work with:

Dialogflow is Google’s development suite for creating conversational AI applications.
Wit.ai is a Facebook company and is free to use (your data will be shared with Wit for open apps).
Rasa is an open source package to build contextual assistents. You can try training a small chatbot in the Rasa playground.

^{Rasa chatbot example (source)}

70 - Question Answering

Tue, 10 Nov 2020 10:00:00 GMT

Question Answering is the task of automatically answer questions posed by humans in a natural language. There are different settings to answer a question, like abstractive, extractive, boolean and multiple-choice QA.

Extractive QA has the goal to extract a substring from the reference text. Abstractive QA has the goal to generate an answer based on the reference text, but might not be a substring of the reference text. Boolean questions are Yes-No answers. Multiple choice questions have several options to choose from.

^{Different QA formats (source)}

A variant to the regular question-answer is Multi-hop question answering which requires a model to gather information from different parts of a text to answer a question.

^{Multi-hop QA example (source)}

A special feature of a QA system is the option to not answer a question or answer ‘idk’ (i don’t know) . An example is SQuaD. SQuaD 1.0 QA training data set was created as reference texts with questions that always were answered. The improved SQuaD 2.0 dataset was supplemented with questions that could not be answered.

As shown, different researchers treat different formats as distinct problems. But AllenAI made UnifiedQA, which is a T5 (Text-to-Text Transfer Transformer) model that was trained on all types of QA-formats. You can try their demo.

Another variant is where there is no reference text that serves the question. The required knowledge has to come from within the model itself. The knowledge is stored in the models parameters that it picked up during unsupervised pre-training. You can give this demo a try.

69 - Relation Extraction

Mon, 09 Nov 2020 10:00:00 GMT

Relationship extraction is the task of extracting semantic relationships from a text. A relation can be defined as a connection between entities. There are different ways of extracting relations:

Simple deduction: use the presence of two entities in the same sentence (or paragraph) as an unnamed relation.
Predicate logic: use the Dependency tags to define semantic relation-queries like (subject, verb, object) where the verb defines the relation.
Hearst Patterns: use the POS-tags to extract Hearst Patterns, which are hierarchical relations based on semantic information. Hearst Patterns are used to extract hypernym relations. A hyponym (e.g. Shakespeare) is in a type-of relationship with its hypernym (e.g. author). These are important for extracting tuples for ontologies.

^{Examples of Hearts Patterns (source)}

Word2vec similarity: use vector calculations to define relations, like in Gensim:

    import gensim 

    model = gensim.models.Word2Vec.load('model-01')  
    model.most_similar(positive=[' **father** ', ' **son** '], negative=[' **mother** ']) 

    >>> [(' **daughter** ', 0.8783684968948364)]

Question Answering and Slot Filling: ask a question in a certain relationship-template and use the answer to fill the slot.

    Template: husband_of = ”Who is the husband of [PERSON]?”  
    Question: ”Who is the husband of Michelle?”  
    Answer  : ”Barack”  
    Relation: Barack --> husband_of --> Michelle

Transformer for Relation Extraction: Use deeplearning for relation extraction. TACRED, with 106k sentence-level examples and 41 relation types, and DocRED, with 107k document-level examples and 96 relation types, are good relation extraction datasets to train models on.

68 - Long Text Generation

Sun, 08 Nov 2020 10:00:00 GMT

There are limits to the input for text generation models. Most models have a limited length around 500-tokens long. This is due to the shortcomings of Recurrent Neural Networks (RNN), resulting in vanishing gradients for long sequences where long-term information has to sequentially travel through all cells before getting to the present processing cell.

Long ShortTerm Memory networks (LSTM) and Gated Recurrent Unit (GRU) require less computations and are better capable of learning and remembering over long sequences, but eventually they also don’t work either for very long sequences.

Vanishing gradients are better solved by using attention based models like Transformers that can parallelly process input in contrast to RNNs. Longformer is a model designed for long sequences. It has an attention mechanism that scales linearly with the input sequence length, compared to most self-attention based models that scale quadratically and therefor require more memory.

BigBird is another model from Google with a sparse attention mechanism that reduces the required computations and memory. BigBird handles inputs that are up to 8 times longer than the original BERT model could handle. Several NLP tasks will benefit from the handling of longer inputs: Long Document Summarization, Question Answering and Genomics Processing.

Want to autowrite yourself? Use the Write With Transformers demo to write text, based on an initial text.

^{Write With Transformers demo (source)}

67 - Paraphrasing

Sat, 07 Nov 2020 10:00:00 GMT

Paraphrasing is the task of expressing the meaning of a source text into a new text by using different words and maintaining the semantic meaning. The goal might be to achieve greater clarity, to prevent plagiarism or to do data augmentation by generating related-but-different training data.

With rulebased functionality you might replace synonyms in a text, but with Neural Networks the process will be more sophisticated and the output will have more variety in the expressions. However, these are types of error you might find:

^{Paraphrase errors (source)}

66 - Abstractive Summarization

Fri, 06 Nov 2020 10:00:00 GMT

Abstractive summarization systems generate new phrases. The perfect summarizer truly understands the document and expresses this by using as few words as possible.

It is a very difficult task, because the summarizer might produce factually incorrect details, struggle with Out-of-Vocabulary (OOV) words and might be repetitive in its output on important phrases.

Another subtask of Abstractive Summarization is Content Determination. What is the focus of the reader? This is important for deciding what information should be communicated in the summary. If you want to generate a summary from a book, it might be helpful to know what the reader is interested in.

65 - Machine Translation

Thu, 05 Nov 2020 10:00:00 GMT

Language translation by machines is since decades one of the most important NLP-tasks, because all things start by understanding each other without barriers. Google Translate still has shortcomings and is the absolute leader, but Facebook is in the race with it’s multilingual machine translation model M2M-100.

However, custom-build models are within range with the arrival of Neural Machine Translation implementations, which provide sequence-to-sequence models and Parallel Corpora like Paracrawl and Opus.

With the growing quality of Machine Translation models, there is also an opportunity to better translate training datasets into another language. The English language is almost always used for NLP-blogs, model demo’s and SOTA leaderboards. These superior resources might benefit you for other languages.

In the figure below a Word Alignment matrix from a Neural Machine Translation task. Each pixel shows the weight of the annotation and explains which positions in the source sentence were considered more important when generating the target word.

^{Word Alignment matrix for a translated sentence (source)}

64 - Report Writing

Wed, 04 Nov 2020 10:00:00 GMT

Writing sentences based on structured data is also called Data-to-Text Generation. The task is to generate content without explicitly modelling what to say and in what order. The task can exist of two steps. The first step is to define what parts of the structured data should have the most attention and in what sequence they should occur. The second step is to generate the content, while taking the first step into account.

An example is an automatic summary of the financial results of a Business Intelligence (BI) dashboard. Or in sport broadcasting; generating a match report containing statistics on NBA basketball games.

^{Factual mentions from the table are boldfaced in the description. (source)}

63 - Next Token Prediction

Tue, 03 Nov 2020 10:00:00 GMT

Can I help you by predicting the next word you will type? A lot of apps are using this auto-complete feature to please the user.

N-gram language models can be used as a simple solution for Next Token prediction. It assigns the probability to a sequence of words, in a way that more likely sequences receive higher scores. For example, ‘I have a pen‘ is expected to have a higher probability than ‘I am a pen’ since the first one seems to be a more natural sentence in the real world.

Long Short Term Memory (LSTM) is a more advanced approach. It will better understand the context and has a better performance, because not all N-grams have to be calculated.

^{Googles autocomplete (source)}

62 - Contextualized Word Representations

Mon, 02 Nov 2020 10:00:00 GMT

Contextualized / Dynamic Word Representations can be seen as incorporating context into word embeddings and is the ‘upgrade’ of Static Word Representations. Contextualized Embeddings can be found in models like BERT, ELMo, and GPT-2.

^{Pretrained Language Models that take context into consideration (source)}

ELMo (Embeddings from Language Models by AllenNLP) was the response to the polysemy-problem and took context into consideration an LSTM-based model; same words having different meanings based on their context.

BERT (Bidirectional Encoder Representations from Transformers by Google) was a follow-up that considered the context from both the left and the right sides of each word. It was universal, because no domain-specific dataset was needed. It was also generalizable, because a pre-trained BERT model can be fine-tuned easily for various downstream NLP tasks.

GPT (Generative Pretrained Transformer by OpenAI) also emphasized the importance of the Transformer framework, which has a simpler architecture and can train faster and facilitates more parallelization than an LSTM-based model. It is also able to learn complex patterns in the data by using the Attention mechanism. Attention is an added layer that let’s a model focus on what’s important in a long input sequence.

For a technical summary of the (20+) available model types see this Transformers Summary from Huggin Face.

61 - Distributed Word Representations

Sun, 01 Nov 2020 10:00:00 GMT

Distributed / Static Word Representations, or Word Vectors or Word Embeddings are multi-dimensional meaning representations of a word, which are reduced to a level of N dimensions. This technique received a lot of attention since 2013 when Google published the algorithm. There were still several challenges.

Original word embedding have one vector per word. A vector typically has 300 or 512 dimensions and 500k words for a large model. This results in embeddings which can grow over 500Mb, which have to be loaded into memory. To reduce this load one can use less dimensions, which makes the vectors less unique. Remove vectors for infrequent words, but these might be the most interesting words. Or map multiple words into one vector (pruning in spaCy), but these words will then be 100% similar to each other.

Encountering out-of-vocabulary words is the word embedding problem of having words for which no vector exists. Subword approaches try to solve the unknown word problem by assuming that you can reconstruct a word’s meaning from its parts.

Lexical ambiguity or polysemy is another problem. A word in a word embedding has no context, so the vector for the word ‘bank’ is trained on the semantic context of ‘river’ bank, but also for the ‘financial’ bank. Sense2vec solves this context-sensitivity partly by taking metainfo into account. The model is trained on words like ‘duck|NOUN’ and ‘duck|VERB’ or ‘Obama|PERSON’ and ‘Obama|ORG’ (e.g. the Obama administration) to be more distinctive on the metainfo-tag (but how about ‘foot’; body part vs scale unit). Nowadays the ambiguity problem is solved by Attention Based Contextualized Word Representations.

A triggering feature (in the early days) for word embeddings was that they contain semantic relations if the training corpus reflects this. An example is ‘Paris’ is to ‘France’ as ‘London’ is to […]. The embedding can respond with ‘England’. However, it’s not always accurate and deeplearning models are nowadays a better alternative to find these relations.

^{Semantic relations in word embeddings (source)}

Best known word embedding models are:

Word2Vec is the first wordvector algorithm created by Tomáš Mikolov at Google. It is best implemented by Gensim.
GloVe algorithm is created by Stanford.
fastText algorithm is created by Facebook and is a subword embedding where each word is represented as a bag of character n-grams. This means that out-of-vocabulary words can be composed from multiple subwords. This makes the algorithm faster, because the embedding is smaller. Trained word vectors for 157 languages are available to download.
BPEmb is also a subword embedding algorithm. Subwords are based on Byte-Pair Encoding (BPE) which is a specific type of subword tokenization. BPEmb has trained models for 275 languages.

60 - Document Similarity

Sat, 31 Oct 2020 10:00:00 GMT

The task of estimating the degree of similarity between the semantic representation of two documents can be done by different techniques for feature extraction. Some examples:

The statistical techniques BM25 (Best Matching 25) and TF-IDF (Term Frequency * Inverse Document Frequency), which are the default and former-default similarity algorithm in Elasticsearch and Lucene.
Latent Semantic Analysis (LSA/LSI) for vectorization of documents. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on the vector space and only keeps the directions in our vector space that contain the most variance.
Latent Dirichlet allocation (LDA) which is a probabilistic method.
Doc2Vec (aka paragraph2vec, aka sentence embeddings) a neural network method that modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text.
USE (Universal Sentence Encoder) encodes text into high dimensional vectors. It has pretrained models for English, but also a multilingual model.

59 - Distance Measures

Fri, 30 Oct 2020 10:00:00 GMT

Distance Measures show how similar words are to each other. There is word Syntax similarity and Semantic word similarity. Syntax similarity means that sheep and ship are more similar than sheep and lamb, because semantic meaning is ignored. This can be calculated by the Levenshtein Distance that is used by the RapidFuzz library. Semantic similarity measures the meaning of the words, so sheep and lamb are more similar than sheep and ship. This can be calculated by measuring the cosine distance of wordvectors.

^{Lexical vs Semantic similarity (source)}

58 - WordNet Synsets

Thu, 29 Oct 2020 10:00:00 GMT

WordNet is a lexical database from Princeton. It consists of nouns, verbs, adjectives and adverbs which are grouped into sets of synonyms (synsets), hyponyms, meronyms and hypernyms with descriptions. Each synset describes a concept and is interlinked by means of conceptual-semantic and lexical relations.

^{Wordnet results for ‘school’. **S** ynset (semantic) relations, **W** ord (lexical) relations, **n** oun, **v** erb, **adj** ective (source)}

Similar structured knowledge bases are ConceptNet or VerbNet. These focus on the major languages, but there are also initiatives for smaller languages like the Open Dutch WordNet. Applications can be tasks for word sense disambiguation, classification of texts, finding similar terms, lexical simplification, etc.

57 - Outlier Detection

Wed, 28 Oct 2020 10:00:00 GMT

Outliers or Anomalies are generally defined as samples that are exceptionally far from the mainstream of (textual) data. The threshold when something is an outlier is very subjective. If you have a vocabulary, an outlier might be defined as a word that is Out-of-Vocabulary (OOV).

Another way is that the outlier is a result of an extreme class imbalance and can be measured in terms of its word- or document vector.

56 - Trend Detection

Tue, 27 Oct 2020 10:00:00 GMT

Trending topics on Twitters streaming data is one of the best examples for the Trend Detection task. Capturing topics, thoughts and emotions over time produces a very insightful starting point for analysis.

You can quantify the deviation of a particular word count beyond the expected variability, and you can define a threshold above which you call the count a trend. If there is historical data, you can take seasonality-patterns into account in your time-series analysis.

^{Three basic types of Anomalies (source)}

The difficulty is that you often don’t know the scale, size or time interval of the change in advance. Depending on your use-case there are different setups possible. However, all algorithms present trade-offs, including simplicity vs. robustness, precision, recall and time-to-detection. An older python package from Twitterdev might give you a quick start.

An interesting variant is de Cold Trend Detection. This shows which topics have the highest negative change in scores and are cooling down at a certain time.

55 - Topic Modeling

Mon, 26 Oct 2020 10:00:00 GMT

To divide a set of documents into N unsupervised topics, the documents should be represented by compact vectors. Term Frequency * Inverse Document Frequency (TF-IDF), Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) are the best-known Vector Space Model algorithms for transforming a document into a vector.

The critical step in Topic Modeling is to determine how similar these vectors should be, which documents belong to a specific topic and how many topics should be distinguished. In most libraries you have to define how many topics (clusters) the algorithm should generate.

However, the library Top2vec automatically reduces the number of dimensions and finds dense areas in this decreased space. So, it determines the number of topics for you. You can prune this number of topics by iteratively merging each smallest topic to the most similar topic until you reach your target number. It also combines document vectors and word vectors to determining topic(vector)s and their most important words.

Another informative tool that should be mentioned is pyLDAvis, which is a library for interactive topic model visualization (demo).

^{The 5 steps of the Top2Vec algorithm (source)}

54 - Extractive Summarization

Sun, 25 Oct 2020 10:00:00 GMT

Extractive Summarization (or summary generation) works in the same way as Keyword extraction. The most relevant sentences are extracted. The algorithm selects sentences by finding the combination of words that are important or seem representative of the entire text. That’s why packages that support Summarization often also support Keyword detection. A variant is multi-document summarization.

^{Extractive vs Abstractive summarization (source)}

Extractive summarization is also important for the question answering task. By collecting the most relevant documents for a particular question, a summarizer could assemble a cohesive context for the answer. The other way around is also interesting. When building training data for the QA task you have to generate relevant questions; Extractive summarization can identify important sentences where you want to have questions about.

53 - Keyword Extraction

Sat, 24 Oct 2020 10:00:00 GMT

Provide the most relevant words from this document. That is the task of keyword extraction. It is often used as a starting point when doing unsupervised text analytics. The term Keywords is also known as common phrases , multi-word expressions and word n-gram collocations.

Calculating keywords can for example be done by the Textrank algorithm or by detecting Ngrams.

52 - Multi-Label Multi-Class Classification

Fri, 23 Oct 2020 10:00:00 GMT

A specific and difficult sub-solution of Text Classification is Multi-Label Multi-Class Text Classification.

^{Classifier types (source)}

51 - Text Classification

Thu, 22 Oct 2020 10:00:00 GMT

Text classification is the process of assigning tags or categories to text according to its content. It is the broader task where Intent, Sentiment and Spam classification are part of.

^{Tweet from Richard Socher (source)}

You can start classifying text by using a variety of open source libraries. From Scikit Learn to NLTK and spaCy. Or watch the video below for a quick intro.

^{Training an Insults Classifier in 1 hour (source)}

50 - Intent Classification

Wed, 21 Oct 2020 10:00:00 GMT

Chatbots and E-assistants have to accomplish two tasks: understanding the user and giving the correct responses. Intent Classification is the task of finding out what the user exactly asks. For example, does she want to buy tickets, or does she want to know the price of tickets?

^{The Utterance is the user’s quest. The Intent classifier detects the next-best action. Parameterized by the entities. (source)}

The classification task requires training data which might be difficult to create and domain specific. MultiWOZ is a well-known task-oriented dialogue dataset containing over 10k dialogues spanning 8 domains.

49 - Sentiment and Emotion Detection

Tue, 20 Oct 2020 10:00:00 GMT

Sentiment Analysis or Opinion Mining attempts to determine the overall attitude (positive or negative) expressed within a text. Its purpose is to represent emotional or affective meaning.

^{Simple classification with 3 labels (source)}

The technique’s success lies in an imperative need to standardize the measurement of human emotion in social media in order to efficiently monetize it. Signaling a users’ preference is a vital component of the Like economy that underpins the business models of all major social media companies. These peer recommendations are highly trusted by other peers. This makes Sentiment Analysis a valuable source of information.

Sentiment Analysis have a lot of challenges. It is often unclear what is measured; the polarity of a described entity or the emotional state of the writer.

A more focused subfield is Aspect-Based Sentiment Analysis (ABSA). First, the aspect details (or sentiment target) are detected. For example, the sentiment target is a phrase about product X. Then the sentiment of the opinion about the target is detected. The third is to detect the opinion holder, which is often omitted.

^{Plutchick’s wheel of Emotion (source)}

Words indicating the level of sentiment will behave very differently when under the semantic scope of negation (see negation section). If you want to analyze e.g. social media text without training a classifier you can use the python library VADER. You can also modify the library to work with non-English text.

48 - Spam Detection

Mon, 19 Oct 2020 10:00:00 GMT

Spam Detection is one of the oldest applications of NLP and is a frequently seen use case for demo’s and tutorials. Receiving email from a Nigerian Princes might still be a problem, but ISPs are more and more successful in detecting and filtering it out.

^{Global spam volume as percentage of total e-mail traffic from 2007 to 2019 (source)}

47 - Monitoring Models

Sun, 18 Oct 2020 10:00:00 GMT

Once deployed you need to monitor your processors and improve latency and inference speed. Language Models need to be monitored. Maybe you have a feedback loop that gives the end-user the opportunity to tell if the model did a good or bad job.

46 - Deploying Models

Sat, 17 Oct 2020 10:00:00 GMT

Once you realized your Language Model you need to deploy it. It might be a building block in a larger pipeline. You might want to build a Model Factory to periodically retrain the model. You might need specialized processors like GPU’s and TPU’s or use a special runtime like ONNX to accelerate AI (Transformer) Models inference. A real DevOps job.

45 - Explaining Models

Fri, 16 Oct 2020 10:00:00 GMT

The business will require that your model can explain its outcomes. Transparency is needed to prevent distrust. For example, to prevent racial bias in sentiment analysis.

ELI5 (Explain Like I’m 5) is a python package which helps to debug machine learning classifiers and explain their predictions. It supports several Machine Learning frameworks, like Scikit-Learn and Keras.

^{ELI5 Textexplainer showing which words contribute to the prediction of y=sci.med (source)}

LIME (Local Interpretable Model-Agnostic Explanations) explains individual predictions for text classifiers:

^{LIME explanations (source)}

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model:

^{SHAP for multi-class text data, with top positive words for emotion angry (source)}

44 - Evaluating Models

Thu, 15 Oct 2020 10:00:00 GMT

To evaluate the quality of a Language Model, it should be compared based on some score. For supervised models, like a text classification model, it will be easy to evaluate a model with metrics like precision, recall, accuracy and F1-score. For unsupervised models, like Natural Language Generation models, there are metrics like BLEU, ROUGE, METEOR and GLUE for extrinsic evaluation and Perplexity as an intrinsic evaluation.

BLEU is a popular word-overlap metric and compares n-grams between a candidate text and a reference text. Unfortunately, it is unable to capture semantics and can lead to poor scores even for appropriate response. BLEU is popular for evaluating machine translation models.

^{BLEU Metric (source)}

The GLUE (General Language Understanding Evaluation) benchmark is model-agnostic, so any system capable of processing sentence and sentence pairs and producing corresponding predictions is eligible to participate. It has a lot of data sets for several genres and several techniques like coreference resolution, sentiment analysis and question answering. And it is used with a leaderboard. So people can use the data sets, and see how well their models perform compared to others.

You can calculate these metrics with the NLG-eval package or go to Huggingface for an overview and explanation of several metrics.

Perplexity is an intrinsic evaluation method. It’s not as good as the extrinsic metrics, but is useful to quickly make a comparison to the language model itself (e.g. for LDA) and not taking into account the specific task it’s going to be used for. Perplexity is the inability to understand something. A low perplexity indicates the model is good at predicting the sample. SOTA (state-of-the-art) perplexity for a language model is 11.

43 - Training Models

Wed, 14 Oct 2020 10:00:00 GMT

Training NLP models is a broad topic. It’s best to start light and improve later. You can start by building a rulebased model for two hours and experience how good it scores. Take this as a baseline score. Then try to improve this with a simple technique like a regression model. If you want to elaborate further, try training a deeplearning model.

The more complex your model, the longer the training time. More performance requires better hardware. Instead of CPU you might need GPU’s or TPU’s.

Yoav Goldberg talked about the required expertise to build NLP models. His vision is that in future (2021+) humans don’t require much ML or linguistic expertise. Humans will be writing rules, aided by ML/DL, resulting in transparent and debuggable models.

^{From Yoav Goldbergs presentation The missing elements in NLP (spaCy IRL 2019) (source)}

42 - Language Identification

Tue, 13 Oct 2020 10:00:00 GMT

Language identification is one of the first tasks you will do, because you have to select the right language-specific model. You can use the python library langdetect (55 lang, 99% acc) or fasttext (176 lang, 93% acc).

    from langdetect import detect, detect_langs 

    detect("War doesn't show who's right, just who's left.")  
    >>> en 

    detect_langs("Otec matka syn.")  
    >>> [sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]

41 - Meta-Info Extractor

Mon, 12 Oct 2020 10:00:00 GMT

If a document is considered to be information , its title, URL, publishing date, last-editing date, extraction-timestamp, author, filetype and subject are examples of meta-information.

When you extract text from a file with Apache Tika, it will also provide the meta-information. When you use the Twitter API, you can also get the meta-information of a tweet. A Tweet can have over 150 attributes associated with it.

^{Tweet metadata fields (source)}

Once you extracted the meta-information, it’s interesting to combine it with info extracted from the text. For example, the publication date from the document can be used as reference date for temporal expressions. You can declare the date for the word ‘yesterday’ based on the publication date.

40 - Raw Text Cleaning

Sun, 11 Oct 2020 10:00:00 GMT

The task of Raw Text Cleaning consists of several pre-processing steps with the goal to increase the quality of subsequent NLP tasks.

Dedash words that were split up because the end of the line was reached.
Replace redundant whitespace like multiple line-ends, tabs and spaces.
Handle HTML tags like < b >, < div >, < span >, < \ br >
Smart Decapitalization for SCREAMING words
Remove rare special characters, like a standard dash for the figure dash (‒), en dash (–), em dash ( — ), horizontal bar (―), swung dash (⁓), which shouldn’t be confused by the tilde (~).
Replace special letter (áççêñtèd) characters with their simple form.
Remove footnotes, page numbers, headers and references.
Replace numerical words by values: ‘hundred thousand’ to 100.000
Replace Emoji’s by their description, like 😶 to ‘no mouth’
Remove URL and Email references

Etcetera, etcetera! A lot of other NLP tasks described here, can also be used for Text Cleaning.

39 - Deduplication

Sat, 10 Oct 2020 10:00:00 GMT

Deduplication is a relevant task if you have little control over your input data collection. For example, you can have a lot of duplicate documents when scraping internet articles or using tweets.

There are different methods for creating a unique document set. For finding texts that are exactly the same you can use hashing. For comparing similarity between texts you can use fuzzy string matching or subsequencematching. This often has a bad performance when you want to scale up. Calculation costs grow quadratically when you increase the set of documents for deduplication. A solution is to cluster the documents first. You now only compare documents within the same cluster instead of comparing all documents against each other.

For comparing semantic similarity between texts you can use distributed or contextualized word representations. A vector then represents each text and the (cosine) distance will indicate the similarity.

Deduplication can benefit a lot from cleaning your text, like lowercasing all text or replace URLs.

38 - Readability Scoring

Fri, 09 Oct 2020 10:00:00 GMT

Readability is the quality of the text that was written. If it’s too long and complicated, no one will understand it. Measuring the readability is about measuring the text quality. This can be done by looking at the keyword density, syllable count and the average length of sentences and words in a document. Also checking for simpler synonyms or words with a higher word prevalence can help. Word prevalence is about word knowledge in the crowd and refers to the number of people who know the word.

Well-known Readability measures are Flesch-Kincaid Grade Level and the Coleman-Liau Index. These are developed for English. For non-English languages there might be specific variants. However, the best language-agnostic linguistic proxy for readability is (not surprisingly) the average number of words per sentence.

You can try the English readability metrics with this python package.

^{Flesch-Kincaid Grade Levels (source)}

37 - Grammar Checker

Thu, 08 Oct 2020 10:00:00 GMT

Grammar Checkers have not been great in the past. Despite all efforts in the pre-deeplearning era, transfer learning is currently the way to go. Depending on the implementation, the language models perform pretty good.

36 - Paragraph Segmentation

Wed, 07 Oct 2020 10:00:00 GMT

Detecting Paragraphs is somehow less mainstream. Mostly there is some custom logic like: split after two line-ends, or split before uppercased title. Maybe there is some layout-meta information, or a specific paragraph- and chapter numbering that could help.

Mostly, there just is no default way of determining the paragraph boundary and people tend to work with sentences. Still, the unit of a paragraph might be of a higher value than that of a sentence. Examples might be: coreference resolutions that overlap multiple sentences. Questions that find their answer throughout a whole paragraph. A reader that understands a paragraph better than an isolated sentence. It’s clear that the signal from a writer is best expressed in a paragraph.

35 - Sentencizer

Tue, 06 Oct 2020 10:00:00 GMT

Once you tokenized your textual data, a sentencizer should find the words that together form a sentence.

Starting with a titlecased word, followed by lowercase words, until there is a dot. That might be the simplest (erroneous) version of rulebased sentence boundary detection (SBD) logic.

More sophisticated SBD is, for example, done by the spaCy library. The sentence segmentation is performed by the Dependency Parser, which predicts the sentence boundary by the dependency tags.

In NLTK you can train (unsupervised) a sentence-tokenizer on your own training data. It builds a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.

34 - Text Anonymizer

Mon, 05 Oct 2020 10:00:00 GMT

Text Anonymizing is the task of removing sensitive information before a document is shared with others. Deidentification and obfuscation is often done for strings that identify persons (name, social number, email, etc.) and organizations or for other sensitive details from crime records and patient dossiers.

A simple solution is to do Named Entity Recognition and replace the found mention with a tag. If it is Anonymization you replace the mention with nothing (the black marker). For Pseudonymization you replace the mention with a unique tag.

^{Deidentification levels (source)}

Text Anonymization has its use cases within governments and organizations that want to avoid being too transparent and have to deal with legal frameworks such as the General Data Protection Regulation (GDPR).

33 - Coreference Resolution

Sun, 04 Oct 2020 10:00:00 GMT

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It has much to do with Named Entity Linking, but it doesn’t necessarily use a knowledge base.

^{The nomenclature of Coreference Resolution (source)}

You can test Coreference-demo’s from AllenNLP or Huggingface.

^{It doesn’t know who’s in the band! Coreference Resolution by Huggingface (source)}

32 - Named Entity Linking

Sat, 03 Oct 2020 10:00:00 GMT

Named Entity Linking (NEL) is the task of assigning a unique identity from a knowledge base to an entity. For example, the entities ‘Obama’ and ‘The President’ from document X might refer to the same person, but “Obama” from document X and “Obama” from document Y might not refer to the same person.

^{Named Entity Linking steps (source)}

NEL can be done by training an Entity Linker with spaCy, or use the knowledge base DBpedia and the service dbpedia-spotlight.org.

31 - Temporal Parser

Fri, 02 Oct 2020 10:00:00 GMT

Once you found some string that contains an indication of time, you still have to extract a normalized time format out of it. Otherwise you cannot calculate time differences between events or put results in a timeline.

Challenging are the numerous time zones and local formats. But also the relative notations, like ‘tomorrow’ should be normalized and you should declare a reference date that functions as the ‘now’ in relation to the ‘tomorrow’. Another point is the duration, like ‘the summer of 1969’; when does a summer begin and end?

^{Example of Temporal Parsing (source)}

Some of the best temporal parsers are Scrapinghub’s python dateparser and Facebook’s Duckling (in Haskell).

30 - Geocoding

Thu, 01 Oct 2020 10:00:00 GMT

Geocoding is the task of parsing text into an address and converting an addresses into geographic coordinates like latitude and longitude. The goal is to plot an item to a map. In a broader sense, this can also be done for organization names (find the headquarter), facilities and landmarks (convert ‘Eiffel Tower’ into 48°51′29.6″N 2°17′40.2″E ). The opposite is reverse geocoding; from coordinates to an address.

There are a lot of paid API’s for geocoding. Geopy is a Python client for several popular geocoding web services. If you want to do this yourself, you’ll need a reference database. For example the open data from openaddresses.io or geonames.org or a local initiative.

^{Geocoding (source)}

29 - Price Parser

Wed, 30 Sep 2020 10:00:00 GMT

A Price Parser extracts prices and currency from raw text. Applications can be in Price Management and Competitive Pricing, often combined with webscraping. Another usage example is from the Panama Papers; extract all valuta amounts from documents and link each amount to a person, organization, date or bank account.

A good price parser normalizes all prices into a standard format. It should also recognize all currencies. It might be difficult to recognize the abbreviations of currencies; Euro, EUR, US Dollar, dollar (which one?), USD. The currency symbols are easier to find, because they have their own Unicode category. But still this doesn’t guarantee completeness.

    import unicodedata 

    def is_currency_symbol(char):  
        return unicodedata.category(char) == “Sc”

A good python package is price-parser. Another library for finding amounts of money is Facebook’s Duckling (in Haskell).

28 - Abbreviation Finder

Tue, 29 Sep 2020 10:00:00 GMT

An abbreviation is a way to write a word not using the complete spelling. Abbreviations are an efficient way to write text, but it lowers text comprehension and increases ambiguity. Acronyms are a specialized form of abbreviations, usually using the first letters of each word in a multi-word phrase.

Identifying abbreviations often depends on detecting the simple template Long-Form (Short-Form). After identification a consensus view should be reached on all ambiguous long forms. Distinctions should be made between different long forms with the same semantic meaning versus long forms of different concepts.

^{Reaching a Consensus view on Dutch Healthcare Acronyms; choosing the most frequent writing for an acronym (source)}

27 - Named Entity Recognition

Mon, 28 Sep 2020 10:00:00 GMT

Named Entity Recognition (NER) is the task of identifying named entities and their class. A lot of NER models are based upon Wikipedia-generated training data for 3 NER categories: persons, locations, organizations. This is a rather simple NER annotation scheme, but available for multiple languages. The Ontonotes training data provides a more fine-grained annotation scheme, like the table below. It all depends how schema’s are defined and how training data is created. Domain specific text requires custom models, like for Legal and Biomedical. You can also get inspiration from schema.org.

^{Examples of Named Entity types and Value types (source)}

^{displaCy Named Entity Visualizer demo (source)}

26 - Dependency Nounchunks

Sun, 27 Sep 2020 10:00:00 GMT

A form of Constituency Parsing is breaking a text into sub-phrases. Partial parsing is known as chunking and has noun-phrases (nounchunks), verb-phrases, adjective phrases or prepositional phrases as a result. You can compare this with the more default N-grams.

You can test spaCy for its nounchunker. The chunks are based on dependency tags.

^{Nounchunks from English spaCy model. It would be better to have ‘the summer of 1969’ as chunk (source)}

25 - Rulebased Phrasematcher

Sat, 26 Sep 2020 10:00:00 GMT

A lot of text analytic analysis start with searching a list of words or phrases. If you have a problem of finite size where a lookup table (gazetteer) is sufficient, then a matcher might be the solution. This task has evolved over time. It is still rulebased, but smarter. Missing search hits will be prevented by looking at tokens instead of raw text.

For example, all the double spaces and tabs between words will not influence the search results. Also searches can be defined on a more semantic level. For example one can search for all forms of a verb, just by defining the lemma (base form) of the verb. Or search for the lemma of a noun and you will get all the single and plural values of this noun.

Performance is also important when searching for a large list of words against a large corpus. Flashtext has an algorithm in python that has a gigantic performance gain compared to regex searches.

spaCy has a Token Matcher for detailed searches on (semantic) properties of (multiple) tokens and a Phrase Matcher for very long lists of (multiple) words.

^{Rule-based Token Matcher demo (source)}

24 - N-grams

Fri, 25 Sep 2020 10:00:00 GMT

An N-gram is a sequence of N words, with a high probability of occurrence. It can also be named a collocation, a multi-word expression or a common phrase. A Bi-gram example is ‘red wine’ and a Tri-gram is ‘summer of 69’. The probability is calculated by: (the number of times the previous word occurs before this word) / (the total number of times the previous word occurs in the corpus)

You can detect N-grams with Gensim’s Phrases-model or with the CountVectorizer from Scikit learn.

N-grams are sometimes used for next-word prediction. This is a simple, but costly (performance) solution. A variant is the cheaper character N-grams, but it’s better to use LSTMs.

23 - Negation Recognizer

Thu, 24 Sep 2020 10:00:00 GMT

A negation is the denial of something and is very important to include in your analysis. When you count the word ‘talent’ in your document, you should also find the preceding word ‘without’. Ignoring negation will flip the polarity of your analysis incorrectly.

^{Negation Recognizer example (source)}

spaCy has a pipeline plugin to detect negations. There is also training code for negBert, but there is no ready-to-go model.

The broader task of Negation Finding is finding the level of Certainty. From completely Affirmed to a Speculative form to Negated. The prepositions before or after a word determine the polarity of the assertion.

^{Examples of different levels of certainty (source)}

22 - Spell Checker

Wed, 23 Sep 2020 10:00:00 GMT

Spell Checkers can recommend corrections on three levels: subword level, word level and sentence level. Spell Checkers evolved from rule-based to deeplearning models. Spark-NLP has a contextual spell checker model and spaCy has a contextual spell checker and a spellchecker based on Hunspell, which is the spell checker of Chrome, Firefox and OpenOffice.

Words that should not be corrected, can sometimes be defined as exceptions. This can be done with exception-classes and might have a high overlap with Named Entities. For example, the string ‘Aug-69’ is from the class Date and should not be corrected.

^{Word Spell Checker example (source)}

Spelling correction (but also Lemmatization, Stemming and Normalization) helps you decrease the unique number of tokens in your vocabulary, which improves performance in NLP tasks. Especially when you have noisy textual data like tweets.

An interesting application of Spell Checking is in OCR. The quality of extracted text from OCR can be improved by checking whether there are low-probability words or out-of-context words or out-of-vocabulary words that need to be corrected.

21 - Normalization

Tue, 22 Sep 2020 10:00:00 GMT

Besides Stemming or Lemmatizing, there still might be a need to edit words to move to more default words.

For example, transform word numerals into numbers, handle emoji, substitute contractions (I’m → I am), replace repetitions (Yeaaaaaahhhh → Yeah), remove gender to prohibit to have a gender-bias in your model (all he, his, she, her, etc. to a default form).

You can also normalize spèçíâl characters with Pythons unicodedata. This has for example the effect that accents are removed and that curly quotes are converted to their ASCII equivalent. An advantage for simplicity, although you lose directionality of the quote.

^{Word Normalization example (source)}

20 - Lemmatization

Mon, 21 Sep 2020 10:00:00 GMT

Lemmatization usually refers to rewriting a word to its base form properly, with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

^{Word Lemmatization example (source)}

19 - Stemming

Sun, 20 Sep 2020 10:00:00 GMT

Stemming refers to a crude heuristic process that chops off the ends of words in the hope that words with the same meaning become words with the same syntax and less unique words remain. This often includes the removal of derivational affixes.

^{Word Stemming example (source)}

18 - Dependency Parser

Sat, 19 Sep 2020 10:00:00 GMT

A Dependency Parser extracts a dependency graph from a sentence. In the graph the grammatical structure is represented and relationships are defined between head-words and words, which modify these heads. For example, most sentences have a root-word, which is the main verb, a subject and an object. All tags are defined on the Universal Dependency Relations website. You can also play with the displaCy Dependency Visualizer demo.

^{{ **subject** } had { **object** } example with displaCy Dependency Visualizer (source)}

Dependencies are also great for Information Extraction, because there often is an object-subject relation. Spacy is very useful for navigating the dependency parse tree as you can see in this example.

17 - Part-of-Speech Tagger

Fri, 18 Sep 2020 10:00:00 GMT

The Part-of-Speech (POS) tagger marks words with a POS-tag, based on its context and definition. So, for example, the word ‘answer’ is tagged as a Noun or as a Verb, depending on its context.

^{Part-of-Speech tags ( **S** : sentence, **NP** : noun phrase) (source)}

There are many variants of POS-tag schemes and their abbreviations. In some schemes (e.g. in the Penn Treebank below) the POS tag includes some morphological information. This (partly) depends on the morphological richness of a language. There is also a Universal Scheme for POS tags.

^{Penn Treebank for POS-tags (source)}

POS-tags are (or were) useful for building lemmatizers and NER systems, but also for information retrieval with rule-based token-patterns. For example, see this Spacy demo for rulebased search for token patterns.

16 - Morphological Tagger

Thu, 17 Sep 2020 10:00:00 GMT

The task of the Morphological Tagger is to assign additional morphological information to each token. This can be the gender, case, person, tense, etc. UniMorph and UniversalDependencies are projects to describe al morphological features in a universal schema.

^{` **FIN**` **** indicates a finite verb. `**IND**` **** indicates the indicative mood. `**PFV**` **** indicates the perfective aspect. `**PST**` **** indicates the past tense. **2** indicates the second person. **SG** indicates the singular number. `**INFM**` **** indicates the informal register. (source)}

The Spanish word ‘hablaste’ can be represented as the lemma ‘hablar’ plus the bundle [FIN;IND;PFV;PST;2;SG;INFM]. This bundle describes all grammatical features for the word. The [V] stands for Verb and is the Part-of-Speech.

Some languages (like German and Arabic) mark a lot of information through morphology, giving them a rich morphology. Other languages (like Mandarin) have almost no morphology, so whatever needs to be encoded gets encoded through syntax. Syntax means: by adding more words and by changing word order.

15 - Vocabulary Building

Wed, 16 Sep 2020 10:00:00 GMT

The goal of tokenization is a vocabulary. Some extra properties might be interesting to include in the vocabulary. Count the word types per document and for the entire corpus for probability measures (like TF-IDF) to reflect how important the word type is.

14 - Tokenization

Tue, 15 Sep 2020 10:00:00 GMT

Tokenization is the task of splitting raw text into smaller fundamental units; word tokens. This task is required for almost any NLP task. The goal is to build a vocabulary of word types (in spaCy: lexemes). A word type is a distinct word, in the abstract, rather than a specific instance. Word types are word tokens with no context. A word token is a word (string) observed in a piece of text.

How large your vocabulary should be, is a trade-off between memory limitations vs. coverage of word tokens. Each word token is converted to a unique id per word type. This is for performance reasons, because many NLP tasks make use of intensive (matrix) calculations.

^{spaCy’s rulebased Tokenizer (source)}

The text is generally split up into tokens that match words. But the text can also be split up into subwords or characters. The level of tokenization (or granularity) depends on the NLP-task and the target size for the total number of tokens for your vocabulary. The larger the vocabulary size the more common words you can tokenize and the more memory you need. The smaller the vocabulary size the more subword tokens you need to avoid having to use the -UNK- token (unknown).

Above is an example of a word-level tokenizer. Techniques for subword tokenization are often used for training deeplearning models and have names like Wordpiece, Unigram and Byte Pair Encoding (BPE). For example, BPE ensures that the most common words will be represented in the new vocabulary as a single token, while less common words will be broken down into two or more subword tokens. More details here.

^{Tokenization levels vs Vocabulary size (source)}

Once you choose a tokenizer and train a model on the tokenized data, you should always use that same tokenizer when using the model!

13 - Rulebased Training Data

Mon, 14 Sep 2020 10:00:00 GMT

A solution to scale up your training data is by programmatically building training datasets without manual labeling. The idea is to define heuristic rules which are used in functions for labeling training data.

^{Writing labeling functions in Snorkel (source)}

Since the labeling functions have unknown accuracies and correlations, their output labels may overlap and conflict. By using a model to automatically estimate the accuracies and correlations, reweight and combine the labels, you can produce a final set of clean, integrated training labels.

The goal is to use the resulting labeled training data points to train a machine learning model that can generalize beyond the coverage of the labeling functions. This Python tutorial uses Stanford’s Snorkel library for this purpose. It’s quite successful, because the team is building a business solution around the concept.

12 - Textual Data Augmentation

Sun, 13 Sep 2020 10:00:00 GMT

The amount of available textual (training) data influences the performance of many NLP tasks. If collecting more data is not an option, there are different techniques for boosting performance on your NLP task.

Data augmentations are a standard part for Computer Visions tasks. However, due to the grammatical structure, the task is much more delicate for textual data and Natural Language Generation.

Here are some examples of how the textual data is transformed by Easy Data Augmentation (EDA) techniques and Back Translation:

^{Textual Data Augmentation Techniques (source)}

Data Augmentation might not help, but it’s worth the shot if you are stuck. Whatever you do; do not validate with augmented textual data!

11 - Crowdsourcing Marketplace

Sat, 12 Sep 2020 10:00:00 GMT

Remote workers are often recruited to complete labor-intensive tasks like building training datasets. Amazons Mechanical Turk (MTurk) is the leading platform for this task. Work is defined in human intelligence tasks (HITs) and range in length, complexity and compensation.

MTurk mechanism (source)

Outsourcing work is great, but there are limits. While MTurk is often considered a cheap solution to data collection, there are actually many hidden costs. You must be very explicit in defining HIT descriptions. A lot of effort is required for managing the project and monitoring quality control.

It is recommended to start labeling yourself. You will experience a lot of trial-and-error in the labeling-scheme and the annotation- and tag-definitions. Only proceed if you have confidence in your labeling scheme and applied this for building a Proof-of-Concept. So, experiment and fine-tune the training data definition yourself and then scale-up to MTurk.

10 - Training Data Provider

Fri, 11 Sep 2020 10:00:00 GMT

Gold data refers to data of very high quality, which is more or less as close as you can get to the ground truth. This is the data you want to use when training a new Language Model. Some Data Providers sell these high quality datasets. However, the use of an off-the-shelf dataset depends on the usability of the data. This depends on the NLP-Task, Language, Domain and Tag-schema. I’m skeptical of using third-party datasets as they almost never match your purpose. Unless you use it for a default NLP task or a demo purpose or in addition to your own training dataset.

A good starting point for an overview of company- and research datasets for various tasks can be found in the Big Bad NLP database.

09 - Annotation with Active Learning

Thu, 10 Sep 2020 10:00:00 GMT

It might not be useful to build a training dataset for Named Entity Recognition with 2000 annotations, including 100 occurrences where ‘Barack Obama’ is tagged as a Person. You only want to annotate sentences where the model is least sure about the prediction.

With active learning, the model chooses which sentences should be selected for annotating. Other sentences are skipped, because the model is more certain about those annotations.

The makers of spaCy made the annotation tool Prodi.gy which is powered by active learning is (video below).

^{Type caption for embed (optional)

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.}

08 - Manual Annotation

Wed, 09 Sep 2020 10:00:00 GMT

The good old fashioned manual labor. Annotate a sentence, paragraph or document for your task. For example: tag a word with a Part-of-Speech tag, or Dependency tag. Tag one or more words as a Named Entity. Or tag a sequence of words with a Category-tag.

^{Manually captured annotations (source)}

Over the years, numerous of annotation tools were developed. Here is a list with almost hundred annotation tools. A lot of them have terrible inefficient user interfaces or are not actively developed.

07 - Text Extraction and OCR

Tue, 08 Sep 2020 10:00:00 GMT

Extracting text from files is a fragile process if you don’t know what to expect from the format of the input files. The Java-based Apache Tika can handle hundreds of file extensions, has a Python port and is your best option. If you are looking for other (Python) packages to extract text, always check the underlying required packages.

However, Tika-outputs are often not perfect. Especially the PDF format is a challenging one. PDF was designed as an output format, resulting in a perfectly readable document, but is difficult to reduce the file back to the source text. Especially the broken lines for PDFs are difficult to clean. Should you remove the line-end to restore the sentence, or does this has the effect that you just merged a title with no ending dot with another sentence. Also PDFs with multiple columns on a page are often wrongly parsed.

OCR on a Polish receipt (source)

Older PDFs do have extra challenges, because they may contain images of text and not digital text. For this an Optical Character Recognition (OCR) model is needed that is trained for your language and Font type. It has a lot of subtasks like adjusting the scan to the right angle, converting the image from color to black and white, smoothing edges, normalizing the aspect ratio and analyzing the layout for zones like columns, paragraphs or tables. Widely used libraries are OpenCV and Tesseract-OCR which was started decades ago by HP and is now maintained by Google. There is also a Python wrapper.

06 - Text and File Scraping

Mon, 07 Sep 2020 10:00:00 GMT

If you need information from the web, but there is no API or other structured way of retrieving the data, then you might want to scrape textual data, or files.

^{Type caption for image (optional)
However, there are some challenges. When visiting a lot of webpages in the same domain in a short time there’s a good chance your IP-address gets blocked by the server. In preventing this you probably want to handle request headers, manage cookies, automatically login, handle popups and browse javascript-generated websites in ‘headless’ mode.
After getting access to the page source of the webpage one has to define the logic to extract values from HTML. For example, retrieve all the text from the news article, but not the text from the advertisements or menu buttons.
Very popular python packages to get started with building successful scrapers are: Scrapy, Beautiful Soup, Urllib and Selenium.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.}

05 - Loading from API

Sun, 06 Sep 2020 10:00:00 GMT

An API serves as the interface between different applications. The requestor automatically gets access to data, with the benefit that the source doesn’t have to know how the other system exactly works. Unfortunately, few organizations maintain their data for (free) external use by means of an API. Good examples are the Twitter-API (streaming), NewsAPI and GuardianAPI (batch requests).

^{API scheme (source)}

04 - Generating a Corpus

Sat, 05 Sep 2020 10:00:00 GMT

A Corpus is a language resource consisting of a structured set of documents and additional information. A corpus can consist of pre-processed documents. So it might be the output of other NLP tasks. It’s important to know the Corpus Balance. What does the corpus represent; which languages, genres, domains and for which NLP-tasks can it be used. You can load, for example, the corpora that are built-in in the NLTK library.

^{Wikipedia as source for corpus (source)}

Wikipedia is one of the most popular open data sources for building a corpus. Select and download a Wikipedia dump. Clean the articles with wikiextractor and you have a great corpus for starters. Also Commoncrawl is often used, but you have to be able to handle all those gigabytes.

03 - Loading Structured Datafile

Fri, 04 Sep 2020 10:00:00 GMT

If you are lucky there is a ready-to-use file in a valid CSV- or JSON-format with raw text documents in it. This means other people have made an effort for you to save all retrieved documents into one file. This file is a good starting point for your task, provided that you know the constraints and filters that were applied when the structured datafile was created.

02 - Manual Typewriting

Thu, 03 Sep 2020 10:00:00 GMT

Why not just type your own text and process it in your NLP task. Creating your own content is the best test in demo’s. Hopefully, it increases your confidence in the NLP-task that you are testing and lets you insert all the rare cases you can think of.

01 - Bits to Character Encoding

Wed, 02 Sep 2020 10:00:00 GMT

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). It all starts with loading the textual data in the right encoding. The former CEO of StackOverflow wrote an interesting overview about The Absolute Minimum you should know about Unicode and Character Sets.

^{Non-ASCII characters as replacement (source)}

If you have problems with encodings, try the ftfy library to fix them. If you don’t know what encoding your data has, try chardet or charset_normalizer to detect the encoding.

About Innerdoc

Sat, 01 Aug 2020 10:00:00 GMT

Helping you with Text Analytics tasks.

Do you recognize this? You have an important business problem. All information is available. But … you have to collect, search and analyze it. And the volume … it is hidden in hundreds or thousands of documents. You certainly won’t make this a manual task, or spend a lot of time on it.

Innerdocs services make use of Text Mining and Natural Language Processing techniques. We make your textual data useful again. We extract information from Word documents, PDF’s, articles, emails, web-content, Electronic Health Records, you name it. Innerdoc will analyze and present you the hidden knowledge and information from within the documents.

innerdoc
inner document
get value from documents

We extract all textual data from the documents and transform this unstructured content into structured information. Innerdoc provides a structured overview to all your documents. We provide all the information, but together we decide which problem we solve.

Search, Semantic Meaning, Interactive Apps
Topic Modeling, Data Quality and Linkage
Named Entities, Question Answering

We are ready to help you! Contact us at hey@innerdoc.com

A Full and Comprehensive Style Test

Sun, 30 Sep 2018 07:03:47 GMT

Below is just about everything you’ll need to style in the theme. Check the source code to see the many embedded elements within paragraphs.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, test link adipiscing elit. This is strong. Nullam dignissim convallis est. Quisque aliquam. This is emphasized. Donec faucibus. Nunc iaculis suscipit dui. 5³ = 125. Water is H₂O. Nam sit amet sem. Aliquam libero nisi, imperdiet at, tincidunt nec, gravida vehicula, nisl. The New York Times (That’s a citation). Underline. Maecenas ornare tortor. Donec sed tellus eget sapien fringilla nonummy. Mauris a ante. Suspendisse quam sem, consequat at, commodo vitae, feugiat in, nunc. Morbi imperdiet augue quis tellus.

HTML and CSS are our tools. Mauris a ante. Suspendisse quam sem, consequat at, commodo vitae, feugiat in, nunc. Morbi imperdiet augue quis tellus. Praesent mattis, massa quis luctus fermentum, turpis mi volutpat justo, eu volutpat enim diam eget metus. To copy a file type COPY filename. ~~Dinner’s at 5:00.~~ Let’s make that 7. This ~~text~~ has been struck.

Media

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore.

Big Image

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Small Image

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore.

Labore et dolore.

List Types

Definition List

Definition List Title : This is a definition list division.

Definition : An exact statement or description of the nature, scope, or meaning of something: our definition of what constitutes poetry.

Ordered List

List Item 1
List Item 2
1. Nested list item A
2. Nested list item B
List Item 3

Unordered List

List Item 1
List Item 2
- Nested list item A
- Nested list item B
List Item 3

Table

Table Header 1	Table Header 2	Table Header 3
Division 1	Division 2	Division 3
Division 1	Division 2	Division 3
Division 1	Division 2	Division 3

Preformatted Text

Typographically, preformatted text is not the same thing as code. Sometimes, a faithful execution of the text requires preformatted text that may not have anything to do with code. Most browsers use Courier and that’s a good default — with one slight adjustment, Courier 10 Pitch over regular Courier for Linux users.

Code

Code can be presented inline, like <?php bloginfo('stylesheet_url'); ?>, or using jekyll’s highlight filter to highlight a block of code. Because we have more specific typographic needs for code, we’ll specify Consolas and Monaco ahead of the browser-defined monospace font.

#container {
  float: left;
  margin: 0 -240px 0 0;
  width: 100%;
}

Blockquotes

Let’s keep it simple. Italics are good to help set it off from the body text. Be sure to style the citation.

And here’s a bit of trailing text.

Text-level semantics

HTML elements

The a element example
The abbr element and abbr element with title examples
The b element example
The cite element example
The code element example
The ~~del element~~ example
The dfn element and dfn element with title examples
The em element example
The i element example
The ins element example
The kbd element example
The mark element example
The q element inside a q element example
The ~~s element~~ example
The samp element example
The small element example
The span element example
The strong element example
The _{sub element} example
The ^{sup element} example
The var element example
The u element example

* * *

Embeds

Sometimes all you want to do is embed a little love from another location and set your post alive.

Video

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Culpa qui officia deserunt mollit anim id est laborum.

Slides

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Culpa qui officia deserunt mollit anim id est laborum.

Audio

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Culpa qui officia deserunt mollit anim id est laborum.

Code

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt.

See the Pen Simple Rotating Spinner by Rob Glazebrook (@rglazebrook) on CodePen.

Isn’t it beautiful.

_[HTML]: Hyper Text Markup Language _[CSS]: Cascading Style Sheets

Topics for Innerdoc posts

Mon, 02 Feb 2015 23:46:37 GMT

Topics

NLP basics; tokenization, sentencizer, lemma, pos, dep,
ontology making, schema.org, domain specific, input for NER
NER, input for NEL, how to define NER schema (schema.org, ontotext)
NEL, linked to a knowledge graph
KG - generating, using/querying, examples GKG, Bing, linked data
healthcare KG, snowmed
NER vs Timelines; books and its main characters mentions
Abbreviations
Definitions
relation extraction, network of people within a text, hearst patterns
(key)word usage, scattertext, anti-wordclouds
topic clustering sne-t
Korfball blogposts
NLP is hard, buffalo buffalo, siri, i’m bleeding, bigram/trigram
spinnenweb visualisaties
web scraping
document extraction
story metrics
sentiment
wordnet
MLML - https://awesome-streamlit.azurewebsites.net/ https://raw.githubusercontent.com/MarcSkovMadsen/awesome-streamlit/master/gallery/medical_language_learner/medical_language_learner.py
similarity - TFIDF gensim.summarisation.BM25- https://github.com/jroakes/tech-seo-crawler/blob/master/lib/doc_frequency.py

Resources

https://web.stanford.edu/~jurafsky/slp3/slides/21_SentLex.pdf http://liwc.wpengine.com/compare-dictionaries/ https://github.com/Ejhfast/empath-client/blob/master/empath/data/categories.tsv https://journals.sagepub.com/doi/full/10.1177/2056305118797724 https://www.researchgate.net/publication/336496045_A_Survey_of_Computational_Approaches_and_Challenges_in_Multimodal_Sentiment_Analysis