<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Innerdoc]]></title><description><![CDATA[Deep Text Search. Made Intelligent.]]></description><link>https://www.innerdoc.com</link><generator>GatsbyJS</generator><lastBuildDate>Thu, 04 Feb 2021 16:58:19 GMT</lastBuildDate><item><title><![CDATA[Custom Solutions for Businesses]]></title><description><![CDATA[Some of our core-tasks: Setting up a Project 🦉 Getting an Understanding of the Product Vision and Context of the larger application or…]]></description><link>https://www.innerdoc.com/business-solutions/</link><guid isPermaLink="false">https://www.innerdoc.com/business-solutions/</guid><pubDate>Wed, 10 Mar 2021 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Some of our core-tasks:&lt;/p&gt;
&lt;br&gt;
&lt;h3&gt;Setting up a Project 🦉&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Getting an &lt;mark&gt;Understanding of the Product Vision and Context&lt;/mark&gt; of the larger application or business process where the model will be used&lt;/li&gt;
&lt;li&gt;Defining what &lt;mark&gt;Accuracy&lt;/mark&gt; is needed to achieve the usage goals&lt;/li&gt;
&lt;li&gt;Defining how the &lt;mark&gt;Annotation Scheme and Corpus construction&lt;/mark&gt; should look like to facilitate consistent annotations and easy decisions from the models&lt;/li&gt;
&lt;li&gt;Initiating the &lt;mark&gt;Annotation process&lt;/mark&gt; for human annotators, semi-automated annotations and a good &lt;mark&gt;quality control&lt;/mark&gt; process &lt;/li&gt;
&lt;li&gt;Build the &lt;mark&gt;Model Architecture&lt;/mark&gt;, train and evaluate it&lt;/li&gt;
&lt;li&gt;&lt;mark&gt;Iterate&lt;/mark&gt; above steps and improve code and data with smart choices, parameter tuning, tricks and sweat&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3&gt;Gathering and Ingesting Textual Data 💬&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;mark&gt;Extracting&lt;/mark&gt; textual data from your CMS, ERP and other information systems (Database queries and API calls)&lt;/li&gt;
&lt;li&gt;Automated &lt;mark&gt;Scraping&lt;/mark&gt; of content from websites&lt;/li&gt;
&lt;li&gt;Handling data files like &lt;code class=&quot;language-text&quot;&gt;csv&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;json&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;xml&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;xlsx&lt;/code&gt; and many more&lt;/li&gt;
&lt;li&gt;Interpreting document types like &lt;code class=&quot;language-text&quot;&gt;pdf&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;docx&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;rtx&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;txt&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;epub&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;ppt&lt;/code&gt; and more&lt;/li&gt;
&lt;li&gt;Interpreting special document-types like &lt;code class=&quot;language-text&quot;&gt;emails&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;contracts&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;invoices&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;resumes&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;patient files&lt;/code&gt; and other &lt;mark&gt;Templated formats&lt;/mark&gt;&lt;/li&gt;
&lt;li&gt;Creating a domain specific &lt;mark&gt;Corpus&lt;/mark&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3&gt;Processing Textual Data ✨&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Parsing different languages like &lt;mark&gt;Dutch&lt;/mark&gt; and &lt;mark&gt;English&lt;/mark&gt;&lt;/li&gt;
&lt;li&gt;Finding sentences, words and punctuation.&lt;/li&gt;
&lt;li&gt;Recognizing word types (&lt;code class=&quot;language-text&quot;&gt;noun&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;verb&lt;/code&gt;, etc), syntactic types (&lt;code class=&quot;language-text&quot;&gt;object&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;subject&lt;/code&gt;, etc.), base forms of words (the &lt;code class=&quot;language-text&quot;&gt;lemma&lt;/code&gt; of ‘was’ is ‘be’)&lt;/li&gt;
&lt;li&gt;Labeling &lt;mark&gt;Named Entities&lt;/mark&gt; (NER) like &lt;code class=&quot;language-text&quot;&gt;Persons&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;Organizations&lt;/code&gt; (proper and common!), &lt;code class=&quot;language-text&quot;&gt;Locations&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;Places&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;Nationalities&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;Labeling &lt;mark&gt;Named Value notations&lt;/mark&gt; like &lt;code class=&quot;language-text&quot;&gt;Money&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;Email Addresses&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;URLs&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;Time&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;Phone Numbers&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;Quantity&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;Finding &lt;mark&gt;Similarity&lt;/mark&gt; between words, sentences, paragraphs, documents for recommendation and automated learning.&lt;/li&gt;
&lt;li&gt;Labeling documents with your domain-specific tags for &lt;mark&gt;Text Classification&lt;/mark&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3&gt;Advanced Analytics Projects 🚀&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;We train &lt;mark&gt;custom language Parsers&lt;/mark&gt; that are optimized for your &lt;mark&gt;domain-specific&lt;/mark&gt; content (Retail, Healthcare, Legal, etc)&lt;/li&gt;
&lt;li&gt;We create advanced analyses for information extraction with &lt;mark&gt;classification&lt;/mark&gt;, &lt;mark&gt;clustering&lt;/mark&gt; and &lt;mark&gt;visualization&lt;/mark&gt; of the results&lt;/li&gt;
&lt;li&gt;We find &lt;mark&gt;relations&lt;/mark&gt; among concepts by &lt;mark&gt;statistical and rule-based search&lt;/mark&gt; based on linguistic features&lt;/li&gt;
&lt;li&gt;We provide &lt;mark&gt;insight&lt;/mark&gt; into you textual data so you can make the right &lt;mark&gt;decisions&lt;/mark&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;p&gt;We are ready to help you! Contact us at &lt;a href=&quot;mailto:hey@innerdoc.com&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;hey@innerdoc.com&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Datashare]]></title><description><![CDATA[Datashare OSINT]]></description><link>https://www.innerdoc.com/datashare/</link><guid isPermaLink="false">https://www.innerdoc.com/datashare/</guid><pubDate>Wed, 10 Mar 2021 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Datashare&lt;/p&gt;
&lt;p&gt;OSINT&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Timeline Demo for a History of NLP]]></title><description><![CDATA[This demo is created to show the integration of Streamlit with TimelineJS. It became a timeline about the history of Natural Language…]]></description><link>https://www.innerdoc.com/nlp-timeline/</link><guid isPermaLink="false">https://www.innerdoc.com/nlp-timeline/</guid><pubDate>Thu, 04 Feb 2021 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;This demo is created to show the integration of &lt;em&gt;Streamlit&lt;/em&gt; with &lt;em&gt;TimelineJS&lt;/em&gt;. It became a timeline about the &lt;em&gt;history of Natural Language Processing&lt;/em&gt;!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/nlp-timeline-demo&quot;&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1195px; &quot;
    &gt;
      &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 81.4%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABqUlEQVQ4y5WTWYsCQQyE+///Ix9EEDwQ0UdBGTwQvM/xviJfoIZ21mXZQKaPdFdXKpnwer0Mxxjv97stl0vb7XZ2vV7t+Xxm8bzpbnw/xBu32802m42Nx2ObTCa23W4dFOORfr9v8/ncY/lHtP4AfDwezoyL3W7XwQXIXqFQsHq9bo1Gww6Hg53PZzudTn7vK8PL5WLT6dSGw6GNRiMHRwLseDw6Y/aYM/Lger32zDLAmO5gMLBisWitVsva7bZVq1Vnhu33e0+X1AESOHOIyAIMxAJm5XLZKpWKO/MkSTzGZdgvFgsHns1mPuKkDilSD3ykAYClUsl1giFMSV0MARQYzppRgHTER8pi2Gw2rVarecpimKZpph8FiV0a/qgy6XQ6Hev1eq4dDwAUP/qbCSfYPyxu5HxTf7QNhg5UDq2oHKmpz9SL0unbH5I1NhqQljQCEGD+BiQAnEIwEmePOP2nFiLOPlhBAIyxc4EqrlYrB9LvSIy1epAqc5ZzrDNAHZLn9zgslrHHDyBP4INejNIMRwJS0Do+F691hjlah2/VygstoyB/nXkDgmrNg5l72hsAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Click for Demo!&quot;
        title=&quot;Click for Demo!&quot;
        src=&quot;/static/f3febac48fbabafbdb768700226c922b/08158/demo-timeline.png&quot;
        srcset=&quot;/static/f3febac48fbabafbdb768700226c922b/0eb09/demo-timeline.png 500w,
/static/f3febac48fbabafbdb768700226c922b/1263b/demo-timeline.png 1000w,
/static/f3febac48fbabafbdb768700226c922b/08158/demo-timeline.png 1195w&quot;
        sizes=&quot;(max-width: 1195px) 100vw, 1195px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
    &lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Visit &lt;a href=&quot;/nlp-timeline-demo&quot;&gt;demo&lt;/a&gt;!&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Detecting Emotions in Christmas Lyrics]]></title><description><![CDATA[It started with the idea for a fun pre-christmas project and with the help of Plutchik’s Wheel of Emotions and Zero-shot emotion…]]></description><link>https://www.innerdoc.com/christmas-lyrics-emotions/</link><guid isPermaLink="false">https://www.innerdoc.com/christmas-lyrics-emotions/</guid><pubDate>Mon, 21 Dec 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;It started with the idea for a fun pre-christmas &lt;a href=&quot;/christmas-lyrics-emotions-demo&quot;&gt;project&lt;/a&gt; and with the help of &lt;em&gt;Plutchik’s Wheel of Emotions&lt;/em&gt; and &lt;em&gt;Zero-shot emotion classification&lt;/em&gt; with Transformers, it became this &lt;a href=&quot;/christmas-lyrics-emotions-demo&quot;&gt;demo&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/christmas-lyrics-emotions-demo&quot;&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 733px; &quot;
    &gt;
      &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 145.79999999999998%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAdCAIAAAAl5NuSAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAEs0lEQVQ4y32V+28UVRTH9+/wR3/RqD8QfJEQQQP+YIqxaLXY0EAEgkFNJIaoJKCCRozUxNoiba1iFaS8SqGktZQ+trRdSnfb7XR3h+50Ybs7s7Pzfr8fntlt6Qs8uTuZvXM/53zvufeeG7IsyzBMHX6mocgyx4MJLMsJoiRJMsOw8NR1Q1FUz/NXWUiURJIq8qJA0TQvCALHcTTNMQxFFiWBp2m6WKTABfC+vxbmhfHR6HRsJhlPIVMJBMWm781Noxj0xVEsjWUm7sYSiZS3Ni7AdJoZa7ozejoycmps6hJynXSaMfO3jHk265zGrLwWMI8CS7DruJZuLTTDtnzfcIKmW5Juu24J8x5Dh5a9ByM0YtSS0r6rWOJF35cU+Z4kzjiO/mjZi6Dje67v2cT4Lwxy1rOSRXSXqY7nc+2zqROeZ5fCO2vhwKWr8h52Q01eYBIdYjaqCVeJ6a0qW8cxMZ4bzuKd2cI5zSQeClyES3o8iRUjbXOnKqW+Ax6XUvj6B9GNKr2Xokcy+JGByGtJ7OfS5N3Vsm2mSHR16QU6374PPbFOOLtRmz3IYjU8WTUW3zUUfQGZ2yWpmXS+G6fvLA8ewNxU7F5Dg0HgnpIsXtuX+/0VumdzcfYDBNkcjm6ZmXtX1kYVVbqbaIogjStgz/c0HCe7w8yZgfynveSHP+WOfJQ+/g45+HYc3RFGdvando7M1UUzbbF0u2rQK2BXM9X+5GRt89j640PPHJ1/sj6zoSW6vXXw/In2mdoLYzV/h6tbBrY291ac7q6YwS9pJluivYWEOazMRzP3f+1Hj3X8ueHQgeeqvnv92/3tyb03fzzYvucKcrRn4qsI2pbK9+Ec4rrWioSJOI9eiPmqg6GZivVvVT775qaXK5//5NirX368u/nwob7GODVrWsbIXN8DJv0w7AJcRObDR68KGPVvXW/109WVT1W8uKVqXdWel3bs3n/x65rzX3w/1Jrnsq29J3uil5fv1gDWeZXFCPRWquW91mtHrg+eG6j9puGJN2q2ff5DLDtzLt71/pnPrkzeZEWSV9jVS1W2RFcq3DRMZAXD9jtRcltj1x9TuGE7cTI/hEZuIP2CJi/XvAQ7pmOqJryEJ+6PRjCMMyr+GR8j1OHM7Mm+XlZWNduwXXvV8Qh5ixZsNdvpu4113kwSjPTX+CzOyWmGvjw1Oc/QcF7//0gG2xya43q24+qO/5gSsAyGMlcAIws0RbEsQ1FFluOCc7akyHMXpT1MtVfqCxmGIYmiLCuqqkKJVFQNbMW4JV0LLtwADTSGyr0FgoQS6zhONjsvCBJFUTheME2TphlDN6TAglTblg2V1DENsViwDb0M+4IgKopmmhZBFEQRqo/C8yL4UlXNsmyou1DXg5iuB1U+J1vdKJXltZDruo7tMDQTMIpKForlKcBf+ATVX1YUhmFAgmVacBO4htYxr20fpLvxIHIwE1ClaXp5tcC9bdsQEJzCPQJfIS74hToL0mD8cJw63DQZSdDBOsONA9dCeVZGYCZMEW4P6M/l8oDpug4SgtoKs/Ds2PmJ+k110x1TARwoLxm8Q2RooBB8BNs+MKM8AFIDCYM0Z2O5Ww1D+Wn8P00YA8uoE/dAAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Click for Demo!&quot;
        title=&quot;Click for Demo!&quot;
        src=&quot;/static/54d0c06ec4b01f9e30906ef818978796/6e70f/christmas-lyrics-emotion-detector-demo.png&quot;
        srcset=&quot;/static/54d0c06ec4b01f9e30906ef818978796/0eb09/christmas-lyrics-emotion-detector-demo.png 500w,
/static/54d0c06ec4b01f9e30906ef818978796/6e70f/christmas-lyrics-emotion-detector-demo.png 733w&quot;
        sizes=&quot;(max-width: 733px) 100vw, 733px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
    &lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Visit &lt;a href=&quot;/christmas-lyrics-emotions-demo&quot;&gt;demo&lt;/a&gt;!&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Periodic Table of Natural Language Processing Tasks]]></title><description><![CDATA[Russian chemist Dmitri Mendeleev published the first Periodic Table in 1869. Now it’s time for the NLP tasks to be organized in the Periodic…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/</guid><pubDate>Tue, 01 Dec 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Russian chemist Dmitri Mendeleev published the first Periodic Table in 1869. Now it’s time for the NLP tasks to be organized in the Periodic Table style!&lt;/p&gt;
&lt;p&gt;The variation and structure of NLP tasks is endless. Still, you can think about building NLP Pipelines out of standard NLP tasks and dividing them into groups. But what do these tasks entail?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;More than 80 NLP tasks are explained and subdivided into 15 groups!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h3&gt;TASK GROUPS&lt;/h3&gt;
&lt;p&gt;👆 &lt;sub&gt;click links for tasks&lt;/sub&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/source-data-loading/&quot;&gt;Source Data Loading&lt;/a&gt;&lt;/b&gt; 💾
Tasks from this group care about the textual data to do NLP analysis on.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/training-data-generation/&quot;&gt;Training Data Generation&lt;/a&gt;&lt;/b&gt; 💎
Generating gold data which is needed for training language models.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/word-parsing/&quot;&gt;Word Parsing&lt;/a&gt; &lt;/b&gt; 🔬
Splitting text to tokens and creating the first structured metadata per token.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/word-processing/&quot;&gt;Word Processing&lt;/a&gt;&lt;/b&gt; 🛠
Improving the token format on a lexical level.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/phrases-and-entities/&quot;&gt;Phrases and Entities&lt;/a&gt;&lt;/b&gt; 🗣
Recognizing multi-token phrases.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/entity-enriching/&quot;&gt;Entity Enriching&lt;/a&gt;&lt;/b&gt; 🏛
Enriching entities with structured meta-data.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/sentences-and-paragraphs/&quot;&gt;Sentences and Paragraphs&lt;/a&gt;&lt;/b&gt; 📃
Working with the sense and coherence of words on a sentence- and paragraph-level.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/documents/&quot;&gt;Documents&lt;/a&gt;&lt;/b&gt; 📚
Handling the unity of textual data on the level of documents.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/natural-language-models/&quot;&gt;Natural Language Models&lt;/a&gt;&lt;/b&gt; 💃
Steps for building a respectable language model.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/supervised-classification/&quot;&gt;Supervised Classification&lt;/a&gt;&lt;/b&gt; 🚩
Classifying textual data on all kinds of levels.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/unsupervised-signaling/&quot;&gt;Unsupervised Signaling&lt;/a&gt;&lt;/b&gt; 🙈
Unsupervised discovery of important signals.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/similarity/&quot;&gt;Similarity&lt;/a&gt;&lt;/b&gt; 👯
Calculating the closeness of different chunks of text.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/natural-language-generation/&quot;&gt;Natural Language Generation&lt;/a&gt;&lt;/b&gt; 🤖
Producing content as if it’s written by humans.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/systems/&quot;&gt;Systems&lt;/a&gt;&lt;/b&gt; 🚀
NLP systems as a basis for interactive applications.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href=&quot;/tags/information-visualization/&quot;&gt;Information Visualization&lt;/a&gt;&lt;/b&gt; ✨
Visualizing textual information to better understand complex textual data.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h3&gt;ABOUT THIS PROJECT&lt;/h3&gt;
&lt;p&gt;I have tried to make the Periodic Table of NLP tasks as complete as possible. It’s therefore more a long-read than some self-contained blog articles.&lt;/p&gt;
&lt;p&gt;The set-up and composition of the Periodic Table is subjective. The division of tasks and categories could have been done in multiple other ways. I appreciate your feedback and new ideas in the form below. I tried to make a clear and short description for each task. I omitted the deeper details, but provided links to extra information where possible. If you have improvements, you can send them in the form below or you can contact me on LinkedIn.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;CREATE YOUR OWN PERIODIC TABLE&lt;/h3&gt;
&lt;p&gt;After continuously editing the project table, I made a ‘Periodic Table Creator’ to dynamically rebuild the figure over and over again. The code is available on &lt;a href=&quot;https://www.github.com/innerdoc&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;github.com/innerdoc&lt;/a&gt;. I build it with the help of Streamlit and inspired by Bokeh-examples it became a dynamic creator that can be customized to your Periodic Table!&lt;/p&gt;
&lt;p&gt;Feel free to use the &lt;a href=&quot;/periodic-table-demo&quot;&gt;Periodic Table Creator!&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 2000px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/ccc40d2a4a94e88f0701109f5df3dfc5/fec90/periodic-table-of-nlp-tasks-full.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 57.00000000000001%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAIAAADwazoUAAAACXBIWXMAAA7DAAAOwwHHb6hkAAACOElEQVQozz1R72/SUBTtP6yfzGKMfjAxGhOSfXAaB2yCTBnEbW6uDAbsB9CVtpS1A0rL+gotdpCp40fbQVte6wOcNzc35513z3kvOZjv+7Zta5qmANDr9QAAuq532u22otzouizLLUnqdDo/u11FURBQVRXtOI6DhNiv8dScugh5noeGv0D+Q3kP9R8vedt27/5YWAC/OhF66AwhnN+hhjPfuf9nhPiFYDAcWKaBwNSdIk4Bt9FQEQsXpAQNat3bhTdEcza8uWfivjtZMsgPerDf1we/e7ZrE50C12Nqinh+JmIbRTFGt8tVdpJ/Z528vydjFhM3Ch+9+eM+uOsGyM295v5e8zsupQi1kG7hpFa4lGs0DbCNgviF7lA0aSZXrJ1n49Rbo7huFoOmUoZdnq0fPU6/DrGb62xoi49lrvEsyDA6wYNGmZSxyAXYruoMyxi7z839l+PsqlEKG8WgIaQdKV+9/LaSD0S42Cc+mmgk8komB9JLMUUBLEJcxytamSkbuy/m4uPVcSlslcIj4XginlaqOyu5QJTfil59TjaSOTmdvU5T3RIn12lawYLnYqSslEhilHgyTD4d4m9Gpx9GJ2sDHp8KxxQTf3T0KshuBKuhGB/Dm/vo5xfqWUXiCKKFHXJqRripC3XzPGSchU1q2+QPLO5gJJOOyjbF0zX66w/h8LCVyoIspRK0Rjb6nKTJtZqKQdeFM3eGJgrJ82ae78J5Pg7iFgAFDWdw2Sh7x3YRi9p1nb9LVjAiVutLIAAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Click to enlarge!&quot;
        title=&quot;Click to enlarge!&quot;
        src=&quot;/static/ccc40d2a4a94e88f0701109f5df3dfc5/f97d7/periodic-table-of-nlp-tasks-full.png&quot;
        srcset=&quot;/static/ccc40d2a4a94e88f0701109f5df3dfc5/0eb09/periodic-table-of-nlp-tasks-full.png 500w,
/static/ccc40d2a4a94e88f0701109f5df3dfc5/1263b/periodic-table-of-nlp-tasks-full.png 1000w,
/static/ccc40d2a4a94e88f0701109f5df3dfc5/f97d7/periodic-table-of-nlp-tasks-full.png 2000w,
/static/ccc40d2a4a94e88f0701109f5df3dfc5/49142/periodic-table-of-nlp-tasks-full.png 3000w,
/static/ccc40d2a4a94e88f0701109f5df3dfc5/fec90/periodic-table-of-nlp-tasks-full.png 3014w&quot;
        sizes=&quot;(max-width: 2000px) 100vw, 2000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;
&lt;sup&gt;Click image to enlarge or download.&lt;/sup&gt;&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Demo Time!]]></title><description><![CDATA[👆 Click the pictures below for a demo! TIMELINE DEMO FOR A HISTORY OF NLP This demo is created to show the integration of Streamlit with…]]></description><link>https://www.innerdoc.com/demo/</link><guid isPermaLink="false">https://www.innerdoc.com/demo/</guid><pubDate>Sun, 29 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;👆 Click the pictures below for a demo!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;TIMELINE DEMO FOR A HISTORY OF NLP&lt;/h3&gt;
&lt;p&gt;This demo is created to show the integration of &lt;em&gt;Streamlit&lt;/em&gt; with &lt;em&gt;TimelineJS&lt;/em&gt;. It became a timeline about the &lt;em&gt;history of Natural Language Processing&lt;/em&gt;!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/nlp-timeline-demo&quot;&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1195px; &quot;
    &gt;
      &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 81.4%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABqUlEQVQ4y5WTWYsCQQyE+///Ix9EEDwQ0UdBGTwQvM/xviJfoIZ21mXZQKaPdFdXKpnwer0Mxxjv97stl0vb7XZ2vV7t+Xxm8bzpbnw/xBu32802m42Nx2ObTCa23W4dFOORfr9v8/ncY/lHtP4AfDwezoyL3W7XwQXIXqFQsHq9bo1Gww6Hg53PZzudTn7vK8PL5WLT6dSGw6GNRiMHRwLseDw6Y/aYM/Lger32zDLAmO5gMLBisWitVsva7bZVq1Vnhu33e0+X1AESOHOIyAIMxAJm5XLZKpWKO/MkSTzGZdgvFgsHns1mPuKkDilSD3ykAYClUsl1giFMSV0MARQYzppRgHTER8pi2Gw2rVarecpimKZpph8FiV0a/qgy6XQ6Hev1eq4dDwAUP/qbCSfYPyxu5HxTf7QNhg5UDq2oHKmpz9SL0unbH5I1NhqQljQCEGD+BiQAnEIwEmePOP2nFiLOPlhBAIyxc4EqrlYrB9LvSIy1epAqc5ZzrDNAHZLn9zgslrHHDyBP4INejNIMRwJS0Do+F691hjlah2/VygstoyB/nXkDgmrNg5l72hsAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Click for Demo!&quot;
        title=&quot;Click for Demo!&quot;
        src=&quot;/static/f3febac48fbabafbdb768700226c922b/08158/demo-timeline.png&quot;
        srcset=&quot;/static/f3febac48fbabafbdb768700226c922b/0eb09/demo-timeline.png 500w,
/static/f3febac48fbabafbdb768700226c922b/1263b/demo-timeline.png 1000w,
/static/f3febac48fbabafbdb768700226c922b/08158/demo-timeline.png 1195w&quot;
        sizes=&quot;(max-width: 1195px) 100vw, 1195px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
    &lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;CHRISTMAS LYRICS EMOTION DETECTOR&lt;/h3&gt;
&lt;p&gt;It started with the idea for a fun pre-christmas project and with the help of &lt;em&gt;Plutchik’s Wheel of Emotions&lt;/em&gt; and &lt;em&gt;Zero-shot emotion classification&lt;/em&gt; with Transformers, it became this demo.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/christmas-lyrics-emotions-demo&quot;&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 733px; &quot;
    &gt;
      &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 145.79999999999998%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAdCAIAAAAl5NuSAAAACXBIWXMAAA7DAAAOwwHHb6hkAAAEs0lEQVQ4y32V+28UVRTH9+/wR3/RqD8QfJEQQQP+YIqxaLXY0EAEgkFNJIaoJKCCRozUxNoiba1iFaS8SqGktZQ+trRdSnfb7XR3h+50Ybs7s7Pzfr8fntlt6Qs8uTuZvXM/53zvufeeG7IsyzBMHX6mocgyx4MJLMsJoiRJMsOw8NR1Q1FUz/NXWUiURJIq8qJA0TQvCALHcTTNMQxFFiWBp2m6WKTABfC+vxbmhfHR6HRsJhlPIVMJBMWm781Noxj0xVEsjWUm7sYSiZS3Ni7AdJoZa7ozejoycmps6hJynXSaMfO3jHk265zGrLwWMI8CS7DruJZuLTTDtnzfcIKmW5Juu24J8x5Dh5a9ByM0YtSS0r6rWOJF35cU+Z4kzjiO/mjZi6Dje67v2cT4Lwxy1rOSRXSXqY7nc+2zqROeZ5fCO2vhwKWr8h52Q01eYBIdYjaqCVeJ6a0qW8cxMZ4bzuKd2cI5zSQeClyES3o8iRUjbXOnKqW+Ax6XUvj6B9GNKr2Xokcy+JGByGtJ7OfS5N3Vsm2mSHR16QU6374PPbFOOLtRmz3IYjU8WTUW3zUUfQGZ2yWpmXS+G6fvLA8ewNxU7F5Dg0HgnpIsXtuX+/0VumdzcfYDBNkcjm6ZmXtX1kYVVbqbaIogjStgz/c0HCe7w8yZgfynveSHP+WOfJQ+/g45+HYc3RFGdvando7M1UUzbbF0u2rQK2BXM9X+5GRt89j640PPHJ1/sj6zoSW6vXXw/In2mdoLYzV/h6tbBrY291ac7q6YwS9pJluivYWEOazMRzP3f+1Hj3X8ueHQgeeqvnv92/3tyb03fzzYvucKcrRn4qsI2pbK9+Ec4rrWioSJOI9eiPmqg6GZivVvVT775qaXK5//5NirX368u/nwob7GODVrWsbIXN8DJv0w7AJcRObDR68KGPVvXW/109WVT1W8uKVqXdWel3bs3n/x65rzX3w/1Jrnsq29J3uil5fv1gDWeZXFCPRWquW91mtHrg+eG6j9puGJN2q2ff5DLDtzLt71/pnPrkzeZEWSV9jVS1W2RFcq3DRMZAXD9jtRcltj1x9TuGE7cTI/hEZuIP2CJi/XvAQ7pmOqJryEJ+6PRjCMMyr+GR8j1OHM7Mm+XlZWNduwXXvV8Qh5ixZsNdvpu4113kwSjPTX+CzOyWmGvjw1Oc/QcF7//0gG2xya43q24+qO/5gSsAyGMlcAIws0RbEsQ1FFluOCc7akyHMXpT1MtVfqCxmGIYmiLCuqqkKJVFQNbMW4JV0LLtwADTSGyr0FgoQS6zhONjsvCBJFUTheME2TphlDN6TAglTblg2V1DENsViwDb0M+4IgKopmmhZBFEQRqo/C8yL4UlXNsmyou1DXg5iuB1U+J1vdKJXltZDruo7tMDQTMIpKForlKcBf+ATVX1YUhmFAgmVacBO4htYxr20fpLvxIHIwE1ClaXp5tcC9bdsQEJzCPQJfIS74hToL0mD8cJw63DQZSdDBOsONA9dCeVZGYCZMEW4P6M/l8oDpug4SgtoKs/Ds2PmJ+k110x1TARwoLxm8Q2RooBB8BNs+MKM8AFIDCYM0Z2O5Ww1D+Wn8P00YA8uoE/dAAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Click for Demo!&quot;
        title=&quot;Click for Demo!&quot;
        src=&quot;/static/54d0c06ec4b01f9e30906ef818978796/6e70f/christmas-lyrics-emotion-detector-demo.png&quot;
        srcset=&quot;/static/54d0c06ec4b01f9e30906ef818978796/0eb09/christmas-lyrics-emotion-detector-demo.png 500w,
/static/54d0c06ec4b01f9e30906ef818978796/6e70f/christmas-lyrics-emotion-detector-demo.png 733w&quot;
        sizes=&quot;(max-width: 733px) 100vw, 733px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
    &lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;CREATE YOUR OWN PERIODIC TABLE&lt;/h3&gt;
&lt;p&gt;After continuously editing the project table, I made a ‘Periodic Table Creator’ to dynamically rebuild the figure over and over again. I build it with the help of Streamlit and inspired by Bokeh-examples it became a dynamic creator that can be customized to your Periodic Table!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/periodic-table-demo&quot;&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 2000px; &quot;
    &gt;
      &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 40%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAICAIAAAB2/0i6AAAACXBIWXMAAA7DAAAOwwHHb6hkAAABvklEQVQY022P3U/aABTF++8uS5Yl6pI94cse5kzmYpwIQ8aHQaoTEEpbW0QqXYFSWlr66WRW2TBbEPpBx1ILa/F1J7nJzUl+554LrJ+wHX00e/QWgeb+eM7Ye9AX/5PruffW0PYMju9DBQ74gEtQVx+M7QCdB7A7VGYivIyaL5bOMjVYHNehdFIbiR1JpsgrIEqoWUYXO9QjnflDH/+V8Sl9OG2mnxjx93W6VyD0y+rNRWvQZIftklJo3dUZRRCEAbBbEbOdW75yND14bmZfGfBbq7ptfk049+riVx/i8s/g9XBrb7uxk+KSZeUU0Ur0gGxLAsfqQITQvvA/xcuiA65OcyEDe28Ru3Y9asuIKyFIM7FSeRdn41E2Boog3kfQbxA7bDJKj+MCWD3mfggX+Sm4ZuVCJr5l1sI2uT+RsJmMlxupFXwjwSdjXNyHsWsY1or0HdmWe0HtSE09Ym/5as6HzVxogm0ZtbBVj43FM/9/mEqtYhufu8lP3H5WODzTyugVtKzNd/3LH8+lDKPz5ydW5uUIfG2gm0Z1Z1LbexDgmYgWyfgL9E2kEQkz0YNuutTLQ+pp44agBJZhvv8D8GeBd8PnTGwAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Click for Demo!&quot;
        title=&quot;Click for Demo!&quot;
        src=&quot;/static/b0c57fdd2803d18c9c9ed231712074e6/f97d7/periodic-table-of-nlp-tasks.png&quot;
        srcset=&quot;/static/b0c57fdd2803d18c9c9ed231712074e6/0eb09/periodic-table-of-nlp-tasks.png 500w,
/static/b0c57fdd2803d18c9c9ed231712074e6/1263b/periodic-table-of-nlp-tasks.png 1000w,
/static/b0c57fdd2803d18c9c9ed231712074e6/f97d7/periodic-table-of-nlp-tasks.png 2000w,
/static/b0c57fdd2803d18c9c9ed231712074e6/ef7d6/periodic-table-of-nlp-tasks.png 2997w&quot;
        sizes=&quot;(max-width: 2000px) 100vw, 2000px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
    &lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;SCATTERTEXT DEMO&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;/scattertext-demo&quot;&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 882px; &quot;
    &gt;
      &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 59.4%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAMCAYAAABiDJ37AAAACXBIWXMAAA7DAAAOwwHHb6hkAAACBUlEQVQoz21TC8/bIBDL//+V39ameRHCGwIknknWqpN2EgotnM/2HR3+xnEcX9/zsy+l4DxPLlzf77h/n6i1onCdvN9ZayHlhpQylmXB6zUg53IBGWOgtfkHpBXZ9wRrHc8d1s1AbRJL/wNjDbp5ntC/ZsRYrwtCrBhG/tf3sM7BOc+zBLFK3MUV+mHCNAsEkvAxYpcL9ESMfUenlILSDlJ5HuYLUG4KzscLrAG1NS8C2hBw2zALiRADzhpRrcQ2L/j9eFyqOkcWlokuUqJPpO2wl4MJ934RguBkmRLteMGHeHtMsNMbWNrkWKgRa3Z0v35+8BgFfD4R9sLEHSnnS25j43wgsL3klpLpX4Q3EjWw2DTi+XhiI/vEgi06rRWGacGqLQKZRTakgTTWiZ5EepQIklIgMQJGKlIbXs8H5mnCJimf999d79qmdXOgnFUZ7JyayoNMlo1RKTvHIjPJYBx6JGdwssuRjNdxxMTVir6je8/XxGqPZ0//KgEKDrI5akLJZEpWnkCbFMiBfhPM07M3oObe0+fK3O4e2ntg22Bqrbnk1UEUXkqWjByOzI5GD7MKGMqUbIZkw5p/jYymyo/kb9AW2mhkygp6hV1HVC1QxIhMhoVnlkCJA2/4IAwb10h8ST7xP9AWzUfFBM8X4PkawrrCCY4JGe5sROEI+TZmMX0w/gDVkKBee0aTWgAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Click for Demo!&quot;
        title=&quot;Click for Demo!&quot;
        src=&quot;/static/135b124efa052a09c67c8165a9fba10b/5f5e5/scatter.png&quot;
        srcset=&quot;/static/135b124efa052a09c67c8165a9fba10b/0eb09/scatter.png 500w,
/static/135b124efa052a09c67c8165a9fba10b/5f5e5/scatter.png 882w&quot;
        sizes=&quot;(max-width: 882px) 100vw, 882px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
    &lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;</content:encoded></item><item><title><![CDATA[How you doin'? Typology for Affective Meaning]]></title><description><![CDATA[Why compute Affective Meaning? Applications can be Detecting: sentiment towards politicians, products, countries, ideas frustration of…]]></description><link>https://www.innerdoc.com/affective-meaning/</link><guid isPermaLink="false">https://www.innerdoc.com/affective-meaning/</guid><pubDate>Sat, 28 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;h3&gt;Why compute Affective Meaning?&lt;/h3&gt;
&lt;p&gt;Applications can be Detecting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;sentiment towards politicians, products, countries, ideas&lt;/li&gt;
&lt;li&gt;frustration of callers to a help line&lt;/li&gt;
&lt;li&gt;stress in drivers or pilots&lt;/li&gt;
&lt;li&gt;depression and other medical conditions&lt;/li&gt;
&lt;li&gt;confusion in students talking to e-tutors&lt;/li&gt;
&lt;li&gt;emotions in novels (e.g., for studying groups that are feared over time)
Could we generate:&lt;/li&gt;
&lt;li&gt;emotions or moods for literacy tutors in the children’s storybook domain&lt;/li&gt;
&lt;li&gt;emotions or moods for computer games&lt;/li&gt;
&lt;li&gt;personalities for dialogue systems to match the user&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Relying on language as an indicator of psychological well-being and a conduit for emotions more generally also has a long tradition in clinical psychology.&lt;/p&gt;
&lt;h3&gt;Two families of theories of Emotion&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Dimensions of Emotion&lt;/strong&gt;&lt;br&gt;
Affect can vary in Valence and Arousal. Valence can be expressed in positive/pleasant or negative/unpleasant. Arousal can be expressed in strong/activated or weak/deactivated.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Plutchick’s wheel of Emotion&lt;/strong&gt;&lt;br&gt;
A list of 8 basic emotions in four opposing pairs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Joy – Sadness &lt;/li&gt;
&lt;li&gt;Anger – Fear&lt;/li&gt;
&lt;li&gt;Trust – Disgust&lt;/li&gt;
&lt;li&gt;Anticipation – Surprise
&lt;br&gt;
&lt;br&gt;
&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 1733px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/1bcf4b82662f0a564672e85b417a7937/22bcb/plutchik_model_of_emotions_with_faces.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 100%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAAACXBIWXMAAC4jAAAuIwF4pT92AAAF+ElEQVQ4yz1UaUxbVxr9KDNp1B8ZVarUqv3R6TKtRv3TLZWqVqo0SqcMnYQGkpQsJihpSRolaVZl0mSioWVSSCAhIYFCsEnYLUNCCpjg2BBjvMYG79h4wcb2M8Y8m2f72Xh73zxaaX4cXd1P9xzpfLrnAOOXAoa0kAtqgCEUhTQieGbH/hj3S7/JxqycXFIAmfjQyWxS+N80zQeantuZyfqO5Bl6I+IaUAlNYYzWs6cOYrQRgPE9ASRnWRiA0lwuSKSXIE+7PlixdLoyUb0e0/E36FDLgzWKL0TETeHlx8oIqfBHIqa/ZZNpMMxVP4OYB8QUCwRAaoYVnYK8X1mYRfwDEmOQpSx8Qn0D44tiRIZoJReOOsmFIwHE5XrC34fzc9eQpu2jNmcL1A3AulDhutjvgsR0IVVbAWsMPos+SSd6R08Rxv5G92Q9FZA3UamY6GrYfjgRMldhin50e9HV4SEWuZkIZTphcTaWLpG/ipcj068vhe/Dyqqs4DfVtZkxWK1++/msrHGcGjuF4YF9Wma+udqGuDFOfr7fZ/oCI4sfIRUs4g0P/vgXP9F6Sa4rm57WHkRPsL2HtQE0uypvaLQApK8D5Mjlt5I2PS/v0Z6ZvVpqWeziYID7AaZ7P7yQThy5EjSyd0MxJlf2PxDJSppMrq1ocu7BCcX2hG9pfHd8LXQ9lQkXU0nPuv/s9izhUXv7+jDlWqCWp7qb/c3vo6WxDJP8zUhrOelV/3GM+YpxKbQtMaUtR+nTLYzRvRWdvpq6ZHbFYnJ1oXdJlmDdvgqZoJcTmpo0W240I6mU59jhGzHJvlvLQwfQd+fj7LLgCwzbvmII27eM2vAuynRFOaNjF857v+n029de8QRUUbm+ASWa6iTLfQ3cAu773oGe4z5+v3BlXCwNFN3luD/m1oR/OpoNtu1G162ifPhRMbr0+1Ft/opRGfcyGlcJiuyHmh9oDx9XmNtrpnQ3u6VPf9rsDIwDpLTOKnrS0jR7oN1i3XE7aXivFvXv1GLg0w70brnJzBZfRv+ts8ycbRsjt1fiQz0HexUVTI+qFAd1e5An2RnrmNhlC8TkN0KU/nOIjczwqD6FwXyYRyjLmhh5UQOqihox8God2ovbGElpG/JPDjOXdVymTlyJLaISZkBegT2yPdgj5eC9x5XYPclZ0bjacD44XgOrjgBgGl8gbYFfiAe6Ptup7m7HSf7g022d6Y7tLdj893rmux9VWKpUYWXXGaay6yT2WK7lx3Rnn/CEFbW2pZGa+eCMwOSRfZliYsDmdO2fIaNvwtqlxshciGIX+3Ib597p1vK72H+Qn+We6GOODFtx19gs83X7Raxq+yF7TtSCF1QdzcABcIbsGpF1EF3hOTZJ+AokQtSeBanVqKoR4pLCnfMoPNXcXbwcd0cH3uf0YjtvBqskNuZgnwJPiIeZY2M1uLf/PPO98AqOOCTnzQtqf5uoDse0gvUf8iZgFjfou2Tb5Y1Cq/m+8lLndz3TgqODOHB6CAfL7+ouDbr0JRN+LBHM4jFNeKRB1Tb5r4fX8eKjm1hx79ycgzRvvSO82tslub37jrD+t2BD9T+GYfTCQzAMmPcNnrqP3VW9ERPfyo2agp+dllhbD48YcceADstkkfPs+w1acv76udEG7w+CeuRrhdxNPwNcE1TDxc5/F4BzeuH/TWF77BQqf1HxTBPz1XZ/1O9whwldMF1b9chOHxe7cSqM/9H4PbN8nS7tDhLnn9jkZzQ+g0k5r35xne+JBllBhQdILwUrTvLZ0CL5krrfALFk7sq41IYyjQupeK7j0IjBXt6vCkcyWCdx2PHmpBitRNBybUTyUhpzm5L51HMZzPxeXw6ZG4LWEISdJGA+D/oIBU9nFst6HxoT3UOGtFrr3tGqmJO0TJmdLOHPEy6HukEiInvlsrMNw0OfRNMErCSjYAq4IZ5JALiVXoiHE5BcTYFiarGgscMELF4Qy4mXx2XBv+4s7/uTdnmtRB1MVbni682MG7zR6KZbYhE0CUfBTDgKSHoVrIQXIqlV+B/x2G1qQJhP7gAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;Plutchick’s wheel of Emotion&quot;
        title=&quot;Plutchick’s wheel of Emotion&quot;
        src=&quot;/static/1bcf4b82662f0a564672e85b417a7937/22bcb/plutchik_model_of_emotions_with_faces.png&quot;
        srcset=&quot;/static/1bcf4b82662f0a564672e85b417a7937/0eb09/plutchik_model_of_emotions_with_faces.png 500w,
/static/1bcf4b82662f0a564672e85b417a7937/1263b/plutchik_model_of_emotions_with_faces.png 1000w,
/static/1bcf4b82662f0a564672e85b417a7937/22bcb/plutchik_model_of_emotions_with_faces.png 1733w&quot;
        sizes=&quot;(max-width: 1733px) 100vw, 1733px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
&lt;h3&gt;The Big Five Dimensions of Personality&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Extraversion vs. Introversion&lt;br&gt;
sociable, assertive, playful vs. aloof, reserved, shy&lt;/li&gt;
&lt;li&gt;Emotional stability vs. Neuroticism&lt;br&gt;
calm, unemotional vs. insecure, anxious&lt;/li&gt;
&lt;li&gt;Agreeableness vs. Disagreeable&lt;br&gt;
friendly, cooperative vs. antagonistic, faultfinding&lt;/li&gt;
&lt;li&gt;Conscientiousness vs. Unconscientious&lt;br&gt;
self-disciplined, organised vs. inefficient, careless&lt;/li&gt;
&lt;li&gt;Openness to experience&lt;br&gt;
intellectual, insightful vs. shallow, unimaginative&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3&gt;Scherer’s typology of affective states&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Emotion&lt;/strong&gt;: relatively brief episode of synchronized response of all or most organismic subsystems in response to the evaluation of an event as being of major significance&lt;br&gt;
&lt;mark&gt;angry&lt;/mark&gt;, &lt;mark&gt;sad&lt;/mark&gt;, &lt;mark&gt;joyful&lt;/mark&gt;, &lt;mark&gt;fearful&lt;/mark&gt;, &lt;mark&gt;ashamed&lt;/mark&gt;, &lt;mark&gt;proud&lt;/mark&gt;, &lt;mark&gt;desperate&lt;/mark&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mood&lt;/strong&gt;: diffuse affect state …change in subjective feeling, of low intensity but relatively long duration, often without apparent cause&lt;br&gt;
&lt;mark&gt;cheerful&lt;/mark&gt;, &lt;mark&gt;gloomy&lt;/mark&gt;, &lt;mark&gt;irritable&lt;/mark&gt;, &lt;mark&gt;listless&lt;/mark&gt;, &lt;mark&gt;depressed&lt;/mark&gt;, &lt;mark&gt;buoyant&lt;/mark&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Interpersonal stance&lt;/strong&gt;: affective stance taken toward another person in a specific interaction, coloring the interpersonal exchange&lt;br&gt;
&lt;mark&gt;distant&lt;/mark&gt;, &lt;mark&gt;cold&lt;/mark&gt;, &lt;mark&gt;warm&lt;/mark&gt;, &lt;mark&gt;supportive&lt;/mark&gt;, &lt;mark&gt;contemptuous&lt;/mark&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Attitudes&lt;/strong&gt;: relatively enduring, affectively colored beliefs, preferences predispositions towards objects or persons&lt;br&gt;
&lt;mark&gt;liking&lt;/mark&gt;, &lt;mark&gt;loving&lt;/mark&gt;, &lt;mark&gt;hating&lt;/mark&gt;, &lt;mark&gt;valuing&lt;/mark&gt;, &lt;mark&gt;desiring&lt;/mark&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Personality traits&lt;/strong&gt;: emotionally laden, stable personality dispositions and behavior tendencies, typical for a person&lt;br&gt;
&lt;mark&gt;nervous&lt;/mark&gt;, &lt;mark&gt;anxious&lt;/mark&gt;, &lt;mark&gt;reckless&lt;/mark&gt;, &lt;mark&gt;morose&lt;/mark&gt;, &lt;mark&gt;hostile&lt;/mark&gt;, &lt;mark&gt;envious&lt;/mark&gt;, &lt;mark&gt;jealous&lt;/mark&gt;&lt;/p&gt;
&lt;h3&gt;How can we model the Lexical Semantics?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;sentiment&lt;/li&gt;
&lt;li&gt;emotion&lt;/li&gt;
&lt;li&gt;personality&lt;/li&gt;
&lt;li&gt;mood &lt;/li&gt;
&lt;li&gt;attitudes&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3&gt;Sentiment vs. Affective Meaning 🗣&lt;/h3&gt;
&lt;p&gt;The expansion of the area of Sentiment Analysis has resulted in a new interest in the quantification of opinion, sentiment, affect, feeling, emotion, personality, mood and attitude. These terms are often used interchangeably. &lt;/p&gt;
&lt;p&gt;&lt;mark&gt;We differentiate sentiment from affective meaning based on their duration.&lt;/mark&gt; Sentiment lives longer, while an affective state has a short term duration. An effective sentiment analysis system is one which captures the sentiment of the opinion about an entity. The recognition of Affective Meaning is for example the automatic discovery of an emotional reaction, often of a single person. Unlike opinions, emotions are short-term.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Also read our post on understanding &lt;a href=&quot;/sentiment-analysis&quot;&gt;Sentiment Analysis&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[Understanding Sentiment Analysis]]></title><description><![CDATA[Sentiment Analysis can be considered a subfield of information extraction. It is used in a wide range of areas and is sometimes referred to…]]></description><link>https://www.innerdoc.com/sentiment-analysis/</link><guid isPermaLink="false">https://www.innerdoc.com/sentiment-analysis/</guid><pubDate>Thu, 26 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Sentiment Analysis can be considered a subfield of information extraction. It is used in a wide range of areas and is sometimes referred to as Opinion Mining. Sentiment Analysis attempts to determine the overall attitude (positive or negative) expressed within a text. Its purpose is to represent emotional or affective meaning.&lt;/p&gt;
&lt;p&gt;The technique’s success lies in an imperative need to standardize the measurement of human emotion in social media in order to efficiently monetize it. Signalling a users’ preference is a vital component of the &lt;em&gt;Like economy&lt;/em&gt; that underpins the business models of all major social media companies. Peer recommendations are highly trusted by other peers. This makes Sentiment Analysis a valuable source of information.&lt;/p&gt;
&lt;br&gt;
&lt;h3&gt;Scoring 🎯&lt;/h3&gt;
&lt;p&gt;The simplest &lt;mark&gt;scale&lt;/mark&gt; for sentiment scoring is a binary (positive–negative) or modal (positive–neutral–negative) categorization. Alternatively a score between 1 and –1 provides a more detailed range from very positive to very negative.&lt;/p&gt;
&lt;p&gt;The &lt;mark&gt;level&lt;/mark&gt; on which the sentiment is scored, depends on the needs. Establishing a sentiment classification on document-level is often seen. Although applying sentiment analysis on the level of sentences, phrases or named entities is also common.&lt;/p&gt;
&lt;p&gt;The simplest approach for scoring a sentiment is a &lt;mark&gt;calculation&lt;/mark&gt; based on a lexicon. This is a precompiled wordlist of terms that indicate positive or negative expressions of sentiment. Summing the positive and negative hits within a document will give the score. In some cases, a simple majority decides the final labeling of the post. This can be enhanced by recognizing negation, applying rulebased predicate structures or by preprocessing steps like lemmatization.&lt;/p&gt;
&lt;p&gt;A more elaborate strategy would be a &lt;mark&gt;classification algorithm&lt;/mark&gt; that interpretes on the level of sentences or phrases. The selection of linguistic features for clasification also requires choices. For example, term frequency has been found to be a poorer predictor than term presence. The usage of highly emotionally charged terms is more significant than their exact frequency. Word class, multi-word phrases, syntax, and negation have all been used as features, as have been the use of text length, exclamation marks, all caps, and character repetition.&lt;/p&gt;
&lt;br&gt;
&lt;h3&gt;Challenges ⛰️&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Often it is &lt;mark&gt;unclear what is measured&lt;/mark&gt;, the polarity of a described entity or the emotional state of the writer.&lt;/li&gt;
&lt;li&gt;Issues arise when genre or &lt;mark&gt;domain-specific&lt;/mark&gt; sentiment algorithmes and dictionaries are suddenly applied to another field and context-dependent word meanings no longer fit with the original context.&lt;/li&gt;
&lt;li&gt;The absence of &lt;mark&gt;detection of fake news&lt;/mark&gt; and fraudulent reviews. Fake opinions try to deliberately mislead readers and algorithms by giving undeserving positive opinions to some target objects in order to promote the objects.&lt;/li&gt;
&lt;li&gt;People who are skeptical whether computers are able to understand the complexities of natural language &lt;mark&gt;assume scoring and interpretation are the same step&lt;/mark&gt;. While the system only delivers a score, the result should still be interpreted.&lt;/li&gt;
&lt;li&gt;Language is easy to misinterpret. For example, misclassification can occur because &lt;mark&gt;sarcasm and humor are difficult to recognize&lt;/mark&gt;.&lt;/li&gt;
&lt;li&gt;Spoken opinions complicate sentiment analysis. Because proper language structure is often ignored and &lt;mark&gt;signs of body language and tone of voice are not recorded&lt;/mark&gt;.&lt;/li&gt;
&lt;li&gt;Sentiment analysis cannot &lt;mark&gt;determine why&lt;/mark&gt; someone is unhappy. On the other hand, it is not easily solved by humans either.&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3&gt;Applications 💶&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;u&gt;Market Research&lt;/u&gt;: Understanding the voice of Customers when they express their desires, thoughts, preferences and frustrations&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Public Relations&lt;/u&gt;: Identify opinions towards public persons or organisations&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Customer Service and Support&lt;/u&gt;: Identify the needs to solve a customer request&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Human Resources&lt;/u&gt;: Understand the voice of Employees&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Healthcare&lt;/u&gt;: Understanding the feelings of Patients and measuring the effect of a chosen therapy&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Stock Market&lt;/u&gt;: Predicting Stock prices, as they are often driven by positive or negative information&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Recommender Systems&lt;/u&gt;: Help users to decide on products&lt;/li&gt;
&lt;li&gt;&lt;u&gt;Business Analytics&lt;/u&gt;: Using Sentiment Analysis for decision support systems and business process improvement.&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;h3&gt;Sentiment vs. Affective Meaning 🗣&lt;/h3&gt;
&lt;p&gt;The expansion of the area of Sentiment Analysis has resulted in a new interest in the quantification of opinion, sentiment, affect, feeling, emotion, personality, mood and attitude. These terms are often used interchangeably. &lt;/p&gt;
&lt;p&gt;&lt;mark&gt;We differentiate sentiment from affective meaning based on their duration.&lt;/mark&gt; Sentiment lives longer, while an affective state has a short term duration. An effective sentiment analysis system is one which captures the sentiment of the opinion about an entity. The recognition of Affective Meaning is for example the automatic discovery of an emotional reaction, often of a single person. Unlike opinions, emotions are short-term.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Also read our post on understanding &lt;a href=&quot;/affective-meaning&quot;&gt;Affective Meaning&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[Scattertext Project]]></title><description><![CDATA[Sometimes the problems encountered when trying to understand a text, or better, a corpus of documents becomes so complex that you need to…]]></description><link>https://www.innerdoc.com/scattertext/</link><guid isPermaLink="false">https://www.innerdoc.com/scattertext/</guid><pubDate>Tue, 24 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Sometimes the problems encountered when trying to understand a text, or better, a corpus of documents becomes so complex that you need to visualize it first. Here’s an interactive visualization for understanding texts: scattertext, a product of the genius of Jason Kessler.&lt;/p&gt;
&lt;p&gt;You can see our &lt;a href=&quot;/scattertext-demo&quot;&gt;demo of scattertext here&lt;/a&gt;.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[81 - Knowledge Graph Visualization]]></title><description><![CDATA[A Knowledge Graph is a knowledge base with interlinked descriptions of entities. This can be used to put data into context and enhance…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/81-knowledge-graph-visualization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/81-knowledge-graph-visualization/</guid><pubDate>Sat, 21 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A Knowledge Graph is a knowledge base with interlinked descriptions of entities. This can be used to put data into context and enhance search engines. Keywords and Named Entity Recognition in combination with relation extraction is a good source when feeding Knowledge Graphs.&lt;/p&gt;
&lt;p&gt;Technically, a Knowledge Graph is a network that represents multiple types of entities (nodes) and relations (edges) in the same graph. Each link of the network represents an (entity, relation, value) triplet. For example: Eiffel Tower (entity) is located in (relation) in Paris (value). When you know A relates to B and B relates to C, then you automatically profit from the advantage that there is an indirect connection between A and C. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*r6Wy8ZxPojWES8XzAJEeew.jpeg&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Knowledge Graph visualization (&lt;a href=&quot;https://www.kaggle.com/ferdzso/knowledge-graph-analysis-with-node2vec&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;You can build your own network in python with &lt;a href=&quot;https://networkx.org/documentation/stable/tutorial.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Networkx&lt;/a&gt; and draw it with pyplot from Matplotlib. If data gets bigger, you need to scale up to a Graph Database like &lt;a href=&quot;https://grakn.ai/grakn-core&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Grakn&lt;/a&gt;, &lt;a href=&quot;https://www.arangodb.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;ArangoDB&lt;/a&gt; or Neo4j. &lt;/p&gt;
&lt;p&gt;Fraud Detection and Exploratory Data Analysis are important use-cases. For example, graph databases were intensively used to explore complex networked data during the &lt;a href=&quot;https://gijn.org/2016/05/10/the-people-and-the-technology-behind-the-panama-papers/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Panama Papers&lt;/a&gt; investigation.&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[80 - Locations on Geomap]]></title><description><![CDATA[Geocoded Named Entities can easily be mapped on a geographical map. There are several services and libraries to do the job: With a Mapbox…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/80-locations-on-geomap/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/80-locations-on-geomap/</guid><pubDate>Fri, 20 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Geocoded Named Entities can easily be mapped on a geographical map. There are several services and libraries to do the job:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;With a &lt;a href=&quot;https://www.mapbox.com/maps/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Mapbox&lt;/a&gt; account (50k web map loads/month free tier) you can plot your coordinates from a Pandas dataframe to a Plotly &lt;a href=&quot;https://plotly.com/python/scattermapbox/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;scatter-mapbox&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://geopandas.org/gallery/create_geopandas_from_pandas.html#sphx-glr-gallery-create-geopandas-from-pandas-py&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;GeoPandas&lt;/a&gt; makes working with geospatial data in python easier. It extends the datatypes used by Pandas to allow spatial operations on geometric types.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/python-visualization/folium&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Folium&lt;/a&gt; creates beautiful and interactive maps by using Python and &lt;a href=&quot;https://leafletjs.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Leaflet&lt;/a&gt;, a JavaScript library for interactive maps. Folium has a lot of Jupyter &lt;a href=&quot;https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/Quickstart.ipynb&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;demo notebooks&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*Ce3q00GMzorlIVisWUu2rg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Folium chart with Python and Leaflet (&lt;a href=&quot;https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/Quickstart.ipynb#Markers&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[79 - Events on Timeline]]></title><description><![CDATA[Plotting events chronological on a timeline increases the insight. Some example setups for plots are: Document Timeline: document publishing…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/79-events-on-timeline/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/79-events-on-timeline/</guid><pubDate>Thu, 19 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Plotting events chronological on a timeline increases the insight. Some example setups for plots are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Document Timeline: document publishing date vs document title&lt;/li&gt;
&lt;li&gt;Sentence Timeline: date, timestamp or period vs its sentence (&lt;a href=&quot;http://tl-generator.herokuapp.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;demo&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Dispersion Plot: the location (word offset) of a keyword in a text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*u2VFOMZmMCX1ZWXQHo7Okg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Dispersion plot for Game of Thrones keywords (&lt;a href=&quot;https://towardsdatascience.com/text-processing-is-coming-c13a0e2ee15c&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Some libraries for inspiration: Seaborn &lt;a href=&quot;https://seaborn.pydata.org/generated/seaborn.stripplot.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;stripplot&lt;/a&gt;, Yellowbrick &lt;a href=&quot;https://www.scikit-yb.org/en/latest/api/text/dispersion.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;dispersion plot&lt;/a&gt;, NLTK &lt;a href=&quot;https://www.nltk.org/_modules/nltk/draw/dispersion.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;dispersion plot&lt;/a&gt;, Calmap &lt;a href=&quot;https://pythonhosted.org/calmap/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;heatmap per day&lt;/a&gt; from pandas timeseries, Knightlab &lt;a href=&quot;http://timeline.knightlab.com/#make&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;javascript timeline&lt;/a&gt; for data in google sheets.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[78 - Word Embedding Visualization]]></title><description><![CDATA[Visualizing Word Embeddings is often done to inspect the embedding and experience the cohesiveness of a subset of the embedding. It is all…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/78-word-embedding-visualization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/78-word-embedding-visualization/</guid><pubDate>Wed, 18 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Visualizing Word Embeddings is often done to inspect the embedding and experience the cohesiveness of a subset of the embedding. It is all about dimension reduction; how to get a 2-D chart from e.g. a 300 dimensional embedding. Three often seen dimension reduction techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;T-SNE (t-Distributed Stochastic Neighbor Embedding) maps the multi-dimensional data to a lower dimensional space. This is computationally expensive. After this process, the input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE. Hence it is mainly a data exploration and visualization technique. T-SNE is good at preserving local context (neighbors).&lt;/li&gt;
&lt;li&gt;PCA (Principal Component Analysis) is a linear feature extraction technique. It combines your input features in a specific way that you can drop the least important feature while still retaining the most valuable parts of all of the features. As an added benefit, each of the new features or components created after PCA are all independent of one another.&lt;/li&gt;
&lt;li&gt;UMAP (Uniform Manifold Approximation and Projection) has some advantages over t-SNE, most important is the increased speed and better preservation of the data’s local (neighbors) and global (clusters) structure.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/JasonKessler/scattertext&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Scattertext&lt;/a&gt; is a famous package for finding distinguishing terms in corpora, and presenting them in an interactive, HTML scatter plot. This is done by visualizing the difference and overlap of two categories of documents. You can try a &lt;a href=&quot;https://jasonkessler.github.io/demo_compact.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;demo&lt;/a&gt; about republican vs democratic speeches.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1200/1*1ksF4L1FhBmAuvYcLRUIdg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Scattertext visualization (&lt;a href=&quot;https://jasonkessler.github.io/demo_compact.html&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Googles &lt;a href=&quot;https://projector.tensorflow.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;TensorBoard Embedding Projector&lt;/a&gt; graphically represents high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers. A similar but simpler library is RASA’s &lt;a href=&quot;https://github.com/RasaHQ/whatlies&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Whatlies&lt;/a&gt; that also helps to inspect your word embedding.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1200/1*TXDycHZuhFEo3cko6P3HKw.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Visualization in the Tensorflow projector for the most similar words to ‘school’ (&lt;a href=&quot;http://projector.tensorflow.org&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[77 - Wordcloud]]></title><description><![CDATA[The Wordcloud has been around for a long time. Visualizing information is a profession in itself. So there are some best practices, but…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/77-wordcloud/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/77-wordcloud/</guid><pubDate>Tue, 17 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The Wordcloud has been around for a long time. Visualizing information is a profession in itself. So there are some best practices, but Wordclouds seem to ignore these. Here are some remarks about the (missing) elements in a Wordcloud:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Stopwords are excluded, while the word &lt;em&gt;don’t&lt;/em&gt; has an important meaning in front of another word. Including stopwords will mess up the Wordcloud because of their high frequencies.&lt;/li&gt;
&lt;li&gt;Multi-word expressions are not calculated. The separate words from a multi-word expression (e.g. &lt;em&gt;New York Times&lt;/em&gt; ) will be interpreted totally different.&lt;/li&gt;
&lt;li&gt;Different colors have no different meaning.&lt;/li&gt;
&lt;li&gt;Vertical or horizontal words have no different meaning. The same applies to words at the top/bottom/left/right.&lt;/li&gt;
&lt;li&gt;No context is given to clarify the sense of that word.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Although there is a lot of resistance against Wordclouds, they are still around. You can generate your Wordclouds with the &lt;a href=&quot;https://github.com/minimaxir/stylecloud&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;stylecloud&lt;/a&gt; python library.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*JrOsOwD7gXCGiasOFnfyUA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Compare two Wordclouds about the State of the Union 2002 vs 2011 (&lt;a href=&quot;https://tagcrowd.com/blog/2011/03/05/state-of-the-union-2002-vs-2011/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[76 - Annotated Text Visualization]]></title><description><![CDATA[Printing text, but prettier. Often you want to show text with emphasis on specific words and their metadata. A simple plugin for Streamlit…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/76-annotated-text-visualization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/76-annotated-text-visualization/</guid><pubDate>Mon, 16 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Printing text, but prettier. Often you want to show text with emphasis on specific words and their metadata. A simple plugin for Streamlit apps is this &lt;a href=&quot;https://github.com/tvst/st-annotated-text&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;word annotations plugin&lt;/a&gt;. spaCy also has its &lt;a href=&quot;https://explosion.ai/demos/displacy-ent&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Named Entity Visualizer&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*z7KCpm1Q7lhpr1WtzFDuOg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Annotations made with a Streamlit plugin (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[75 - Interactive App Creation]]></title><description><![CDATA[Presenting your NLP task results should be transparent, interactive and fancy. Several solutions are available to share your data and code…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/75-interactive-app-creation/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/75-interactive-app-creation/</guid><pubDate>Sun, 15 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Presenting your NLP task results should be transparent, interactive and fancy. Several solutions are available to share your data and code. Notebooks are open-source web applications that allows you to create and share documents that contain live code, equations, visualizations and narrative text. These Notebooks and web apps are flexible and you can arrange the user interface with building-blocks or plugins for your specific use-case.&lt;/p&gt;
&lt;p&gt;With &lt;a href=&quot;https://www.streamlit.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Streamlit&lt;/a&gt; you can code the UI in the same script as you analysis is. By saving the script, the browser will automatically refresh. It’s a great way of building interactive demo’s. You can also deploy your scripts to the Streamlit cloud platform. The computing power in the free cloud platform is not suited for heavy apps.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*Vl_CSZlK9Te1v9wDeZEYPw.gif&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Streamlit self-driving car demo with code on the left and browser UI with a sidebar on the right (&lt;a href=&quot;https://github.com/streamlit/demo-self-driving&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*E2-3AFqihkrDi9G_UKQCUg.gif&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Streamlit demo: Controllable face GAN generator (&lt;a href=&quot;https://github.com/streamlit/demo-face-gan&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href=&quot;https://jupyter.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Jupyter&lt;/a&gt; Notebooks (f.k.a. IPython Notebook) facilitate in-browser interactive computing with direct results. There is also JupyterLab which is a web-based interactive development environment for Jupyter notebooks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*kESn7YXf8sYxzaPj7JUOfg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Jupyter Notebook with text-, code- and output blocks (&lt;a href=&quot;https://hub-binder.mybinder.ovh/user/fonnesbeck-2352771-mwwtsqub/notebooks/GPTutorial.ipynb&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*P-rJlJaRRm-c2352JW5Hxg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;JupyterLab environment (&lt;a href=&quot;https://jupyterlab.readthedocs.io/en/latest/getting_started/overview.html&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Google’s &lt;a href=&quot;https://colab.research.google.com/notebooks/intro.ipynb#recent=true&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Colab&lt;/a&gt; (from Colaboratory) Notebooks are like Jupyter Notebooks, it’s free, runs in the cloud and there is no setup. In Colab you can choose to run on a (light but free) GPU runtime, instead of CPU.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*TVCwPcbGaWv8jolaVbQluw.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Colab demo notebook (&lt;a href=&quot;https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[74 - E-Discovery and Media Monitoring]]></title><description><![CDATA[Electronic Discovery and (Social) Media Monitoring are tasks for doing large scale content analysis.  Electronic Discovery is the task of…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/74-e-discovery-and-media-monitoring/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/74-e-discovery-and-media-monitoring/</guid><pubDate>Sat, 14 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Electronic Discovery and (Social) Media Monitoring are tasks for doing large scale content analysis. &lt;/p&gt;
&lt;p&gt;Electronic Discovery is the task of identifying, collecting and producing electronically stored information (ESI) in (legal) investigations. Important aspects are the performance of the system regarding the volume, combining textual data with metadata, preserving and linking the original document and keeping your analysis up-to-date with the latest documents.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*cIbGMFGGkEOzf3U9seWebw.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Organizations cannot do E-discovery without NLP (&lt;a href=&quot;https://blogs.opentext.com/ediscovery/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;(Social) Media Monitoring is the task of analyzing social media, news media or any other content like posts, blogs, articles, whitepapers, comments and conversations. It can be used to improve (social) marketing, listening and engagement. &lt;/p&gt;
&lt;p&gt;The goal is to understand the voice of the customer, which can be in any kind of setting like the customer of your brand, or the user of your forum, or your subscriber, etc. This is done by iterating through the cycle of listen — understand — engage. Listening is the part where metrics like tone, emotions, topics, brand attitude are interpreted. Parsing text to insightful metrics might be more interesting than just counting the number of followers, likes, shares, visitors and recommendations.&lt;/p&gt;
&lt;p&gt;In practice, you often see sentiment analysis on twitter data. While brand and conversation audits and interpreting topics and patterns might be more interesting, they are also more complex.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*eI3wGiEd0HvRS30zS5-clQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;A typical social media monitoring layout with volume, sentiment, wordcloud and mentions (&lt;a href=&quot;https://marketingland.com/6-of-the-best-social-listening-tools-for-2019-249953&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[73 - Knowledge Base Population]]></title><description><![CDATA[Knowledge Bases (also known as knowledge graphs or ontologies) are valuable resources for developing intelligence applications, including…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/73-knowledge-base-population/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/73-knowledge-base-population/</guid><pubDate>Fri, 13 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Ontology_(information_science)&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Knowledge Bases&lt;/a&gt; (also known as knowledge graphs or ontologies) are valuable resources for developing intelligence applications, including search, question answering, and recommendation systems. The goal of Knowledge Base Population is discovering facts about entities (NER, NEL) and building a knowledge base with it. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*mfX4MQ-s_p9xZT4BLvvzxg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;From Text to Knowledge Base (&lt;a href=&quot;https://blog.diffbot.com/knowledgenet-a-benchmark-for-knowledge-base-population/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;There is often an Inference Engine to complement the Knowledge Base. Together they can be seen as an Expert System. The Knowledge Base represents facts and rules. The Inference Engine applies the rules or AI model to the known facts to deduce new facts.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[72 - Semantic Search Indexing]]></title><description><![CDATA[Search Engines became famous for their keyword-based information retrieval. Adding semantic information about a piece of text can increase…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/72-semantic-search-indexing/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/72-semantic-search-indexing/</guid><pubDate>Thu, 12 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Search Engines became famous for their keyword-based information retrieval. Adding semantic information about a piece of text can increase search accuracy. Adding not only the text, but also it’s vector will allow to search for the intent and semantic meaning of the search terms, in addition to keyword search.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/nmslib/nmslib&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;NMSLIB&lt;/a&gt; (Non-Metric Space Library) is a fast similarity search library that can find objects with a minimal (cosine) distance to other objects. When handling a question, you calculate its vector and then find closest embedding vector from the NMSLIB-index. Calculating vectors can, for example, be done with the Universal Sentence Encoder. NMSLIB has become a part of &lt;a href=&quot;https://aws.amazon.com/about-aws/whats-new/2020/03/build-k-nearest-neighbor-similarity-search-engine-with-amazon-elasticsearch-service/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Amazon Elasticsearch Service&lt;/a&gt;.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[71 - Chatbot Dialogue]]></title><description><![CDATA[2016 was the year of the chatbot-hype. Talking to your brand through virtual assistents was (is) the future. The challenge is to program a…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/71-chatbot-dialogue/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/71-chatbot-dialogue/</guid><pubDate>Wed, 11 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;2016 was the year of the chatbot-hype. Talking to your brand through virtual assistents was (is) the future. The challenge is to program a natural and convincing chatbot dialogue for the personas of your customers. You have to meet the customers’ needs and respond to their informal language and emojis.&lt;/p&gt;
&lt;p&gt;Some systems to work with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://cloud.google.com/dialogflow&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Dialogflow&lt;/a&gt; is Google’s development suite for creating conversational AI applications.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://wit.ai/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Wit.ai&lt;/a&gt; is a Facebook company and is free to use (your data will be shared with Wit for open apps).&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://rasa.com/docs/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Rasa&lt;/a&gt; is an open source package to build contextual assistents. You can try training a small chatbot in the &lt;a href=&quot;https://rasa.com/docs/rasa/playground/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Rasa playground&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*1-4RSLl4U9B_OjLVhzYc9Q.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Rasa chatbot example (&lt;a href=&quot;https://legacy-docs.rasa.com/docs/core/0.10.4/motivation/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[70 - Question Answering]]></title><description><![CDATA[Question Answering is the task of automatically answer questions posed by humans in a natural language. There are different settings to…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/70-question-answering/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/70-question-answering/</guid><pubDate>Tue, 10 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Question Answering is the task of automatically answer questions posed by humans in a natural language. There are different settings to answer a question, like abstractive, extractive, boolean and multiple-choice QA. &lt;/p&gt;
&lt;p&gt;Extractive QA has the goal to extract a substring from the reference text. Abstractive QA has the goal to generate an answer based on the reference text, but might not be a substring of the reference text. Boolean questions are Yes-No answers. Multiple choice questions have several options to choose from.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*H-iB6tnAW_JoUhSGChzwyg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Different QA formats (&lt;a href=&quot;https://arxiv.org/pdf/2005.00700.pdf&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;A variant to the regular question-answer is Multi-hop question answering which requires a model to gather information from different parts of a text to answer a question.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*J4vcpW7qWlA7j2cJPZajIg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Multi-hop QA example (&lt;a href=&quot;https://www.semanticscholar.org/paper/Multi-hop-Inference-for-Sentence-level-TextGraphs%3A-Jansen/ad406dea8c9c616402c000f32261700b77f9ac3a&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;A special feature of a QA system is the option to &lt;em&gt;not&lt;/em&gt; answer a question or answer ‘idk’ (i don’t know) . An example is &lt;a href=&quot;https://rajpurkar.github.io/SQuAD-explorer/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;SQuaD&lt;/a&gt;. SQuaD 1.0 QA training data set was created as reference texts with questions that always were answered. The improved SQuaD 2.0 dataset was supplemented with questions that could not be answered.&lt;/p&gt;
&lt;p&gt;As shown, different researchers treat different formats as distinct problems. But AllenAI made &lt;a href=&quot;https://github.com/allenai/unifiedqa&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;UnifiedQA&lt;/a&gt;, which is a T5 (Text-to-Text Transfer Transformer) model that was trained on all types of QA-formats. You can try their &lt;a href=&quot;https://unifiedqa.apps.allenai.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;demo&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;Another variant is where there is no reference text that serves the question. The required knowledge has to come from within the model itself. The knowledge is stored in the models parameters that it picked up during unsupervised pre-training. You can give this &lt;a href=&quot;https://t5-trivia.glitch.me&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;demo&lt;/a&gt; a try.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[69 - Relation Extraction]]></title><description><![CDATA[Relationship extraction is the task of extracting semantic relationships from a text. A relation can be defined as a connection between…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/69-relation-extraction/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/69-relation-extraction/</guid><pubDate>Mon, 09 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Relationship extraction is the task of extracting semantic relationships from a text. A relation can be defined as a connection between entities. There are different ways of extracting relations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple deduction: use the presence of two entities in the same sentence (or paragraph) as an unnamed relation.&lt;/li&gt;
&lt;li&gt;Predicate logic: use the Dependency tags to define semantic relation-queries like (subject, verb, object) where the verb defines the relation.&lt;/li&gt;
&lt;li&gt;Hearst Patterns: use the POS-tags to extract Hearst Patterns, which are hierarchical relations based on semantic information. Hearst Patterns are used to extract hypernym relations. A hyponym (e.g. Shakespeare) is in a type-of relationship with its hypernym (e.g. author). These are important for extracting tuples for ontologies.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*xzk3cWVoTEQkGlXn2UIcJQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Examples of Hearts Patterns (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;ul&gt;
&lt;li&gt;Word2vec similarity: use vector calculations to define relations, like in Gensim: &lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;    import gensim 

    model = gensim.models.Word2Vec.load(&amp;#39;model-01&amp;#39;)  
    model.most_similar(positive=[&amp;#39; **father** &amp;#39;, &amp;#39; **son** &amp;#39;], negative=[&amp;#39; **mother** &amp;#39;]) 

    &amp;gt;&amp;gt;&amp;gt; [(&amp;#39; **daughter** &amp;#39;, 0.8783684968948364)] &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Question Answering and Slot Filling: ask a question in a certain relationship-template and use the answer to fill the slot. &lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;    Template: husband_of = ”Who is the husband of [PERSON]?”  
    Question: ”Who is the husband of Michelle?”  
    Answer  : ”Barack”  
    Relation: Barack --&amp;gt; husband_of --&amp;gt; Michelle &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Transformer for Relation Extraction: Use &lt;a href=&quot;https://openreview.net/pdf?id=BJgrxbqp67&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;deeplearning&lt;/a&gt; for relation extraction. &lt;a href=&quot;https://nlp.stanford.edu/projects/tacred/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;TACRED&lt;/a&gt;, with 106k sentence-level examples and 41 relation types, and &lt;a href=&quot;https://github.com/thunlp/DocRED&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;DocRED&lt;/a&gt;, with 107k document-level examples and 96 relation types, are good relation extraction datasets to train models on.&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[68 - Long Text Generation]]></title><description><![CDATA[There are limits to the input for text generation models. Most models have a limited length around 500-tokens long. This is due to the…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/68-long-text-generation/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/68-long-text-generation/</guid><pubDate>Sun, 08 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;There are limits to the input for text generation models. Most models have a limited length around 500-tokens long. This is due to the shortcomings of Recurrent Neural Networks (RNN), resulting in vanishing gradients for long sequences where long-term information has to sequentially travel through all cells before getting to the present processing cell.&lt;/p&gt;
&lt;p&gt;Long ShortTerm Memory networks (LSTM) and Gated Recurrent Unit (GRU) require less computations and are better capable of learning and remembering over long sequences, but eventually they also don’t work either for very long sequences.&lt;/p&gt;
&lt;p&gt;Vanishing gradients are better solved by using attention based models like Transformers that can parallelly process input in contrast to RNNs. &lt;a href=&quot;https://arxiv.org/abs/2004.05150&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Longformer&lt;/a&gt; is a model designed for long sequences. It has an attention mechanism that scales linearly with the input sequence length, compared to most self-attention based models that scale quadratically and therefor require more memory.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2007.14062&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;BigBird&lt;/a&gt; is another model from Google with a sparse attention mechanism that reduces the required computations and memory. BigBird handles inputs that are up to 8 times longer than the original BERT model could handle. Several NLP tasks will benefit from the handling of longer inputs: Long Document Summarization, Question Answering and Genomics Processing.&lt;/p&gt;
&lt;p&gt;Want to autowrite yourself? Use the &lt;a href=&quot;https://transformer.huggingface.co/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Write With Transformers demo&lt;/a&gt; to write text, based on an initial text.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*wcMnHrcGfbFA4CJt_AHOPw.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Write With Transformers demo (&lt;a href=&quot;https://transformer.huggingface.co/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[67 - Paraphrasing]]></title><description><![CDATA[Paraphrasing is the task of expressing the meaning of a source text into a new text by using different words and maintaining the semantic…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/67-paraphrasing/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/67-paraphrasing/</guid><pubDate>Sat, 07 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Paraphrasing is the task of expressing the meaning of a source text into a new text by using different words and maintaining the semantic meaning. The goal might be to achieve greater clarity, to prevent plagiarism or to do data augmentation by generating related-but-different training data.&lt;/p&gt;
&lt;p&gt;With rulebased functionality you might replace synonyms in a text, but with Neural Networks the process will be more sophisticated and the output will have more variety in the expressions. However, these are types of error you might find:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*Mr9H8X6Xvylyy1B_P4uNgg.jpeg&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Paraphrase errors (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[66 - Abstractive Summarization]]></title><description><![CDATA[Abstractive summarization systems generate new phrases. The perfect summarizer truly understands the document and expresses this by using as…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/66-abstractive-summarization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/66-abstractive-summarization/</guid><pubDate>Fri, 06 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Abstractive summarization systems generate new phrases. The perfect summarizer truly understands the document and expresses this by using as few words as possible.&lt;/p&gt;
&lt;p&gt;It is a very difficult task, because the summarizer might produce factually incorrect details, struggle with Out-of-Vocabulary (OOV) words and might be repetitive in its output on important phrases.&lt;/p&gt;
&lt;p&gt;Another subtask of Abstractive Summarization is Content Determination. What is the focus of the reader? This is important for deciding what information should be communicated in the summary. If you want to generate a summary from a book, it might be helpful to know what the reader is interested in.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[65 - Machine Translation]]></title><description><![CDATA[Language translation by machines is since decades one of the most important NLP-tasks, because all things start by understanding each other…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/65-machine-translation/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/65-machine-translation/</guid><pubDate>Thu, 05 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Language translation by machines is since decades one of the most important NLP-tasks, because all things start by understanding each other without barriers. &lt;a href=&quot;https://translate.google.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Google Translate&lt;/a&gt; still has shortcomings and is the absolute leader, but Facebook is in the race with it’s multilingual machine translation model &lt;a href=&quot;https://github.com/pytorch/fairseq/tree/master/examples/m2m_100&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;M2M-100&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;However, custom-build models are within range with the arrival of &lt;a href=&quot;https://opennmt.net/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Neural Machine Translation implementations&lt;/a&gt;, which provide sequence-to-sequence models and Parallel Corpora like &lt;a href=&quot;https://paracrawl.eu/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Paracrawl&lt;/a&gt; and &lt;a href=&quot;http://opus.nlpl.eu/opus-100.php&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Opus&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With the growing quality of Machine Translation models, there is also an opportunity to better translate training datasets into another language. The English language is almost always used for NLP-blogs, model demo’s and &lt;a href=&quot;https://paperswithcode.com/sota/question-answering-on-squad20&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;SOTA leaderboards&lt;/a&gt;. These superior resources might benefit you for other languages.&lt;/p&gt;
&lt;p&gt;In the figure below a Word Alignment matrix from a Neural Machine Translation task. Each pixel shows the weight of the annotation and explains which positions in the source sentence were considered more important when generating the target word.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*IiFGvqsvFArmhvb9yQVXDQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Word Alignment matrix for a translated sentence (&lt;a href=&quot;https://arxiv.org/pdf/1409.0473.pdf&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[64 - Report Writing]]></title><description><![CDATA[Writing sentences based on structured data is also called Data-to-Text Generation. The task is to generate content without explicitly…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/64-report-writing/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/64-report-writing/</guid><pubDate>Wed, 04 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Writing sentences based on structured data is also called Data-to-Text Generation. The task is to generate content without explicitly modelling what to say and in what order. The task can exist of two steps. The first step is to define what parts of the structured data should have the most attention and in what sequence they should occur. The second step is to generate the content, while taking the first step into account.&lt;/p&gt;
&lt;p&gt;An example is an automatic summary of the financial results of a Business Intelligence (BI) dashboard. Or in sport broadcasting; generating a match report containing statistics on NBA basketball games.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1200/1*Oje-gR-R-C4IpZDq3oYl5w.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Factual mentions from the table are boldfaced in the description. (&lt;a href=&quot;https://www.groundai.com/project/a-hierarchical-model-for-data-to-text-generation/1#S1.F1&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[63 - Next Token Prediction]]></title><description><![CDATA[Can I help you by predicting the next word you will type? A lot of apps are using this auto-complete feature to please the user. N-gram…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/63-next-token-prediction/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/63-next-token-prediction/</guid><pubDate>Tue, 03 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Can I help you by predicting the next word you will type? A lot of apps are using this auto-complete feature to please the user.&lt;/p&gt;
&lt;p&gt;N-gram language models can be used as a simple solution for Next Token prediction. It assigns the probability to a sequence of words, in a way that more likely sequences receive higher scores. For example, ‘I have a pen‘ is expected to have a higher probability than ‘I am a pen’ since the first one seems to be a more natural sentence in the real world.&lt;/p&gt;
&lt;p&gt;Long Short Term Memory (LSTM) is a more advanced approach. It will better understand the context and has a better performance, because not all N-grams have to be calculated.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*tV51rGBFxtnOUElG0bUg3w.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Googles autocomplete (&lt;a href=&quot;https://www.google.com/search?q=james+bond+is+played+by&amp;oq;=James+bond+is+played+by&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[62 - Contextualized Word Representations]]></title><description><![CDATA[Contextualized / Dynamic Word Representations can be seen as incorporating context into word embeddings and is the ‘upgrade’ of Static Word…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/62-contextualized-word-representations/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/62-contextualized-word-representations/</guid><pubDate>Mon, 02 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Contextualized / Dynamic Word Representations can be seen as incorporating context into word embeddings and is the ‘upgrade’ of Static Word Representations. Contextualized Embeddings can be found in models like BERT, ELMo, and GPT-2.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*vRoy0cVr--AURNjIi-S0zA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Pretrained Language Models that take context into consideration (&lt;a href=&quot;https://www.cellstrat.com/2020/06/02/nlp-with-bert/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;ELMo (Embeddings from Language Models by AllenNLP) was the response to the polysemy-problem and took context into consideration an LSTM-based model; same words having different meanings based on their context.&lt;/p&gt;
&lt;p&gt;BERT (Bidirectional Encoder Representations from Transformers by Google) was a follow-up that considered the context from both the left and the right sides of each word. It was universal, because no domain-specific dataset was needed. It was also generalizable, because a pre-trained BERT model can be fine-tuned easily for various downstream NLP tasks.&lt;/p&gt;
&lt;p&gt;GPT (Generative Pretrained Transformer by OpenAI) also emphasized the importance of the Transformer framework, which has a simpler architecture and can train faster and facilitates more parallelization than an LSTM-based model. It is also able to learn complex patterns in the data by using the Attention mechanism. Attention is an added layer that let’s a model focus on what’s important in a long input sequence.&lt;/p&gt;
&lt;p&gt;For a technical summary of the (20+) available model types see this &lt;a href=&quot;https://huggingface.co/transformers/model_summary.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Transformers Summary&lt;/a&gt; from Huggin Face.&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[61 - Distributed Word Representations]]></title><description><![CDATA[Distributed / Static Word Representations, or Word Vectors or Word Embeddings are multi-dimensional meaning representations of a word, which…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/61-distributed-word-representations/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/61-distributed-word-representations/</guid><pubDate>Sun, 01 Nov 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Distributed / Static Word Representations, or Word Vectors or Word Embeddings are multi-dimensional meaning representations of a word, which are reduced to a level of N dimensions. This technique received a lot of attention since 2013 when Google published the algorithm. There were still several challenges.&lt;/p&gt;
&lt;p&gt;Original word embedding have one vector per word. A vector typically has 300 or 512 dimensions and 500k words for a large model. This results in embeddings which can grow over 500Mb, which have to be loaded into memory. To reduce this load one can use less dimensions, which makes the vectors less unique. Remove vectors for infrequent words, but these might be the most interesting words. Or map multiple words into one vector (&lt;a href=&quot;https://spacy.io/usage/vectors-similarity#custom-vectors-coverage&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;pruning&lt;/a&gt; in spaCy), but these words will then be 100% similar to each other.&lt;/p&gt;
&lt;p&gt;Encountering out-of-vocabulary words is the word embedding problem of having words for which no vector exists. Subword approaches try to solve the unknown word problem by assuming that you can reconstruct a word’s meaning from its parts.&lt;/p&gt;
&lt;p&gt;Lexical ambiguity or polysemy is another problem. A word in a word embedding has no context, so the vector for the word ‘bank’ is trained on the semantic context of ‘river’ bank, but also for the ‘financial’ bank. &lt;a href=&quot;https://explosion.ai/blog/sense2vec-with-spacy&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Sense2vec&lt;/a&gt; solves this context-sensitivity partly by taking metainfo into account. The model is trained on words like ‘duck|NOUN’ and ‘duck|VERB’ or ‘Obama|PERSON’ and ‘Obama|ORG’ (e.g. the Obama administration) to be more distinctive on the metainfo-tag (but how about ‘foot’; body part vs scale unit). Nowadays the ambiguity problem is solved by Attention Based Contextualized Word Representations.&lt;/p&gt;
&lt;p&gt;A triggering feature (in the early days) for word embeddings was that they contain semantic relations if the training corpus reflects this. An example is ‘Paris’ is to ‘France’ as ‘London’ is to […]. The embedding can respond with ‘England’. However, it’s not always accurate and deeplearning models are nowadays a better alternative to find these relations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*hELlVp9hmZbDZVFstS61pg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Semantic relations in word embeddings (&lt;a href=&quot;https://samyzaf.com/ML/nlp/nlp.html&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Best known word embedding models are: &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Word2Vec is the first wordvector algorithm created by Tomáš Mikolov at Google. It is best implemented by &lt;a href=&quot;https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Gensim&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://nlp.stanford.edu/projects/glove/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;GloVe&lt;/a&gt; algorithm is created by Stanford.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://fasttext.cc/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;fastText&lt;/a&gt; algorithm is created by Facebook and is a subword embedding where each word is represented as a bag of character n-grams. This means that out-of-vocabulary words can be composed from multiple subwords. This makes the algorithm faster, because the embedding is smaller. Trained word vectors for 157 languages are available to download.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://nlp.h-its.org/bpemb/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;BPEmb&lt;/a&gt; is also a subword embedding algorithm. Subwords are based on Byte-Pair Encoding (BPE) which is a specific type of subword tokenization. BPEmb has trained models for 275 languages.&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[60 - Document Similarity]]></title><description><![CDATA[The task of estimating the degree of similarity between the semantic representation of two documents can be done by different techniques for…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/60-document-similarity/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/60-document-similarity/</guid><pubDate>Sat, 31 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The task of estimating the degree of similarity between the semantic representation of two documents can be done by different techniques for feature extraction. Some examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The statistical techniques BM25 (Best Matching 25) and TF-IDF (Term Frequency * Inverse Document Frequency), which are the default and former-default similarity algorithm in Elasticsearch and Lucene.&lt;/li&gt;
&lt;li&gt;Latent Semantic Analysis (LSA/LSI) for vectorization of documents. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on the vector space and only keeps the directions in our vector space that contain the most variance.&lt;/li&gt;
&lt;li&gt;Latent Dirichlet allocation (LDA) which is a probabilistic method.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://radimrehurek.com/gensim/models/doc2vec.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Doc2Vec&lt;/a&gt; (aka paragraph2vec, aka sentence embeddings) a neural network method that modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text.&lt;/li&gt;
&lt;li&gt;USE (Universal Sentence Encoder) encodes text into high dimensional vectors. It has pretrained models for English, but also a &lt;a href=&quot;https://tfhub.dev/google/universal-sentence-encoder-multilingual/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;multilingual&lt;/a&gt; model.&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[59 - Distance Measures]]></title><description><![CDATA[Distance Measures show how similar words are to each other. There is word Syntax similarity and Semantic word similarity. Syntax similarity…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/59-distance-measures/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/59-distance-measures/</guid><pubDate>Fri, 30 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Distance Measures show how similar words are to each other. There is word Syntax similarity and Semantic word similarity. Syntax similarity means that sheep and ship are more similar than sheep and lamb, because semantic meaning is ignored. This can be calculated by the Levenshtein Distance that is used by the &lt;a href=&quot;https://github.com/maxbachmann/rapidfuzz&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;RapidFuzz &lt;/a&gt;library. Semantic similarity measures the meaning of the words, so sheep and lamb are more similar than sheep and ship. This can be calculated by measuring the cosine distance of wordvectors.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*a3ksdSRePqWDQCYyh3U3wg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Lexical vs Semantic similarity (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[58 - WordNet Synsets]]></title><description><![CDATA[WordNet is a lexical database from Princeton. It consists of nouns, verbs, adjectives and adverbs which are grouped into sets of synonyms…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/58-wordnet-synsets/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/58-wordnet-synsets/</guid><pubDate>Thu, 29 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://wordnet.princeton.edu/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;WordNet&lt;/a&gt; is a lexical database from Princeton. It consists of nouns, verbs, adjectives and adverbs which are grouped into sets of synonyms (synsets), hyponyms, meronyms and hypernyms with descriptions. Each synset describes a concept and is interlinked by means of conceptual-semantic and lexical relations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*drXFxQuhg9o-VDXnhE-B3Q.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Wordnet results for ‘school’. **S** ynset (semantic) relations, **W** ord (lexical) relations, **n** oun, **v** erb, **adj** ective (&lt;a href=&quot;http://wordnetweb.princeton.edu/perl/webwn?o2=&amp;o0;=1&amp;o8;=1&amp;o1;=1&amp;o7;=&amp;o5;=&amp;o9;=&amp;o6;=&amp;o3;=&amp;o4;=&amp;s;=school&amp;i;=1&amp;h;=1010011231231223123122312302222100000000000#c&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Similar structured knowledge bases are &lt;a href=&quot;https://conceptnet.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;ConceptNet&lt;/a&gt; or &lt;a href=&quot;http://verbs.colorado.edu/verbnet/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;VerbNet&lt;/a&gt;. These focus on the major languages, but there are also initiatives for smaller languages like the &lt;a href=&quot;http://wordpress.let.vupr.nl/odwn/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Open Dutch WordNet&lt;/a&gt;. Applications can be tasks for word sense disambiguation, classification of texts, finding similar terms, lexical simplification, etc.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[57 - Outlier Detection]]></title><description><![CDATA[Outliers or Anomalies are generally defined as samples that are exceptionally far from the mainstream of (textual) data. The threshold when…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/57-outlier-detection/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/57-outlier-detection/</guid><pubDate>Wed, 28 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Outliers or Anomalies are generally defined as samples that are exceptionally far from the mainstream of (textual) data. The threshold when something is an outlier is very subjective. If you have a vocabulary, an outlier might be defined as a word that is Out-of-Vocabulary (OOV).&lt;/p&gt;
&lt;p&gt;Another way is that the outlier is a result of an extreme class imbalance and can be measured in terms of its word- or document vector.&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[56 - Trend Detection]]></title><description><![CDATA[Trending topics on Twitters streaming data is one of the best examples for the Trend Detection task. Capturing topics, thoughts and emotions…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/56-trend-detection/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/56-trend-detection/</guid><pubDate>Tue, 27 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Trending topics on Twitters streaming data is one of the best &lt;a href=&quot;https://trends24.in/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;examples&lt;/a&gt; for the Trend Detection task. Capturing topics, thoughts and emotions over time produces a very insightful starting point for analysis. &lt;/p&gt;
&lt;p&gt;You can quantify the deviation of a particular word count beyond the expected variability, and you can define a threshold above which you call the count a trend. If there is historical data, you can take seasonality-patterns into account in your time-series analysis. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*mdeFIbIcFqGoHFUwn1e5ww.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Three basic types of Anomalies (&lt;a href=&quot;https://developer.twitter.com/content/dam/developer-twitter/pdfs-and-files/Trend-Detection.pdf&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;The difficulty is that you often don’t know the scale, size or time interval of the change in advance. Depending on your use-case there are different setups possible. However, all algorithms present trade-offs, including simplicity vs. robustness, precision, recall and time-to-detection. An older python package from &lt;a href=&quot;https://github.com/twitterdev/Gnip-Trend-Detection&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Twitterdev&lt;/a&gt; might give you a quick start.&lt;/p&gt;
&lt;p&gt;An interesting variant is de Cold Trend Detection. This shows which topics have the highest negative change in scores and are cooling down at a certain time.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[55 - Topic Modeling]]></title><description><![CDATA[To divide a set of documents into N unsupervised topics, the documents should be represented by compact vectors. Term Frequency * Inverse…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/55-topic-modeling/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/55-topic-modeling/</guid><pubDate>Mon, 26 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;To divide a set of documents into N unsupervised topics, the documents should be represented by compact vectors. Term Frequency * Inverse Document Frequency (&lt;a href=&quot;https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;TF-IDF&lt;/a&gt;), Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) are the best-known Vector Space Model algorithms for transforming a document into a vector.&lt;/p&gt;
&lt;p&gt;The critical step in Topic Modeling is to determine how similar these vectors should be, which documents belong to a specific topic and how many topics should be distinguished. In most libraries you have to define how many topics (clusters) the algorithm should generate. &lt;/p&gt;
&lt;p&gt;However, the library &lt;a href=&quot;https://github.com/ddangelov/Top2Vec&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Top2vec &lt;/a&gt;automatically reduces the number of dimensions and finds dense areas in this decreased space. So, it determines the number of topics for you. You can prune this number of topics by iteratively merging each smallest topic to the most similar topic until you reach your target number. It also combines document vectors and word vectors to determining topic(vector)s and their most important words.&lt;/p&gt;
&lt;p&gt;Another informative tool that should be mentioned is &lt;a href=&quot;https://github.com/bmabey/pyLDAvis&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;pyLDAvis&lt;/a&gt;, which is a library for interactive topic model visualization (&lt;a href=&quot;https://nbviewer.jupyter.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;demo&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*Zo-p_GV1OjJKbubIq-Ldxw.gif&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;The 5 steps of the Top2Vec algorithm (&lt;a href=&quot;https://github.com/ddangelov/Top2Vec&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[54 - Extractive Summarization]]></title><description><![CDATA[Extractive Summarization (or summary generation) works in the same way as Keyword extraction. The most relevant sentences are extracted. The…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/54-extractive-summarization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/54-extractive-summarization/</guid><pubDate>Sun, 25 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Extractive Summarization (or summary generation) works in the same way as Keyword extraction. The most relevant sentences are extracted. The algorithm selects sentences by finding the combination of words that are important or seem representative of the entire text. That’s why packages that support Summarization often also support Keyword detection. A variant is multi-document summarization.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*k2PR-Nbo5pH3aV99uvhAjw.jpeg&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Extractive vs Abstractive summarization (&lt;a href=&quot;https://isarth.github.io/textrank/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Extractive summarization is also important for the question answering task. By collecting the most relevant documents for a particular question, a summarizer could assemble a cohesive context for the answer. The other way around is also interesting. When building training data for the QA task you have to generate relevant questions; Extractive summarization can identify important sentences where you want to have questions about.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[53 - Keyword Extraction]]></title><description><![CDATA[Provide the most relevant words from this document. That is the task of keyword extraction. It is often used as a starting point when doing…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/53-keyword-extraction/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/53-keyword-extraction/</guid><pubDate>Sat, 24 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Provide the most relevant words from this document. That is the task of keyword extraction. It is often used as a starting point when doing unsupervised text analytics. The term Keywords is also known as common phrases , multi-word expressions and word n-gram collocations.&lt;/p&gt;
&lt;p&gt;Calculating keywords can for example be done by the &lt;a href=&quot;https://gist.github.com/BrambleXu/3d47bbdbd1ee4e6fc695b0ddb88cbf99&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Textrank&lt;/a&gt; algorithm or by detecting &lt;a href=&quot;https://radimrehurek.com/gensim/models/phrases.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Ngrams&lt;/a&gt;.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[52 - Multi-Label Multi-Class Classification]]></title><description><![CDATA[A specific and difficult sub-solution of Text Classification is Multi-Label Multi-Class Text Classification.  This article is part of the…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/52-multi-label-multi-class-classification/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/52-multi-label-multi-class-classification/</guid><pubDate>Fri, 23 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A specific and difficult sub-solution of Text Classification is Multi-Label Multi-Class Text Classification.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*hL4Gvk9PdkAfHongaZKBWw.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Classifier types (&lt;a href=&quot;https://stats.stackexchange.com/questions/11859/what-is-the-difference-between-multiclass-and-multilabel-problem&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[51 - Text Classification]]></title><description><![CDATA[Text classification is the process of assigning tags or categories to text according to its content. It is the broader task where Intent…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/51-text-classification/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/51-text-classification/</guid><pubDate>Thu, 22 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Text classification is the process of assigning tags or categories to text according to its content. It is the broader task where Intent, Sentiment and Spam classification are part of.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*y_n8Jf5XM5yO_Q0VufG6lg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Tweet from Richard Socher (&lt;a href=&quot;https://twitter.com/RichardSocher/status/840333380130553856&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;You can start classifying text by using a variety of open source libraries. From &lt;a href=&quot;https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Scikit Learn&lt;/a&gt; to &lt;a href=&quot;https://www.nltk.org/book/ch06.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;NLTK&lt;/a&gt; and &lt;a href=&quot;https://spacy.io/usage/training#textcat&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;spaCy&lt;/a&gt;. Or watch the video below for a quick intro.&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Training an Insults Classifier in 1 hour (&lt;a href=&quot;https://www.youtube.com/watch?v=5di0KlKl0fE&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[50 - Intent Classification]]></title><description><![CDATA[Chatbots and E-assistants have to accomplish two tasks: understanding the user and giving the correct responses. Intent Classification is…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/50-intent-classification/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/50-intent-classification/</guid><pubDate>Wed, 21 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Chatbots and E-assistants have to accomplish two tasks: understanding the user and giving the correct responses. Intent Classification is the task of finding out what the user exactly asks. For example, does she want to buy tickets, or does she want to know the price of tickets?&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*vOH8tZMlaWBTeqCO-Pq2dg.jpeg&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;The Utterance is the user’s quest. The Intent classifier detects the next-best action. Parameterized by the entities. (&lt;a href=&quot;https://www.elasticpath.com/blog/how-to-choose-an-ecommerce-chatbot-14-solutions-reviewed&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;The classification task requires training data which might be difficult to create and domain specific. &lt;a href=&quot;https://github.com/budzianowski/multiwoz&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;MultiWOZ&lt;/a&gt; is a well-known task-oriented dialogue dataset containing over 10k dialogues spanning 8 domains.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[49 - Sentiment and Emotion Detection]]></title><description><![CDATA[Sentiment Analysis or Opinion Mining attempts to determine the overall attitude (positive or negative) expressed within a text. Its purpose…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/49-sentiment-and-emotion-detection/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/49-sentiment-and-emotion-detection/</guid><pubDate>Tue, 20 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Sentiment Analysis or Opinion Mining attempts to determine the overall attitude (positive or negative) expressed within a text. Its purpose is to represent emotional or affective meaning.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*ioJAE-VEx8BSFFLiou5dmA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Simple classification with 3 labels (&lt;a href=&quot;https://thedatascientist.com/issue-sentiment-analysis/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;The technique’s success lies in an imperative need to standardize the measurement of human emotion in social media in order to efficiently monetize it. Signaling a users’ preference is a vital component of the &lt;em&gt;Like economy&lt;/em&gt; that underpins the business models of all major social media companies. These peer recommendations are highly trusted by other peers. This makes Sentiment Analysis a valuable source of information.&lt;/p&gt;
&lt;p&gt;Sentiment Analysis have a lot of &lt;a href=&quot;https://www.innerdoc.com/sentiment-analysis/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;challenges&lt;/a&gt;. It is often unclear what is measured; the polarity of a described entity or the emotional state of the writer. &lt;/p&gt;
&lt;p&gt;A more focused subfield is Aspect-Based Sentiment Analysis (ABSA). First, the aspect details (or sentiment target) are detected. For example, the sentiment target is a phrase about product X. Then the sentiment of the opinion about the target is detected. The third is to detect the opinion holder, which is often omitted.&lt;/p&gt;
&lt;p&gt;The expansion of the area of Sentiment Analysis has resulted in a new interest in the quantification of opinion, sentiment, affect, feeling, emotion, personality, mood and attitude. These terms are often used interchangeably.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*N7-MlH8SaeYRvX34H3N3SQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Plutchick’s wheel of Emotion (&lt;a href=&quot;https://www.innerdoc.com/affective-meaning/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;We differentiate sentiment from &lt;a href=&quot;https://www.innerdoc.com/affective-meaning/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;affective meaning&lt;/a&gt; based on their duration. Sentiment lives longer, while an affective state has a short term duration. An effective sentiment analysis system is one which captures the sentiment of the opinion about an entity. The recognition of Affective Meaning is for example the automatic discovery of an emotional reaction, often of a single person. Unlike opinions, emotions are short-term.&lt;/p&gt;
&lt;p&gt;Words indicating the level of sentiment will behave very differently when under the semantic scope of negation (see negation section). If you want to analyze e.g. social media text without training a classifier you can use the python library &lt;a href=&quot;https://github.com/cjhutto/vaderSentiment&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;VADER&lt;/a&gt;. You can also modify the library to work with non-English text.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[48 - Spam Detection]]></title><description><![CDATA[Spam Detection is one of the oldest applications of NLP and is a frequently seen use case for demo’s and tutorials. Receiving email from a…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/48-spam-detection/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/48-spam-detection/</guid><pubDate>Mon, 19 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Spam Detection is one of the oldest applications of NLP and is a frequently seen use case for demo’s and tutorials. Receiving email from a Nigerian Princes might still be a problem, but ISPs are more and more successful in detecting and filtering it out.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*68iYn4wq8VpEuXsa4anLig.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Global spam volume as percentage of total e-mail traffic from 2007 to 2019 (&lt;a href=&quot;https://www.statista.com/statistics/420400/spam-email-traffic-share-annual/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[47 - Monitoring Models]]></title><description><![CDATA[Once deployed you need to monitor your processors and improve latency and inference speed. Language Models need to be monitored. Maybe you…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/47-monitoring-models/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/47-monitoring-models/</guid><pubDate>Sun, 18 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Once deployed you need to monitor your processors and improve latency and inference speed. Language Models need to be monitored. Maybe you have a feedback loop that gives the end-user the opportunity to tell if the model did a good or bad job.&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[46 - Deploying Models]]></title><description><![CDATA[Once you realized your Language Model you need to deploy it. It might be a building block in a larger pipeline. You might want to build a…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/46-deploying-models/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/46-deploying-models/</guid><pubDate>Sat, 17 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Once you realized your Language Model you need to deploy it. It might be a building block in a larger pipeline. You might want to build a Model Factory to periodically retrain the model. You might need specialized processors like GPU’s and TPU’s or use a special runtime like &lt;a href=&quot;https://github.com/onnx/onnx&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;ONNX&lt;/a&gt; to accelerate AI (Transformer) Models inference. A real DevOps job.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[45 - Explaining Models]]></title><description><![CDATA[The business will require that your model can explain its outcomes. Transparency is needed to prevent distrust. For example, to prevent…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/45-explaining-models/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/45-explaining-models/</guid><pubDate>Fri, 16 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The business will require that your model can explain its outcomes. Transparency is needed to prevent distrust. For example, to prevent &lt;a href=&quot;https://www.oxfordinsights.com/racial-bias-in-natural-language-processing&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;racial bias&lt;/a&gt; in sentiment analysis.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/TeamHG-Memex/eli5&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;ELI5&lt;/a&gt; (Explain Like I’m 5) is a python package which helps to debug machine learning classifiers and explain their predictions. It supports several Machine Learning frameworks, like Scikit-Learn and Keras.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*WXnoJUwlG2tR3NWofHPtKw.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;ELI5 Textexplainer showing which words contribute to the prediction of y=sci.med (&lt;a href=&quot;https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html#textexplainer&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/marcotcr/lime&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;LIME&lt;/a&gt; (Local Interpretable Model-Agnostic Explanations) explains individual predictions for text classifiers:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*WDQadJlcQKUNq4AoYFp1-w.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;LIME explanations (&lt;a href=&quot;https://marcotcr.github.io/lime/tutorials/Lime%20-%20multiclass.html#Explaining-predictions-without-headers,-quotes-and-footers&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/slundberg/shap&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;SHAP&lt;/a&gt; (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*upS3k3UUv_X4-ckzaiDDHA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;SHAP for multi-class text data, with top positive words for emotion angry (&lt;a href=&quot;https://shap.readthedocs.io/en/latest/example_notebooks/partition_explainer/Emotion%20Data%20Multiclass%20Text%20Explanation%20Demo.html&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[44 - Evaluating Models]]></title><description><![CDATA[To evaluate the quality of a Language Model, it should be compared based on some score. For supervised models, like a text classification…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/44-evaluating-models/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/44-evaluating-models/</guid><pubDate>Thu, 15 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;To evaluate the quality of a Language Model, it should be compared based on some score. For supervised models, like a text classification model, it will be easy to evaluate a model with metrics like precision, recall, accuracy and F1-score. For unsupervised models, like Natural Language Generation models, there are metrics like BLEU, ROUGE, METEOR and GLUE for extrinsic evaluation and Perplexity as an intrinsic evaluation.&lt;/p&gt;
&lt;p&gt;BLEU is a popular word-overlap metric and compares n-grams between a candidate text and a reference text. Unfortunately, it is unable to capture semantics and can lead to poor scores even for appropriate response. BLEU is popular for evaluating machine translation models. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*uCq5y9bgRVsxxrUHq7M_vQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;BLEU Metric (&lt;a href=&quot;https://www.coursera.org/lecture/machinetranslation/bleu-Bv81F&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;The &lt;a href=&quot;https://gluebenchmark.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;GLUE&lt;/a&gt; (General Language Understanding Evaluation) benchmark is model-agnostic, so any system capable of processing sentence and sentence pairs and producing corresponding predictions is eligible to participate. It has a lot of data sets for several genres and several techniques like coreference resolution, sentiment analysis and question answering. And it is used with a leaderboard. So people can use the data sets, and see how well their models perform compared to others.&lt;/p&gt;
&lt;p&gt;You can calculate these metrics with the &lt;a href=&quot;https://github.com/Maluuba/nlg-eval&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;NLG-eval&lt;/a&gt; package or go to &lt;a href=&quot;https://huggingface.co/metrics&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Huggingface&lt;/a&gt; for an overview and explanation of several metrics.&lt;/p&gt;
&lt;p&gt;Perplexity is an intrinsic evaluation method. It’s not as good as the extrinsic metrics, but is useful to quickly make a comparison to the language model itself (e.g. for LDA) and not taking into account the specific task it’s going to be used for. Perplexity is the inability to understand something. A low perplexity indicates the model is good at predicting the sample. SOTA (state-of-the-art) perplexity for a language model is 11.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[43 - Training Models]]></title><description><![CDATA[Training NLP models is a broad topic. It’s best to start light and improve later. You can start by building a rulebased model for two hours…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/43-training-models/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/43-training-models/</guid><pubDate>Wed, 14 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Training NLP models is a broad topic. It’s best to start light and improve later. You can start by building a rulebased model for two hours and experience how good it scores. Take this as a baseline score. Then try to improve this with a simple technique like a regression model. If you want to elaborate further, try training a deeplearning model.&lt;/p&gt;
&lt;p&gt;The more complex your model, the longer the training time. More performance requires better hardware. Instead of CPU you might need GPU’s or TPU’s.&lt;/p&gt;
&lt;p&gt;Yoav Goldberg talked about the required expertise to build NLP models. His vision is that in future (2021+) humans don’t require much ML or linguistic expertise. Humans will be writing rules, aided by ML/DL, resulting in transparent and debuggable models.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*r0yLdqjTOneXjO7vXOk5dA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;From Yoav Goldbergs presentation The missing elements in NLP (spaCy IRL 2019) (&lt;a href=&quot;https://www.youtube.com/watch?v=e12danHhlic&amp;feature;=youtu.be&amp;list;=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc&amp;t;=109&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[42 - Language Identification]]></title><description><![CDATA[Language identification is one of the first tasks you will do, because you have to select the right language-specific model. You can use the…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/42-language-identification/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/42-language-identification/</guid><pubDate>Tue, 13 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Language identification is one of the first tasks you will do, because you have to select the right language-specific model. You can use the python library &lt;a href=&quot;https://github.com/Mimino666/langdetect&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;langdetect&lt;/a&gt; (55 lang, 99% acc) or &lt;a href=&quot;https://fasttext.cc/docs/en/language-identification.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;fasttext&lt;/a&gt; (176 lang, 93% acc).&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;    from langdetect import detect, detect_langs 

    detect(&amp;quot;War doesn&amp;#39;t show who&amp;#39;s right, just who&amp;#39;s left.&amp;quot;)  
    &amp;gt;&amp;gt;&amp;gt; en 

    detect_langs(&amp;quot;Otec matka syn.&amp;quot;)  
    &amp;gt;&amp;gt;&amp;gt; [sk:0.572770823327, pl:0.292872522702, cs:0.134356653968] &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[41 - Meta-Info Extractor]]></title><description><![CDATA[If a document is considered to be information , its title, URL, publishing date, last-editing date, extraction-timestamp, author, filetype…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/41-meta-info-extractor/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/41-meta-info-extractor/</guid><pubDate>Mon, 12 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;If a document is considered to be &lt;em&gt;information&lt;/em&gt; , its title, URL, publishing date, last-editing date, extraction-timestamp, author, filetype and subject are examples of &lt;em&gt;meta-information&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;When you extract text from a file with &lt;a href=&quot;https://tika.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Apache Tika&lt;/a&gt;, it will also provide the meta-information. When you use the Twitter API, you can also get the meta-information of a tweet. A Tweet can have over &lt;a href=&quot;https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/intro-to-tweet-json&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;150 attributes&lt;/a&gt; associated with it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*suo0aGZ5sHSYu6l0C08zIg.jpeg&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Tweet metadata fields (&lt;a href=&quot;https://tighewall.com/2014/10/15/how-youre-actually-paying-for-those-free-apps-and-online-services/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Once you extracted the meta-information, it’s interesting to combine it with info extracted from the text. For example, the publication date from the document can be used as reference date for temporal expressions. You can declare the date for the word ‘yesterday’ based on the publication date.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[40 - Raw Text Cleaning]]></title><description><![CDATA[The task of Raw Text Cleaning consists of several pre-processing steps with the goal to increase the quality of subsequent NLP tasks. Dedash…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/40-raw-text-cleaning/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/40-raw-text-cleaning/</guid><pubDate>Sun, 11 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The task of Raw Text Cleaning consists of several pre-processing steps with the goal to increase the quality of subsequent NLP tasks.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dedash words that were split up because the end of the line was reached.&lt;/li&gt;
&lt;li&gt;Replace redundant whitespace like multiple line-ends, tabs and spaces.&lt;/li&gt;
&lt;li&gt;Handle HTML tags like &amp;#x3C; b &gt;, &amp;#x3C; div &gt;, &amp;#x3C; span &gt;, &amp;#x3C; \ br &gt;&lt;/li&gt;
&lt;li&gt;Smart Decapitalization for SCREAMING words&lt;/li&gt;
&lt;li&gt;Remove rare special characters, like a standard dash for the figure dash (‒), en dash (–), em dash ( — ), horizontal bar (―), swung dash (⁓), which shouldn’t be confused by the tilde (~).&lt;/li&gt;
&lt;li&gt;Replace special letter (áççêñtèd) characters with their simple form.&lt;/li&gt;
&lt;li&gt;Remove footnotes, page numbers, headers and references.&lt;/li&gt;
&lt;li&gt;Replace numerical words by values: ‘hundred thousand’ to 100.000&lt;/li&gt;
&lt;li&gt;Replace Emoji’s by their description, like 😶 to ‘no mouth’&lt;/li&gt;
&lt;li&gt;Remove URL and Email references&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Etcetera, etcetera! A lot of other NLP tasks described here, can also be used for Text Cleaning.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[39 - Deduplication]]></title><description><![CDATA[Deduplication is a relevant task if you have little control over your input data collection. For example, you can have a lot of duplicate…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/39-deduplication/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/39-deduplication/</guid><pubDate>Sat, 10 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Deduplication is a relevant task if you have little control over your input data collection. For example, you can have a lot of duplicate documents when scraping internet articles or using tweets. &lt;/p&gt;
&lt;p&gt;There are different methods for creating a unique document set. For finding texts that are exactly the same you can use &lt;a href=&quot;https://docs.python.org/3/library/hashlib.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;hashing&lt;/a&gt;. For comparing similarity between texts you can use &lt;a href=&quot;https://github.com/maxbachmann/rapidfuzz&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;fuzzy string matching&lt;/a&gt; or &lt;a href=&quot;https://docs.python.org/3/library/difflib.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;subsequencematching&lt;/a&gt;. This often has a bad performance when you want to scale up. Calculation costs grow quadratically when you increase the set of documents for deduplication. A solution is to cluster the documents first. You now only compare documents within the same cluster instead of comparing all documents against each other.&lt;/p&gt;
&lt;p&gt;For comparing semantic similarity between texts you can use distributed or contextualized word representations. A vector then represents each text and the (cosine) distance will indicate the similarity.&lt;/p&gt;
&lt;p&gt;Deduplication can benefit a lot from cleaning your text, like lowercasing all text or replace URLs.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[38 - Readability Scoring]]></title><description><![CDATA[Readability is the quality of the text that was written. If it’s too long and complicated, no one will understand it. Measuring the…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/38-readability-scoring/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/38-readability-scoring/</guid><pubDate>Fri, 09 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Readability is the quality of the text that was written. If it’s too long and complicated, no one will understand it. Measuring the readability is about measuring the text quality. This can be done by looking at the keyword density, syllable count and the average length of sentences and words in a document. Also checking for simpler synonyms or words with a higher word prevalence can help. Word prevalence is about word knowledge in the crowd and refers to the number of people who know the word.&lt;/p&gt;
&lt;p&gt;Well-known Readability measures are Flesch-Kincaid Grade Level and the Coleman-Liau Index. These are developed for English. For non-English languages there might be specific variants. However, the &lt;a href=&quot;https://www.researchgate.net/publication/338046963_Linguistic_Proxies_of_Readability_Comparing_Easy-to-Read_and_regular_newspaper_Dutch&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;best &lt;/a&gt;language-agnostic linguistic proxy for readability is (not surprisingly) the average number of words per sentence. &lt;/p&gt;
&lt;p&gt;You can try the English readability metrics with this python &lt;a href=&quot;https://github.com/cdimascio/py-readability-metrics&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;package&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*FCMgRdKyHNzLPwaOZVOLwQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Flesch-Kincaid Grade Levels (&lt;a href=&quot;https://readable.com/blog/the-flesch-reading-ease-and-flesch-kincaid-grade-level/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[37 - Grammar Checker]]></title><description><![CDATA[Grammar Checkers have not been great in the past. Despite all efforts in the pre-deeplearning era, transfer learning is currently the way to…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/37-grammar-checker/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/37-grammar-checker/</guid><pubDate>Thu, 08 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Grammar Checkers have not been great in the past. Despite all efforts in the pre-deeplearning era, transfer learning is currently the way to go. Depending on the implementation, the language models &lt;a href=&quot;https://towardsdatascience.com/checking-grammar-with-bert-and-ulmfit-1f59c718fe75&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;perform&lt;/a&gt; pretty good.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[36 - Paragraph Segmentation]]></title><description><![CDATA[Detecting Paragraphs is somehow less mainstream. Mostly there is some custom logic like: split after two line-ends, or split before…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/36-paragraph-segmentation/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/36-paragraph-segmentation/</guid><pubDate>Wed, 07 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Detecting Paragraphs is somehow less mainstream. Mostly there is some custom logic like: split after two line-ends, or split before uppercased title. Maybe there is some layout-meta information, or a specific paragraph- and chapter numbering that could help.&lt;/p&gt;
&lt;p&gt;Mostly, there just is no default way of determining the paragraph boundary and people tend to work with sentences. Still, the unit of a paragraph might be of a higher value than that of a sentence. Examples might be: coreference resolutions that overlap multiple sentences. Questions that find their answer throughout a whole paragraph. A reader that understands a paragraph better than an isolated sentence. It’s clear that the signal from a writer is best expressed in a paragraph.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[35 - Sentencizer]]></title><description><![CDATA[Once you tokenized your textual data, a sentencizer should find the words that together form a sentence.  Starting with a titlecased word…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/35-sentencizer/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/35-sentencizer/</guid><pubDate>Tue, 06 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Once you tokenized your textual data, a sentencizer should find the words that together form a sentence. &lt;/p&gt;
&lt;p&gt;Starting with a titlecased word, followed by lowercase words, until there is a dot. That might be the simplest (erroneous) version of rulebased sentence boundary detection (SBD) logic.&lt;/p&gt;
&lt;p&gt;More sophisticated SBD is, for example, done by the &lt;a href=&quot;https://spacy.io/usage/linguistic-features#sbd&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;spaCy&lt;/a&gt; library. The sentence segmentation is performed by the Dependency Parser, which predicts the sentence boundary by the dependency tags.&lt;/p&gt;
&lt;p&gt;In &lt;a href=&quot;https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;NLTK &lt;/a&gt;you can train (unsupervised) a sentence-tokenizer on your own training data. It builds a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[34 - Text Anonymizer]]></title><description><![CDATA[Text Anonymizing is the task of removing sensitive information before a document is shared with others. Deidentification and obfuscation is…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/34-text-anonymizer/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/34-text-anonymizer/</guid><pubDate>Mon, 05 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Text Anonymizing is the task of removing sensitive information before a document is shared with others. Deidentification and obfuscation is often done for strings that identify persons (name, social number, email, etc.) and organizations or for other sensitive details from crime records and patient dossiers.&lt;/p&gt;
&lt;p&gt;A simple solution is to do Named Entity Recognition and replace the found mention with a tag. If it is Anonymization you replace the mention with nothing (the black marker). For Pseudonymization you replace the mention with a unique tag.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*gBvA0q_bf4hi-C1P4feczA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Deidentification levels (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Text Anonymization has its use cases within governments and organizations that want to avoid being too transparent and have to deal with legal frameworks such as the General Data Protection Regulation (GDPR).&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[33 - Coreference Resolution]]></title><description><![CDATA[Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It has much to do with Named Entity…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/33-coreference-resolution/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/33-coreference-resolution/</guid><pubDate>Sun, 04 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It has much to do with Named Entity Linking, but it doesn’t necessarily use a knowledge base.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*srL_zh6JQjEx-FdC5tTE9A.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;The nomenclature of Coreference Resolution (&lt;a href=&quot;https://github.com/shayneobrien/coreference-resolution&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;You can test Coreference-demo’s from &lt;a href=&quot;https://demo.allennlp.org/coreference-resolution&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;AllenNLP&lt;/a&gt; or &lt;a href=&quot;https://huggingface.co/coref/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Huggingface&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/2560/1*YeedtLnoAuTM1WQJGC5NvQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;It doesn’t know who’s in the band! Coreference Resolution by Huggingface (&lt;a href=&quot;https://huggingface.co/coref/?text=Me%20and%20some%20guys%20from%20school%20had%20a%20band%20and%20we%20tried%20real%20hard.%20Jimmy%20quit%2C%20Jody%20got%20married%2C%20I%20should%27ve%20known%20we%27d%20never%20get%20far.&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[32 - Named Entity Linking]]></title><description><![CDATA[Named Entity Linking (NEL) is the task of assigning a unique identity from a knowledge base to an entity. For example, the entities ‘Obama…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/32-named-entity-linking/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/32-named-entity-linking/</guid><pubDate>Sat, 03 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Named Entity Linking (NEL) is the task of assigning a unique identity from a knowledge base to an entity. For example, the entities ‘Obama’ and ‘The President’ from document X might refer to the same person, but “Obama” from document X and “Obama” from document Y might not refer to the same person.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*z6YQLlEqL01PtNJ7RdedYA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Named Entity Linking steps (&lt;a href=&quot;https://en.wikipedia.org/wiki/Entity_linking#Approaches_to_entity_linking&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;NEL can be done by training an &lt;a href=&quot;https://spacy.io/usage/training#entity-linker&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Entity Linker&lt;/a&gt; with spaCy, or use the knowledge base &lt;a href=&quot;https://wiki.dbpedia.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;DBpedia&lt;/a&gt; and the service &lt;a href=&quot;https://www.dbpedia-spotlight.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;dbpedia-spotlight.org&lt;/a&gt;.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[31 - Temporal Parser]]></title><description><![CDATA[Once you found some string that contains an indication of time, you still have to extract a normalized time format out of it. Otherwise you…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/31-temporal-parser/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/31-temporal-parser/</guid><pubDate>Fri, 02 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Once you found some string that contains an indication of time, you still have to extract a normalized time format out of it. Otherwise you cannot calculate time differences between events or put results in a timeline. &lt;/p&gt;
&lt;p&gt;Challenging are the numerous time zones and local formats. But also the relative notations, like ‘tomorrow’ should be normalized and you should declare a reference date that functions as the ‘now’ in relation to the ‘tomorrow’. Another point is the duration, like ‘the summer of 1969’; when does a summer begin and end?&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*d5abWtQwz7lBhcpB7jAVeQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Example of Temporal Parsing (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Some of the best temporal parsers are Scrapinghub’s python &lt;a href=&quot;https://github.com/scrapinghub/dateparser&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;dateparser&lt;/a&gt; and Facebook’s &lt;a href=&quot;https://github.com/facebook/duckling&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Duckling&lt;/a&gt; (in Haskell).&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[30 - Geocoding]]></title><description><![CDATA[Geocoding is the task of parsing text into an address and converting an addresses into geographic coordinates like latitude and longitude…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/30-geocoding/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/30-geocoding/</guid><pubDate>Thu, 01 Oct 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Geocoding is the task of parsing text into an address and converting an addresses into geographic coordinates like latitude and longitude. The goal is to plot an item to a map. In a broader sense, this can also be done for organization names (find the headquarter), facilities and landmarks (convert ‘Eiffel Tower’ into &lt;a href=&quot;https://www.google.com/maps/place/48%C2%B051&amp;#x27;29.6%22N+2%C2%B017&amp;#x27;40.2%22E&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;48°51′29.6″N 2°17′40.2″E&lt;/a&gt; ). The opposite is reverse geocoding; from coordinates to an address.&lt;/p&gt;
&lt;p&gt;There are a lot of paid API’s for geocoding. &lt;a href=&quot;https://geopy.readthedocs.io/en/stable/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Geopy&lt;/a&gt; is a Python client for several popular geocoding web services. If you want to do this yourself, you’ll need a reference database. For example the open data from &lt;a href=&quot;https://openaddresses.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;openaddresses.io&lt;/a&gt; or &lt;a href=&quot;https://www.geonames.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;geonames.org&lt;/a&gt; or a local initiative.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*zH6EsRrPlDY1jPWfosnyAA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Geocoding (&lt;a href=&quot;https://www.giscloud.com/apps/geocoder&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[29 - Price Parser]]></title><description><![CDATA[A Price Parser extracts prices and currency from raw text. Applications can be in Price Management and Competitive Pricing, often combined…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/29-price-parser/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/29-price-parser/</guid><pubDate>Wed, 30 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A Price Parser extracts prices and currency from raw text. Applications can be in Price Management and Competitive Pricing, often combined with webscraping. Another usage example is from the Panama Papers; extract all valuta amounts from documents and link each amount to a person, organization, date or bank account.&lt;/p&gt;
&lt;p&gt;A good price parser normalizes all prices into a standard format. It should also recognize all currencies. It might be difficult to recognize the abbreviations of currencies; Euro, EUR, US Dollar, dollar (which one?), USD. The currency symbols are easier to find, because they have their own Unicode category. But still this doesn’t guarantee completeness.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;    import unicodedata 

    def is_currency_symbol(char):  
        return unicodedata.category(char) == “Sc” &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;A good python package is &lt;a href=&quot;https://github.com/scrapinghub/price-parser&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;price-parser&lt;/a&gt;. Another library for finding amounts of money is Facebook’s &lt;a href=&quot;https://github.com/facebook/duckling&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Duckling&lt;/a&gt; (in Haskell).&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[28 - Abbreviation Finder]]></title><description><![CDATA[An abbreviation is a way to write a word not using the complete spelling. Abbreviations are an efficient way to write text, but it lowers…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/28-abbreviation-finder/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/28-abbreviation-finder/</guid><pubDate>Tue, 29 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;An abbreviation is a way to write a word not using the complete spelling. Abbreviations are an efficient way to write text, but it lowers text comprehension and increases ambiguity. Acronyms are a specialized form of abbreviations, usually using the first letters of each word in a multi-word phrase.&lt;/p&gt;
&lt;p&gt;Identifying abbreviations often depends on detecting the simple template &lt;a href=&quot;https://arxiv.org/ftp/arxiv/papers/1309/1309.6185.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;&lt;em&gt;Long-Form (Short-Form)&lt;/em&gt;&lt;/a&gt;. After identification a consensus view should be reached on all ambiguous long forms. Distinctions should be made between different long forms with the same semantic meaning versus long forms of different concepts.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*esV6MtrttSCVUbSQYwIFFA.gif&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Reaching a Consensus view on Dutch Healthcare Acronyms; choosing the most frequent writing for an acronym (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[27 - Named Entity Recognition]]></title><description><![CDATA[Named Entity Recognition (NER) is the task of identifying named entities and their class. A lot of NER models are based upon Wikipedia…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/27-named-entity-recognition/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/27-named-entity-recognition/</guid><pubDate>Mon, 28 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Named Entity Recognition (NER) is the task of identifying named entities and their class. A lot of NER models are based upon &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S0004370212000276&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Wikipedia-generated&lt;/a&gt; training data for 3 NER categories: persons, locations, organizations. This is a rather simple NER annotation scheme, but available for multiple languages. The &lt;a href=&quot;https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Ontonotes&lt;/a&gt; training data provides a more fine-grained annotation scheme, like the table below. It all depends how schema’s are defined and how training data is created. Domain specific text requires custom models, like for &lt;a href=&quot;https://github.com/ICLRandD/Blackstone#applying-the-ner-model&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Legal&lt;/a&gt; and &lt;a href=&quot;https://allenai.github.io/scispacy/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Biomedical&lt;/a&gt;. You can also get inspiration from &lt;a href=&quot;https://schema.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;schema.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*hPvw322zk_nHuXiBw9k_tQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Examples of Named Entity types and Value types (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1200/1*oVSQqXJWwUXjvtoyKopzdw.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;displaCy Named Entity Visualizer demo (&lt;a href=&quot;https://explosion.ai/demos/displacy-ent&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[26 - Dependency Nounchunks]]></title><description><![CDATA[A form of Constituency Parsing is breaking a text into sub-phrases. Partial parsing is known as chunking and has noun-phrases (nounchunks…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/26-dependency-nounchunks/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/26-dependency-nounchunks/</guid><pubDate>Sun, 27 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A form of Constituency Parsing is breaking a text into sub-phrases. Partial parsing is known as chunking and has noun-phrases (nounchunks), verb-phrases, adjective phrases or prepositional phrases as a result. You can compare this with the more default N-grams.&lt;/p&gt;
&lt;p&gt;You can test spaCy for its &lt;a href=&quot;https://spacy.io/usage/linguistic-features#noun-chunks&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;nounchunker&lt;/a&gt;. The chunks are based on dependency tags.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*YKZjlbLhb2tU5njNWtxQtA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Nounchunks from English spaCy model. It would be better to have ‘the summer of 1969’ as chunk (&lt;a href=&quot;https://spacy.io/usage/linguistic-features#noun-chunks&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[25 - Rulebased Phrasematcher]]></title><description><![CDATA[A lot of text analytic analysis start with searching a list of words or phrases. If you have a problem of finite size where a lookup table…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/25-rulebased-phrasematcher/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/25-rulebased-phrasematcher/</guid><pubDate>Sat, 26 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A lot of text analytic analysis start with searching a list of words or phrases. If you have a problem of finite size where a lookup table (gazetteer) is sufficient, then a matcher might be the solution. This task has evolved over time. It is still rulebased, but smarter. Missing search hits will be prevented by looking at tokens instead of raw text. &lt;/p&gt;
&lt;p&gt;For example, all the double spaces and tabs between words will not influence the search results. Also searches can be defined on a more semantic level. For example one can search for all forms of a verb, just by defining the lemma (base form) of the verb. Or search for the lemma of a noun and you will get all the single and plural values of this noun.&lt;/p&gt;
&lt;p&gt;Performance is also important when searching for a large list of words against a large corpus. &lt;a href=&quot;https://github.com/vi3k6i5/flashtext&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Flashtext&lt;/a&gt; has an algorithm in python that has a gigantic performance gain compared to regex searches.&lt;/p&gt;
&lt;p&gt;spaCy has a &lt;a href=&quot;https://explosion.ai/demos/matcher&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Token Matcher&lt;/a&gt; for detailed searches on (semantic) &lt;a href=&quot;https://spacy.io/usage/rule-based-matching#adding-patterns-attributes&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;properties&lt;/a&gt; of (multiple) tokens and a &lt;a href=&quot;https://spacy.io/usage/rule-based-matching#phrasematcher&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Phrase Matcher&lt;/a&gt; for very long lists of (multiple) words. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*ieEHMKAFdESk7HBZV7AbHQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Rule-based Token Matcher demo (&lt;a href=&quot;https://explosion.ai/demos/matcher&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[24 - N-grams]]></title><description><![CDATA[An N-gram is a sequence of N words, with a high probability of occurrence. It can also be named a collocation, a multi-word expression or a…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/24-n-grams/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/24-n-grams/</guid><pubDate>Fri, 25 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;An N-gram is a sequence of N words, with a high probability of occurrence. It can also be named a collocation, a multi-word expression or a common phrase. A Bi-gram example is ‘red wine’ and a Tri-gram is ‘summer of 69’. The probability is calculated by: (the number of times the previous word occurs before this word) / (the total number of times the previous word occurs in the corpus)&lt;/p&gt;
&lt;p&gt;You can detect N-grams with Gensim’s &lt;a href=&quot;https://radimrehurek.com/gensim/models/phrases.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Phrases-model&lt;/a&gt; or with the &lt;a href=&quot;https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;CountVectorizer&lt;/a&gt; from Scikit learn.&lt;/p&gt;
&lt;p&gt;N-grams are sometimes used for next-word prediction. This is a simple, but costly (performance) solution. A variant is the cheaper character N-grams, but it’s better to use LSTMs.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[23 - Negation Recognizer]]></title><description><![CDATA[A negation is the denial of something and is very important to include in your analysis. When you count the word ‘talent’ in your document…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/23-negation-recognizer/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/23-negation-recognizer/</guid><pubDate>Thu, 24 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A negation is the denial of something and is very important to include in your analysis. When you count the word ‘talent’ in your document, you should also find the preceding word ‘without’. Ignoring negation will flip the polarity of your analysis incorrectly. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*mnZbFKFBP0sxufoent1HNA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Negation Recognizer example (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;spaCy has a pipeline &lt;a href=&quot;https://github.com/jenojp/negspacy&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;plugin to detect negations&lt;/a&gt;. There is also &lt;a href=&quot;https://github.com/adityak6798/Transformers-For-Negation-and-Speculation&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;training code&lt;/a&gt; for negBert, but there is no ready-to-go model.&lt;/p&gt;
&lt;p&gt;The broader task of Negation Finding is finding the level of Certainty. From completely Affirmed to a Speculative form to Negated. The prepositions before or after a word determine the polarity of the assertion.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*OpNGEHlcGzN1BwhbguBr9A.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Examples of different levels of certainty (&lt;a href=&quot;https://books.google.nl/books?id=Z7taDwAAQBAJ&amp;pg;=PA27&amp;lpg;=PA27&amp;dq;=Assertions+are+prepositions+that+have+some+sort+of+positive+or+negative+polarity.&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[22 - Spell Checker]]></title><description><![CDATA[Spell Checkers can recommend corrections on three levels: subword level, word level and sentence level. Spell Checkers evolved from rule…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/22-spell-checker/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/22-spell-checker/</guid><pubDate>Wed, 23 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Spell Checkers can recommend corrections on three levels: subword level, word level and sentence level. Spell Checkers evolved from rule-based to deeplearning models. Spark-NLP has a &lt;a href=&quot;https://nlp.johnsnowlabs.com/docs/en/annotators#context-spellchecker&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;contextual spell checker&lt;/a&gt; model and spaCy has a &lt;a href=&quot;https://github.com/R1j1t/contextualSpellCheck&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;contextual spell checker&lt;/a&gt; and a &lt;a href=&quot;https://spacy.io/universe/project/spacy_hunspell&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;spellchecker&lt;/a&gt; based on &lt;a href=&quot;http://hunspell.github.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Hunspell&lt;/a&gt;, which is the spell checker of Chrome, Firefox and OpenOffice.&lt;/p&gt;
&lt;p&gt;Words that should not be corrected, can sometimes be defined as exceptions. This can be done with exception-classes and might have a high overlap with Named Entities. For example, the string ‘Aug-69’ is from the class Date and should not be corrected.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*wmi321XUkX-VRzHIs_nMuQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Word Spell Checker example (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Spelling correction (but also Lemmatization, Stemming and Normalization) helps you decrease the unique number of tokens in your vocabulary, which improves performance in NLP tasks. Especially when you have noisy textual data like tweets.&lt;/p&gt;
&lt;p&gt;An interesting application of Spell Checking is in OCR. The quality of extracted text from OCR can be improved by checking whether there are low-probability words or out-of-context words or out-of-vocabulary words that need to be corrected.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[21 - Normalization]]></title><description><![CDATA[Besides Stemming or Lemmatizing, there still might be a need to edit words to move to more default words.  For example, transform word…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/21-normalization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/21-normalization/</guid><pubDate>Tue, 22 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Besides Stemming or Lemmatizing, there still might be a need to edit words to move to more default words. &lt;/p&gt;
&lt;p&gt;For example, transform word numerals into numbers, handle emoji, substitute contractions (I’m → I am), replace repetitions (Yeaaaaaahhhh → Yeah), remove gender to prohibit to have a gender-bias in your model (all he, his, she, her, etc. to a default form). &lt;/p&gt;
&lt;p&gt;You can also normalize spèçíâl characters with Pythons &lt;a href=&quot;https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;unicodedata&lt;/a&gt;. This has for example the effect that accents are removed and that curly quotes are converted to their ASCII equivalent. An advantage for simplicity, although you lose directionality of the quote.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*EuBQDeKXIe8B4bx14YF9HQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Word Normalization example (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[20 - Lemmatization]]></title><description><![CDATA[Lemmatization usually refers to rewriting a word to its base form properly, with the use of a vocabulary and morphological analysis of words…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/20-lemmatization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/20-lemmatization/</guid><pubDate>Mon, 21 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Lemmatization usually refers to rewriting a word to its base form properly, with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*_TkWyJTJkq-WrvIkQrFrAA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Word Lemmatization example (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[19 - Stemming]]></title><description><![CDATA[Stemming refers to a crude heuristic process that chops off the ends of words in the hope that words with the same meaning become words with…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/19-stemming/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/19-stemming/</guid><pubDate>Sun, 20 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Stemming refers to a crude heuristic process that chops off the ends of words in the hope that words with the same meaning become words with the same syntax and less unique words remain. This often includes the removal of derivational affixes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*lyePMa_7Y1LJ9Y_IkH1avA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Word Stemming example (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[18 - Dependency Parser]]></title><description><![CDATA[A Dependency Parser extracts a dependency graph from a sentence. In the graph the grammatical structure is represented and relationships are…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/18-dependency-parser/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/18-dependency-parser/</guid><pubDate>Sat, 19 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A Dependency Parser extracts a dependency graph from a sentence. In the graph the grammatical structure is represented and relationships are defined between head-words and words, which modify these heads. For example, most sentences have a root-word, which is the main verb, a subject and an object. All tags are defined on the &lt;a href=&quot;https://universaldependencies.org/u/dep/index.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Universal Dependency Relations&lt;/a&gt; website. You can also play with the &lt;a href=&quot;https://explosion.ai/demos/displacy&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt; displaCy Dependency Visualizer demo&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1200/1*_NDL8qhd3EMEkG6Ejx93Dg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;{ **subject** } had { **object** } example with displaCy Dependency Visualizer (&lt;a href=&quot;https://explosion.ai/demos/displacy?text=Me%20and%20some%20guys%20from%20school%20had%20a%20band.&amp;model;=en_core_web_sm&amp;cpu;=1&amp;cph;=1&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Dependencies are also great for Information Extraction, because there often is an object-subject relation. Spacy is very useful for navigating the dependency parse tree as you can see in this &lt;a href=&quot;https://spacy.io/usage/examples#entity-relations&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;example&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[17 - Part-of-Speech Tagger]]></title><description><![CDATA[The Part-of-Speech (POS) tagger marks words with a POS-tag, based on its context and definition. So, for example, the word ‘answer’ is…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/17-part-of-speech-tagger/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/17-part-of-speech-tagger/</guid><pubDate>Fri, 18 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The Part-of-Speech (POS) tagger marks words with a POS-tag, based on its context and definition. So, for example, the word ‘answer’ is tagged as a Noun or as a Verb, depending on its context.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*NKmFKvzDdSGF0vNutABWjQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Part-of-Speech tags ( **S** : sentence, **NP** : noun phrase) (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;There are many variants of POS-tag schemes and their abbreviations. In some schemes (e.g. in the Penn Treebank below) the POS tag includes some morphological information. This (partly) depends on the morphological richness of a language. There is also a &lt;a href=&quot;https://universaldependencies.org/u/pos/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Universal Scheme for POS tags&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*JMU_Rps2ys6MmAo2UHWaCg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Penn Treebank for POS-tags (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;POS-tags are (or were) useful for building lemmatizers and NER systems, but also for information retrieval with rule-based token-patterns. For example, see this &lt;a href=&quot;https://explosion.ai/demos/matcher&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Spacy demo&lt;/a&gt; for rulebased search for token patterns.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[16 - Morphological Tagger]]></title><description><![CDATA[The task of the Morphological Tagger is to assign additional morphological information to each token. This can be the gender, case, person…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/16-morphological-tagger/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/16-morphological-tagger/</guid><pubDate>Thu, 17 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The task of the Morphological Tagger is to assign additional morphological information to each token. This can be the gender, case, person, tense, etc. &lt;a href=&quot;https://unimorph.github.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;UniMorph&lt;/a&gt; and &lt;a href=&quot;https://universaldependencies.org/u/feat/index.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;UniversalDependencies &lt;/a&gt;are projects to describe al morphological features in a universal schema.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*L1d5A5zCvAgS9YGSR1-diA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;` **FIN**` **** indicates a finite verb. `**IND**` **** indicates the indicative mood. `**PFV**` **** indicates the perfective aspect. `**PST**` **** indicates the past tense. **2** indicates the second person. **SG** indicates the singular number. `**INFM**` **** indicates the informal register. (&lt;a href=&quot;https://unimorph.github.io/schema/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;The Spanish word ‘hablaste’ can be represented as the lemma ‘hablar’ plus the bundle [FIN;IND;PFV;PST;2;SG;INFM]. This bundle describes all grammatical features for the word. The [V] stands for Verb and is the Part-of-Speech.&lt;/p&gt;
&lt;p&gt;Some languages (like German and Arabic) mark a lot of information through morphology, giving them a rich morphology. Other languages (like Mandarin) have almost no morphology, so whatever needs to be encoded gets encoded through syntax. Syntax means: by adding more words and by changing word order.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[15 - Vocabulary Building]]></title><description><![CDATA[The goal of tokenization is a vocabulary. Some extra properties might be interesting to include in the vocabulary. Count the word types per…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/15-vocabulary-building/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/15-vocabulary-building/</guid><pubDate>Wed, 16 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The goal of tokenization is a vocabulary. Some extra properties might be interesting to include in the vocabulary. Count the word types per document and for the entire corpus for probability measures (like TF-IDF) to reflect how important the word type is.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[14 - Tokenization]]></title><description><![CDATA[Tokenization is the task of splitting raw text into smaller fundamental units; word tokens. This task is required for almost any NLP task…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/14-tokenization/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/14-tokenization/</guid><pubDate>Tue, 15 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Tokenization is the task of splitting raw text into smaller fundamental units; word tokens. This task is required for almost any NLP task. The goal is to build a vocabulary of word types (in spaCy: &lt;a href=&quot;https://spacy.io/api/lexeme&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;lexemes&lt;/a&gt;). A word type is a distinct word, in the abstract, rather than a specific instance. Word types are word tokens with no context. A word token is a word (string) observed in a piece of text.&lt;/p&gt;
&lt;p&gt;How large your vocabulary should be, is a trade-off between memory limitations vs. coverage of word tokens. Each word token is converted to a unique id per word type. This is for performance reasons, because many NLP tasks make use of intensive (matrix) calculations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*oOdNPzecjBzgbsFeBnVh0Q.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;spaCy’s rulebased Tokenizer (&lt;a href=&quot;https://spacy.io/usage/linguistic-features#tokenization&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;The text is generally split up into tokens that match words. But the text can also be split up into subwords or characters. The level of tokenization (or granularity) depends on the NLP-task and the target size for the total number of tokens for your vocabulary. The larger the vocabulary size the more common words you can tokenize and the more memory you need. The smaller the vocabulary size the more subword tokens you need to avoid having to use the -UNK- token (unknown).&lt;/p&gt;
&lt;p&gt;Above is an example of a word-level tokenizer. Techniques for subword tokenization are often used for training deeplearning models and have names like Wordpiece, Unigram and Byte Pair Encoding (BPE). For example, BPE ensures that the most common words will be represented in the new vocabulary as a single token, while less common words will be broken down into two or more subword tokens. More details &lt;a href=&quot;https://blog.floydhub.com/tokenization-nlp/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*jCHBfRMrbjgn5WJtGdigvQ.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Tokenization levels vs Vocabulary size (&lt;a href=&quot;https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Once you choose a tokenizer and train a model on the tokenized data, you should always use that same tokenizer when using the model!&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[13 - Rulebased Training Data]]></title><description><![CDATA[A solution to scale up your training data is by programmatically building training datasets without manual labeling. The idea is to define…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/13-rulebased-training-data/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/13-rulebased-training-data/</guid><pubDate>Mon, 14 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A solution to scale up your training data is by programmatically building training datasets without manual labeling. The idea is to define heuristic rules which are used in functions for labeling training data. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*mg2OsOn8BjFfAn03GYajHg.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Writing labeling functions in Snorkel (&lt;a href=&quot;https://www.snorkel.org/get-started/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Since the labeling functions have unknown accuracies and correlations, their output labels may overlap and conflict. By using a model to automatically estimate the accuracies and correlations, reweight and combine the labels, you can produce a final set of clean, integrated training labels.&lt;/p&gt;
&lt;p&gt;The goal is to use the resulting labeled training data points to train a machine learning model that can generalize beyond the coverage of the labeling functions. &lt;a href=&quot;https://www.snorkel.org/use-cases/01-spam-tutorial&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;This Python tutorial&lt;/a&gt; uses Stanford’s &lt;a href=&quot;https://github.com/snorkel-team/snorkel&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Snorkel&lt;/a&gt; library for this purpose. It’s quite successful, because the team is building a &lt;a href=&quot;https://www.snorkel.ai/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;business solution&lt;/a&gt; around the concept.&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[12 - Textual Data Augmentation]]></title><description><![CDATA[The amount of available textual (training) data influences the performance of many NLP tasks. If collecting more data is not an option…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/12-textual-data-augmentation/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/12-textual-data-augmentation/</guid><pubDate>Sun, 13 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The amount of available textual (training) data influences the performance of many NLP tasks. If collecting more data is not an option, there are different techniques for boosting performance on your NLP task.&lt;/p&gt;
&lt;p&gt;Data augmentations are a standard part for Computer Visions tasks. However, due to the grammatical structure, the task is much more delicate for textual data and Natural Language Generation.&lt;/p&gt;
&lt;p&gt;Here are some examples of how the textual data is transformed by &lt;a href=&quot;https://arxiv.org/pdf/1901.11196v2.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Easy Data Augmentation&lt;/a&gt; (EDA) techniques and Back Translation:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*Rdz9Mb2swB3gfE5Ewix60w.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Textual Data Augmentation Techniques (&lt;a href=&quot;http://www.innerdoc.com&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Data Augmentation might not help, but it’s worth the shot if you are stuck. Whatever you do; do not validate with augmented textual data!&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[11 - Crowdsourcing Marketplace]]></title><description><![CDATA[Remote workers are often recruited to complete labor-intensive tasks like building training datasets. Amazons Mechanical Turk (MTurk) is the…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/11-crowdsourcing-marketplace/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/11-crowdsourcing-marketplace/</guid><pubDate>Sat, 12 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Remote workers are often recruited to complete labor-intensive tasks like building training datasets. &lt;a href=&quot;https://www.mturk.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Amazons Mechanical Turk&lt;/a&gt; (MTurk) is the leading platform for this task. Work is defined in human intelligence tasks (HITs) and range in length, complexity and compensation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*yBYX_zFu0IOC02i-CkpqTw.png&quot;&gt;&lt;/p&gt;
&lt;p&gt;MTurk mechanism (&lt;a href=&quot;https://www.mturk.com/&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;Outsourcing work is great, but there are limits. While MTurk is often considered a cheap solution to data collection, there are actually many hidden costs. You must be very explicit in defining HIT descriptions. A lot of effort is required for managing the project and monitoring quality control.&lt;/p&gt;
&lt;p&gt;It is recommended to start labeling yourself. You will experience a lot of trial-and-error in the labeling-scheme and the annotation- and tag-definitions. Only proceed if you have confidence in your labeling scheme and applied this for building a Proof-of-Concept. So, experiment and fine-tune the training data definition yourself and then scale-up to MTurk.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[10 - Training Data Provider]]></title><description><![CDATA[Gold data refers to data of very high quality, which is more or less as close as you can get to the ground truth. This is the data you want…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/10-training-data-provider/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/10-training-data-provider/</guid><pubDate>Fri, 11 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Gold data refers to data of very high quality, which is more or less as close as you can get to the ground truth. This is the data you want to use when training a new Language Model. Some Data Providers sell these high quality datasets. However, the use of an off-the-shelf dataset depends on the usability of the data. This depends on the NLP-Task, Language, Domain and Tag-schema. I’m skeptical of using third-party datasets as they almost never match your purpose. Unless you use it for a default NLP task or a demo purpose or in addition to your own training dataset.&lt;/p&gt;
&lt;p&gt;A good starting point for an overview of company- and research datasets for various tasks can be found in the &lt;a href=&quot;https://datasets.quantumstat.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Big Bad NLP database&lt;/a&gt;.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[09 - Annotation with Active Learning]]></title><description><![CDATA[It might not be useful to build a training dataset for Named Entity Recognition with 2000 annotations, including 100 occurrences where…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/09-annotation-with-active-learning/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/09-annotation-with-active-learning/</guid><pubDate>Thu, 10 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;It might not be useful to build a training dataset for Named Entity Recognition with 2000 annotations, including 100 occurrences where ‘Barack Obama’ is tagged as a Person. You only want to annotate sentences where the model is least sure about the prediction. &lt;/p&gt;
&lt;p&gt;With active learning, the model chooses which sentences should be selected for annotating. Other sentences are skipped, because the model is more certain about those annotations. &lt;/p&gt;
&lt;p&gt;The makers of spaCy made the annotation tool &lt;a href=&quot;https://prodi.gy/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Prodi.gy&lt;/a&gt; which is powered by active learning is (video below).&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Type caption for embed (optional)
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[08 - Manual Annotation]]></title><description><![CDATA[The good old fashioned manual labor. Annotate a sentence, paragraph or document for your task. For example: tag a word with a Part-of-Speech…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/08-manual-annotation/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/08-manual-annotation/</guid><pubDate>Wed, 09 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The good old fashioned manual labor. Annotate a sentence, paragraph or document for your task. For example: tag a word with a &lt;a href=&quot;https://universaldependencies.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Part-of-Speech tag, or Dependency tag&lt;/a&gt;. Tag one or more words as a Named Entity. Or tag a sequence of words with a Category-tag.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*aRIQEMYe0p6LcQidrYztow.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Manually captured annotations (&lt;a href=&quot;https://arxiv.org/pdf/1704.02853.pdf&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Over the years, numerous of annotation tools were developed. Here is a &lt;a href=&quot;https://github.com/mariananeves/annotation-tools&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;list&lt;/a&gt; with almost hundred annotation tools. A lot of them have terrible inefficient user interfaces or are not actively developed.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[07 - Text Extraction and OCR]]></title><description><![CDATA[Extracting text from files is a fragile process if you don’t know what to expect from the format of the input files. The Java-based Apache…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/07-text-extraction-and-ocr/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/07-text-extraction-and-ocr/</guid><pubDate>Tue, 08 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Extracting text from files is a fragile process if you don’t know what to expect from the format of the input files. The Java-based &lt;a href=&quot;https://tika.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Apache Tika&lt;/a&gt; can handle &lt;a href=&quot;https://tika.apache.org/1.24.1/formats.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;hundreds &lt;/a&gt;of file extensions, has a &lt;a href=&quot;https://github.com/chrismattmann/tika-python&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Python port&lt;/a&gt; and is your best option. If you are looking for other (Python) packages to extract text, always check the underlying required packages.&lt;/p&gt;
&lt;p&gt;However, Tika-outputs are often not perfect. Especially the PDF format is a challenging one. PDF was designed as an output format, resulting in a perfectly readable document, but is difficult to reduce the file back to the source text. Especially the broken lines for PDFs are difficult to clean. Should you remove the line-end to restore the sentence, or does this has the effect that you just merged a title with no ending dot with another sentence. Also PDFs with multiple columns on a page are often wrongly parsed.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*dGZB8buO0d6Y5Xrzi5fHXQ.jpeg&quot;&gt;&lt;/p&gt;
&lt;p&gt;OCR on a Polish receipt (&lt;a href=&quot;https://medium.com/hackernoon/simple-ocr-implementation-on-android-with-googles-ml-kit-ceb4cdd8d70c&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;Older PDFs do have extra challenges, because they may contain images of text and not digital text. For this an Optical Character Recognition (OCR) model is needed that is trained for your language and Font type. It has a lot of subtasks like adjusting the scan to the right angle, converting the image from color to black and white, smoothing edges, normalizing the aspect ratio and analyzing the layout for zones like columns, paragraphs or tables. Widely used libraries are &lt;a href=&quot;https://github.com/skvark/opencv-python&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;OpenCV&lt;/a&gt; and &lt;a href=&quot;https://tesseract-ocr.github.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Tesseract-OCR&lt;/a&gt; which was started decades ago by HP and is now maintained by Google. There is also a &lt;a href=&quot;https://github.com/madmaze/pytesseract&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Python wrapper&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[06 - Text and File Scraping]]></title><description><![CDATA[If you need information from the web, but there is no API or other structured way of retrieving the data, then you might want to scrape…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/06-text-and-file-scraping/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/06-text-and-file-scraping/</guid><pubDate>Mon, 07 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;If you need information from the web, but there is no API or other structured way of retrieving the data, then you might want to scrape textual data, or files.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*Y-S9T8_UAGirgYttt2t4_A.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Type caption for image (optional)  
&lt;p&gt;However, there are some challenges. When visiting a lot of webpages in the same domain in a short time there’s a good chance your IP-address gets blocked by the server. In preventing this you probably want to handle request headers, manage cookies, automatically login, handle popups and browse javascript-generated websites in ‘headless’ mode.&lt;/p&gt;
&lt;p&gt;After getting access to the page source of the webpage one has to define the logic to extract values from HTML. For example, retrieve all the text from the news article, but not the text from the advertisements or menu buttons.&lt;/p&gt;
&lt;p&gt;Very popular python packages to get started with building successful scrapers are: &lt;a href=&quot;https://scrapy.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Scrapy&lt;/a&gt;, &lt;a href=&quot;https://www.crummy.com/software/BeautifulSoup/bs4/doc/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Beautiful Soup&lt;/a&gt;, &lt;a href=&quot;https://docs.python.org/3/library/urllib.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Urllib&lt;/a&gt; and &lt;a href=&quot;https://selenium-python.readthedocs.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Selenium&lt;/a&gt;.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[05 - Loading from API]]></title><description><![CDATA[An API serves as the interface between different applications. The requestor automatically gets access to data, with the benefit that the…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/05-loading-from-api/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/05-loading-from-api/</guid><pubDate>Sun, 06 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;An API serves as the interface between different applications. The requestor automatically gets access to data, with the benefit that the source doesn’t have to know how the other system exactly works. Unfortunately, few organizations maintain their data for (free) external use by means of an API. Good examples are the &lt;a href=&quot;https://developer.twitter.com/en/docs/api-reference-index&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Twitter-API&lt;/a&gt; (streaming), &lt;a href=&quot;https://newsapi.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;NewsAPI&lt;/a&gt; and &lt;a href=&quot;https://open-platform.theguardian.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;GuardianAPI&lt;/a&gt; (batch requests).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*y7c627mWjqLuWXX9Eyhg4w.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;API scheme (&lt;a href=&quot;https://www.youtube.com/watch?v=-VnBJQ8pBj0&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[04 - Generating a Corpus]]></title><description><![CDATA[A Corpus is a language resource consisting of a structured set of documents and additional information. A corpus can consist of pre…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/04-generating-a-corpus/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/04-generating-a-corpus/</guid><pubDate>Sat, 05 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A Corpus is a language resource consisting of a structured set of documents and additional information. A corpus can consist of pre-processed documents. So it might be the output of other NLP tasks. It’s important to know the Corpus Balance. What does the corpus represent; which languages, genres, domains and for which NLP-tasks can it be used. You can load, for example, the &lt;a href=&quot;http://www.nltk.org/nltk_data/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;corpora&lt;/a&gt; that are built-in in the NLTK library.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*0-4LSIkPyU7Y8ai-v-tErg.jpeg&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Wikipedia as source for corpus (&lt;a href=&quot;https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;Wikipedia is one of the most popular open data sources for building a corpus. Select and download a &lt;a href=&quot;https://dumps.wikimedia.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Wikipedia dump&lt;/a&gt;. Clean the articles with &lt;a href=&quot;https://github.com/attardi/wikiextractor&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;wikiextractor&lt;/a&gt; and you have a great corpus for starters. Also &lt;a href=&quot;https://commoncrawl.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Commoncrawl&lt;/a&gt; is often used, but you have to be able to handle all those gigabytes.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[03 - Loading Structured Datafile]]></title><description><![CDATA[If you are lucky there is a ready-to-use file in a valid CSV- or JSON-format with raw text documents in it. This means other people have…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/03-loading-structured-datafile/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/03-loading-structured-datafile/</guid><pubDate>Fri, 04 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;If you are lucky there is a ready-to-use file in a valid CSV- or JSON-format with raw text documents in it. This means other people have made an effort for you to save all retrieved documents into one file. This file is a good starting point for your task, provided that you know the constraints and filters that were applied when the structured datafile was created.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[02 - Manual Typewriting]]></title><description><![CDATA[Why not just type your own text and process it in your NLP task. Creating your own content is the best test in demo’s. Hopefully, it…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/02-manual-typewriting/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/02-manual-typewriting/</guid><pubDate>Thu, 03 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Why not just type your own text and process it in your NLP task. Creating your own content is the best test in demo’s. Hopefully, it increases your confidence in the NLP-task that you are testing and lets you insert all the rare cases you can think of.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[01 - Bits to Character Encoding]]></title><description><![CDATA[Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). It…]]></description><link>https://www.innerdoc.com/periodic-table-of-nlp-tasks/01-bits-to-character-encoding/</link><guid isPermaLink="false">https://www.innerdoc.com/periodic-table-of-nlp-tasks/01-bits-to-character-encoding/</guid><pubDate>Wed, 02 Sep 2020 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). It all starts with loading the textual data in the right encoding. The former CEO of StackOverflow wrote an interesting overview about &lt;a href=&quot;https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;The Absolute Minimum you should know about Unicode and Character Sets&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/800/1*uTVVsoRElSpoYSO5JUy0pA.png&quot;&gt;&lt;/p&gt;
&lt;div style=&quot;color:grey; text-align:center;&quot;&gt;&lt;sup&gt;Non-ASCII characters as replacement (&lt;a href=&quot;https://www.datafix.com.au/BASHing/2019-10-18.html&quot; style=&quot;color:grey;&quot; target=&quot;_blank&quot;&gt;source&lt;/a&gt;)&lt;/sup&gt;&lt;/div&gt;
&lt;br&gt;
&lt;p&gt;If you have problems with encodings, try the &lt;a href=&quot;https://ftfy.readthedocs.io/en/latest/#ftfy.fix_encoding&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;ftfy&lt;/a&gt; library to fix them. If you don’t know what encoding your data has, try &lt;a href=&quot;https://github.com/chardet/chardet&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;chardet&lt;/a&gt; or &lt;a href=&quot;https://github.com/ousret/charset_normalizer&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;charset_normalizer&lt;/a&gt; to detect the encoding.&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;blockquote&gt;
&lt;p&gt;This article is part of the project &lt;a href=&quot;/periodic-table-of-nlp-tasks/&quot;&gt;Periodic Table of NLP Tasks&lt;/a&gt;. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks. &lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title><![CDATA[About Innerdoc]]></title><description><![CDATA[Helping you with Text Analytics tasks. Do you recognize this? You have an important business problem. All information is available. But…]]></description><link>https://www.innerdoc.com/about/</link><guid isPermaLink="false">https://www.innerdoc.com/about/</guid><pubDate>Sat, 01 Aug 2020 10:00:00 GMT</pubDate><content:encoded>&lt;h5&gt;Helping you with Text Analytics tasks.&lt;/h5&gt;
&lt;p&gt;Do you recognize this? You have an important business problem. All information is available. But … you have to collect, search and analyze it. And the volume … it is hidden in hundreds or thousands of documents. You certainly won’t make this a manual task, or spend a lot of time on it.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 166px; &quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/78ed8a3430bbc9fa52bd17c3af678d30/02ab0/innerdoc-icon-white-small.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 109.03614457831326%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAWCAYAAADAQbwGAAAACXBIWXMAAA7DAAAOwwHHb6hkAAADbElEQVQ4y5WU7UtbVxzHxXf+I/0XBmN7I7QvVjcVtHSg7o1PKPVFVnxoHSI+EUyKY9pF3YtQ2HCsNSmYrDNtx9Y6XVdKW7XKVhIV225NbtR6NTE393x27rl50BL78IMv555fzvnm+3s4vyIhBJYZhqHWYDBIRUUFra2tNDQ0KNTX19PY2KjQ3NxMVVUVc3Nz6nw6nSbLYVnR64Q1NTWUlpYyPj7O8PAwbrcbl8ulVmvv8XgoKytjdHQ0d+8I4WGyhYUFiouL8Xq9vMmmpqbo6+srTGhtsg5LXVNTk1JjWTKZVBeysPaWWeoHBwcLE2bVjYyM0N/fTygUoru72yZMSSJT5JDKnJ2YmGBgYOD4kOPxOOXl5WiaRl1dnczPWOZn82isIv1uhL29vVRWVlJbW4vLfUn9sJU08azqfP1EZ3RFx728y6Jmh+wZfwNhOBympKSEstOnWVxaOiIouJnAt57g+kaCq2v7hHcO7BxKhcfmsKenh7a2tpzjb01n7MkO3n92mV7T8VtY1/kpohPeTuYIh4aGChO2t7czOxvK5AgiUoQnDN51mIzY+E7isvStbNs5nZicxOl0Fibs7Ozk2rWrapPaWYPnAdBuQiyUh3ZbrrOgb6hz3459c3wfdnR04PdN288o+gDz8QDG0iUJVx4PezHud2DEHqpz09cDsvmvHHp6VnQir9Dv89n/pr/AfDmPGfsLM2rhnvy+r5pH3Untws4qYmtZfut2loSZVyjXDKGtkP/m4YEDHn8Fjy5KXJDfsslNeTn+CDFbCt+XwI2P4O4XiNXLGSLTVpgl9Pn9yh+TRVzeg9X9PFbkPqzL52nITeIlQkYgfq1EPPsF5psQS06lHtPIhzztswnvPN+j+Y8Y5+/FcfwZ50u5nlvQcC6+whC5RpCRyNH12xnbEToFr57mXpIdckYhaePYCaOGiJkmW09xpxb2nsm0yGqHf8hXuauri5mZGd7Z9jYQm7K1Qiel0ruycX+E38+C5dv/lyKHw5Gbf9FoVA2KwtCIb21jJGWlgx8gbnxsFyKdAP8JuPWJKk5RIBCgurqalpYWNeKzY/8wsv7GhnqeRjat3GDe+gxeyIaPTEmFnx+dNlZzWsPz4ODgrTDTKTuHsuL8/CHc/lQSJPNFMU2T9zWRaRGrjUjGM6PTnpX/A4sLC+eHtHLaAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;innerdoc logo&quot;
        title=&quot;innerdoc logo&quot;
        src=&quot;/static/78ed8a3430bbc9fa52bd17c3af678d30/02ab0/innerdoc-icon-white-small.png&quot;
        srcset=&quot;/static/78ed8a3430bbc9fa52bd17c3af678d30/02ab0/innerdoc-icon-white-small.png 166w&quot;
        sizes=&quot;(max-width: 166px) 100vw, 166px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Innerdocs services make use of Text Mining and Natural Language Processing techniques. We make your textual data useful again. We extract information from Word documents, PDF’s, articles, emails, web-content, Electronic Health Records, you name it. Innerdoc will analyze and present you the hidden knowledge and information from within the documents.&lt;/p&gt;
&lt;h5&gt;innerdoc&lt;br&gt;inner document&lt;br&gt;get value from documents&lt;/h5&gt;
&lt;p&gt;We extract all textual data from the documents and transform this unstructured content into structured information. Innerdoc provides a structured overview to all your documents. We provide all the information, but together we decide which problem we solve.&lt;/p&gt;
&lt;h5&gt;Search, Semantic Meaning, Interactive Apps&lt;br&gt;Topic Modeling, Data Quality and Linkage&lt;br&gt;Named Entities, Question Answering&lt;/h5&gt;
&lt;p&gt;We are ready to help you! Contact us at &lt;a href=&quot;mailto:hey@innerdoc.com&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;hey@innerdoc.com&lt;/a&gt;&lt;/p&gt;</content:encoded></item><item><title><![CDATA[A Full and Comprehensive Style Test]]></title><description><![CDATA[Below is just about everything you’ll need to style in the theme. Check the source code to see the many embedded elements within paragraphs…]]></description><link>https://www.innerdoc.com/style-guide/</link><guid isPermaLink="false">https://www.innerdoc.com/style-guide/</guid><pubDate>Sun, 30 Sep 2018 07:03:47 GMT</pubDate><content:encoded>&lt;p&gt;Below is just about everything you’ll need to style in the theme. Check the source code to see the many embedded elements within paragraphs.&lt;/p&gt;
&lt;hr&gt;
&lt;h1&gt;Heading 1&lt;/h1&gt;
&lt;h2&gt;Heading 2&lt;/h2&gt;
&lt;h3&gt;Heading 3&lt;/h3&gt;
&lt;h4&gt;Heading 4&lt;/h4&gt;
&lt;h5&gt;Heading 5&lt;/h5&gt;
&lt;h6&gt;Heading 6&lt;/h6&gt;
&lt;hr&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, &lt;a href=&quot;&quot;&gt;test link&lt;/a&gt; adipiscing elit. &lt;strong&gt;This is strong.&lt;/strong&gt; Nullam dignissim convallis est. Quisque aliquam. &lt;em&gt;This is emphasized.&lt;/em&gt; Donec faucibus. Nunc iaculis suscipit dui. 5&lt;sup&gt;3&lt;/sup&gt; = 125. Water is H&lt;sub&gt;2&lt;/sub&gt;O. Nam sit amet sem. Aliquam libero nisi, imperdiet at, tincidunt nec, gravida vehicula, nisl. &lt;cite&gt;The New York Times&lt;/cite&gt; (That’s a citation). &lt;span style=&quot;text-decoration:underline;&quot;&gt;Underline&lt;/span&gt;. Maecenas ornare tortor. Donec sed tellus eget sapien fringilla nonummy. Mauris a ante. Suspendisse quam sem, consequat at, commodo vitae, feugiat in, nunc. Morbi imperdiet augue quis tellus.&lt;/p&gt;
&lt;p&gt;HTML and CSS are our tools. Mauris a ante. Suspendisse quam sem, consequat at, commodo vitae, feugiat in, nunc. Morbi imperdiet augue quis tellus. Praesent mattis, massa quis luctus fermentum, turpis mi volutpat justo, eu volutpat enim diam eget metus. To copy a file type &lt;code class=&quot;language-text&quot;&gt;COPY filename&lt;/code&gt;. &lt;del&gt;Dinner’s at 5:00.&lt;/del&gt; &lt;span style=&quot;text-decoration:underline;&quot;&gt;Let’s make that 7&lt;/span&gt;. This &lt;del&gt;text&lt;/del&gt; has been struck.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Media&lt;/h2&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore.&lt;/p&gt;
&lt;h3&gt;Big Image&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;img/testimg1.jpg&quot; alt=&quot;Test Image&quot;&gt;&lt;/p&gt;
&lt;p&gt;Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.&lt;/p&gt;
&lt;h3&gt;Small Image&lt;/h3&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;img/testimg2.jpg&quot; alt=&quot;Small Test Image&quot;&gt;&lt;/p&gt;
&lt;p&gt;Labore et dolore.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;List Types&lt;/h2&gt;
&lt;h3&gt;Definition List&lt;/h3&gt;
&lt;p&gt;Definition List Title
: This is a definition list division.&lt;/p&gt;
&lt;p&gt;Definition
: An exact statement or description of the nature, scope, or meaning of something: &lt;em&gt;our definition of what constitutes poetry.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Ordered List&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;List Item 1&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;List Item 2&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Nested list item A&lt;/li&gt;
&lt;li&gt;Nested list item B&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;List Item 3&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Unordered List&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;List Item 1&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;List Item 2&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Nested list item A&lt;/li&gt;
&lt;li&gt;Nested list item B&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;List Item 3&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;Table&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&quot;center&quot;&gt;Table Header 1&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;Table Header 2&lt;/th&gt;
&lt;th align=&quot;center&quot;&gt;Table Header 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;Division 1&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;Division 2&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;Division 3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;Division 1&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;Division 2&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;Division 3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;center&quot;&gt;Division 1&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;Division 2&lt;/td&gt;
&lt;td align=&quot;center&quot;&gt;Division 3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2&gt;Preformatted Text&lt;/h2&gt;
&lt;p&gt;Typographically, preformatted text is not the same thing as code. Sometimes, a faithful execution of the text requires preformatted text that may not have anything to do with code. Most browsers use Courier and that’s a good default — with one slight adjustment, Courier 10 Pitch over regular Courier for Linux users.&lt;/p&gt;
&lt;h3&gt;Code&lt;/h3&gt;
&lt;p&gt;Code can be presented inline, like &lt;code class=&quot;language-text&quot;&gt;&amp;lt;?php bloginfo(&amp;#39;stylesheet_url&amp;#39;); ?&amp;gt;&lt;/code&gt;, or using &lt;a href=&quot;http://jekyllrb.com/docs/templates/#code-snippet-highlighting&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;jekyll’s highlight
filter&lt;/a&gt; to
highlight a block of code. Because we have more specific typographic needs for code, we’ll specify Consolas and Monaco ahead of the browser-defined monospace font.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;css&quot;&gt;&lt;pre class=&quot;language-css&quot;&gt;&lt;code class=&quot;language-css&quot;&gt;&lt;span class=&quot;token selector&quot;&gt;#container&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; left&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;margin&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; 0 -240px 0 0&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;width&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; 100%&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;hr&gt;
&lt;h2&gt;Blockquotes&lt;/h2&gt;
&lt;p&gt;Let’s keep it simple. Italics are good to help set it off from the body text. Be sure to style the citation.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Good afternoon, gentlemen. I am a HAL 9000 computer. I became operational at the H.A.L. plant in Urbana, Illinois on the 12th of January 1992. My instructor was Mr. Langley, and he taught me to sing a song. If you’d like to hear it I can sing it for you. — &lt;a href=&quot;http://en.wikipedia.org/wiki/HAL_9000&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;HAL 9000&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And here’s a bit of trailing text.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Text-level semantics&lt;/h2&gt;
&lt;p&gt;HTML elements&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;#&quot;&gt;a element&lt;/a&gt; example &lt;br&gt;
The &lt;abbr&gt;abbr element&lt;/abbr&gt; and &lt;abbr title=&quot;Title text&quot;&gt;abbr element with title&lt;/abbr&gt; examples &lt;br&gt;
The &lt;b&gt;b element&lt;/b&gt; example &lt;br&gt;
The &lt;cite&gt;cite element&lt;/cite&gt; example &lt;br&gt;
The &lt;code&gt;code element&lt;/code&gt; example &lt;br&gt;
The &lt;del&gt;del element&lt;/del&gt; example &lt;br&gt;
The &lt;dfn&gt;dfn element&lt;/dfn&gt; and &lt;dfn title=&quot;Title text&quot;&gt;dfn element with title&lt;/dfn&gt; examples &lt;br&gt;
The &lt;em&gt;em element&lt;/em&gt; example &lt;br&gt;
The &lt;i&gt;i element&lt;/i&gt; example &lt;br&gt;
The &lt;ins&gt;ins element&lt;/ins&gt; example &lt;br&gt;
The &lt;kbd&gt;kbd element&lt;/kbd&gt; example &lt;br&gt;
The &lt;mark&gt;mark element&lt;/mark&gt; example &lt;br&gt;
The &lt;q&gt;q element &lt;q&gt;inside&lt;/q&gt; a q element&lt;/q&gt; example &lt;br&gt;
The &lt;s&gt;s element&lt;/s&gt; example &lt;br&gt;
The &lt;samp&gt;samp element&lt;/samp&gt; example &lt;br&gt;
The &lt;small&gt;small element&lt;/small&gt; example &lt;br&gt;
The &lt;span&gt;span element&lt;/span&gt; example &lt;br&gt;
The &lt;strong&gt;strong element&lt;/strong&gt; example &lt;br&gt;
The &lt;sub&gt;sub element&lt;/sub&gt; example &lt;br&gt;
The &lt;sup&gt;sup element&lt;/sup&gt; example &lt;br&gt;
The &lt;var&gt;var element&lt;/var&gt; example &lt;br&gt;
The &lt;u&gt;u element&lt;/u&gt; example&lt;/p&gt;
* * *
&lt;h2&gt;Embeds&lt;/h2&gt;
&lt;p&gt;Sometimes all you want to do is embed a little love from another location and set your post alive.&lt;/p&gt;
&lt;h3&gt;Video&lt;/h3&gt;
&lt;p&gt;Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.&lt;/p&gt;
&lt;p&gt;Culpa qui officia deserunt mollit anim id est laborum.&lt;/p&gt;
&lt;div class=&quot;gatsby-resp-iframe-wrapper&quot; style=&quot;padding-bottom: 56.166666666666664%; position: relative; height: 0; overflow: hidden; margin-bottom: 1rem&quot; &gt; &lt;iframe src=&quot;//player.vimeo.com/video/103224792&quot; frameborder=&quot;0&quot; webkitallowfullscreen mozallowfullscreen allowfullscreen style=&quot; position: absolute; top: 0; left: 0; width: 100%; height: 100%; &quot;&gt;&lt;/iframe&gt; &lt;/div&gt;
&lt;h3&gt;Slides&lt;/h3&gt;
&lt;p&gt;Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.&lt;/p&gt;
&lt;script async className=&quot;speakerdeck-embed&quot; data-id=&quot;585245d01ee1013238737e42b879906f&quot; data-ratio=&quot;1.77777777777778&quot; src=&quot;//speakerdeck.com/assets/embed.js&quot;&gt;&lt;/script&gt;
&lt;p&gt;Culpa qui officia deserunt mollit anim id est laborum.&lt;/p&gt;
&lt;script async className=&quot;speakerdeck-embed&quot; data-id=&quot;6b7ef470b2e40131083f6ac252c60ef6&quot; data-ratio=&quot;1.33333333333333&quot; src=&quot;//speakerdeck.com/assets/embed.js&quot;&gt;&lt;/script&gt;
&lt;div class=&quot;gatsby-resp-iframe-wrapper&quot; style=&quot;padding-bottom: 28.083333333333332%; position: relative; height: 0; overflow: hidden; margin-bottom: 1rem&quot; &gt; &lt;iframe src=&quot;https://share.streamlit.io/innerdoc/periodic-table-creator/main/periodic-table-creator/periodic_table_creator.py&quot; frameborder=&quot;0&quot; webkitallowfullscreen mozallowfullscreen allowfullscreen style=&quot; position: absolute; top: 0; left: 0; width: 100%; height: 100%; &quot;&gt;&lt;/iframe&gt; &lt;/div&gt;
&lt;h3&gt;Audio&lt;/h3&gt;
&lt;p&gt;Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.&lt;/p&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;450&quot; scrolling=&quot;no&quot; frameborder=&quot;no&quot; src=&quot;https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/52891122&amp;amp;auto_play=false&amp;amp;hide_related=false&amp;amp;show_comments=true&amp;amp;show_user=true&amp;amp;show_reposts=false&amp;amp;visual=true&quot;&gt;&lt;/iframe&gt;
&lt;p&gt;Culpa qui officia deserunt mollit anim id est laborum.&lt;/p&gt;
&lt;h3&gt;Code&lt;/h3&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt.&lt;/p&gt;
&lt;p data-height=&quot;268&quot; data-theme-id=&quot;0&quot; data-slug-hash=&quot;bcqhe&quot; data-default-tab=&quot;result&quot; data-user=&quot;rglazebrook&quot; className=&apos;codepen&apos;&gt;See the Pen &lt;a href=&apos;http://codepen.io/rglazebrook/pen/bcqhe/&apos;&gt;Simple Rotating Spinner&lt;/a&gt; by Rob Glazebrook (&lt;a href=&apos;http://codepen.io/rglazebrook&apos;&gt;@rglazebrook&lt;/a&gt;) on &lt;a href=&apos;http://codepen.io&apos;&gt;CodePen&lt;/a&gt;.&lt;/p&gt;
&lt;script async src=&quot;//assets.codepen.io/assets/embed/ei.js&quot;&gt;&lt;/script&gt;
&lt;p&gt;Isn’t it beautiful.&lt;/p&gt;
&lt;p&gt;_[HTML]: Hyper Text Markup Language
_[CSS]: Cascading Style Sheets&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Topics for Innerdoc posts]]></title><description><![CDATA[Topics NLP basics; tokenization, sentencizer, lemma, pos, dep, ontology making, schema.org, domain specific, input for NER NER, input for…]]></description><link>https://www.innerdoc.com/post-topics/</link><guid isPermaLink="false">https://www.innerdoc.com/post-topics/</guid><pubDate>Mon, 02 Feb 2015 23:46:37 GMT</pubDate><content:encoded>&lt;h3&gt;Topics&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;NLP basics; tokenization, sentencizer, lemma, pos, dep,&lt;/li&gt;
&lt;li&gt;ontology making, schema.org, domain specific, input for NER&lt;/li&gt;
&lt;li&gt;NER, input for NEL, how to define NER schema (schema.org, ontotext)&lt;/li&gt;
&lt;li&gt;NEL, linked to a knowledge graph&lt;/li&gt;
&lt;li&gt;KG - generating, using/querying, examples GKG, Bing, linked data&lt;/li&gt;
&lt;li&gt;healthcare KG, snowmed&lt;/li&gt;
&lt;li&gt;NER vs Timelines; books and its main characters mentions&lt;/li&gt;
&lt;li&gt;Abbreviations&lt;/li&gt;
&lt;li&gt;Definitions&lt;/li&gt;
&lt;li&gt;relation extraction, network of people within a text, hearst patterns&lt;/li&gt;
&lt;li&gt;(key)word usage, scattertext, anti-wordclouds&lt;/li&gt;
&lt;li&gt;topic clustering sne-t&lt;/li&gt;
&lt;li&gt;Korfball blogposts&lt;/li&gt;
&lt;li&gt;NLP is hard, buffalo buffalo, siri, i’m bleeding, bigram/trigram&lt;/li&gt;
&lt;li&gt;spinnenweb visualisaties&lt;/li&gt;
&lt;li&gt;web scraping&lt;/li&gt;
&lt;li&gt;document extraction&lt;/li&gt;
&lt;li&gt;story metrics&lt;/li&gt;
&lt;li&gt;sentiment&lt;/li&gt;
&lt;li&gt;wordnet&lt;/li&gt;
&lt;li&gt;MLML - &lt;a href=&quot;https://awesome-streamlit.azurewebsites.net/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://awesome-streamlit.azurewebsites.net/&lt;/a&gt;   &lt;a href=&quot;https://raw.githubusercontent.com/MarcSkovMadsen/awesome-streamlit/master/gallery/medical_language_learner/medical_language_learner.py&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://raw.githubusercontent.com/MarcSkovMadsen/awesome-streamlit/master/gallery/medical_language_learner/medical_language_learner.py&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;similarity - TFIDF gensim.summarisation.BM25- &lt;a href=&quot;https://github.com/jroakes/tech-seo-crawler/blob/master/lib/doc_frequency.py&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://github.com/jroakes/tech-seo-crawler/blob/master/lib/doc_frequency.py&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Resources&lt;/h3&gt;
&lt;iframe src=&apos;https://web.stanford.edu/~jurafsky/slp3/slides/21_SentLex.pdf&apos; height=&quot;550&quot;  width=&quot;100%&quot;&gt;&lt;/iframe&gt;
&lt;p&gt;&lt;a href=&quot;https://web.stanford.edu/~jurafsky/slp3/slides/21_SentLex.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://web.stanford.edu/~jurafsky/slp3/slides/21_SentLex.pdf&lt;/a&gt;
&lt;a href=&quot;http://liwc.wpengine.com/compare-dictionaries/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;http://liwc.wpengine.com/compare-dictionaries/&lt;/a&gt;
&lt;a href=&quot;https://github.com/Ejhfast/empath-client/blob/master/empath/data/categories.tsv&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://github.com/Ejhfast/empath-client/blob/master/empath/data/categories.tsv&lt;/a&gt;
&lt;a href=&quot;https://journals.sagepub.com/doi/full/10.1177/2056305118797724&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://journals.sagepub.com/doi/full/10.1177/2056305118797724&lt;/a&gt;
&lt;a href=&quot;https://www.researchgate.net/publication/336496045_A_Survey_of_Computational_Approaches_and_Challenges_in_Multimodal_Sentiment_Analysis&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://www.researchgate.net/publication/336496045_A_Survey_of_Computational_Approaches_and_Challenges_in_Multimodal_Sentiment_Analysis&lt;/a&gt;&lt;/p&gt;</content:encoded></item></channel></rss>