01 - Bits to Character Encoding

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). Fix or load your data by choosing the right encoding.

Rob van Zoest
Founder @ innerdoc.com | NLP Expert-Engineer-Enthusiast | Writes about how to get value from textual data | Lives in the Netherlands | Loves to travel around the globe | Dutchman | rob@innerdoc.com
More posts by Rob van Zoest.

Rob van Zoest

02 Sep 2020• 1 min read

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). It all starts with loading the textual data in the right encoding. The former CEO of StackOverflow wrote an interesting overview about The Absolute Minimum you should know about Unicode and Character Sets.

^{Non-ASCII characters as replacement (source)}

If you have problems with encodings, try the ftfy library to fix them. If you don’t know what encoding your data has, try chardet or charset_normalizer to detect the encoding.

This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.

01 - Bits to Character Encoding

Rob van Zoest

Rob van Zoest

07 - Text Extraction and OCR

06 - Text and File Scraping

05 - Loading from API

About Innerdoc

02 - Manual Typewriting