Source Data Loading

01 - Bits to Character Encoding

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). Fix or load your data by choosing the right encoding.

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding (aka character set). It all starts with loading the textual data in the right encoding. The former CEO of StackOverflow wrote an interesting overview about The Absolute Minimum you should know about Unicode and Character Sets.

Non-ASCII characters as replacement (source)

If you have problems with encodings, try the ftfy library to fix them. If you don’t know what encoding your data has, try chardet or charset_normalizer to detect the encoding.



This article is part of the project Periodic Table of NLP Tasks. Click to read more about the making of the Periodic Table and the project to systemize NLP tasks.