Encoding was always such a pain in the ass when programming (specifically scraping data from websites). I learned too late that the reason it was such a pain in the ass was because I just didn't understand it, so I decided to study it a bit (spoiler alert: I still don't really understand it).

read more

Basics about encoding

There’s the character set, which is how you represent the characters. Such as Unicode. Unicode has representations of characters, sort of, so you have some unicode character representing all A’s.

Then, there’s the encoding, which is how you store the character set. Such as UTF-8 (another example is a first common character set, english with some controls (32->A, 33->B, etc, which I think is often referred to as ASCII), was encoded as “8-bit ASCII”). In UTF-8, a single character can be given up to 6 bytes to represent it. English characters fit into all one byte (8 bits), and so UTF-8 and ASCII are mostly interchangeable for english characters. That means if a computer were to read english characters that were encoded in UTF-8, and interpret them as ASCII, all would be fine.

In UTF-16, you give every Unicode character two bytes. But UTF-16 is not as standard as UTF-8.

To deal with UTF-8 being variable length per character, the first few bits in every byte are reserved to say if this byte is a continuation of a character or a new character altogether.

When you try to encode unicode into an encoding that doesn’t support its wide range of characters, like ASCII, that’s when you get a question mark/box/alien.

UTF-7,8,16, and 32 all can represent Unicode correctly.

The meta tag that specifies the charset encoding really has to be the first thing in the head tag because as soon as the browser gets this tag it’s going to start over reading the html file with that encoding.

Browsers will guess if they can’t find the content-type tag, but don’t rely on that.