#41: Unicode: can you see these: Æ, 爱 and 🚀?
Computers speak bits and bytes. Numbers in general. They don’t understand images, poems and JSON. When we say “hello”, it needs to be encoded to numbers. Conveniently, each character becomes one number. A number can then be stored, transfered and rendered on another computer. Therefore, everyone needs to agree which numbers represent which characters. The first commonly used attempt was called ASCII. American Standard Code for Information Interchange. In short, it’s a table of 127 symbols and their respective numbers. For example, lower-case h is 104, whereas exclamation mark is 33. There’s one problem here. 127 symbols. 7 bits. Of course, it’s an American Standard. So it ignores the existence of any other country and alphabet.
But Poland, Germany, Russia - they all have their own special characters. Over time each country develop their own standard mapping special characters to numbers. Sometimes even one country had multiple competing standards that conflicted with each other. It was impossible to open a file without knowing the source encoding. This was painful and caused a lot of websites and documents to be rendered incorrectly.
We had to admit that one byte is not enough. Having the same number reused for different alphabets is tedious and error-prone. So let’s use two bytes! Here comes Unicode 1.0. A standard which mapped every character known to the human kind into a 14-bit number. The standard from 1988 assumed that 16 thousand characters should be enough for everyone. Well, I have a feeling that China, Japan and Korea already existed in 1988. And their alphabets combined, known as CJK, are almost 100 thousand characters alone. Also, there are other character set, like Braille, music notes or Egyptian Hieroglyphs. Suffice to say, the current version 12.0 defines almost 150 thousand different characters.
OK, so we can map national characters, dead alfphabets and these funny emojis as numbers. Yes, emojis are part of the Unicode standard as well. But how we actually encode these numbers into bytes? Superficially it’s simple. There are way more than 100 thousand characters, so we need at least 3 bytes. For practical reasons let’s use 4. Each character (officially named code point) uses 4 bytes on disk and in network. This is rather wasteful, especially when 99% of characters are simple Latin, that needs just one byte. But for the sake of simplicity, we can encode Unicode this way. It’s known as UTF-32.
To save on storage we can use just two bytes for the majority of characters. But for these less frequent we use 4. So one character is sometimes encoded with 2, sometimes with 4 bytes. It’s more complicated, but saves a lot of space. Such encoding in named UTF-16. However, the most commonly used encoding is UTF-8. Which also happens to be the most complex one. It uses 1, 2, 3 or 4 bytes to represent one character. The decoding algorithm is a bit convoluted but UTF-8 has one major advantage: it is backward compatible with ASCII. This means that a UTF-8 document using only American chatacters can be opened as ASCII. That’s because Latin characters A through Z are encoded using just one byte. Just like ASCII. UTF-16 and UTF-32 opened as ASCII file will look like garbage.
To make matters worse, there are emojis. You know, tears of joy, heart, and poo pictograms. They are standardized as well! But to make things more complex, some emojis use multiple code points. For example there are skin tone modifiers that change the default yellow skin of emojis. Also, if you put a woman 👩 and a rocket 🚀 emoji next to each other, you’ll get a single emoji. A lady astronaut 👩🚀. I kid you not! It gets better. Placing a man 👨, a woman 👩, a girl 👧 and a boy 👦 next to each other renders as a one family emoji. All concatenated with special joiner character. This means that a single symbol can be encoded with as many as 7 code points. And 28 bytes in UTF-32. One emoji!
This topic is way more complex, but let’s stop here. Thanks for listening, bye!