From the web: The Absolute Minimum Every Software Developer Must Know About Unicode

My thoughts and favorite points of someone else’s writing from the web:

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)

By: Niki Prokopov

tonsky.me

Visit

I first read this post a few years ago and it immediately struck me as something that I should know more about. Niki is clearly extremely knowledgeable about Unicode and you should read his whole article.

But this is one of those topics that I have to keep coming back to lest it slip out of my brain. So for the re-reads, or those in a hurry, I have picked out just the “absolute minimum of the absolute minimum” to remember. Please don’t be mad ad me Niki 😅.

Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.

In practice, Unicode is a table that assigns unique numbers to different characters.

For example:

The Latin letter A is assigned the number 65.
The Arabic Letter Seen سis 1587
The Katakana Letter Tu ツ is 12484
The Musical Symbol G Clef 𝄞 is 119070
💩 is 128169

Unicode refers to these numbers as code points.

Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.

What does U+1F4A9 mean?

It’s a convention for how to write code point values. The prefix U+ means, well, Unicode, and 1F4A9 is a code point number in hexadecimal.

Oh, and U+1F4A9 specifically is 💩.

What’s UTF-8 then?

UTF-8 is an encoding. Encoding is what you’ll actually deal with as a programmer. Encoding is how we store code points in memory and on disk; how we copy strings, send them over the network, etc.

The simplest possible encoding for Unicode is UTF-32. It simply stores code points as 32-bit integers. So U+1F4A9 becomes 00 01 F4 A9, taking up four bytes. Any other code point in UTF-32 will also occupy four bytes. Since the highest defined code point is U+10FFFF, any code point is guaranteed to fit.

UTF-16 and UTF-8 are less straightforward, but the ultimate goal is the same: to take a code point and encode it as bytes.

How many bytes are in UTF-8?

UTF-8 is a variable-length encoding. A code point might be encoded as a sequence of as few as one or as many as four bytes.

UTF-8 is byte-compatible with ASCII. The code points 0..127, the former ASCII, are encoded with one byte, and it’s the same exact byte. U+0041 (A, Latin Capital Letter A) is just 41, one byte.

Any pure ASCII text is also a valid UTF-8 text, and any UTF-8 text that only uses codepoints 0..127 can be read as ASCII directly.

Second, UTF-8 is space-efficient for basic Latin. That was one of its main selling points over UTF-16. It might not be fair for texts all over the world, but for technical strings like HTML tags or JSON keys, it makes sense.

You CAN’T determine the length of the string by counting bytes.
You CAN’T randomly jump into the middle of the string and start reading.
You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.

Those who do will eventually meet this bad boy: �

� is the Replacement Character

U+FFFD, the Replacement Character, is simply another code point in the Unicode table. Apps and libraries can use it when they detect Unicode errors.

If you cut half of the code point off, there’s not much left to do with the other half, except displaying an error. That’s when � is used.

A grapheme is what the user thinks of as a single character

You don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short. A grapheme is a minimally distinctive unit of writing in the context of a particular writing system. ö is one grapheme. é is one too. And 각.

For example, é (a single grapheme) is encoded in Unicode as e (U+0065 Latin Small Letter E) + ´ (U+0301 Combining Acute Accent). Two code points!

☹️ is U+2639 + U+FE0F
👨‍🏭 is U+1F468 + U+200D + U+1F3ED
🚵🏻‍♀️ is U+1F6B5 + U+1F3FB + U+200D + U+2640 + U+FE0F
y̖̠͍̘͇͗̏̽̎͞ is U+0079 + U+0316 + U+0320 + U+034D + U+0318 + U+0347 + U+0357 + U+030F + U+033D + U+030E + U+035E

What’s ”🤦🏼‍♂️”.length?

Different programming languages will happily give you different answers.

Python 3:

>>> len("🤦🏼‍♂️")
5

JavaScript / Java / C#:

>> "🤦🏼‍♂️".length
7

Rust:

println!("{}", "🤦🏼‍♂️".len());
// => 17

As you can guess, different languages use different internal string representations (UTF-32, UTF-16, UTF-8) and report length in whatever units they store characters in (ints, shorts, bytes).

BUT! If you ask any normal person, one that isn’t burdened with computer internals, they’ll give you a straight answer: 1. The length of 🤦🏼‍♂️ string is 1.

That’s what extended grapheme clusters are all about: what humans perceive as a single character. And in this case, 🤦🏼‍♂️ is undoubtedly a single character.

The fact that 🤦🏼‍♂️ consists of 5 code points (U+1F926 U+1F3FB U+200D U+2642 U+FE0F) is mere implementation detail. It should not be broken apart, it should not be counted as multiple characters, the text cursor should not be positioned inside it, it shouldn’t be partially selected, etc.

For all intensive purposes, this is an atomic unit of text. Internally, it could be encoded whatever, but for user-facing API, it should be treated as a whole.

Term Comparison

Views of 💩

💩 can be represented by a single unicode code point U+1F4A9

Concept	Value	Explanation
Grapheme Cluster	💩	A single grapheme cluster (no combining marks or ZWJ in this case)
Name	PILE OF POO	Unicode standard name
Integer code point	`128169`
Hex code point	`1F4A9`
Unicode Code Point	`U+1F4A9`	The official Unicode code point
UTF-8 Encoding	`F0 9F 92 A9`	4 bytes. UTF-8 encodes code points using 1–4 bytes
UTF-16 Encoding	`D83D DCA9`	4 bytes using surrogate pairs (two 16-bit code units)
UTF-32 Encoding	`00 01 F4 A9`	4 bytes (direct 1-to-1 mapping with code point)

Views of 👨‍❤️‍👨

👨‍❤️‍👨 is a more complex emoji that is actually composed of three emoji characters with two ZWJs (Zero Width Joiners) joining them:

👨 = U+1F468 (MAN)
‍ = U+200D (ZWJ)
❤️ = U+2764 U+FE0F (HEAVY BLACK HEART + VARIATION SELECTOR-16 for emoji style)
‍ = U+200D (ZWJ)
👨 = U+1F468 (MAN)

Concept	Value	Explanation
Grapheme Cluster	👨‍❤️‍👨	Perceived as one emoji, but composed of multiple code points
Name	COUPLE WITH HEART: MAN, MAN	Unicode name is composite and informal (only the individual code points have formal unicode names)
Unicode Code Points	`U+1F468 U+200D U+2764 U+FE0F U+200D U+1F468`	Emoji + ZWJs + variation selector
UTF-8 Encoding	`F0 9F 91 A8 E2 80 8D E2 9D A4 EF B8 8F E2 80 8D F0 9F 91 A8`	23 bytes total
UTF-16 Encoding	`D83D DC68 200D 2764 FE0F 200D D83D DC68`	12 code units (24 bytes)
UTF-32 Encoding	`0001F468 0000200D 00002764 0000FE0F 0000200D 0001F468`	24 bytes (6 code points × 4 bytes)

How do I detect extended grapheme clusters then?

Answer: Use a Library

Unfortunately, most languages choose the easy way out and let you iterate through strings with 1-2-4-byte chunks, but not with grapheme clusters.

It makes no sense and has no semantics, but since it’s the default, programmers don’t think twice, and we see corrupted strings as the result:

Cutting out part of an emoji has weird results

>> '👨‍❤️‍👨'.substring(3, 6)
'❤️‍'

I live in the US/UK, should I even care?

Yes. Even pure English text uses lots of “typographical signs” that aren’t available in ASCII, like:

quotation marks “ ” ‘ ’
apostrophe ’
dashes – —
different variations of spaces (figure, hair, non-breaking)
bullets • ■ ☞
currency symbols other than the $ (kind of tells you who invented computers, doesn’t it?): € ¢ £
mathematical signs—plus + and equals = are part of ASCII, but minus − and multiply × are not ¯\_(ツ)\_/¯

Hell, you can’t even spell café, piñata, or naïve without Unicode. So yes, we are all in it together, even Americans.

Conclusion

To sum it up:

Unicode has won.
UTF-8 is the most popular encoding for data in transfer and at rest.
UTF-16 is still sometimes used as an in-memory representation (e.g. in JavaScript)
The two most important views for strings are:
- bytes (allocate memory/copy/encode/decode)
- extended grapheme clusters (all semantic operations)
Using code points for iterating over a string is wrong. They are not the basic unit of writing. One grapheme could consist of multiple code points.
To detect grapheme boundaries, you need Unicode tables.
Use a Unicode library for everything Unicode, even boring stuff like strlen, indexOf and substring.
Unicode updates every year, and rules sometimes change.
Unicode strings need to be normalized before they can be compared.
Unicode depends on locale for some operations and for rendering.
All this is important even for pure English text.

Overall, yes, Unicode is not perfect, but the fact that

an encoding exists that covers all possible languages at once
the entire world agrees to use it
we can completely forget about encodings and conversions and all that stuff

is a miracle.