
My thoughts and favorite points of someone else’s writing from the web:
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)
By: Niki Prokopov
I first read this post a few years ago and it immediately struck me as something that I should know more about. Niki is clearly extremely knowledgeable about Unicode and you should read his whole article.
But this is one of those topics that I have to keep coming back to lest it slip out of my brain. So for the re-reads, or those in a hurry, I have picked out just the “absolute minimum of the absolute minimum” to remember. Please don’t be mad ad me Niki 😅.
Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.
In practice, Unicode is a table that assigns unique numbers to different characters.
For example:
- The Latin letter
A
is assigned the number65
. - The Arabic Letter Seen
س
is1587
- The Katakana Letter Tu
ツ
is12484
- The Musical Symbol G Clef
𝄞
is119070
💩
is128169
Unicode refers to these numbers as code points.
Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.
What does U+1F4A9 mean?Section titled: What does U+1F4A9 mean?
It’s a convention for how to write code point values. The prefix U+
means, well, Unicode, and 1F4A9
is a code point number in hexadecimal.
Oh, and U+1F4A9
specifically is 💩.
What’s UTF-8 then?Section titled: What’s UTF-8 then?
UTF-8 is an encoding. Encoding is what you’ll actually deal with as a programmer. Encoding is how we store code points in memory and on disk; how we copy strings, send them over the network, etc.
The simplest possible encoding for Unicode is UTF-32. It simply stores code points as 32-bit integers. So U+1F4A9
becomes 00 01 F4 A9
, taking up four bytes. Any other code point in UTF-32 will also occupy four bytes. Since the highest defined code point is U+10FFFF
, any code point is guaranteed to fit.
UTF-16 and UTF-8 are less straightforward, but the ultimate goal is the same: to take a code point and encode it as bytes.
How many bytes are in UTF-8?Section titled: How many bytes are in UTF-8?
UTF-8 is a variable-length encoding. A code point might be encoded as a sequence of as few as one or as many as four bytes.
UTF-8 is byte-compatible with ASCII. The code points 0..127, the former ASCII, are encoded with one byte, and it’s the same exact byte. U+0041
(A
, Latin Capital Letter A) is just 41
, one byte.
Any pure ASCII text is also a valid UTF-8 text, and any UTF-8 text that only uses codepoints 0..127 can be read as ASCII directly.
Second, UTF-8 is space-efficient for basic Latin. That was one of its main selling points over UTF-16. It might not be fair for texts all over the world, but for technical strings like HTML tags or JSON keys, it makes sense.
- You CAN’T determine the length of the string by counting bytes.
- You CAN’T randomly jump into the middle of the string and start reading.
- You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.
Those who do will eventually meet this bad boy: �
� is the Replacement CharacterSection titled: � is the Replacement Character
U+FFFD
, the Replacement Character, is simply another code point in the Unicode table. Apps and libraries can use it when they detect Unicode errors.
If you cut half of the code point off, there’s not much left to do with the other half, except displaying an error. That’s when � is used.
A grapheme is what the user thinks of as a single characterSection titled: A grapheme is what the user thinks of as a single character
You don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short. A grapheme is a minimally distinctive unit of writing in the context of a particular writing system. ö
is one grapheme. é
is one too. And 각
.
For example, é
(a single grapheme) is encoded in Unicode as e
(U+0065
Latin Small Letter E) + ´
(U+0301
Combining Acute Accent). Two code points!
- ☹️ is
U+2639
+U+FE0F
- 👨🏭 is
U+1F468
+U+200D
+U+1F3ED
- 🚵🏻♀️ is
U+1F6B5
+U+1F3FB
+U+200D
+U+2640
+U+FE0F
- y̖̠͍̘͇͗̏̽̎͞ is
U+0079
+U+0316
+U+0320
+U+034D
+U+0318
+U+0347
+U+0357
+U+030F
+U+033D
+U+030E
+U+035E
What’s ”🤦🏼♂️”.length?Section titled: What’s ”🤦🏼♂️”.length?
Different programming languages will happily give you different answers.
Python 3:
>>> len("🤦🏼♂️")5
JavaScript / Java / C#:
>> "🤦🏼♂️".length7
Rust:
println!("{}", "🤦🏼♂️".len());// => 17
As you can guess, different languages use different internal string representations (UTF-32, UTF-16, UTF-8) and report length in whatever units they store characters in (ints, shorts, bytes).
BUT! If you ask any normal person, one that isn’t burdened with computer internals, they’ll give you a straight answer: 1. The length of 🤦🏼♂️ string is 1.
That’s what extended grapheme clusters are all about: what humans perceive as a single character. And in this case, 🤦🏼♂️ is undoubtedly a single character.
The fact that 🤦🏼♂️ consists of 5 code points (U+1F926 U+1F3FB U+200D U+2642 U+FE0F
) is mere implementation detail. It should not be broken apart, it should not be counted as multiple characters, the text cursor should not be positioned inside it, it shouldn’t be partially selected, etc.
For all intensive purposes, this is an atomic unit of text. Internally, it could be encoded whatever, but for user-facing API, it should be treated as a whole.
Term ComparisonSection titled: Term Comparison
Views of 💩Section titled: Views of 💩
💩 can be represented by a single unicode code point U+1F4A9
Concept | Value | Explanation |
---|---|---|
Grapheme Cluster | 💩 | A single grapheme cluster (no combining marks or ZWJ in this case) |
Name | PILE OF POO | Unicode standard name |
Integer code point | 128169 | |
Hex code point | 1F4A9 | |
Unicode Code Point | U+1F4A9 | The official Unicode code point |
UTF-8 Encoding | F0 9F 92 A9 | 4 bytes. UTF-8 encodes code points using 1–4 bytes |
UTF-16 Encoding | D83D DCA9 | 4 bytes using surrogate pairs (two 16-bit code units) |
UTF-32 Encoding | 00 01 F4 A9 | 4 bytes (direct 1-to-1 mapping with code point) |
Views of 👨❤️👨Section titled: Views of 👨❤️👨
👨❤️👨 is a more complex emoji that is actually composed of three emoji characters with two ZWJs (Zero Width Joiners) joining them:
👨
=U+1F468
(MAN)
=U+200D
(ZWJ)❤️
=U+2764 U+FE0F
(HEAVY BLACK HEART
+VARIATION SELECTOR-16
for emoji style)
=U+200D
(ZWJ)👨
=U+1F468
(MAN)
Concept | Value | Explanation |
---|---|---|
Grapheme Cluster | 👨❤️👨 | Perceived as one emoji, but composed of multiple code points |
Name | COUPLE WITH HEART: MAN, MAN | Unicode name is composite and informal (only the individual code points have formal unicode names) |
Unicode Code Points | U+1F468 U+200D U+2764 U+FE0F U+200D U+1F468 | Emoji + ZWJs + variation selector |
UTF-8 Encoding | F0 9F 91 A8 E2 80 8D E2 9D A4 EF B8 8F E2 80 8D F0 9F 91 A8 | 23 bytes total |
UTF-16 Encoding | D83D DC68 200D 2764 FE0F 200D D83D DC68 | 12 code units (24 bytes) |
UTF-32 Encoding | 0001F468 0000200D 00002764 0000FE0F 0000200D 0001F468 | 24 bytes (6 code points × 4 bytes) |
How do I detect extended grapheme clusters then?Section titled: How do I detect extended grapheme clusters then?
Answer:
Unfortunately, most languages choose the easy way out and let you iterate through strings with 1-2-4-byte chunks, but not with grapheme clusters.
It makes no sense and has no semantics, but since it’s the default, programmers don’t think twice, and we see corrupted strings as the result:
>> '👨❤️👨'.substring(3, 6)'❤️'
I live in the US/UK, should I even care?Section titled: I live in the US/UK, should I even care?
Yes. Even pure English text uses lots of “typographical signs” that aren’t available in ASCII, like:
- quotation marks “ ” ‘ ’
- apostrophe ’
- dashes – —
- different variations of spaces (figure, hair, non-breaking)
- bullets
•
■
☞
- currency symbols other than the
$
(kind of tells you who invented computers, doesn’t it?):€
¢
£
- mathematical signs—plus
+
and equals=
are part of ASCII, but minus−
and multiply×
are not¯\_(ツ)\_/¯
- various other signs
©
™
¶
†
§
Hell, you can’t even spell café
, piñata
, or naïve
without Unicode. So yes, we are all in it together, even Americans.
ConclusionSection titled: Conclusion
To sum it up:
- Unicode has won.
- UTF-8 is the most popular encoding for data in transfer and at rest.
- UTF-16 is still sometimes used as an in-memory representation (e.g. in JavaScript)
- The two most important views for strings are:
- bytes (allocate memory/copy/encode/decode)
- extended grapheme clusters (all semantic operations)
- Using code points for iterating over a string is wrong. They are not the basic unit of writing. One grapheme could consist of multiple code points.
- To detect grapheme boundaries, you need Unicode tables.
- Use a Unicode library for everything Unicode, even boring stuff like
strlen
,indexOf
andsubstring
. - Unicode updates every year, and rules sometimes change.
- Unicode strings need to be normalized before they can be compared.
- Unicode depends on locale for some operations and for rendering.
- All this is important even for pure English text.
Overall, yes, Unicode is not perfect, but the fact that
- an encoding exists that covers all possible languages at once
- the entire world agrees to use it
- we can completely forget about encodings and conversions and all that stuff
is a miracle.