At first I didn't understood exactly what was being analyzed in the topic, but after take a look more or less I catch it. If it is not a distraction, and if do not mind, I would like to write this post for who could have interest in the matter, mostly for other visitors like me.
.
A code point is an abstracted instruction of values and/or behaviors, whom are determined by a standard of codes.
It means the code points are not characters, and treating them as such is an huge error that other languages has done trying to support Unicode.
In the Unicode standard, the code point is represented with the hexadecimal format U+HHHHHH
. The first two positions determine a classification range called plane
(with have 17 possible values, from 0-16)
The code unit is the unit of storage. In Unicode can be done in 8bits (UTF-8), 16bits (UTF-16 BE LE) or 32bits (UTF-32 BE LE). A code unit may represent a full code point, or an incomplete part of it.
The crab glyph [ ] in Unicode have one code point: U+1F980
with UTF-8 is stored in 4 code units F0 9F A6 80
with UTF-16 BE is stored in 2 code units D8 3E DD 80
with UTF-32 BE is stored in 1 code unit 00 01 F9 80
The hot coffee glyph [ ] in Unicode have one code point: U+2615
with UTF-8 is stored in 3 code units E2 98 95
with UTF-16 BE is stored in 1 code unit 26 15
with UTF-32 BE is stored in 1 code unit 00 00 26 15
.
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system.
The glyph [ å ] in Unicode is a grapheme with one code point: U+00E5
with UTF-8 is stored in 2 code units C3 A5
with UTF-16 BE is stored in 1 code unit 00 E5
with UTF-32 BE is stored in 1 code unit 00 00 00 E5
The glyph [ å ] in Unicode can be also a grapheme with two code points: U+0061 U+030A
[ a ] U+0061
UTF-8 61
UTF-16BE 00 61
UTF-32BE 00 00 00 61
[ ̊ ] U+030A 'Combining Ring Above'
UTF-8 CC 8A
UTF-16BE 03 0A
UTF-32BE 00 00 03 0A
with UTF-8 is stored in 3 code units 61 CC 8A
with UTF-16 BE is stored in 2 code units 00 61 03 0A
with UTF-32 BE is stored in 2 code units 00 00 00 61 00 00 03 0A
So, one Unicode grapheme -besides can be composed of more than one code point- can also employ more than one code unit, even if UTF-32 is used
The glyph [ ậ ] code point U+1EAD
The glyph [ ậ ] code points U+0061 + U+0302 + U+0323
The glyph [ ậ ] code points U+0061 + U+0323 + U+0302
UTF-8: E1 BA AD (3 bytes); UTF-32: 00001EAD ( 4 bytes)
UTF-8: 61 CC 82 CC A3 (5 bytes); UTF-32: 00000061 00000302 00000323 (12 bytes)
UTF-8: 61 CC A3 CC 82 (5 bytes); UTF-32: 00000061 00000323 00000302 (12 bytes)
[ a ] U+0061
[ ̂ ] U+0302 'Combining Circumflex Accent'
[ ̣ ] U+0323 'Combining Dot Below'
.
Also exists code points that are never part of any grapheme, defining only behaviors, such as
No-break space U+00A0, Soft hyphen U+00AD, Zero width space U+200B, Zero width joiner U+200D, Word joiner U+2060, Left-To-Right embedding U+202A, Right-To-Left embedding U+202B, Left-To-Right override U+202D, Right-To-Left override U+202E
, etc
.
A glyph is an image, representing graphemes or parts of it, whom shape is determined by the selected font type and how are interpreted the code points.
.
assert!( 'Café' == 'Café' ); // homoglyph example, what would be expected?
'C' + 'a' + 'f' + 'e' + '́ ' << Unicode UTF-8 = 43 61 66 65 CC 81
'C' + 'a' + 'f' + 'é' << Unicode UTF-8 = 43 61 66 C3 A9