UTF-8compaction mode is principally designed to support data systems with8-bit communications paths.
UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. It has the clearadvantage that the character addresses U+0000hex toU+007Fhex, corresponding ASCII (and ISO 646:1991) values00hex to 7Fhex are represented by single octetsof the same value. It is straightforward both to generate and parseand produces reasonable compaction.
Inputand output of up to 21-bit Unicode 3 character addresses for all 1114 112 characters on the 17 Code Planes 0 through 16 can becumbersome in normal byte-oriented data systems. In Table B.1, thelength of the binary data representation of characters to be encoded(ignoring leading zero bits) determines how many UTF-8 bytes arerequired.
Datatype and length
Unicodeaddress
(binaryformat)
1stByte
2ndByte
3rdByte
4thByte
Upto 7-bits, encoded as 7-bit ASCII or ISO 646
000000000xxxxxxx
0xxxxxxxx
8to 11 bits
00000yyyyyxxxxxx
110yyyyy
10xxxxxx
16bits (BMP)
zzzzyyyyyyxxxxxx
1110zzzz
10yyyyyy
10xxxxxx
21bits, Code Planes 1-16
000uuuuuzzzzyyyy yyxxxxxx
11110uuu
10uuzzzz
10yyyyyy
10xxxxxx
Duringdecoding, the number of bytes in each UTF-8 byte sequence can beimmediately determined from the first byte of each sequence.
LegalUTF-8 byte sequences shall conform to Unicode Technical Report 27as summarized in Table B.2.
Unicodeaddress range
1stByte
2ndByte
3rdByte
4thByte
U+0000to U+007F
00…7F
U+0080to U+07FF
C2...DF
80…BF
U+0800to U+0FFF
E0
A0...BF
80...BF
U+1000to U+FFFF
E1…EF
80...BF
80...BF
U+10000to U+3FFFF
F0
90…BF
80…BF
80…BF
U+40000to U+FFFFF
F1…F3
80…BF
80…BF
80…BF
U+100000to U+10FFFF
F4
80…BF
80…BF
80…BF