您现在的位置： Linux教程網 >> UnixLinux > >> Unix知識 >> 關於Unix

about UTF- 8

UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. AnnexBUTF- 8 UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. It has the clearadv ant

UTF-8compaction mode is principally designed to support data systems with8-bit communications paths.

AnnexBUTF- 8

UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. It has the clearadvantage that the character addresses U+0000_hex toU+007F_hex, corresponding ASCII (and ISO 646:1991) values00_hex to 7F_hex are represented by single octetsof the same value. It is straightforward both to generate and parseand produces reasonable compaction.

Inputand output of up to 21-bit Unicode 3 character addresses for all 1114 112 characters on the 17 Code Planes 0 through 16 can becumbersome in normal byte-oriented data systems. In Table B.1, thelength of the binary data representation of characters to be encoded(ignoring leading zero bits) determines how many UTF-8 bytes arerequired.

TableB.1: UTF- 8 byte sequences for Unicode character addresses

Datatype and length

Unicodeaddress

(binaryformat)

1^stByte

2^ndByte

3^rdByte

4^thByte

Upto 7-bits, encoded as 7-bit ASCII or ISO 646

000000000xxxxxxx

0xxxxxxxx

8to 11 bits

00000yyyyyxxxxxx

110yyyyy

10xxxxxx

16bits (BMP)

zzzzyyyyyyxxxxxx

1110zzzz

10yyyyyy

10xxxxxx

21bits, Code Planes 1-16

000uuuuuzzzzyyyy yyxxxxxx

11110uuu

10uuzzzz

10yyyyyy

10xxxxxx

Duringdecoding, the number of bytes in each UTF-8 byte sequence can beimmediately determined from the first byte of each sequence.

LegalUTF-8 byte sequences shall conform to Unicode Technical Report 27as summarized in Table B.2.

TableB.2 – Unicode address ranges for legal UTF-8 byte sequences

Unicodeaddress range

1^stByte

2^ndByte

3^rdByte

4^thByte

U+0000to U+007F

00…7F

U+0080to U+07FF

C2...DF

80…BF

U+0800to U+0FFF

A0...BF

80...BF

U+1000to U+FFFF

E1…EF

80...BF

U+10000to U+3FFFF

90…BF

80…BF

U+40000to U+FFFFF

F1…F3

80…BF

U+100000to U+10FFFF

80…BF

關於Unix

SOL10SETUP

也談UTF-8編碼

Linux下UTF 32和UTF 16互相轉換代碼