您现在的位置： Linux教程網 >> UnixLinux > >> Linux編程 >> Linux編程

C#中各種編碼格式的區別

最近了解了一下C#中Encoding的不同編碼方式的區別，和大家分享一下，如果有不對的地方還請各位批評指教。

簡單的說，為什麼需要編碼？比如，我們的計算機中需要表示字母'a','b'等等字母，然而這些字母如何在計算機內存中表示？眾所周知，在計算機內存中數據是以二進制來表示的，這樣，我們就需要將這些需要表示的字母和數字或者符號轉換成能在計算機中表示的二進制表示，這就是編碼的意義所在。

將字符編碼成內存中的二進制表示，首先需要對字符集進行編碼表示，每個編碼代表一個固定的字符。然後再將這個字符的編碼轉換成內存中的二進制表示。

計算機常用字符的編碼主要分為兩種：ASCII碼和Unicode碼。
1. ASCII 碼
ASCII(American Standard Code for Information Interchange) 美國信息互換標准代碼，是基於拉丁字母的一套電腦編碼系統。ASCII是標准的單字節字符編碼方案，用於基於文本的數據，使用7位或者8位的二進制組合起來表示128或者256中可能的字符。

ASCII碼最大的缺點就是只能表示美國英語中常用的字符數字和符號，不能表示其他語言中的字符符號等，比如中文中的漢字。

2. Unicode 碼
Unicode碼是能夠容納世界上所有的文字和符號的編碼方案，成為統一碼，滿足跨語言跨平台的需求，Unicode碼是基於通用字符集(Universal Character Set)的標准發展起來的。Unicode碼能夠容納所有的字符符號等，所以被使用的更加廣泛，ASCII幾乎不怎麼用了。

以上兩種編碼方式說明了如何將常用的字符進行編碼，並賦予每個字符一個code point(a number)來表示，這個是固定的。方便以後的應用。比如漢字"字"對應的Unicode編碼為23383.
在這兩種編碼表示的基礎上，就可以將編碼表示成內存中可以使用的二進制方式了。

1. ASCII碼的編碼比較簡單，因為ASCII碼是以字節為單位編碼的，最大為255，直接可以使用一個字節在內存中進行表示，編碼無需特殊操作。
2. Unicode編碼相對比較負責，因為Unicode要表示所有語言的字母符號等，所以編碼沒有那麼簡單。
一下介紹為Unicode的編碼方式。

Unicode編碼可分為以下五種：
ASCIIEncoding
UTF7Encoding
UTF8Encoding
UnicodeEncoding
UTF32Encoding

下面先介紹Encoding的理解，然後分別詳細介紹這幾種編碼方式的優點缺點和區別。
Encoding的理解
Internally, the .NET Framework stores text as Unicode UTF-16. An encoder transforms this text data to a sequence of bytes. A decoder transforms a sequence of bytes into this internal format. An encoding describes the rules by which an encoder or decoder operates. For example, the UTF8Encoding class describes the rules for encoding to and decoding from a sequence of bytes representing text as UTF-8. Encoding and decoding can also include certain validation steps. For example, theUnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. Both of these classes inherit from theEncoding class.
關鍵的一句為：An encoding describes the rules by which an encoder or decoder operates

UTF是一種將Unicode碼編碼成內存中二進制表示的方法。The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF) is a way to encode that code point.
Selecting an Encoding Class
when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding. If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection,UTF8Encoding is also better for security.
UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. Assuming default settings, the following scenarios can occur:
If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application then decodes this data, the information is lost.
If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the application then decodes this data, the data performs a round trip successfully.

1. ASCIIEncoding
ASCIIEncoding只需要使用一個字節對Unicode碼進行編碼。
ASCII字母被限制在Unicode中最小的128個字符，從U+0000到U+007F。ASCIIEncoding不提供錯誤檢測，如果需要錯誤檢測的話，你的程序被推薦使用UTF8Encoding，UnicodeEncoding或者UTF32Encoding。
UTF8Encoding，UnicodeEncoding或者UTF32Encoding更適合用來構建全球范圍的應用程序。

When selecting the ASCII encoding for your applications, consider the following:
The ASCII encoding is usually appropriate for protocols that require ASCII.
If your application requires 8-bit encoding, the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.
Previous versions of .NET Framework allowed spoofing by merely ignoring the 8th bit. The current version has been changed so that non-ASCII code points fall back during the decoding of bytes.
2. UTF7Encoding
Represents a UTF-7 encoding of Unicode characters.
The UTF-7 encoding represents Unicode characters as sequences of 7-bit ASCII characters. This encoding supports certain protocols for which it is required, most often e-mail or newsgroup protocols. Since UTF-7 is not particularly secure or robust, and most modern systems allow 8-bit encodings, UTF-8 should normally be preferred to UTF-7.
UTF7Encoding does not provide error detection. For security reasons, the application should useUTF8Encoding,UnicodeEncoding, orUTF32Encoding and enable error detection.
UTF7Encoding推薦不被使用。

3. UTF8Encoding
UTF-8 encoding represents each code point as a sequence of one to four bytes. UTFEncoding將Unicode碼編碼成1-4個單字節碼。
UTF-8 encoding 以字節對Unicode進行編碼，不同范圍的字符使用不同長度的編碼，UTF-8 encoding 的最大長度為4個字節。
UTF8Encoding的編碼速度要比其他的所有編碼方式都要快，即使是要編碼的內容都是ASCII碼，編碼速度也要比用ASCIIEncoding編碼的速度要快。
UTF8Encoding的效果要比ASCIIEncoding的效果好的多，所以推薦用UTF8Encoding，而不是ASCIIEncoding。
when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding. If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection,UTF8Encoding is also better for security.
UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. Assuming default settings, the following scenarios can occur:
If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application then decodes this data, the information is lost.
If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the application then decodes this data, the data performs a round trip successfully.

4. UnicodeEncoding
UnicodeEncoding編碼以16位無符號整數為編碼單位，編碼成1-2個16位的integers。
The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF) is a way to encode that code point. TheUnicode Standard uses the following UTFs:
UTF-8, which represents each code point as a sequence of one to four bytes.
UTF-16, which represents each code point as a sequence of one to two 16-bit integers.
UTF-32, which represents each code point as a 32-bit integer.
UnicodeEncoding無法兼容ASCII，C#的默認編碼方式就是UnicodeEncoding。使用的編碼方式為UTF-16
5. UTF32Encoding
UTF32Encoding 以32位無符號整數為編碼單位，編碼成一個32bit的integer

上一篇文章： C++實現哈夫曼編碼完整代碼
下一篇文章： Lua 類與繼承

Linux編程

C/C++中printf和C++中cout的輸出格式

Linux C/C++(或標准C++或標准C)編程雜記

Linux編程

SHELL編程

PERL編程