The UTF-8 encoding rules
1. Characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0×00 to 0×7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. A single byte is needed for any of these characters!
2. All characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0×00-0×7F) can appear as part of any other character.
3. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0×80 to 0xBF. This allows easy resynchronization andmakes the encoding stateless and robust against missing bytes
4. All possible 231 UCS codes can be encoded
5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit characters (the ones implicitly supported in the UCS-2 encoding, and by Str Library) are only up to three bytes long
6. The sorting order of Bigendian UCS-4 byte strings is preserved
7. The bytes 0xFE and 0xFF are never used in this encoding
The following byte sequences are used to represent a character:
U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.