Archive for the ‘Character Encoding’ Category

Confirmed my understanding

March 9, 2006

In the earlier blog, the character not being displayed properly by the browser is the browser issue and the code to deal with supplementary character is correct. The link http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=10177 shows the character that will be displayed in browser(which is the one present in xml file when opened by browser). So, to work with supplementary characters, create new String by passing character[] containing the low and high surrogate pairs or the byte[] specifying the encoding. The byte[] can for any code point can be obtained using the link :
Byte for codepoint

Some code to illustrate the unicode support by Character class

March 9, 2006

public class UnicodeTest{
public static void main(String… args) throws IOException
{
System.out.println(“is valid codepoint : ” + Character.isValidCodePoint(0×10FFFF));
System.out.println(“is valid codepoint : ” + Character.isValidCodePoint(0×20FFFF));
int cp = 0×10177;
System.out.println(“is valid codepoint : ” + Character.isValidCodePoint(cp));
char[] ch = new char[2];
ch = Character.toChars(cp);
int low = ch[0];
int high = ch[1];
System.out.println(“Low Surrogate Pair : ” + low + ” Hexadecimal : ” + Integer.toHexString(low) + ” Binary String : ” + Integer.toBinaryString(low));
System.out.println(“High Surrogate Pair : ” + high + ” Hexadecimal : ” + Integer.toHexString(high) + ” Binary String : ” + Integer.toBinaryString(high));
String st = new String(ch);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(“krishna.xml”), “UTF-8″));
out.write(“”);
out.write(“”);
out.write(st);
out.write(“”);
out.close();

}
}

In the above case, the supplementary character written in xml was not the one it is intended to be. Don’t know whether it is the limitation of browser to display the supplementary characters or anything wrong in the way a supplementary character be handled in java code.

Some more understandings on java unicode support

March 9, 2006

From the start, java had used UTF-16 encoding for encoding the characters. Thus, in the earlier stages when the unicode character set was limited to 16 bits and hence was given full support by java character which was using the utf-16 encoding. Once the unicode was extended to support till the range U+10FFFF, the earlier UTF-16 encoded characters cannot represent characters more than U+FFFF. Hence, in J2se5, support was provided through the Character class. So, the primitive char still supports only the characters till code point: UTF+FFFF. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

* The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter(‘\uD840′) returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
* The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0×2F81A) returns true because the code point value represents a letter (a CJK ideograph).

Back to Unicode support in java

March 9, 2006

Again got confused in unicode. Some of the terms used are:
1. Coded Character Set
A character Set(collection of characters) where each character has been assigned a unique number. E.g., Unicode character set, where every character is assigned a hexadecimal number.
2. Code Points
The numbers that can be used in a coded character set. Valid code points for Unicode character set is : U+0000 to U+10FFFF (Unicode :4 standard)
3. Supplementary Characters
Characters that could not be represented in the original 16-bit design of Unicode. U+0000 to U+FFFF are referred to as Base Multilingual Plane(BMP) and the others are supplementary characters.
4. Character Encoding Scheme
Mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. e.g., UTF-32, UTF-16, and UTF-8
4. Character Encoding
Mapping from a set of characters to sequences of code units. e.g., UTF-8, ISO-8859-1, GB18030, Shift_JIS.

UTF-16
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0×41 could mean the letter “A” or be the second byte of a two-byte character.

Java Unicode Support

December 16, 2005

Check the JSR-204 for java unicode support : JSR-204

Supplementary Character Support Approach

  • Use the primitive type int to represent code points in low-level APIs, such as the static methods of the Character class.
  • Interpret char sequences in all forms (char[], implementations of java.lang.CharSequence, implementations of java.text.CharacterIterator) as UTF-16 sequences, and promote their use in higher-level APIs.
  • Provide APIs to easily convert between various char and code point based representations.

Good blog on unicode support in j2se5 : John Conner blog
Highlights:
# char is a UTF-16 code unit, not a code point
# new low-level APIs use an int to represent a Unicode code point
# high level APIs have been updated to understand surrogate pairs
# a preference towards char sequence APIs instead of char based methods

UTF-8 Encoding Rules

August 25, 2005

The UTF-8 encoding rules

1. Characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0×00 to 0×7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. A single byte is needed for any of these characters!
2. All characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0×00-0×7F) can appear as part of any other character.
3. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0×80 to 0xBF. This allows easy resynchronization andmakes the encoding stateless and robust against missing bytes
4. All possible 231 UCS codes can be encoded
5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit characters (the ones implicitly supported in the UCS-2 encoding, and by Str Library) are only up to three bytes long
6. The sorting order of Bigendian UCS-4 byte strings is preserved
7. The bytes 0xFE and 0xFF are never used in this encoding

The following byte sequences are used to represent a character:

U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.