Re: U+xxxx, U-xxxxxx, and the basics

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Mar 08 2000 - 12:43:03 EST


Hi,
interesting discussion... and I thought the books, the Tech reports, and the mailing list had given these answers before...

Unicode does go beyond 0xffff, to 0x10ffff, with its code points, or "Unicode scalar values". There are 128k private-use code points from 0xf0000 to 0x10ffff, and all code points with the lower 16 bits being 0xfffe or 0xffff are non-character code points. There are with Unicode 3.0 no other character assignments from 0x10000..0xeffff, but this will change in the next couple of years.

UTF-8 is a character encoding form _and_ a character encoding scheme because it specifies serialization into octets.
UTF-16 is a character encoding form.
UTF-16BE & UTF-16LE are character encoding schemes.

U+xxxx has 4 digits.
U-xxxxxxxx has 8 digits.
6 digits are enough for the Unicode code point range of 0..0x10ffff, but I am not aware of a special notation for that. It has been suggested (in the Regular Expression UTR?) to use a \vxxxxxx notation with 6 digits in strings with escape sequences.

"Surrogate" values are allowed only in UTF-16 as part of the encoding of code points >=0x10000. They are "characters" in the vague sense that they are "code units". They are never valid "code points", or values for "abstract characters".

This all is much clearer with the more recent UTRs and the Unicode 3.0 book (and the mailing list discussions of the last two years).

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT