Re: Displaying Plane 1 characters (annotating the code table

From: Mark Davis (
Date: Mon Nov 09 1998 - 16:29:12 EST

Even if it is a trivial algorithm, for human use of the charts it is
very handle to have the common representations on the page, rather
than having to puzzle them out or resort to a character code
calculator (like I had to wire up for myself,

Since the last digit is always the same in the main three formats,
it's easy to list the 3 prefixes in the header of each column. This
lets the reader combine that prefix with the row number to get the
value in each encoding. E.g.

       column 0 1 2 ... F
UCS-4 0010000x 0010001x 0010002x ... 001000Fx
UTF-16 D800DC0x D800DC1x D800DC2x ... D800DCFx
UTF-8 F090808x F090809x F09080Ax ... F09083Bx

However, one has to try out different formats to see what is most
useful, and what fits on the page. For example, since the charts don't
span a byte boundary on a page, the top 6 characters of the UCS-4 and
UTF-16 values remain constant, as do the top 5 UTF-8 characters. The
only one that varies by column is the UTF-8 prefix, which changes
every 4 columns.

Asmus pointed out, however, this issue is moot for UCS-4 in Unicode
3.0, since there are no non-BMP characters defined there.


---Markus Scherer <> wrote:
> Hello,
> I agree that all character code points in both standards should be
presented in
> the plain 32b bit format, (with or) without the "U-".
> There is no need to add the UTF-16 (=surrogate pair) or any other
> representation because it can be determined with a trivial algorithm.
> markus
> Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430

Get your free address at

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT