Re: FW: 6 questions

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Sep 18 2001 - 15:26:33 EDT


Bernard Miller asked:

> 1. Why does Unicode say that there are 63486 code
> values available to represent characters with single
> 16 bit values and 2048 available to represent an
> additional 1,048,544 characters as surrogates? 65536 -
> 2048 = 63488 (difference of 2) --I guess it's due to
> the 2 code values guaranteed not to be characters. But
> what about: 1024 x 1024 = 1,048,576 (difference of
> 32), what accounts for the 32?

There are 32 noncharacters on Planes 1..16:

1FFFE, 1FFFF
2FFFE, 2FFFF
...
10FFFE, 10FFFF

See the discussion of noncharacters in Unicode 3.1 (UAX #27
on the website). Note that an additional 32 noncharacters
have been designated on the BMP, which further reduces the
number of code points available for encoded characters on
the BMP, so the above figures will have to be revised yet again.

> 2. CNS = chinese national standard? Why is there a
> chinese standard for japanese small variant forms
> (ch14, page 334 of Unicode 3.0 book)?

They aren't for Japanese small variant forms. They are for
some otherwise unexplained small forms of punctuation from
CNS 11463. See U+FE50..U+FE6B.

> Do CJK
> ideographs have small variant forms?

No.

> Where are they?
>
> 3. Why don't "noBreak" formatted Unicode characters
> have a canonical decomposition (the compatibility
> decomposition surrounded by glue)?

A long story. But the short answer is that such a decomposition
would cause problems for implementations.

>
> 4. Greek final sigma is not considered a compatibility
> decomposition (word position variant) because it's
> usage could also be dependant on spelling convention?

Roughly. The exact rules to predict it in all positions are
more complicated than it is worth. Unicode merely inherited the
legacy treatment of the Greek encoding, which gives the two
sigmas separate code points. (There are encodings which do it
the other way: encode one sigma and render with an appropriate
glyph in context, but their rules are complicated.)

> Is that right? Even if so, isn't it more consistent to
> precede sigma with a non joiner if you don't want it
> to automatically be displayed as final sigma at the
> end of a word?

No. Greek implementations have traditionally not made use
of joiner/non-joiner mechanisms.

>
> 5. How come east asian width type W and H are non
> starters for line breaking?

I'll let somebody else tackle that one.

>
> 6. Why does Unicode use "capital" vs "small letter"
> terminology instead of "uppercase" vs "lowercase"? It
> seems like lowercase is more descriptive than "small
> letter".

Inheritance from SC2 character encoding naming conventions.
All the Unicode names were harmonized to the 10646 names back
in 1993, but the Unicode Standard used "capital" and "small"
even before that. The use of "capital" and "small" has a
venerable tradition in character encoding that goes back to
ASCII and its precursors.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Sep 18 2001 - 14:38:18 EDT