Re: Discrepancy in ch03.pdf?

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Apr 10 2002 - 11:21:35 EDT


Антон Тагунов <atagunov@online.ptt.ru> wrote regarding Definition D5:

> Every time I read the following passage in
> http://www.unicode.org/unicode/uni2book/ch03.pdf
> I get confused:
>
> - A single abstract character may correspond to more then one code
> value - ...
> - Multiple code values may be required to represent a single abstract
> character.

I don't see a discrepancy between these two statements, at least not one
that the phrase "more than one code value sequence" would clarify.

> For example, a byte is the code unit in SJIS:...
> ideographs require two code values

I do think the text here is unclear about "code values" and "code
units." It says they are the same thing, and then uses both terms
interchangeably, which is a bit confusing for a standard.

To me, a more useful distinction is the one in Technical Report #17,
"Character Encoding Model"
<http://www.unicode.org/unicode/reports/tr17/> between "code point" and
"code unit." A code point is something like U+0410 for CYRILLIC CAPITAL
LETTER A. Code units are the two bytes 0xD0 0x90 required to express
that code point in UTF-8, or the single 32-bit word 0x00000410 required
to express it in UTF-32.

Incorporating the concepts from UTR #17 into the main text is one place
where the "language tightening" project for Unicode 4.0 should really
pay off.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 12:15:31 EDT