Re: UTF-16 clarification needed

From: Doug Ewell (dewell@roadrunner.com)
Date: Fri Jul 04 2008 - 16:15:49 CDT

Next message: philip chastney: "Re: how to add all latin (and greek) subscripts"

Previous message: Michael Everson: "Re: wikipedia unicode font."
Maybe in reply to: Jeroen Ruigrok van der Werven: "UTF-16 clarification needed"
Next in thread: Kenneth Whistler: "Re: UTF-16 clarification needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> Are you just considering the definitions, but also ignoring the
> conformance clauses that restrict them more precisely?

Jeroen asked a simple question, and I answered it:

"When you have the U+D800 - U+DFFF range for creating code points using
surrogate pairs and you take for example U+20045 it will be created as:
U+D840 U+DC45. Are these, by themselves only code units or are they also
code points?"

There is a lot of complexity to Unicode -- otherwise the book wouldn't
be 570 pages long BEFORE you get to the code charts -- but my answer was
not wrong. Yes, the values 0xD840 and 0xDC45 are code points. They are
surrogate code points, and they are only used in UTF-16, and they are
not Unicode scalar values, and they do not individually encode
characters, but that was not what Jeroen asked.

> Really, I prefer NEVER using the U+xxxx notation for anything else
> that is not mapped to a single code point, independantly of the
> encoding form or encoding scheme where those code points may be mapped
> to ordered streams of code units or bytes.

You are correct about the notation. U+... notation is only for use with
code points, not code values. I did not perceive the notation as being
at the heart of Jeroen's question, and in private exchange, he confirmed
that it was not. But you are correct about the notation.

> And I don't make the confusion between code points and code units
> because they don't belong to the same space (even if they seem to
> intersect, they don't: code points are arbitrary elements without
> numeric capabilities, so without arithmetic, even if they are assigned
> several numeroc properties like their nominal scalar value; code units
> have arithmetic properties, they are elements in a mathematical Galois
> field).

Well, gee, I don't like to "make the confusion" either, which is
probably why I opened the book before answering, instead of trusting my
instinct. Actually, my instinct was wrong on this: I was expecting to
see than the surrogates were not code points. In fact, they are not
Unicode scalar values. That is why that term was invented: (code
points) - (surrogate code points) = (USVs).

> UTF-16 defines an encoding form/scheme for conforming texts, not just
> for isolated characters.

Jeroen didn't ask about encoding U+D840 in isolation, or U+DC45 in
isolation.

> TUS is clear:
> "Each encoding form maps the Unicode code points U+0000..U+D7FF and
> U+E000..U+10FFFF to unique code unit sequences."
>
> This means that there's NO ***code points*** of the surrogates range
> U+D800..U+DFFF in any encodoing form, so they can't occur as well in
> UTF-16 (as long as you are conforming to its rules).

But they do exist as ***code points***. TUS is clear there too, in
definitions D9 and D10.

I'd like to wait for Ken or Mark or somebody to issue a bull on this. I
think I gave the correct answer to the question Jeroen asked, and you
are giving the correct answer for the question you think Jeroen really
meant to ask.

--
Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

Next message: philip chastney: "Re: how to add all latin (and greek) subscripts"
Previous message: Michael Everson: "Re: wikipedia unicode font."
Maybe in reply to: Jeroen Ruigrok van der Werven: "UTF-16 clarification needed"
Next in thread: Kenneth Whistler: "Re: UTF-16 clarification needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 04 2008 - 16:18:26 CDT