Re: UTF-16 clarification needed

From: Doug Ewell (dewell@roadrunner.com)
Date: Fri Jul 04 2008 - 16:15:49 CDT

  • Next message: philip chastney: "Re: how to add all latin (and greek) subscripts"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Are you just considering the definitions, but also ignoring the
    > conformance clauses that restrict them more precisely?

    Jeroen asked a simple question, and I answered it:

    "When you have the U+D800 - U+DFFF range for creating code points using
    surrogate pairs and you take for example U+20045 it will be created as:
    U+D840 U+DC45. Are these, by themselves only code units or are they also
    code points?"

    There is a lot of complexity to Unicode -- otherwise the book wouldn't
    be 570 pages long BEFORE you get to the code charts -- but my answer was
    not wrong. Yes, the values 0xD840 and 0xDC45 are code points. They are
    surrogate code points, and they are only used in UTF-16, and they are
    not Unicode scalar values, and they do not individually encode
    characters, but that was not what Jeroen asked.

    > Really, I prefer NEVER using the U+xxxx notation for anything else
    > that is not mapped to a single code point, independantly of the
    > encoding form or encoding scheme where those code points may be mapped
    > to ordered streams of code units or bytes.

    You are correct about the notation. U+... notation is only for use with
    code points, not code values. I did not perceive the notation as being
    at the heart of Jeroen's question, and in private exchange, he confirmed
    that it was not. But you are correct about the notation.

    > And I don't make the confusion between code points and code units
    > because they don't belong to the same space (even if they seem to
    > intersect, they don't: code points are arbitrary elements without
    > numeric capabilities, so without arithmetic, even if they are assigned
    > several numeroc properties like their nominal scalar value; code units
    > have arithmetic properties, they are elements in a mathematical Galois
    > field).

    Well, gee, I don't like to "make the confusion" either, which is
    probably why I opened the book before answering, instead of trusting my
    instinct. Actually, my instinct was wrong on this: I was expecting to
    see than the surrogates were not code points. In fact, they are not
    Unicode scalar values. That is why that term was invented: (code
    points) - (surrogate code points) = (USVs).

    > UTF-16 defines an encoding form/scheme for conforming texts, not just
    > for isolated characters.

    Jeroen didn't ask about encoding U+D840 in isolation, or U+DC45 in
    isolation.

    > TUS is clear:
    > "Each encoding form maps the Unicode code points U+0000..U+D7FF and
    > U+E000..U+10FFFF to unique code unit sequences."
    >
    > This means that there's NO ***code points*** of the surrogates range
    > U+D800..U+DFFF in any encodoing form, so they can't occur as well in
    > UTF-16 (as long as you are conforming to its rules).

    But they do exist as ***code points***. TUS is clear there too, in
    definitions D9 and D10.

    I'd like to wait for Ken or Mark or somebody to issue a bull on this. I
    think I gave the correct answer to the question Jeroen asked, and you
    are giving the correct answer for the question you think Jeroen really
    meant to ask.

    --
    Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Fri Jul 04 2008 - 16:18:26 CDT