Re: The result of the plane 14 tag characters review.

From: Doug Ewell (
Date: Wed Nov 13 2002 - 12:02:13 EST

  • Next message: Doug Ewell: "Re: The result of the plane 14 tag characters review."

    Dominikus Scherkl <Dominikus dot Scherkl at glueckkanja dot com> wrote:

    > Hm. <lang=en>...<\lang>
    > that are 9+7 = 16 characters to indicate the language (and end of tag)
    > All of them are ASCII, therefore encoded as 1 byte utf-8 each.
    > Plane 14 requires 4 byte utf-8 each, and at least 3 characters
    > (two tag-letters and the end-tag) - this is 12 bytes.
    > Ok, this is less heavy, but not very much.
    > Or what do you think what "weight" in this context means?!?

    I definitely *don't* mean the number of bytes in UTF-8, which is just
    one way of representing Unicode text. In UTF-16, Plane 14 tags require
    only two 16-bit code units (cf. one for each Basic Latin character),
    while in SCSU they can take as little as one byte each, after the
    initial 3-byte overhead to set up the window and maybe another byte at
    the start of each tag (not character) to switch to that window.

    Interpretation of UTF-8 bytes needs to happen at the very earliest
    stages of text processing. It does not belong in the same stage as
    language tag interpretation, normalization, bidi, etc.

    Furthermore, with Plane 14 you don't always need a closing tag like
    </lang>, and if it's present you don't need to check it for syntax.
    (Well, I suppose you could have something illegal like U+E0001 U+E0065
    U+E006E U+E007F, but that's no worse than having to check for </lanf>.)

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Wed Nov 13 2002 - 12:46:24 EST