Re: The result of the plane 14 tag characters review.

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 13 2002 - 11:49:46 EST

  • Next message: Marco Cimarosti: "RE: The result of the plane 14 tag characters review."

    Michael Everson <everson at evertype dot com> wrote:

    >> 3. Is there any method of tagging, anywhere, that is lighter-weight
    >> than Plane 14? (Corollary: Is "lightweight" important?)
    >
    > HTML and XML markup?

    and <Peter_Constable at sil dot org> replied:

    > Doug was already comparing the plane 14 characters to HTML and XML,
    > and clearly considers the latter to be relatively heavy -- and
    > certainly they are heavier.

    Certainly I don't want to claim, as some have, that HTML and XML and
    SGML are *very* heavy. But there is definitely a difference.

    HTML language tags (used here to include the slightly more complex XML
    syntax as well) are of the form <lang="xx">, whereas Plane 14 tags are
    of the form ?xx where ? represents U+E0001 and xx, the language
    identifier, is translated to Plane 14. (HTML allows the alternative
    form <lang=xx> without quotation marks, but XML does not.) In either
    case, there is clearly more parsing to be done in the case of HTML:

    * the spelling of the tag "lang" must be checked;
    * alternatively, it might be another type of tag altogether (not a
    language tag);
    * the equal sign = must be checked;
    * there must be exactly 0 (HTML optional) or 2 quotation marks
    surrounding the identifier;
    * the greater-than sign > must be checked.

    Plane 14 tags begin with a single, dedicated code point that means
    "language tag," so no syntax checking is needed at that point. The
    language identifier itself is encoded by dedicated code points, so
    checking for "the end of the tag" is simpler (last character in the tag
    range, or end of stream).

    Parsing the cancel tag is likewise simpler: </lang> vs. U+E0001
    U+E007F. For that matter, a Plane 14 cancel tag is not always
    necessary, which is not true in HTML.

    Any syntax checking of the identifier itself (e.g. "en" is valid but
    "em" is not) must be performed regardless of the mechanism, so neither
    approach holds an advantage there.

    Peter continued:

    >> 2. What extra processing is necessary to ignore Plane 14 tags that
    >> wouldn't be necessary to ignore any other Unicode character(s)?
    >
    > None. And if some form of light-weight markup were used, then there
    > would inevitably be a need for some kind of character escape
    mechanism,
    > so ignoring language tagging would still entail interpreting of the
    > escapes. E.g.
    >
    > #LT=en#This is English text, #LT=fr# et ce texte ci est en français.
    > #LT=en#To use the pound character in text, as in "He's in room ##4,"
    > you have to encode it twice.

    Exactly. With the dedicated code points in Plane 14, you don't need
    either the closing tag or the double-# escaping scheme.

    I am not arguing that it takes Herculean effort to program a parser for
    ASCII-based language tags, only that Plane 14 tags are even simpler, and
    that some text applications call for the mechanism of greater
    simplicity.

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Wed Nov 13 2002 - 12:38:27 EST