Re: Questions on ZWNBS - for line initial holam plus alef

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Aug 13 2003 - 10:04:12 EDT

  • Next message: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"

    Peter, in XML you really don't want to use attributes for any general
    text; there are too many restrictions on the content. For example, we
    never put translatable text into them. Attributes should really be
    treated more like sequences of symbols, with a constrained syntax.

    This is also not in violation of the Unicode conformance clause. A
    "space plus combining
    character" is a unit in some sense. That is, it is a combining
    character sequence (and grapheme cluster). However, there is no clause
    that says that such units cannot be changed, or that any particular
    sequence of characters cannot be changed; operations such as case
    mapping or normalization do just that, they change characters.

    There are restrictions on what can be changed *if* a process purports
    to not modify the text (C10). But an XML parser is certainly capable
    of interpreting a sequence A B, and deciding that it wants to change A
    to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek
    Alpha, *that* would be a violation of C7. But interpreting a space as
    a space, then deciding to modify it, is perfectly legit.

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Peter Kirk" <peter.r.kirk@ntlworld.com>
    To: "John Cowan" <cowan@mercury.ccil.org>
    Cc: <unicode@unicode.org>
    Sent: Wednesday, August 13, 2003 05:09
    Subject: Re: Questions on ZWNBS - for line initial holam plus alef

    > On 12/08/2003 20:28, John Cowan wrote:
    >
    > >Peter Kirk scripsit:
    > >
    > >
    > >
    > >>>2) In attribute values, LF, CR, and TAB characters are normalized
    to
    > >>>spaces. Not relevant here.
    > >>>
    > >>>
    > >>This would be relevant if it is legal for the character after LF,
    CR,
    > >>and TAB to be a combining mark. Is this legal? In this case what
    was
    > >>previously a defective (but legal) combining sequence would turn
    into a
    > >>non-defective one, but the intended whitespace would be lost.
    > >>
    > >>
    > >
    > >The point is that there is no such thing as an *intended* line
    break in
    > >an attribute value; it will *always* be translated to a space
    before
    > >the application sees it. (More exactly, line-break characters can
    > >be inserted into attribute values, but only with the use of a
    numeric
    > >character reference such as "&#xA;".)
    > >
    > >
    > Sorry, I'm confused. Are you saying that the input processing will
    > translate line breaks into spaces within attribute values, unless
    > inserted as &#xA; ? Well, I suppose this is fair enough as it is up
    to
    > the user not to enter garbage.
    >
    > >
    > >
    > >>Not just a rendering glitch, I suspect. If the combining character
    is
    > >>combined with the separating space, the space loses many of its
    > >>separating functions, and perhaps keeps a confusing subset of them
    with
    > >>all sorts of possibilities of error.
    > >>
    > >>
    > >
    > >The space(s) will be used to separate individual tokens at
    processing
    > >time. No spacing diacritic (either single-character or
    space+combining)
    > >is permitted in a NMTOKEN.
    > >
    > >
    > OK if this is clearly illegal, but this might restrict use of some
    > languages in NMTOKEN. Would NBSP + combining be allowed?
    >
    > >
    > >
    > >>At best tokens beginning with
    > >>combining characters will be unusable. At worst they will crash
    the
    > >>implementation (and count on someone trying deliberately to do
    that!).
    > >>
    > >>
    > >
    > >In effect, the combining character will constitute a defective
    combining
    > >sequence at the beginning of the individual token.
    > >
    > >Stepping away from the letter of the standard for a moment, there
    is
    > >no real reason to begin a NMTOKEN with a combining character. It
    is
    > >only allowed is a result of the miscegenation of SGML concepts with
    > >Unicode ones.
    > >
    > >In SGML's original design of tokens, they consisted of letters and
    digits
    > >(and a few punctuation marks, which functioned as letters). There
    were
    > >four kinds: a NUMBER could contain only digits, a NAME could not
    begin
    > >with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN
    had no
    > >restrictions. ID and IDREF had the same syntax as NAME with
    additional
    > >semantics. Later, the categories "letter" and "digit" were
    generalized,
    > >by redefining the concrete syntax, to be whatever you wanted, and
    were
    > >renamed "name-start" and "name" characters (technically, a name
    character
    > >was a letter *or* a digit).
    > >
    > >When SGML was simplified to produce XML, only NMTOKEN, the most
    general
    > >type of token, was kept. However, in order to keep the semantics
    of
    > >"letter" and "digit" in the Unicode world, "letter" was extended to
    be any
    > >letter and "digit" to be any digit *or* combining character. That
    worked
    > >well for ID and IDREF, since treating combining characters as part
    of
    > >"digit" prevented them from appearing first, as was only sensible.
    > >
    > >Unfortunately, NMTOKENs, since there were no restrictions, became
    able
    > >to begin with a combining character, though that made no real
    sense.
    > >To write in a restriction would make it impossible to specify XML's
    > >concrete syntax in SGML terms, which did not allow for three
    different
    > >classes of characters within tokens. So we wound up with a
    basically
    > >useless capability that if used will only cause trouble.
    > >
    > >
    > >
    > There is some potential for real trouble here, if one process
    outputs an
    > NMTOKEN starting with a combining character preceded by a separating
    > space, or something else which is changed into a space, and another
    > process takes the new space plus combining character as a unit and
    so
    > doesn't recognise the separation. Any hackers and virus programmers
    > reading this will soon start flooding the Internet with tokens
    beginning
    > with combining characters in the hope of crashing implementations or
    > finding back doors. Of course this wouldn't have been a problem if
    > Unicode had never defined space plus combining character as legal
    and
    > meaningful. But this is not my problem!
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 11:01:49 EDT