Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sat Dec 11 2004 - 04:20:20 CST

  • Next message: Lars Kristan: "RE: When to validate?"

    "Philippe Verdy" <verdy_p@wanadoo.fr> writes:

    [...]
    > This was later amended in an errata for XML 1.0 which now says that
    > the list of code points whose use is *discouraged* (but explicitly
    > *not* forbidden) for the "Char" production is now:
    [...]

    Ugh, it's a mess...

    IMHO Unicode is partially to blame, by introducing various kinds of
    holes in code point numbering (non-characters, surrogages), by not
    being clear when the unit of processing should be a code point and
    when a combining character sequence, and earlier by pushing UTF-16 as
    the fundamental representation of the text (which led to such horrible
    descriptions as http://www.xml.com/axml/notes/Surrogates.html).

    XML is just an example of a standard which must decide:
    A. What is the unit of text processing? (code point? combining character
       sequence? something else? hopefully it would not be UTF-16 unit)
    B. Which (sequences of) characters are valid when present in the raw
       source, i.e. what UTF-n really means?
    C. Which (sequences of) characters can be formed by specifying a
       character number?

    A programming language must do the same.

    The language Kogut I'm designing and developing uses Unicode as string
    representation, but the details can still be changed. I want to have
    rules which are "correct" as far as Unicode is concerned, and which
    are simple enough to be practical (e.g. if a standard forced me to
    make the conversion from code point number to actual character
    contextual, or if it forced me to unconditionally unify precomposed
    and decomposed characters, then I quit and won't support a broken
    standard).

    Internal text processing in a programming language can be more
    permissive than an application of such processing like XML parsing:
    if a particular character is valid in UTF-8 but XML disallows it,
    everything is fine, it can be rejected at some stage. It must not be
    more restrictive however, as it would make impossible to implement XML
    parsing in terms of string processing.

    Regarding A, I see three choices:
    1. A string is a sequence of code points.
    2. A string is a sequence of combining character sequences.
    3. A string is a sequence of code points, but it's encouraged
       to process it in groups of combining character sequences.

    I'm afraid that anything other than a mixture of 1 and 3 is too
    complicated to be widely used. Almost everybody is representing
    strings either as code points, or as even lower-level units like
    UTF-16 units. And while 2 is nice from the user's point of view,
    it's a nightmare from the programmer's point of view:
    - Unicode character properties (like general category, character
      name, digit value) are defined in terms of code points. Choosing
      2 would immediately require two-stage processing: a string is
      a sequence of sequences of code points.
    - Unicode algorithms (like collation, case mapping, normalization)
      are specified in terms of code points.
    - Data exchange formats (UTF-n) are always closer to code points
      than to combining character sequences.
    - Code points have a finite domain, so you can make dictionaries
      indexed by code points; for combining character sequences we would
      be forced to make functions which *compute* the relevant property
      basing on the structure of such a sequence.

    I don't believe 2 is workable at all. The question is how to make 3
    convenient enough to be used more often. Unfortunately it's much
    harder than 1, unless strings used some completely different iteration
    protocols than other sequences. I don't have an idea how to make 3
    convenient.

    Regarding B in the context of a programming language (not XML),
    chapter 3.9 of the Unicode standard version 4.0 excludes only
    surrogates: it does not exclude non-characters like U+FFFF.
    But non-characters must be excluded somewhere, because otherwise
    U+FFFE at the beginning would be mistaken for a BOM. I'm confused.

    Regarding C, I'm confused too. Should a function which returns
    the character of the given number accept surrogates? I guess no.
    Should it accept non-characters? I don't know. I only know that
    it should not accept values above 0x10FFFF.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 04:20:54 CST