Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sat Dec 11 2004 - 04:20:20 CST

Next message: Lars Kristan: "RE: When to validate?"

Previous message: Lars Kristan: "RE: When to validate?"
In reply to: Philippe Verdy: "Re: Nicest UTF"
Next in thread: Philippe Verdy: "Re: Nicest UTF"
Reply: Philippe Verdy: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Philippe Verdy" <verdy_p@wanadoo.fr> writes:

[...]
> This was later amended in an errata for XML 1.0 which now says that
> the list of code points whose use is *discouraged* (but explicitly
> *not* forbidden) for the "Char" production is now:
[...]

Ugh, it's a mess...

IMHO Unicode is partially to blame, by introducing various kinds of
holes in code point numbering (non-characters, surrogages), by not
being clear when the unit of processing should be a code point and
when a combining character sequence, and earlier by pushing UTF-16 as
the fundamental representation of the text (which led to such horrible
descriptions as http://www.xml.com/axml/notes/Surrogates.html).

XML is just an example of a standard which must decide:
A. What is the unit of text processing? (code point? combining character
   sequence? something else? hopefully it would not be UTF-16 unit)
B. Which (sequences of) characters are valid when present in the raw
   source, i.e. what UTF-n really means?
C. Which (sequences of) characters can be formed by specifying a
   character number?

A programming language must do the same.

The language Kogut I'm designing and developing uses Unicode as string
representation, but the details can still be changed. I want to have
rules which are "correct" as far as Unicode is concerned, and which
are simple enough to be practical (e.g. if a standard forced me to
make the conversion from code point number to actual character
contextual, or if it forced me to unconditionally unify precomposed
and decomposed characters, then I quit and won't support a broken
standard).

Internal text processing in a programming language can be more
permissive than an application of such processing like XML parsing:
if a particular character is valid in UTF-8 but XML disallows it,
everything is fine, it can be rejected at some stage. It must not be
more restrictive however, as it would make impossible to implement XML
parsing in terms of string processing.

Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
to process it in groups of combining character sequences.

I'm afraid that anything other than a mixture of 1 and 3 is too
complicated to be widely used. Almost everybody is representing
strings either as code points, or as even lower-level units like
UTF-16 units. And while 2 is nice from the user's point of view,
it's a nightmare from the programmer's point of view:
- Unicode character properties (like general category, character
  name, digit value) are defined in terms of code points. Choosing
  2 would immediately require two-stage processing: a string is
  a sequence of sequences of code points.
- Unicode algorithms (like collation, case mapping, normalization)
  are specified in terms of code points.
- Data exchange formats (UTF-n) are always closer to code points
  than to combining character sequences.
- Code points have a finite domain, so you can make dictionaries
  indexed by code points; for combining character sequences we would
  be forced to make functions which *compute* the relevant property
  basing on the structure of such a sequence.

I don't believe 2 is workable at all. The question is how to make 3
convenient enough to be used more often. Unfortunately it's much
harder than 1, unless strings used some completely different iteration
protocols than other sequences. I don't have an idea how to make 3
convenient.

Regarding B in the context of a programming language (not XML),
chapter 3.9 of the Unicode standard version 4.0 excludes only
surrogates: it does not exclude non-characters like U+FFFF.
But non-characters must be excluded somewhere, because otherwise
U+FFFE at the beginning would be mistaken for a BOM. I'm confused.

Regarding C, I'm confused too. Should a function which returns
the character of the given number accept surrogates? I guess no.
Should it accept non-characters? I don't know. I only know that
it should not accept values above 0x10FFFF.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Lars Kristan: "RE: When to validate?"
Previous message: Lars Kristan: "RE: When to validate?"
In reply to: Philippe Verdy: "Re: Nicest UTF"
Next in thread: Philippe Verdy: "Re: Nicest UTF"
Reply: Philippe Verdy: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 04:20:54 CST