Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 16:41:25 CST

Next message: John Cowan: "Re: Nicest UTF"

Previous message: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"D. Starner" <shalesller@writeme.com> writes:

> The semantics there are surprising, but that's true no matter what you
> do. An NFC string + an NFC string may not be NFC; the resulting text
> doesn't have N+M graphemes.

Which implies that automatically NFC-ing strings as they are processed
would be a bad idea. They can be NFC-ed at the end of processing if the
consumer of this data will demand this. Especially if other consumers
would want NFD.

String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.

IMHO splitting into graphemes is the job of a rendering engine, not of
a function which extracts a part of a string which matches a regex.

> If you do so with an language that includes <, you violate the Unicode
> standard, because ≮ (not <) and ≮ are canonically equivalent.

I think that Unicode tries to push implications of "equivalence"
too far.

They are supposed to be equivalent when they are actual characters.
What if they are numeric character references? Should "≮"
(7 characters) represent a valid plain-text character or be a broken
opening tag?

Note that if it's a valid plain-text character, it's impossible
to represent isolated combining code points in XML, and thus it's
impossible to use XML for transportation of data which allows isolated
combining code points (except by introducing custom escaping of
course, e.g. transmitting decimal numbers instead of characters).
I expect breakage of XML-based protocols if implementations are
actually changed to conform to these rules (I bet they don't now).

OTOH if it's not a valid plain-text character, then conversion between
numeric character references and actual characters is getting more
hairy.

> I'll see if I have time after finals to pound out a basic API that
> implements this, in Ada or Lisp or something.

My language is quite similar to Lisp semantically.

Implementing an API which works in terms of graphemes over an API
which works in terms of code points is more sane than the converse,
which suggests that the core API should use code points if both APIs
are sometimes needed at all.

While I'm not obsessed with efficiency, it would be nice if changing
the API would not slow down string processing too much.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: John Cowan: "Re: Nicest UTF"
Previous message: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 16:41:58 CST