Re: Brahmic list ? (was: Oriya: mba / mwa ?)

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 30 2003 - 15:15:59 EST

Next message: Michael Everson: "RE: Oriya: mba / mwa ?"

Previous message: Peter Constable: "RE: Oriya: mba / mwa ?"
In reply to: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"
Next in thread: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"
Reply: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> Please don't use UTF-8 to encode anything other than Unicode code
>> points.
>
> As long as I use it internally for intermediate processing, I can do
> what I want. For now it is just a convenient way to represent variable
> size integers up to 31 bits (in fact I use it to represent 32 bit
> signed integers, but the two highest bits are equal).

As long as you are sure that this will not leak out into the outside
world, you are free to use the UTF-8 mechanism internally to represent
any type of 31-bit data you like, including this private replacement for
allkeys.txt. (You do know about allkeys.txt, don't you? And the fact
that UCA is heavily customizable?)

It would seem to make sense primarily for retaining ASCII compatibility
and representing smaller values in fewer bytes than larger values, so
you would want to be sure these are your design goals too.

But things like this do have a tendency to leak into the outside world,
and if this ever happens with your collation keys, you will have
unleashed something like CESU-8 that fails the "duck test": it walks and
talks like UTF-8, but it's not.

> Of course if I still use it to represent something else thzn
> codepoints in some published data or text, I will rename it and won't
> keep the same charset label. But it's highly probable that this will
> not be the most efficient representation (due to its byte-oriented
> splitting), and a more compact or easier to process serialization
> could require an alternate encoding scheme (or transfer syntax).

This is a *much* better solution, whether it is the most efficient
representation or not. CESU-8 is a classic and notorious example of a
UTF-8-like encoding that could have been kept private and internal,
where it belonged, but instead was "leaked" forcefully into the outside
world, to the point where it was assigned an IANA charset label.

UTF-8 can be auto-detected more or less reliably, and has achieved
widespread use throughout the computing world. Please do not use it, or
any extension of it, for representing anything other than Unicode code
points.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Michael Everson: "RE: Oriya: mba / mwa ?"
Previous message: Peter Constable: "RE: Oriya: mba / mwa ?"
In reply to: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"
Next in thread: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"
Reply: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 15:54:22 EST