Markus Kuhn wrote:
> What is this UTF-8S please? The term showed up on
> linux-utf8@nl.linux.org in
> http://mail.nl.linux.org/linux-utf8/2001-06/msg00037.html
I am the guilty one. Sorry for the inconvenience, but I did it exactly for
the purpose that someone on Linux UTF-8 List asked the question you just
asked.
Short answers are all four letter long, so I'll try a longer one.
UTF-8S is a proposal by Oracle, PeopleSoft, et al. to define a new UTF. It
looks like UTF-8, but characters > 0xFFFF are represented with two 3-byte
sequences, representing the two UTF-16 surrogate codes.
The reason for this proposal, according to Oracle and PeopleSoft, is to
allow UTF-8 database to have a binary sort identical to UTF-16 databases.
So, e.g., U+10000 in UTF-8S is <ED A0 80 ED B0 80>, which is, an UTF-like
re-encoding of UTF-16<D800 U+DC00>.
Lots of people on the Unicode List are arguing that this new UTF-8S will
soon be confused with "genuine" UTF-8, and it will cause a lot of problems
to everybody.
Especially, UTF-8 to UTF-32 converters (that are at the core of, e.g.,
mbrtowc) may soon be presented with "irregular" UTF-8 data, which is
actually UTF-8S mislabeled.
To help this confusion to happen, in the Oracle database, UTF-8S is labeled
"UTF8", while genuine UTF-8 is labeled with the weird acronym "AL32UTF8".
> without any link to a proposal document. Google and Altavista
> don't know the term either.
I wish I could see such a document myself. Ken Whistler said that trying to
find out what this proposal effectively proposes is like "pulling teeth" out
of Oracle people's mouths.
> The unicode@unicode.org archive on
>
> ftp://ftp.unicode.org/Public/MailArchive/
>
> is utterly useless, [...]
This is better:
http://groups.yahoo.com/group/unicode/messages
Warning: it is JUST an archive! Don't post there.
The thread has been renamed several times; look for subjects containing
"UTF-8s" "UTF-8 syntax" and "AL32UTF8".
> Please do not forget to *always* include the original document URL or
> similar introductory information to cross-posts to other lists! Cut &
> paste of URLs really isn't that difficult, so make a habit of it,
> please!
I wish there was some URL to cut&paste. I am afraid that you'll have to
follow the issue as it comes.
> So is this UTF-8S something useful, or just yet another
> political-correctness exercise like UTF-32 was?
Useful? The reason I cross posted the Linux list was to warn you that, if
accepted, UTF-8S could potentially undermine all the effort being made to
Unicodicize Linux!
_ Marco
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT