From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 11 2006 - 19:28:19 CST
> From: "Kenneth Whistler" <kenw@sybase.com>
> >> Another related question: Why isn't there a standard 16-bit UTF
> >> that preserves the binary ordering of codepoints?
> >> (I mean for example UTF-16 modified simply by moving all
> >> code units or code points in E000..FFFF down to D800..F7FF
> >> and moving surrogate code units in D800..DFFF up to F800..FFFF).
> >
> > Huh? Because it would confuse the hell out of everybody and lead
> > to problems, just like any other putative fixes by proliferation
> > of UTF's.
> >
> > Sorting UTF-16 in binary order is easy. See "UTF-16 in UTF-8 Order",
> > p. 136 of TUS 4.0.
>
> I don't say it is not easy to do. What I just indicated is
> that there are applications where onereally wants pure binary
> sort order, where it would also begood that it preserves the order
> of codepoints (like with UTF-8 and UTF-32, but not in UTF-16).
So?
Given that UTF-16 doesn't sort in binary code point order for
supplementary characters, you program around the problem if you
need to.
Advocating changing the encoding of the *data* to work around a
limitation of an algorithm when dealing with that data strikes
me as just another invitation to bad engineering practice.
>
> May be what you are replying there is that Unicode doesnot want
> to add more standard UTFs,
Well, roughly, yes. More precisely, I would put it that the UTC
is absolutely on record as not wanting to modify (or add to) the
3 standard Unicode Encoding Forms (UTF-8, UTF-16, and UTF-32) in
any way, whatsoever, period, end of story.
> and instead prefer to insist that such UTFs
You are already off the rails here. "UTFs" indefinite plural is
an undefined concept, as far as the UTC is concerned. The Unicode
Standard doesn't define some generic concept of encoding bijection,
calling them "UTFs", claim to be standardizing 3 of them, and then
invite others to make up however many more they want to.
The Unicode Standard specifies and standardizes 3 "Unicode Encoding
Forms", which are designed as bijections, and says to conform
to the standard, you use one of those, period.
> should remain private (requiring explicit agreements between users,
> or using private internal interfaces and APIs, so that no public
> standard will need to be standardized).
The UTC can't prevent people from doing whatever odd things pop
into their heads, but I can assure you there isn't any sentiment
on the UTC that implementers should be off making up more
"UTFs" in efforts to solve sorting problems by encoding and then
exchanging such data in putatively "private", "internal" interfaces
where chances are better than even that somewhere down the line
such data is going to leak into public contexts and create
data corruptions that somebody *else* other than the originator
is going to have to deal with.
>
> It's just that alternative UTFs are still possible without
> affecting full conformance with the Unicode standard: with
> the same required properties for all UTFs that they MUST
> preserve the exact encoding of all valid codepoints between
> U+0000 and U+10FFFF, including non-characters,
You're just making this up, right?
> and that they must not change their relative encoding order
> in strings so that all all normalization forms and denormalizations
> are preserved,
How are these connected? UTF-16 and UTF-32 don't have the same
"relative encoding order in strings", but do preserve normalization
forms. Again, you're just making this up, right?
> allthis meaning there must exist a bijection beween all
> UTFs applied to all Unicode strings.
Who said?
>
> If this is still not clear enough, the standard should insists
> that it documents 3 UTFs explicitly with several byte ordering
> options for endianness,
It says perfectly clearly that it documents 3 Unicode Encoding Forms.
What is unclear about that?
> but this still does not restrict full conformance only to these.
Actually, it does.
> In fact Unicode approves also SCSU and BOCU-8, and because they
> respect the bijection rule, they are already compliant UTFs.
The UTC (not "Unicode") approved SCSU as a Unicode Technical Standard.
It is not part of the Unicode Standard, and it isn't a Unicode
Encoding Form. SCSU is losslessly convertible to/from Unicode,
does *not* sort in code point order, meets the criteria for an
IANA charset, and is not MIME-compatible. It is a stateful encoding,
and is non-deterministic (different encoders may produce different
actual SCSU sequences as output).
BOCU-*1* is not something approved by the UTC at all. It is an
independent specification (not a standard) developed by a member
of the Unicode Consortium. It is not part of the Unicode Standard,
and it isn't a Unicode Encoding Form. BOCU-1 is losslessly convertible
to/from Unicode, *does* sort in code point order, meets the
criteria for an IANA charset, and is MIME-compatible. It is a
stateful encoding, and has deterministic output.
> But it should be clear in the standard that they are just
> examples of valid UTFs,
No, that is not at all clear, nor is that the intent in the standard
whatsoever.
> recommandedfor interchange across heterogenous systems or
> networks, and that applications can use their own alternate
> representation, as needed to comply with other needs
The last part of this is certainly true. The use of BOCU-1 as
a compression in a database would be an example of an application
using its own alternate representation of data.
> (for example any attempt to make any standard UTF fit on platforms
> with 64-bit or 80-bit word size would already require an extension,
> which cannot strictly be equal to any standardized UTF, even if
> it's just a simple zero-bit padding, that requires an additional
> specification for the validity of binary interfaces).
The implementation of encoding forms on platforms whose
native word sizes exceed the size of code units has never
been considered an issue of "requir[ing] an extension ...
to any standardized UTF". It is just a special case of the
very general issue (handled by compilers below the level that
most programmers have to worry about) of putting numbers of
defined sizes into registers of defined sizes.
On a Z80 8-bit computer, I would have represented ASCII
"cat" as an array <63 61 74> pushed through registers as:
01100011
01100001
01110100
On a 64-bit processor these days, I would represent ASCII
"cat" as an array <63 61 74> pushed through registers as:
000000000000000000000000000000000000000000000000000000000000000001100011
000000000000000000000000000000000000000000000000000000000000000001100001
000000000000000000000000000000000000000000000000000000000000000001110100
It's still ASCII, and it's still handled logically as 8-bit characters,
although they may get pushed through big registers with lots of zeroes.
On a Z80 8-bit computer, I would have represented UTF-8 for
U+4E8C as an array <E4 BA 8C> pushed through registers as:
11100100
10111010
10001100
And likewise, on the 64-bit processor it would be:
000000000000000000000000000000000000000000000000000000000000000011100100
000000000000000000000000000000000000000000000000000000000000000010111010
000000000000000000000000000000000000000000000000000000000000000010001100
In either case, it is just UTF-8, conformant to the specification
in the Unicode Standard, and neither I nor you should care how many
bits got set to zero in the register when the load register instruction
was executed by the hardware. The guys who write assembly code and
microcode on chips may need to care -- the rest of us don't.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jan 11 2006 - 19:29:55 CST