RE: Subject: Re: 32'nd bit & UTF-8

From: Richard T. Gillam (rgillam@las-inc.com)
Date: Fri Jan 21 2005 - 10:49:10 CST

Next message: Andy Heninger: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

Previous message: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Conformance (Was: 32'nd bit & UTF-8)"
Reply: Hans Aberg: "Re: Conformance (Was: 32'nd bit & UTF-8)"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans--

>> Good grief. We seem to be going through another round of "night of
>> the living thread."
>
>Have you found out first now. :-)

No; I've been biting my tongue for several days.

>> Depending on your particular situation,
>> any of the three [UTFs] might be the best fit. There's a reason all
three
>> exist.
>
>At least for now. UTF-16 cannot be extended beyond the current range,
but UTF-8/32 can both be extended to 2^32
>numbers, the size of a natural type. Even though UTF-16 has a distinct
legacy advantage, it likely does not have that
>in the long run. So deprecating it seems to be a distinct possibility.

I really wish you'd quit saying this. This simply isn't true. Or, at
the very least, is EXTREMELY unlikely and very far into the future. As
several other people have already pointed out to you, the Unicode
codespace contains room for 1.1 million characters. 150,000 code
positions have been set aside for private use or other special purposes,
leaving room for 1 million actual characters. Right now, after 15 years
of encoding, 95,000 of those spaces have been assigned to characters.
At the current rate of encodings, it'll be centuries before the space
fills up. If it ever does-- the consensus seems to be that there just
aren't that many things that will ever merit encoding.

The only thing that would put the codespace in danger of filling up is a
sudden loss in discipline on the part of the committees that maintain
Unicode that turns Unicode into something other than what it's supposed
to be. If people tried to turn Unicode into a generic glyph registry,
for example, or tried to extend it to do styled text, or start
allocating code points for representation of non-text data. The current
committee is EXTREMELY vigilant and won't let these things happen.
People suggest this kind of thing all the time and routinely get slapped
down. It's not that there isn't a need for some of this stuff; it's
just that Unicode isn't the thing that should fill this need. Unicode
is a plain-text character encoding standard. Period. Trying to make it
something else would destroy it.

The space is not going to fill up, and UTF-16 will never have to be
deprecated. Get that notion out of your head once and for all.

>Well, in UTF-8 it has to go away as a requirement to be ignored in
>processes: Either Unicode removes it in the standard, or one will see
that people just don't bother following the
>Unicode standard in that respect.

Again, many people have addressed this point and you're ignoring them.
UTF-8 HAS NO BOM. There is nothing in the Unicode standard mandating or
even encouraging the use of EF BB BF at the beginning of a UTF-8 file.
That sequence has no special meaning in UTF-8; it's just a zero-width
non-breaking space. FE FF at the top of a UTF-8 file is just flat
illegal.

The practice of using EF BB BF as a signature byte to indicate that a
file is in UTF-8 is mentioned in one spot in the standard, but not
encouraged. Some applications (notably Notepad) do this; many do not.
You'll also see it from time to time coming out of an application that
doesn't handle UTF-16 or UTF-32 properly. So EF BB BF at the top of the
UTF-8 file does occur in practice and it's good for software to be aware
of it (but relatively harmless if it isn't). But the fact that it
occurs in practice is a VERY different thing from it being mandated by
Unicode, which it absolutely isn't.

I'll respond to your more substantive note after I get back from
lunch...

--Rich Gillam
Language Analysis Systems, Inc.

Next message: Andy Heninger: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Previous message: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Conformance (Was: 32'nd bit & UTF-8)"
Reply: Hans Aberg: "Re: Conformance (Was: 32'nd bit & UTF-8)"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 10:53:37 CST