Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Joel Rees (rees@server.mediafusion.co.jp)
Date: Thu Feb 22 2001 - 21:06:59 EST


Hi, Carl,

> Joel,
>
> You comment about Microsoft having pie in its face is a bit puzzling.
They
> based NT on Unicode 1.0 and Windows 2000 which was sent to manufacturing
15
> months ago has surrogate support. For all its faults MS has been a big
> promoter of Unicode.

Sometimes I run off at the mouth. Casually including Microsoft in the same
aspersion as made against Sun was partially inaccurate. Sun's sins I could
count if I had reason to. Microsoft's no one can. So I get lazy. (Not that I
could count my own sins if I had to. ;-> )

What I was trying to wave my hands at is that the first thing an engineer
may want to do upon hearing the words "16 bit width" is start planning all
sorts of 257 element tables to be expanded to 65537, with a feeling of
relief and confidence because the memory today's machines can handle such
big tables. There is a certain technological momentum that needs to be
overcome. Those who have been with UNICODE from the beginning seem to have
overcome certain parts of that momentum (like the hope that the C function
isspace() could be appropriately applied to international text). Those who
are just now getting introduced need a lot of patience, and a lot of
accurate answers in small bites. (And a lot of example source code that
really works.)

I was trying to motivate the patience, but I see I probably missed my
target.

> What burns me up is Sun implementing a non-Unicode wchar_t or worse yet
> Oracle proposing to encode surrogates into UTF-8 as two UCS-2 characters.
> This bastardized UTF-8 will probably decode properly but it will not deal
> with properly encoded UTF-8. Also non-plain 0 characters will take 6
bytes
> instead of 4.

Patience. Have you tried showing them some real Java source using UTF-32 as
pivot? If they can see that converting first to UTF-32 is not significantly
less efficient, and that the resulting source is more maintainable, and that
it gives a different answer, etc., it might be easier to push against the
desire to handle surrogate pairs too early in the transform parse.

You wouldn't want to show them C source, especially if you use a macro to
optimize direct conversion. I'm sure that wouldn't go down well at all.

Joel Rees, Media Fusion
Amagasaki, Japan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT