From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Dec 02 2003 - 19:05:09 EST
Doug Ewell wrote:
> Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:
>
> Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be
> fixed. Plain and simple. If a system like Tcl only supports the BMP,
> that is its choice, but it *must not* accept non-shortest UTF-8 forms or
> output CESU-8 disguised as UTF-8.
Agree with you. Just want to make a point that the implementation is not
"< 1%" of the work.
>
> > If you still think adding 4 bytes UTF-8 support is < 1% of the task,
> > then please join the Tcl project and help me fix that. I appreciate
> > your efforts there and I beleive a lot of people will thank for your
> > contribution.
>
> I'll be happy to supply UTF-8 code that handles 4-byte sequences. That
> is not the same thing as converting an entire system from 16-bit to
> 32-bit integers, or adding proper UTF-16 surrogate support to a
> UCS-2-only system. Of course that is more work.
You view is based on the assumption the internal code is UCS4 instead of
UTF-16.
>
> Remember, AGAIN, that this thread was originally about taking an
> application like MySQL that did not support Unicode at all, and adding
> Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.** That is what I
> can't imagine -- making BMP-only assumptions *today*, in 2003, knowing
> that you'll have to go back and fix them some day. That is certainly
> more work than adding support for the full Unicode range at once. I
> think you thought I said the opposite, that such retrofitting is easy,
> and are now trying hard to disprove it.
Nothing wrong if people choose to use UTF-16 instead of UCS4 in the API,
even as 2003. Do you agree?
If people do use UTF-16 in the API, it is nature for people who do care
about BMP but not care about Plan 1-16 to only work on BMP, right? I am
not saying they do the right thing. I am saying they do the "nature"
thing. Remember, the text describe about 'Surrogate' in the Unocde 4.0
standard is probably only 5-10 pages total in that 1462 pages standard.
For developer who won't going to implement the rest 1000 pages right, it
is nature for them to think "why do I need to make this 10 pages right?"
>
> > double your memory cost and size from UTF-8. x4 of the size for your
> > ASCII data. To change implementation of a ASCII compatable / support
> > application to UTF-16 is already hard since people only care about
> > ASCII will upset the data size x 2 for all "their" data. It is already
> > a hard battle most of the time for someone like me. If we tell them to
> > change to UCS-4 that mean they need not only x2 the memory but x4 of
> > the memory.
>
> I can't fight this battle with people who would rather stay with ASCII
> and 7/8 bits per character. They are not living in a Unicode world.
But how about the UTF-16 vs UCS4 battle?
>
> 1024 × 768 screen resolution takes 150% more display memory than 640 ×
> 480, too.
>
> > For web services or application which spend multi millions on those
> > memory and database, it mean adding millions of dollars to their cost.
> > They may have to adding some millions of cost to support international
> > customer by using UTF-16. They probably are willing to add multi
> > millions dollars of cost to change it to use UCS4. In fact, there are
> > people proposed to stored UTF-8 in a hackky way into the database
> > instead of using UTF-16 or UCS4 to save cost. They have to add
> > restriction of using the api and build upper level api to do
> > conversion and hacky operation. That mean it will introduce some fixed
> > (not depend on the size of data) developement cost to the project but
> > it will save millions of dollars of memory cost which depend on the
> > size of the data. I don't like that approach but usually my word and
> > what is "right" is less important than multiple million of dollars for
> > a commercial company.
>
> I would truly be surprised if full 17-plane Unicode support in a single
> app could be demonstrated to be a matter of "multiple millions of
> dollars."
It is not the full 17-plane Unicode support which will contribut to it.
It is the
(Number of ASCII only records X sizeof (records in UCS4)) - ( Number of
ASCII only records X sizeof(record in ASCII))
contribute to that.
compare to
(Number of ASCII only records X sizeof (records in UTF-8)) - ( Number of
ASCII only records X sizeof(record in ASCII))
or
(Number of ASCII only records X sizeof (records in UTF-16)) - ( Number
of ASCII only records X sizeof(record in ASCII))
The other comparision is
(Number of BMP only records X sizeof (records in UCS4)) - ( Number of
BMP only records X sizeof(record in UTF-8))
(Number of BMP only records X sizeof (records in UCS4)) - ( Number of
BMP only records X sizeof(record in UTF-16))
of course, the sizeof() is really the "average size of record with those
data"
>
> -Doug Ewell
> Fullerton, California
> http://users.adelphia.net/~dewell/
>
-- -- Frank Yung-Fong Tang Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 19:44:56 EST