RE: Cost of transition to UTF-8 for central census authorities

From: ktadenev@ups.com
Date: Mon Jan 12 2009 - 09:45:45 CST

Next message: Michael D'Errico: "Re: Emoji: emoticons vs. literacy"

Previous message: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"
In reply to: Adam Twardoch: "Re: Cost of transition to UTF-8 for central census authorities"
Next in thread: Doug Ewell: "Re: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Major Database Management systems support UTF-8 and/or UTF-16, not UTF-32.
Java external string representation is UTF-8 in most cases, XML standard is UTF-8.
In essence, one needs to study the technology stack involved to pick the best Unicode implementation.

Konstantin Tadenev

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Adam Twardoch
Sent: Sunday, January 11, 2009 11:01 AM
To: Trond Trosterud; Unicode List
Subject: Re: Cost of transition to UTF-8 for central census authorities

Trond Trosterud wrote:
> 1. The field length in the database will be longer then the display
> field. So, given a surname "Årø", we will have a display length of 3
> (letters), as compared to the database length of 5 bytes.
> 2. There will have to be a new sorting routine, and a new search routine
> 3. Programs may no longer search for characters as single bytes, but
> must in some cases open for search of sequence of bytes.

All of the above can be solved by using UTF-32 rather than UTF-8. Sure,
the size of the data will grow 4x but the software will be "easier". Or
at least, the software should be migrated to use UTF-32 (i.e.
scalar-based Unicode) *internally* and convert from UTF-8 as early as
possible, and convert to UTF-8 as late as possible

The advantage of using UTF-32 in the "new" storage rather than UTF-8 is
that with UTF-8, it is relatively easy to confuse (for either software
or human) whether the data is actually UTF-8 or still ISO 8859-1. With
UTF-32, it is much more obvious and striking. I believe debugging code
that deals with UTF-32 is much easier than debugging code that deals
with UTF-8.

For example, I've recently dealt with custom UTF-8 software solutions
and at some point I discovered that very rarely, problems were creeping
in because the scalar-to-UTF-8 conversion only worked well for BMP
scalar values.

> c. The problem with the discussion is that the experts within the
> registries are presenting their conclusions, and not the premises behind
> them. Politicians listening to them are thus lost. I am invited in to
> comment the process, but it is not easy, as I get so little information
> about the process. So, what kind of information is it that I need to
> evaluate these estimates?

I would illustrate the particular issue mentioned above this way:

Moving from ISO 8859-1 to UTF-8 is like changing the official color of a
flag from pine green to Shamrock green.

Moving from ISO 8859-1 to UTF-32 is like changing the official color of
a flag from pine green to dark blue.

The first approach can be done gradually, so the cost can be spread
throughout years, but during the process it's very difficult to tell the
old flags and the new flags apart, and you run the risk of using the old
flag rather than the new flag on an official occasion, which would be
embarassing. It may happen that some people won't be able to see the
difference, so they'll need to consult an expert, and it may even happen
that some stuff will be replaced twice or changed back and forth because
of the confusion.

The second approach needs longer preparation and more intense financing
in that phase, but then the switch can be done in a more decisive way,
and after the switch it's very easy to spot anything out of the
ordinary. So even an average office clerk will be able to tell early
that something's wrong.

Hope this helps,
Adam

--
Adam Twardoch
| Language Typography Unicode Fonts OpenType
| twardoch.com | silesian.com | fontlab.net
I hate to advocate drugs, alcohol, violence, or
insanity to anyone, but they've always worked for me.
(Hunter S. Thompson)

Next message: Michael D'Errico: "Re: Emoji: emoticons vs. literacy"
Previous message: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"
In reply to: Adam Twardoch: "Re: Cost of transition to UTF-8 for central census authorities"
Next in thread: Doug Ewell: "Re: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 12 2009 - 09:48:51 CST