Re: Cost of transition to UTF-8 for central census authorities

From: Tim Greenwood (timothy.greenwood@gmail.com)
Date: Sun Jan 11 2009 - 11:12:48 CST

Next message: Erkki I. Kolehmainen: "RE: Cost of transition to UTF-8 for central census authorities"

Previous message: John Hudson: "Re: Flag Symbols"
In reply to: Trond Trosterud: "Cost of transition to UTF-8 for central census authorities"
Next in thread: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"
Reply: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Most databases still define the schema in terms of characters, not bytes. So
a varchar(3) is 3 characters (or perhaps code points) no matter whether the
database is storing it in Latin1 or UTF-8.

Is sorting and searching done inside the database? If so then point 2 is a
noop.

All decent databases will convert output to the codeset required by the
client, converting in ODBC or similar. So conversion of client programs to
work with UTF-8, if needed at all, can be phased in.

Tim

On Sun, Jan 11, 2009 at 10:02 AM, Trond Trosterud <
trond.trosterud@hum.uit.no> wrote:

> I have the following question to the list:
>
> In Norway, our large census databases (https://infobank.edb.com, contains
> the names, social sec num, address, cars, companies, boats, etc, etc, of all
> Norwegian citizens). Today, it is encoded with the 8859-1 charset, probably
> in 8859-1 (some old registries may be EBCDIC, but with the same character
> repertoire or a subset).
>
> Now, Norway wants to be able to use Sámi in that register, i.e., 6x2
> letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are
> possible, but a natural solution is UGF-8.
>
> Now, what will this cost?
>
> According to key personel, this transition will require a transition period
> of appr. 10 years, and a relatively high cost (politeness towards the
> authors of the transition plans prevents me from referring numbers).
>
> Governmental experts see 3 drawbacks with UTF-8:
>
> 1. The field length in the database will be longer then the display field.
> So, given a surname "Årø", we will have a display length of 3 (letters), as
> compared to the database length of 5 bytes.
> 2. There will have to be a new sorting routine, and a new search routine
> 3. Programs may no longer search for characters as single bytes, but must
> in some cases open for search of sequence of bytes.
> 4. Many common programs only support 8-bit character sets
> 5. Data must be removed from registries, converted and replaced
> 6. Millions of lines of code must be changed and tested
>
> To me it seems most of these points are not real problems, but either a
> description of the conversion process, or unfounded fear.
>
> My question to the list is this:
>
> a. How can the variable field length be a problem? The field must in any
> case open for longer names, e.g. my name's (Trosterud) 9 letters requre 9
> bytes, more than the 5 of Årø. Can there be data base solutions who generate
> database fields on the basis of the number of characters? The opposite (view
> fields on the basis of bytes) should be no problem, it will only give [Årø
> ] and [Trosterud].
>
> b. Will it really be necessary to change millions of lines of code? How can
> even old, badly written code require such changes?
>
> c. The problem with the discussion is that the experts within the
> registries are presenting their conclusions, and not the premises behind
> them. Politicians listening to them are thus lost. I am invited in to
> comment the process, but it is not easy, as I get so little information
> about the process. So, what kind of information is it that I need to
> evaluate these estimates?
>
> d. Other comments, or perhaps better: experiences from other countries?
>
> Trond Trosterud.
>
>
> ----------------------------------------------------------------------
> Trond Trosterud t +47 7764 4763
> Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
> N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
> Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/
> dn------------------------------------------------------------------đŋ
>
>
>
>
>
>

Next message: Erkki I. Kolehmainen: "RE: Cost of transition to UTF-8 for central census authorities"
Previous message: John Hudson: "Re: Flag Symbols"
In reply to: Trond Trosterud: "Cost of transition to UTF-8 for central census authorities"
Next in thread: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"
Reply: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 11:15:01 CST