Cost of transition to UTF-8 for central census authorities

From: Trond Trosterud (trond.trosterud@hum.uit.no)
Date: Sun Jan 11 2009 - 09:02:13 CST

Next message: Michael D'Errico: "Re: Emoji: emoticons vs. literacy"

Previous message: vunzndi@vfemail.net: "Re: Emoji: emoticons vs. literacy"
Next in thread: Don Osborn: "RE: Cost of transition to UTF-8 for central census authorities"
Reply: Don Osborn: "RE: Cost of transition to UTF-8 for central census authorities"
Reply: Adam Twardoch: "Re: Cost of transition to UTF-8 for central census authorities"
Reply: Doug Ewell: "Re: Cost of transition to UTF-8 for central census authorities"
Reply: Tim Greenwood: "Re: Cost of transition to UTF-8 for central census authorities"
Reply: Christopher Fynn: "Re: Cost of transition to UTF-8 for central census authorities"
Maybe reply: philip chastney: "Re: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I have the following question to the list:

In Norway, our large census databases (https://infobank.edb.com,
contains the names, social sec num, address, cars, companies, boats,
etc, etc, of all Norwegian citizens). Today, it is encoded with the
8859-1 charset, probably in 8859-1 (some old registries may be EBCDIC,
but with the same character repertoire or a subset).

Now, Norway wants to be able to use Sámi in that register, i.e., 6x2
letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are
possible, but a natural solution is UGF-8.

Now, what will this cost?

According to key personel, this transition will require a transition
period of appr. 10 years, and a relatively high cost (politeness
towards the authors of the transition plans prevents me from referring
numbers).

Governmental experts see 3 drawbacks with UTF-8:

1. The field length in the database will be longer then the display
field. So, given a surname "Årø", we will have a display length of 3
(letters), as compared to the database length of 5 bytes.
2. There will have to be a new sorting routine, and a new search routine
3. Programs may no longer search for characters as single bytes, but
must in some cases open for search of sequence of bytes.
4. Many common programs only support 8-bit character sets
5. Data must be removed from registries, converted and replaced
6. Millions of lines of code must be changed and tested

To me it seems most of these points are not real problems, but either
a description of the conversion process, or unfounded fear.

My question to the list is this:

a. How can the variable field length be a problem? The field must in
any case open for longer names, e.g. my name's (Trosterud) 9 letters
requre 9 bytes, more than the 5 of Årø. Can there be data base
solutions who generate database fields on the basis of the number of
characters? The opposite (view fields on the basis of bytes) should be
no problem, it will only give [Årø ] and [Trosterud].

b. Will it really be necessary to change millions of lines of code?
How can even old, badly written code require such changes?

c. The problem with the discussion is that the experts within the
registries are presenting their conclusions, and not the premises
behind them. Politicians listening to them are thus lost. I am invited
in to comment the process, but it is not easy, as I get so little
information about the process. So, what kind of information is it that
I need to evaluate these estimates?

d. Other comments, or perhaps better: experiences from other countries?

Trond Trosterud.

----------------------------------------------------------------------
Trond Trosterud t +47 7764 4763
Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/
dn------------------------------------------------------------------đŋ

Next message: Michael D'Errico: "Re: Emoji: emoticons vs. literacy"
Previous message: vunzndi@vfemail.net: "Re: Emoji: emoticons vs. literacy"
Next in thread: Don Osborn: "RE: Cost of transition to UTF-8 for central census authorities"
Reply: Don Osborn: "RE: Cost of transition to UTF-8 for central census authorities"
Reply: Adam Twardoch: "Re: Cost of transition to UTF-8 for central census authorities"
Reply: Doug Ewell: "Re: Cost of transition to UTF-8 for central census authorities"
Reply: Tim Greenwood: "Re: Cost of transition to UTF-8 for central census authorities"
Reply: Christopher Fynn: "Re: Cost of transition to UTF-8 for central census authorities"
Maybe reply: philip chastney: "Re: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 09:03:55 CST