From: Trond Trosterud (trond.trosterud@hum.uit.no)
Date: Sun Jan 11 2009 - 09:02:13 CST
I have the following question to the list:
In Norway, our large census databases (https://infobank.edb.com,
contains the names, social sec num, address, cars, companies, boats,
etc, etc, of all Norwegian citizens). Today, it is encoded with the
8859-1 charset, probably in 8859-1 (some old registries may be EBCDIC,
but with the same character repertoire or a subset).
Now, Norway wants to be able to use Sámi in that register, i.e., 6x2
letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are
possible, but a natural solution is UGF-8.
Now, what will this cost?
According to key personel, this transition will require a transition
period of appr. 10 years, and a relatively high cost (politeness
towards the authors of the transition plans prevents me from referring
numbers).
Governmental experts see 3 drawbacks with UTF-8:
1. The field length in the database will be longer then the display
field. So, given a surname "Årø", we will have a display length of 3
(letters), as compared to the database length of 5 bytes.
2. There will have to be a new sorting routine, and a new search routine
3. Programs may no longer search for characters as single bytes, but
must in some cases open for search of sequence of bytes.
4. Many common programs only support 8-bit character sets
5. Data must be removed from registries, converted and replaced
6. Millions of lines of code must be changed and tested
To me it seems most of these points are not real problems, but either
a description of the conversion process, or unfounded fear.
My question to the list is this:
a. How can the variable field length be a problem? The field must in
any case open for longer names, e.g. my name's (Trosterud) 9 letters
requre 9 bytes, more than the 5 of Årø. Can there be data base
solutions who generate database fields on the basis of the number of
characters? The opposite (view fields on the basis of bytes) should be
no problem, it will only give [Årø ] and [Trosterud].
b. Will it really be necessary to change millions of lines of code?
How can even old, badly written code require such changes?
c. The problem with the discussion is that the experts within the
registries are presenting their conclusions, and not the premises
behind them. Politicians listening to them are thus lost. I am invited
in to comment the process, but it is not easy, as I get so little
information about the process. So, what kind of information is it that
I need to evaluate these estimates?
d. Other comments, or perhaps better: experiences from other countries?
Trond Trosterud.
----------------------------------------------------------------------
Trond Trosterud t +47 7764 4763
Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/
dn------------------------------------------------------------------đŋ
This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 09:03:55 CST