Re: Cost of transition to UTF-8 for central census authorities

From: Tim Greenwood (timothy.greenwood@gmail.com)
Date: Sun Jan 11 2009 - 11:12:48 CST

  • Next message: Erkki I. Kolehmainen: "RE: Cost of transition to UTF-8 for central census authorities"

    Most databases still define the schema in terms of characters, not bytes. So
    a varchar(3) is 3 characters (or perhaps code points) no matter whether the
    database is storing it in Latin1 or UTF-8.

    Is sorting and searching done inside the database? If so then point 2 is a
    noop.

    All decent databases will convert output to the codeset required by the
    client, converting in ODBC or similar. So conversion of client programs to
    work with UTF-8, if needed at all, can be phased in.

    Tim

    On Sun, Jan 11, 2009 at 10:02 AM, Trond Trosterud <
    trond.trosterud@hum.uit.no> wrote:

    > I have the following question to the list:
    >
    > In Norway, our large census databases (https://infobank.edb.com, contains
    > the names, social sec num, address, cars, companies, boats, etc, etc, of all
    > Norwegian citizens). Today, it is encoded with the 8859-1 charset, probably
    > in 8859-1 (some old registries may be EBCDIC, but with the same character
    > repertoire or a subset).
    >
    > Now, Norway wants to be able to use Sámi in that register, i.e., 6x2
    > letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are
    > possible, but a natural solution is UGF-8.
    >
    > Now, what will this cost?
    >
    > According to key personel, this transition will require a transition period
    > of appr. 10 years, and a relatively high cost (politeness towards the
    > authors of the transition plans prevents me from referring numbers).
    >
    > Governmental experts see 3 drawbacks with UTF-8:
    >
    > 1. The field length in the database will be longer then the display field.
    > So, given a surname "Årø", we will have a display length of 3 (letters), as
    > compared to the database length of 5 bytes.
    > 2. There will have to be a new sorting routine, and a new search routine
    > 3. Programs may no longer search for characters as single bytes, but must
    > in some cases open for search of sequence of bytes.
    > 4. Many common programs only support 8-bit character sets
    > 5. Data must be removed from registries, converted and replaced
    > 6. Millions of lines of code must be changed and tested
    >
    > To me it seems most of these points are not real problems, but either a
    > description of the conversion process, or unfounded fear.
    >
    > My question to the list is this:
    >
    > a. How can the variable field length be a problem? The field must in any
    > case open for longer names, e.g. my name's (Trosterud) 9 letters requre 9
    > bytes, more than the 5 of Årø. Can there be data base solutions who generate
    > database fields on the basis of the number of characters? The opposite (view
    > fields on the basis of bytes) should be no problem, it will only give [Årø
    > ] and [Trosterud].
    >
    > b. Will it really be necessary to change millions of lines of code? How can
    > even old, badly written code require such changes?
    >
    > c. The problem with the discussion is that the experts within the
    > registries are presenting their conclusions, and not the premises behind
    > them. Politicians listening to them are thus lost. I am invited in to
    > comment the process, but it is not easy, as I get so little information
    > about the process. So, what kind of information is it that I need to
    > evaluate these estimates?
    >
    > d. Other comments, or perhaps better: experiences from other countries?
    >
    > Trond Trosterud.
    >
    >
    > ----------------------------------------------------------------------
    > Trond Trosterud t +47 7764 4763
    > Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
    > N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
    > Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/
    > dn------------------------------------------------------------------đŋ
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 11:15:01 CST