Re: Nicest UTF

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Dec 02 2004 - 06:12:20 CST

  • Next message: Arcane Jill: "RE: Nicest UTF"

    On Wednesday, December 01, 2004 22:40Z Theodore H. Smith va escriure:

    > Assuming you had no legacy code. And no "handy" libraries either,
    > except for byte libraries in C (string.h, stdlib.h). Just a C++
    > compiler, a "blank page" to draw on, and a requirement to do a lot of
    > Unicode text processing.
    <...>
    > What would be the nicest UTF to use?

    There are other factors that might influence your choice.
    For example, the relative cost of using 16-bit entities: on a Pentium it is
    cheap, on more modern X86 processors the price is a bit higher, and on some
    RISC chips it is prohibitive (that is, short may become 32 bits; obviously,
    in such a case, UTF-16 is not really a good choice). On the other extreme,
    you have processors where byte are 16 bits; obviously again, then UTF-8 is
    not optimum there. ;-)

    Also, it may influence if you have write access to the sources for your
    library: if yes, then it could be possible (at a minimal adaptation cost) to
    use it to handle 16-bit ot 32-bit characters. Even more interesting, this
    might already exist, in form of the wcs*() functions of the C95 Standard.

    It also depends, obviously, on the kind of processing you are doing. Some
    are mainly handling strings, so the transformation format is not the most
    important thing. Yet others are handling characters, and then UTF-8 is less
    adequate because of the cost of relocating. On the other hand texts are
    stored in external files, and if the external format is UTF-8 or based on
    it, then it might be a bias toward it.

    And finally it may depend on how many different architectures you need to
    deploy your programs. C is great for its portability, yet portability is a
    tool, not a necessary target. An unique user usually does not care how
    portable is the program he is using, provide it does the job and it results
    cheap (or not too expensive). I agree portability is a good point for IT
    managers (because it foments competition, with is good to cut costs.) But on
    the other hand, too much portability can be counter-productive to everyone
    (for example, writing a text processor in C which allows characters to be
    stored directly as 8-bit as well as UTF-16 bytes. Or using long for
    everything, in order to be potentially portable to 16-bit ints, even if the
    storage limitations will impede practical use.)

    I believe the current availability of 3 competitive formats is a fact that
    we have to accept. It is certainly not as optimum as the prevalence of ASCII
    may have been. It is certainly a bad thing for some suppliers such as those
    that are writing those libraries, because it means ×3 work for them and an
    augmentated price for their users (being in sales price or being in delay of
    availability of features/bug corrections/etc.) Moreover, the present
    existence of widely available yet incompatible installed bases for at least
    two of the formats (namely UTF-16 on Windows NT and UTF-8 on Internet
    protocols) means additional costs for about all the industry. This may mean
    more workload for those that are actually working in this area ;-), but also
    more pression upon them from part of their managements, and results in waste
    when seen from the client side, so not a good thing for marketing.
    Yet it is this way, and I assume we cannot do many things to cure that.

    Now let's proceed to read the rest...

    > I think UTF8 would be the nicest UTF.

    So that is your point of view.

    > But does UTF32 offer simpler better faster cleaner code?

    Perhaps you can actually try to measure it.

    > A Unicode "character" can be decomposed. Meaning that a character
    > could still be a few variables of UTF32 code points! You'll still
    > need to carry around "strings" of characters, instead of characters.

    This sillogism is assuming that any text handling requires decomposition. I
    disagree with this.

    > The fact that it is totally bloat worthy, isn't so great. Bloat
    > mongers aren't your friend.

    Again, do you care to offer us any figures?

    > The fact that it is incompatible with existing byte code doesn't help.

    See above.

    > UTF8 can be used with the existing byte libraries just fine.

    It depends on what you want to do. For example, using strchr()/strspn() and
    the like may be great if you are dealing with some sort of tagged formats
    such as SGML; but if your text uses U+2028 as end-of-line indicator, it
    suddently becomes not so great...

    > An accented A in UTF-8, would be 3 bytes decomposed.

    Or more.

    > In UTF32, thats 8 bytes!

    And so? Nobody is saying that UTF-32 is space efficient. In fact, UTF-32
    specifically trade space against other advantages. If you are space-tight,
    then obviously UTF-32 is not a choice. That is another constraint. Which you
    did not add to the list above.

    On the other hand, nowadays, the general use workstation used for text
    processing has several hundred of megabytes of memory. That is, several
    scores of megabytes of UTF-32 characters, decomposed and so on.
    The biggest text I have at hand is below 15 M. And when I have to deal with
    it, I am quite clearly I/O-bounded, not memory-bounded.

    > Also, UTF-8 is a great file format, or socket-transfer format.

    You are using sockets xeno-IPC for intensive text processing? Do you really
    believe it is representative?

    > Not needing to convert is great.

    I must be missing something here. You are starting living in a perfect
    world, with no legacy. Yet you need to convert for external interfacing...

    > Its also compatible with C strings.

    Specifically not a good point to mention here these days. :-)

    > Also, UTF-8 has no endian issues.

    And?
    If anyone were using network order everywhere, there will not be a problem
    with the other formats either. Alas, while Intel does provide elementary
    instructions to deal with that at the lower level, their use is not
    practical and even less efficient. And I see nothing like bi-endianness
    coming in future processors from Intel.
    Also, if you are living purely in a Windows world, neither is endianness an
    actual problem as far as I can see.

    > Also, UTF-8's compactness makes it great for processing large volumes
    > of UTF-8.

    Is it a real problem? Have you figures?
    I happen to have suuffered of this kind of problems some years ago. My box
    only had 32 then 64 MB, and I was dealing with multi-Mc texts. And UTF-8
    proved to be am important constraint factor when compared with legacy (here
    ISCII/CSX) encodings, when it comes to memory size . . .

    > I think that UTF16 is really bad. UTF16 is basically popular, because
    > so many people thought UCS2 was the answer to internationalisation.
    > UTF16 was kind of a "switch and bait" technique (unintentional of
    > course). Had it been known that we need to treat characters as
    > multiple units of variables, we might as well have gone for UTF8!

    You are perhaps missing something here.
    UCS-2 (that is, the difference between Unicode vs. DIS 10646) was viewed
    with much expectation back in the '90s when enginneers were tired to have to
    deal with multibytes (including state encodings), that fitted pretty badly
    within the scheme of the existing (ASCII- or EBCDIC-based) softwares. Of
    course, the result was not at the level of the most optimistic expectations.
    Part of this comes from the UTF-16 kludge, part from the problems related
    with decompositions, as you mentionned, and also from other problems, such
    as the lack of easy transposition of determined mechanisms like <ctype.h>
    for example; or the underlying fact that we should deal with strings rather
    than characters as C and the usual L3G programming languages invite us to
    do.

    However, UTF-8 on the other side is a step back to multibytes. Yes, it is
    stateless; it is easy to synchronize; and also in the mean time, software
    enginneers did learn and many existing code is not multibyte-hostile
    anymore. In other words, it is a very good variable-sized encoding. Whcih
    does not prevent it to be a variable-sized encoding.

    > The people who like UTF16 because UTF8 takes 3 bytes where UTF16 takes
    > 2 for their favourite language... I can see their point. But even
    > then, with the prevalence of markup, and the prevalence of 1 byte
    > punctuation, the trade-off is really quite small.

    Figures?
    Also, you do use U+2028 as line separator, as Unicode mandates, don't you?

    > UTF-8 (byte) processing code is also more compatible with that Unicode
    > compression scheme whose acronym I forget (something like SCSU).

    I am not sure that text processing should take an appreciable part of the
    time doing compression-decompression. In fact, if things are so, something
    seem wrong to me.

    > Its too bad MicroSoft and Apple didn't realise the same, before they
    > made their silly UCS-2 APIs.

    You began by considering the perfect world of pure text processing: so any
    argument related to file systems or use of string atoms in the API are
    deemed irrelevant, and I will abstain myself to bring them in.
    However what appears to me basically inacceptable is to bash Microsoft or
    Apple for lack of vision when it comes to the APIs they designed 10-15 years
    ago, yet to consider ANSI (89, not 95) C libraries on a octet-oriented
    machine as the only available alternative in 2004, when looking toward the
    future.

    Antoine



    This archive was generated by hypermail 2.1.5 : Thu Dec 02 2004 - 06:18:32 CST