RE: UTF8 vs. Unicode (UTF16) in code

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Mar 09 2001 - 13:24:09 EST


Thomas Chan wrote:
> > Does it exist at least one character > U+FFFF that is
> > commonly used in at least one modern language?
>
> How about music and math notation?

About the music symbols in Unicode 3.1, they are just the basic building
blocks for it. So I assume that handling surrogates (or UTF-32) would be the
minimum requirement for applications supporting the special complex
formatting capabilities of music.

About the math symbols in Unicode 3.1, why should I be the one who breaks
the silence? :-)

> But, yes. U+21075,[1] gan, is an aspect marker in Cantonese,
> that when
> placed after a verb, denotes continuing action (roughly equivalent to
> <-ing> in English). I don't think anyone would dispute the
> indispensability or high frequency of this character.

This is exactly the kind of info that I was seeking, thanks.

It is not very clear to me what is included in Extension B: how is it
possible to know something more about it?

> I probably wouldn't use "idiosyncratic" as an adjective to
> describe the *majority* of them, but "rare" and "ancient"
> (perhaps "historical"[2] would be a better word choice?)
> are correct.

Sorry, I probably misused the term. And I was assuming that KHSCS had been
unified with Extension B.

> [2] e.g., the "recently deceased", such as Vietnamese chu+~ no^m
> characters in Plane 2, or even Deseret in Plane 1.

Well, I guess that Chu-Nôm and Deseret are hardly known out of this mailing
list.

Clearly, it is worth to implement specialized notations or historical
scripts in widely used software such as Internet browsers, e-mail clients,
word processors, etc.

But the discussion was about porting existing applications to Unicode for
the purpose of being able to localize/use them in new markets.

Imagine concrete cases. E.g., I do software for the retail industry.

My managers could come and ask me to localize our solution for a retailer,
based in South China, who want their receipts and GUI messages to be in
Cantonese.

In *this* case I can push Unicode and fully justify the burden of UTF-16
support and, especially, the burden of checking that all programmers in the
team behave themselves with strings (e.g., they won't trim strings blindly,
leaving a lonely high surrogate at the end of it).

But you can imagine how winning would be the argument of UTF-16 for printing
pentagrams or on receipts (or algebraic formulae, or an aborted orthographic
for English, or the script used in Viet-Nam centuries ago)...

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT