Re: compatibility between unicode 2.0 and 3.0

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 03 2003 - 21:02:27 EST

Next message: Rick McGowan: "Re: Public Review Issues update"

Previous message: Peter_Constable@sil.org: "Re: LATIN LETTER N WITH DIAERESIS?"
Next in thread: Keyur Shroff: "Re: compatibility between unicode 2.0 and 3.0"
Reply: Keyur Shroff: "Re: compatibility between unicode 2.0 and 3.0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Erik Ostermueller asked:

> We have a large amount of C++ that currently has Unicode 2.0 support.
>
> Could you all help me figure out what types of operations will fail
> if we attempt to pass Unicode 3.0 thru this code?
>
> I can start the list off with
>
> -sorting
> -searching for text

This depends greatly on what implementation you did for
sorting and searching, and how it handles unassigned code points
in your Unicode 2.0 code. If the code was designed to be
forward compatible, it should do reasonable things with
unassigned code points, and getting Unicode 3.0 data which
is actually using those code points should not disturb your
existing code. But, on the other hand, if you have built
in a bunch of range checks or have used tables which cannot
gracefully handle the appearance of unassigned code points
in your data, then it could well blow up.

The Unicode Collation Algorithm was not defined until after
Unicode 2.0, and was first synched with Unicode 2.1. It has
also been considerably updated since then -- the current version
is aimed at Unicode 3.1. You should take a look at the
current version to check for gotchas you may have in your
current code.

> -text comparison

I assume here you are not talking about language-specific
collation comparisons, but just Unicode analogs of strcmp()
and the like. If so, those should behave well -- they aren't
usually programmed in ways which make them sensitive to
particular code point assignments.

> -other character classification (isSpace, isDigit, etc...).

Again, these depend on what kinds of forward compatibility
assumptions your original code made. If it provides
meaningful results for unassigned code points in Unicode 2.0,
then tossing Unicode 3.0 data at such APIs shouldn't cause
any problem to existing code, other than not getting the
right results for Unicode 3.0 additions until you have
modified and updated your property tables.

>
> I'm understand that these operations probably won't work in ALL cases.
> But how about basic plumbing code -- creating and copying string?

Constructors and copy constructors ought to work fine, unless
you've done something odd.

What you should be more concerned about, however, is
how your code is going to get from Unicode 3.0 to
Unicode 3.1 (or higher), because then you will have to
deal with supplementary characters. Any assumptions that
characters don't lie outside the range U+0000..U+FFFF
will be broken. Whether this will be a small problem
or a big problem for your code depends on whether you
are effectively processing Unicode in UTF-8, UTF-16,
or UTF-32 (or combinations of those). The biggest hit,
when moving from Unicode 3.0 to Unicode 3.1 (or higher)
is for UTF-16 APIs. See Unicode Technical Note #7,
Migrating Software to Supplementary Characters, for some
ideas:
http://www.unicode.org/notes/tn7/

--Ken

>
> As I mentioned in my last post, I've enjoyed
> listening in on this forum -- I've learned a whole lot.
>
> Thanks,
>
> --Erik Ostermueller
>

Next message: Rick McGowan: "Re: Public Review Issues update"
Previous message: Peter_Constable@sil.org: "Re: LATIN LETTER N WITH DIAERESIS?"
Next in thread: Keyur Shroff: "Re: compatibility between unicode 2.0 and 3.0"
Reply: Keyur Shroff: "Re: compatibility between unicode 2.0 and 3.0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Feb 03 2003 - 21:40:57 EST