RE: collating sequence

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Thu Jun 28 2001 - 19:49:22 EDT


Markus,

>
> > Currently I am developing an easy to implement
> > interface (xIUA) that is also free Open Source code.
>
> I would like to point out that ICU itself has a fully functional
> and useful "interface" (API) for all of its services (conversion,
> collation, normalization, formatting, etc.).
What xIUA provides is a started package for those folks who want an ICU
wrapper. If they don't want a wrapper than there is no need to use xIUA at
all.

Wrappers are very useful especially when retrofitting existing code.
Imagine having every function pass current locale, time zone etc. to any
function they may at some time call ICU. The wrappers also standardize
calls to use the same parameters and they can simplify the interface. This
type of wrapper code is designed to be tailored by the user and serves a
very different function for the base ICU code. It does things that should
never be implemented in ICU. Users should not normally touch ICU code.

ICU must be flexible to handle any possible set of parameters. Development
environments usually restrict these to the house standards which are often a
small subset of can be done. If there is a special circumstance they can
either implement a different calling API or invoke ICU services directly.

>
> In particular, for as long as one works with UTF-16 strings, the
> ucol_strcoll() function is quite easy to use.
>

Even with UTF-16 strings you have to setup your collator. For example a
typical collate call:

Open a collator and check to see if it opened without errors. Then set the
collator to use compatibility decomposition followed by canonical
composition. Next set UCOL_ALTERNATE_HANDLING to UCOL_NON_IGNORABLE,
UCOL_CASE_LEVEL to UCOL_ON and the strength to UCOL_TERTIARY.
Issue the collate, close the collator and check for errors.

With xIUA you call: xiua_strcoll(str1,str2);

If you want a bit more flexibility you can call:
xiua_strcollEx(str1,str2,XCOL_TERTIARY); or
xiua_strcollEx(str1,str2,XCOL_SECONDARY_CAN); if you want a secondary, case
insensitive, collate with canonical decomposition followed by canonical
composition.

It you want to tailor these setting you can tailor xIUA to call ICU with
whatever you like.

A user implementing a wrapper would only have to change the code in one
place to upgrade for ICU 1.6 to ICU 1.8.

If for example you do a lot of repeat calls to the collator and want to
setup a collator to use for something like sorting, then none of the xIUA
functions will be suitable. But you can use the code and other programs
like the ICU test and sample programs to develop you own functions. This is
why xIUA is a starter package.

> It may help in some applications to use whatever wrapper one
> likes, but it is not necessary to use a wrapper.
Very true. But if they want to write a wrapper this can save them
man-months of work.

>
> Also, if a wrapper library performs hidden string conversions,
> then a user needs to understand the impact on performance and memory use.
It should not perform unnecessary conversions. xIUA has a memory manager
for internal working memory. It keeps a small buffer that it uses and can
subdivide into pieces for working memory. Thus is saves the malloc/free
overhead that is likely to occur with explicitly implemented calls.
Combining calling sequences into a single function can reduce code size. To
improve speed xIUA uses its own UTF-8 to UTF-16 and UTF-32 to UTF-16
conversion routines. You don't want the overhead of a full converter
especially for converting lots of small fields.

xIUA will also maintain an open ICU converter for translations to and from
code pages. Just keeping track of such a converter is not easy to retrofit
into existing applications. Thus a wrapper if properly designed can lower
overhead. xIUA is a starting point that users can tailor to be efficient in
their own environment.

>
> ICU will add some helper functions to allow users to explicitly
> convert in-process strings between UTFs. This is simple (even
> without such helper functions) and fast - but of course not as
> fast as staying with a single encoding.

Some functions like collate are easy to convert to UTF-16 and process.
Other functions like strtok don't work that way. They need a separate
implementation because they return pointers into the source string and
modify the string contents.

I have worked to implement ICU for clients and have put a lot of pro bono
work into this product because I have seen that ICU would be accepted by
more clients if they could save a lot of time and effort in implementing
ICU. This provides them with a way to speed up the process. It may not be
for everyone but I hope that it will help many.

Carl W. Brown
X.Net, Inc.

I am not a part of the ICU development team. xIUA is not supported by them.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT