From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Jun 08 2006 - 19:48:10 CDT
Mike wrote on Friday, June 09, 2006 at 12:02 AM
> Some of the recent discussions have led me to
> question my implementation of upper/lowercasing
> and case folding. Currently I simply iterate
> through a string exchanging characters with
> their replacements. I don't first normalize
> to any form, or do any reordering of combining
> marks afterward.
> My question is, should I be doing these things?
Like a lot of things, it depends on why you are doing them. If your clients
are dumb, Unicode-non-compliant processes that are only going to do binary
comparison on the outputs, the only normalisation you should do is to make
sure that when U+0345 COMBINING GREEK YPOGEGRAMMENI become an iota, it moves
to the end of the sequence of non-zero combining class characters (and two
Tibetan nasties) following. (The Unicode Standard is quite frankly unclear
on this - it tells you what to do in uppercasing of you want the
linguistically correct outcome, but leaves matters irritatingly vague for
someone trying to implement conversions in strict compliance to the
standard. Perhaps you should provide the option of a jobsworth
interpretation and a linguistically correct one. There is a third option -
not to convert the subscript to an iota, but that would be tailoring for a
specific variety of Greek. This is not what you want to hear, and I had
hoped to get some guidance from inner Unicode counsels before publicising
the problem.) This assumes that your clients wish to make a distinction
between one-character e-acute U+00E9 and the two-character e-acute <U+0065,
U+0301>. Note however that case folding necessarily does some partial
decomposition of composites.
However, if your clients are going to interpret the sequences as text, it is
helpful if you can provide the output in NFC or NFD as required - and
occasionally NFKC and NFKD may be wanted. Too many processes demand NFC -
it's being proposed as an extension of ASCII for some Internet
applications - I presume for things like e-mail headers. For doing
user-customised collation, NFD may be better, for the collation weightings
are defined in terms of NFD.
One nasty practicality is that some fonts display canonically equivalent
sequences differently. In such cases, the dumb approach may be best. If
someone has painfully worked out the best way of expressing a
diacritic-laden grapheme cluster for the fonts at hand, your best bet on
changing its case would be to make as little change to its composition as
possible. I rather suspect the best-displayed form of capital A with
circumflex and dot below will be obtained with <U+00C2, U+0323>, even though
that is neither NFC nor NFD - the NFC form is <U+1EAC>, but a font may very
well not support it.
Richard.
This archive was generated by hypermail 2.1.5 : Thu Jun 08 2006 - 19:58:46 CDT