Re: FW:transform a (UNICODE) accented character to its equivalent (UNICODE) non-accented character

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 06 2003 - 06:37:56 EDT

  • Next message: Philippe Verdy: "Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)"

    On Tuesday, August 05, 2003 9:34 PM, John Cowan <jcowan@reutershealth.com> wrote:

    > Magda Danish (Unicode) scripsit:
    >
    > > > I'm looking for the easiest and more stable way to transform
    > > > an (UNICODE) accented character to its equivalent (UNICODE)
    > > > non-accented character.
    >
    > The following mapping table is an approximation to that.
    >
    > 00C0;0041
    (snip)
    > 1D1C0;1D1BA

    Why such a table? The main UCD table already contains the needed NFD canonical decompositions, and removing accents is simply a matter of NFD decomposition plus removal of combining characters (with combining class > 0), from which you may tune the set of filtered diacritics (for example to not remove some Brahmic diacritics such as viramas, or Hiragana/Katakana voicing marks which are easy to identify from their low positive combining class value, as they are not really accents but are important to correctly identify vowels and consonnants, without creating too much ambiguities if they are removed)...

    Using the NFD/NFC algorithm is certainly the best and safest option as it is stable across Unicode versions. The NFD mappings in the UCD will also transliterate all compatibility characters into their canonical equivalents. Some tuning may be required for Arabic, which includes precomposed sequences defined for compatibility but only mapped with NFKD because they sometime include more than a base character (possibly decomposable) and a single undecomposable diacritic. Other tuning may also be needed for Arabic and Hebrew (accents and points), if one wants to preserve the traditional vowels or use a "modern" simplified mapping without vowels.

    But the NFD mappings are already good for Han. Some compatility decompositions (NFKD) in the Han blocks may be useful (notably removing the narrow/wide differences)

    If your intent is to remove only accents in alphabetized scripts, it's probably best to remove only diacritics (CC>0) below U+800, notably in the U+03xx block, after the NFD decomposition, and ensure that the resulting string is recomposed and reordered with NFC rules. For Japanese, one may want to remap Katakana to Hiragana (but still keep the Kana voice marks), using a table currently not defined by Unicode, but documented in IBM's open-sourced ICU.

    See UAX#14.

    -- 
    Philippe.
    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 07:24:27 EDT