Re: Unicode->ASCII approximate conversion

From: jon@hackcraft.net
Date: Fri Dec 19 2003 - 07:13:17 EST

  • Next message: Jungshik Shin: "Re: Unicode->ASCII approximate conversion"

    Quoting Hallvard B Furuseth <h.b.furuseth@usit.uio.no>:

    > I need a function which converts Latin Unicode characters to the closest
    > equivalent ASCII characters, e.g. "é" -> "e".
    >
    > Before I reinvent the wheel, does any public domain or GPL code for this
    > already exist?
    >
    > If not,
    > for the most part I expect I can make the mapping from the character
    > names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
    > in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
    > Punctuation and other non-letters will be worse, but they are less
    > important to me anyway.
    >

    1. Produce the NFD normalisation of the text.
    2. Remove all characters with a non-zero combining class.
    3. Some non-ASCII characters may remain (particularly those from non-Latin
    scripts) handling of some can be done nicely, but some may require you to raise
    an exception or output a replacement character.

    This can be done efficiently with a streaming processor if the size of the
    source text is large.

    You may want to use NFKD rather than NFD. NFKD would, for example, convert the
    trademark symbol to "TM" and superscript 2 to "2" - this would allow you to
    convert more characters but the loss of semantics may be problematic depending
    on your application. Specialised handling of some characters is possible, for
    instance you could convert the trademark sign to "(TM)" to avoid confusion, of
    course this wouldn't be possible with an existing normalisation API, though if
    the number of characters handled specially is small it would be possible to do
    that in a first pass.

    --
    Jon Hanna                   | Toys and books
    <http://www.hackcraft.net/> | for hospitals:
                                | <http://santa.boards.ie>
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 19 2003 - 08:02:08 EST