Re: Unicode->ASCII approximate conversion

From: jon@hackcraft.net
Date: Fri Dec 19 2003 - 07:13:17 EST

Next message: Jungshik Shin: "Re: Unicode->ASCII approximate conversion"

Previous message: Marco Cimarosti: "RE: Unicode->ASCII approximate conversion"
In reply to: Hallvard B Furuseth: "Unicode->ASCII approximate conversion"
Next in thread: Jungshik Shin: "Re: Unicode->ASCII approximate conversion"
Reply: Jungshik Shin: "Re: Unicode->ASCII approximate conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Quoting Hallvard B Furuseth <h.b.furuseth@usit.uio.no>:

> I need a function which converts Latin Unicode characters to the closest
> equivalent ASCII characters, e.g. "é" -> "e".
>
> Before I reinvent the wheel, does any public domain or GPL code for this
> already exist?
>
> If not,
> for the most part I expect I can make the mapping from the character
> names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
> in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
> Punctuation and other non-letters will be worse, but they are less
> important to me anyway.
>

1. Produce the NFD normalisation of the text.
2. Remove all characters with a non-zero combining class.
3. Some non-ASCII characters may remain (particularly those from non-Latin
scripts) handling of some can be done nicely, but some may require you to raise
an exception or output a replacement character.

This can be done efficiently with a streaming processor if the size of the
source text is large.

You may want to use NFKD rather than NFD. NFKD would, for example, convert the
trademark symbol to "TM" and superscript 2 to "2" - this would allow you to
convert more characters but the loss of semantics may be problematic depending
on your application. Specialised handling of some characters is possible, for
instance you could convert the trademark sign to "(TM)" to avoid confusion, of
course this wouldn't be possible with an existing normalisation API, though if
the number of characters handled specially is small it would be possible to do
that in a first pass.

--
Jon Hanna                   | Toys and books
<http://www.hackcraft.net/> | for hospitals:
                            | <http://santa.boards.ie>

Next message: Jungshik Shin: "Re: Unicode->ASCII approximate conversion"
Previous message: Marco Cimarosti: "RE: Unicode->ASCII approximate conversion"
In reply to: Hallvard B Furuseth: "Unicode->ASCII approximate conversion"
Next in thread: Jungshik Shin: "Re: Unicode->ASCII approximate conversion"
Reply: Jungshik Shin: "Re: Unicode->ASCII approximate conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 19 2003 - 08:02:08 EST