From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Dec 19 2003 - 07:01:59 EST
Hallvard B Furuseth wrote:
> I need a function which converts Latin Unicode characters to
> the closest equivalent ASCII characters, e.g. "é" -> "e".
>
> Before I reinvent the wheel, does any public domain or GPL
> code for this already exist?
I don't know, sorry.
> If not,
> for the most part I expect I can make the mapping from the character
> names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
> in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
Why the name!?
The decomposition property (5th filed on each line) is much better for this.
E.g.:
00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301;;;;N;LATIN
SMALL LETTER E ACUTE;;00C9;;00C9
The decomposition field tells you that "é" (code 00E9 hex) is composed of
ASCII "e" (code 0065 hex) and the combining acute accent (code 0301 hex):
you keep the ASCII character and drop the composing accent.
> Punctuation and other non-letters will be worse, but they are less
> important to me anyway.
The result is much better if you allow the ASCII conversion to be a string.
This allows you to, e.g., "©" = "(c)", "½" = "1/2", and so on. This is also
good for letters: "ß" = "ss", "å" = "aa", etc.
_ Marco
This archive was generated by hypermail 2.1.5 : Fri Dec 19 2003 - 07:41:05 EST