From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Mar 04 2007 - 13:16:19 CST
Kent Karlsson wrote on Sunday, March 04, 2007 1:49 PM
Subject: RE: A .Net Unicode Puzzle
> Converting to a non-Unicode codepage is one thing, removing all "accents"
> is a completely different thing.
The code doesn't even succeed in that. I gave up listing in the IPA
extensions, but most of the following incorporate diacritics:
0243;LATIN CAPITAL LETTER B WITH STROKE
0180;LATIN SMALL LETTER B WITH STROKE
023B;LATIN CAPITAL LETTER C WITH STROKE
023C;LATIN SMALL LETTER C WITH STROKE
0110;LATIN CAPITAL LETTER D WITH STROKE
0111;LATIN SMALL LETTER D WITH STROKE
0246;LATIN CAPITAL LETTER E WITH STROKE
0247;LATIN SMALL LETTER E WITH STROKE
0126;LATIN CAPITAL LETTER H WITH STROKE
0127;LATIN SMALL LETTER H WITH STROKE
0197;LATIN CAPITAL LETTER I WITH STROKE
0268;LATIN SMALL LETTER I WITH STROKE
0248;LATIN CAPITAL LETTER J WITH STROKE
0249;LATIN SMALL LETTER J WITH STROKE
025F;LATIN SMALL LETTER DOTLESS J WITH STROKE
0141;LATIN CAPITAL LETTER L WITH STROKE
0142;LATIN SMALL LETTER L WITH STROKE
00D8;LATIN CAPITAL LETTER O WITH STROKE
00F8;LATIN SMALL LETTER O WITH STROKE
024C;LATIN CAPITAL LETTER R WITH STROKE
024D;LATIN SMALL LETTER R WITH STROKE
0166;LATIN CAPITAL LETTER T WITH STROKE
0167;LATIN SMALL LETTER T WITH STROKE
023E;LATIN CAPITAL LETTER T WITH DIAGONAL STROKE
024E;LATIN CAPITAL LETTER Y WITH STROKE
024F;LATIN SMALL LETTER Y WITH STROKE
01B5;LATIN CAPITAL LETTER Z WITH STROKE
01B6;LATIN SMALL LETTER Z WITH STROKE
019B;LATIN SMALL LETTER LAMBDA WITH STROKE
0189;LATIN CAPITAL LETTER AFRICAN D
0182;LATIN CAPITAL LETTER B WITH TOPBAR
0183;LATIN SMALL LETTER B WITH TOPBAR
018B;LATIN CAPITAL LETTER D WITH TOPBAR
018C;LATIN SMALL LETTER D WITH TOPBAR
0181;LATIN CAPITAL LETTER B WITH HOOK
0253;LATIN SMALL LETTER B WITH HOOK
0187;LATIN CAPITAL LETTER C WITH HOOK
0188;LATIN SMALL LETTER C WITH HOOK
018A;LATIN CAPITAL LETTER D WITH HOOK
0257;LATIN SMALL LETTER D WITH HOOK
0256;LATIN SMALL LETTER D WITH TAIL
025D;LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
025A;LATIN SMALL LETTER SCHWA WITH HOOK
0191;LATIN CAPITAL LETTER F WITH HOOK
0192;LATIN SMALL LETTER F WITH HOOK
0193;LATIN CAPITAL LETTER G WITH HOOK
0260;LATIN SMALL LETTER G WITH HOOK
0266;LATIN SMALL LETTER H WITH HOOK
0198;LATIN CAPITAL LETTER K WITH HOOK
0199;LATIN SMALL LETTER K WITH HOOK
026D;LATIN SMALL LETTER L WITH RETROFLEX HOOK
0273;LATIN SMALL LETTER N WITH RETROFLEX HOOK
01A4;LATIN CAPITAL LETTER P WITH HOOK
01A5;LATIN SMALL LETTER P WITH HOOK
027B;LATIN SMALL LETTER TURNED R WITH HOOK
01AC;LATIN CAPITAL LETTER T WITH HOOK
01AD;LATIN SMALL LETTER T WITH HOOK
01B2;LATIN CAPITAL LETTER V WITH HOOK
01B3;LATIN CAPITAL LETTER Y WITH HOOK
01B4;LATIN SMALL LETTER Y WITH HOOK
01AB;LATIN SMALL LETTER T WITH PALATAL HOOK
01AE;LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
0224;LATIN CAPITAL LETTER Z WITH HOOK
0225;LATIN SMALL LETTER Z WITH HOOK
023D;LATIN CAPITAL LETTER L WITH BAR
019A;LATIN SMALL LETTER L WITH BAR
019F;LATIN CAPITAL LETTER O WITH MIDDLE TILDE
0255;LATIN SMALL LETTER C WITH CURL
0234;LATIN SMALL LETTER L WITH CURL
0235;LATIN SMALL LETTER N WITH CURL
0236;LATIN SMALL LETTER T WITH CURL
023F;LATIN SMALL LETTER S WITH SWASH TAIL
0240;LATIN SMALL LETTER Z WITH SWASH TAIL
0244;LATIN CAPITAL LETTER U BAR
024A;LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
024B;LATIN SMALL LETTER Q WITH HOOK TAIL
026B;LATIN SMALL LETTER L WITH MIDDLE TILDE
026C;LATIN SMALL LETTER L WITH BELT
These two contain a 'subtractive diacritic':
0131;LATIN SMALL LETTER DOTLESS I
0237;LATIN SMALL LETTER DOTLESS J
These are uncertain cases:
00D0;LATIN CAPITAL LETTER ETH
00F0;LATIN SMALL LETTER ETH
0138;LATIN SMALL LETTER KRA
This I think does not really contain a diacritic:
0267;LATIN SMALL LETTER HENG WITH HOOK
A diacritic-stripper might need to decompose these ligatures:
00C6;LATIN CAPITAL LETTER AE
00E6;LATIN SMALL LETTER AE
0152;LATIN CAPITAL LIGATURE OE
0153;LATIN SMALL LIGATURE OE
0276;LATIN LETTER SMALL CAPITAL OE
00DF;LATIN SMALL LETTER SHARP S
I'm not sure of the decomposition of these - I rather suspect they should
decompose to 'gb' and 'kp'.
0238;LATIN SMALL LETTER DB DIGRAPH
0239;LATIN SMALL LETTER QP DIGRAPH
I'm not sure whether one should decompose the likes of eng. If so, the
treatment could sensibly be context sensitive.
> The latter is a false start to just about everything. Note also that many,
> if not most, non-Unicode codepages
> are NOT "accent free".
And stripping non-spacing marks from Indic scripts is plain vandalism.
A more useful exercise would be to do something like reducing strings to a
'minimal' string that collates equal at the first level - ideally it would,
in some sense, preserve capitalisation.
Richard.
This archive was generated by hypermail 2.1.5 : Sun Mar 04 2007 - 13:19:14 CST