Re: A .Net Unicode Puzzle

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Mar 04 2007 - 13:16:19 CST

  • Next message: Gilbert Sneed: "Samaritan Script"

    Kent Karlsson wrote on Sunday, March 04, 2007 1:49 PM
    Subject: RE: A .Net Unicode Puzzle

    > Converting to a non-Unicode codepage is one thing, removing all "accents"
    > is a completely different thing.

    The code doesn't even succeed in that. I gave up listing in the IPA
    extensions, but most of the following incorporate diacritics:

    0243;LATIN CAPITAL LETTER B WITH STROKE
    0180;LATIN SMALL LETTER B WITH STROKE
    023B;LATIN CAPITAL LETTER C WITH STROKE
    023C;LATIN SMALL LETTER C WITH STROKE
    0110;LATIN CAPITAL LETTER D WITH STROKE
    0111;LATIN SMALL LETTER D WITH STROKE
    0246;LATIN CAPITAL LETTER E WITH STROKE
    0247;LATIN SMALL LETTER E WITH STROKE
    0126;LATIN CAPITAL LETTER H WITH STROKE
    0127;LATIN SMALL LETTER H WITH STROKE
    0197;LATIN CAPITAL LETTER I WITH STROKE
    0268;LATIN SMALL LETTER I WITH STROKE
    0248;LATIN CAPITAL LETTER J WITH STROKE
    0249;LATIN SMALL LETTER J WITH STROKE
    025F;LATIN SMALL LETTER DOTLESS J WITH STROKE
    0141;LATIN CAPITAL LETTER L WITH STROKE
    0142;LATIN SMALL LETTER L WITH STROKE
    00D8;LATIN CAPITAL LETTER O WITH STROKE
    00F8;LATIN SMALL LETTER O WITH STROKE
    024C;LATIN CAPITAL LETTER R WITH STROKE
    024D;LATIN SMALL LETTER R WITH STROKE
    0166;LATIN CAPITAL LETTER T WITH STROKE
    0167;LATIN SMALL LETTER T WITH STROKE
    023E;LATIN CAPITAL LETTER T WITH DIAGONAL STROKE
    024E;LATIN CAPITAL LETTER Y WITH STROKE
    024F;LATIN SMALL LETTER Y WITH STROKE
    01B5;LATIN CAPITAL LETTER Z WITH STROKE
    01B6;LATIN SMALL LETTER Z WITH STROKE
    019B;LATIN SMALL LETTER LAMBDA WITH STROKE

    0189;LATIN CAPITAL LETTER AFRICAN D

    0182;LATIN CAPITAL LETTER B WITH TOPBAR
    0183;LATIN SMALL LETTER B WITH TOPBAR
    018B;LATIN CAPITAL LETTER D WITH TOPBAR
    018C;LATIN SMALL LETTER D WITH TOPBAR

    0181;LATIN CAPITAL LETTER B WITH HOOK
    0253;LATIN SMALL LETTER B WITH HOOK
    0187;LATIN CAPITAL LETTER C WITH HOOK
    0188;LATIN SMALL LETTER C WITH HOOK
    018A;LATIN CAPITAL LETTER D WITH HOOK
    0257;LATIN SMALL LETTER D WITH HOOK
    0256;LATIN SMALL LETTER D WITH TAIL
    025D;LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
    025A;LATIN SMALL LETTER SCHWA WITH HOOK
    0191;LATIN CAPITAL LETTER F WITH HOOK
    0192;LATIN SMALL LETTER F WITH HOOK
    0193;LATIN CAPITAL LETTER G WITH HOOK
    0260;LATIN SMALL LETTER G WITH HOOK
    0266;LATIN SMALL LETTER H WITH HOOK
    0198;LATIN CAPITAL LETTER K WITH HOOK
    0199;LATIN SMALL LETTER K WITH HOOK
    026D;LATIN SMALL LETTER L WITH RETROFLEX HOOK
    0273;LATIN SMALL LETTER N WITH RETROFLEX HOOK
    01A4;LATIN CAPITAL LETTER P WITH HOOK
    01A5;LATIN SMALL LETTER P WITH HOOK
    027B;LATIN SMALL LETTER TURNED R WITH HOOK
    01AC;LATIN CAPITAL LETTER T WITH HOOK
    01AD;LATIN SMALL LETTER T WITH HOOK
    01B2;LATIN CAPITAL LETTER V WITH HOOK
    01B3;LATIN CAPITAL LETTER Y WITH HOOK
    01B4;LATIN SMALL LETTER Y WITH HOOK
    01AB;LATIN SMALL LETTER T WITH PALATAL HOOK
    01AE;LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
    0224;LATIN CAPITAL LETTER Z WITH HOOK
    0225;LATIN SMALL LETTER Z WITH HOOK

    023D;LATIN CAPITAL LETTER L WITH BAR
    019A;LATIN SMALL LETTER L WITH BAR
    019F;LATIN CAPITAL LETTER O WITH MIDDLE TILDE

    0255;LATIN SMALL LETTER C WITH CURL
    0234;LATIN SMALL LETTER L WITH CURL
    0235;LATIN SMALL LETTER N WITH CURL
    0236;LATIN SMALL LETTER T WITH CURL

    023F;LATIN SMALL LETTER S WITH SWASH TAIL
    0240;LATIN SMALL LETTER Z WITH SWASH TAIL
    0244;LATIN CAPITAL LETTER U BAR

    024A;LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
    024B;LATIN SMALL LETTER Q WITH HOOK TAIL

    026B;LATIN SMALL LETTER L WITH MIDDLE TILDE
    026C;LATIN SMALL LETTER L WITH BELT

    These two contain a 'subtractive diacritic':
    0131;LATIN SMALL LETTER DOTLESS I
    0237;LATIN SMALL LETTER DOTLESS J

    These are uncertain cases:
    00D0;LATIN CAPITAL LETTER ETH
    00F0;LATIN SMALL LETTER ETH
    0138;LATIN SMALL LETTER KRA

    This I think does not really contain a diacritic:
    0267;LATIN SMALL LETTER HENG WITH HOOK

    A diacritic-stripper might need to decompose these ligatures:
    00C6;LATIN CAPITAL LETTER AE
    00E6;LATIN SMALL LETTER AE
    0152;LATIN CAPITAL LIGATURE OE
    0153;LATIN SMALL LIGATURE OE
    0276;LATIN LETTER SMALL CAPITAL OE
    00DF;LATIN SMALL LETTER SHARP S

    I'm not sure of the decomposition of these - I rather suspect they should
    decompose to 'gb' and 'kp'.
    0238;LATIN SMALL LETTER DB DIGRAPH
    0239;LATIN SMALL LETTER QP DIGRAPH

    I'm not sure whether one should decompose the likes of eng. If so, the
    treatment could sensibly be context sensitive.

    > The latter is a false start to just about everything. Note also that many,
    > if not most, non-Unicode codepages
    > are NOT "accent free".

    And stripping non-spacing marks from Indic scripts is plain vandalism.

    A more useful exercise would be to do something like reducing strings to a
    'minimal' string that collates equal at the first level - ideally it would,
    in some sense, preserve capitalisation.

    Richard.



    This archive was generated by hypermail 2.1.5 : Sun Mar 04 2007 - 13:19:14 CST