RE: A .Net Unicode Puzzle

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Mar 05 2007 - 16:39:18 CST

  • Next message: Kenneth Whistler: "RE: A .Net Unicode Puzzle"

    With such exclusion list, then, why don't you remove also U and W (variantw
    of V), J (a variant of U), D (a variant of t), B (a variant of P)...?
    Removing "accents" like you do is not a good idea because these are really
    different letters (the decomposability is superficial, and such removal
    attempts to unify them in a way that breaks almost all linguistic rules).

    What a strange idea... which would have the bad effect of creating lots of
    ambiguities, or unpronounceable and unrecognizable words (it won't even help
    English users).

    With such "simplications", the effect will be even worse than unifying
    doubled consonants into single consonants (like MM -> M). I see absolutely
    no interest in doing such transform that just breaks text originally
    containing most of these letters.

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    > part de Richard Wordingham
    > Envoyé : dimanche 4 mars 2007 20:16
    > À : 'Unicode'
    > Objet : Re: A .Net Unicode Puzzle
    >
    > Kent Karlsson wrote on Sunday, March 04, 2007 1:49 PM
    > Subject: RE: A .Net Unicode Puzzle
    >
    >
    > > Converting to a non-Unicode codepage is one thing, removing all
    > "accents"
    > > is a completely different thing.
    >
    > The code doesn't even succeed in that. I gave up listing in the IPA
    > extensions, but most of the following incorporate diacritics:
    >
    > 0243;LATIN CAPITAL LETTER B WITH STROKE
    > 0180;LATIN SMALL LETTER B WITH STROKE
    > 023B;LATIN CAPITAL LETTER C WITH STROKE
    > 023C;LATIN SMALL LETTER C WITH STROKE
    > 0110;LATIN CAPITAL LETTER D WITH STROKE
    > 0111;LATIN SMALL LETTER D WITH STROKE
    > 0246;LATIN CAPITAL LETTER E WITH STROKE
    > 0247;LATIN SMALL LETTER E WITH STROKE
    > 0126;LATIN CAPITAL LETTER H WITH STROKE
    > 0127;LATIN SMALL LETTER H WITH STROKE
    > 0197;LATIN CAPITAL LETTER I WITH STROKE
    > 0268;LATIN SMALL LETTER I WITH STROKE
    > 0248;LATIN CAPITAL LETTER J WITH STROKE
    > 0249;LATIN SMALL LETTER J WITH STROKE
    > 025F;LATIN SMALL LETTER DOTLESS J WITH STROKE
    > 0141;LATIN CAPITAL LETTER L WITH STROKE
    > 0142;LATIN SMALL LETTER L WITH STROKE
    > 00D8;LATIN CAPITAL LETTER O WITH STROKE
    > 00F8;LATIN SMALL LETTER O WITH STROKE
    > 024C;LATIN CAPITAL LETTER R WITH STROKE
    > 024D;LATIN SMALL LETTER R WITH STROKE
    > 0166;LATIN CAPITAL LETTER T WITH STROKE
    > 0167;LATIN SMALL LETTER T WITH STROKE
    > 023E;LATIN CAPITAL LETTER T WITH DIAGONAL STROKE
    > 024E;LATIN CAPITAL LETTER Y WITH STROKE
    > 024F;LATIN SMALL LETTER Y WITH STROKE
    > 01B5;LATIN CAPITAL LETTER Z WITH STROKE
    > 01B6;LATIN SMALL LETTER Z WITH STROKE
    > 019B;LATIN SMALL LETTER LAMBDA WITH STROKE
    >
    > 0189;LATIN CAPITAL LETTER AFRICAN D
    >
    > 0182;LATIN CAPITAL LETTER B WITH TOPBAR
    > 0183;LATIN SMALL LETTER B WITH TOPBAR
    > 018B;LATIN CAPITAL LETTER D WITH TOPBAR
    > 018C;LATIN SMALL LETTER D WITH TOPBAR
    >
    > 0181;LATIN CAPITAL LETTER B WITH HOOK
    > 0253;LATIN SMALL LETTER B WITH HOOK
    > 0187;LATIN CAPITAL LETTER C WITH HOOK
    > 0188;LATIN SMALL LETTER C WITH HOOK
    > 018A;LATIN CAPITAL LETTER D WITH HOOK
    > 0257;LATIN SMALL LETTER D WITH HOOK
    > 0256;LATIN SMALL LETTER D WITH TAIL
    > 025D;LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
    > 025A;LATIN SMALL LETTER SCHWA WITH HOOK
    > 0191;LATIN CAPITAL LETTER F WITH HOOK
    > 0192;LATIN SMALL LETTER F WITH HOOK
    > 0193;LATIN CAPITAL LETTER G WITH HOOK
    > 0260;LATIN SMALL LETTER G WITH HOOK
    > 0266;LATIN SMALL LETTER H WITH HOOK
    > 0198;LATIN CAPITAL LETTER K WITH HOOK
    > 0199;LATIN SMALL LETTER K WITH HOOK
    > 026D;LATIN SMALL LETTER L WITH RETROFLEX HOOK
    > 0273;LATIN SMALL LETTER N WITH RETROFLEX HOOK
    > 01A4;LATIN CAPITAL LETTER P WITH HOOK
    > 01A5;LATIN SMALL LETTER P WITH HOOK
    > 027B;LATIN SMALL LETTER TURNED R WITH HOOK
    > 01AC;LATIN CAPITAL LETTER T WITH HOOK
    > 01AD;LATIN SMALL LETTER T WITH HOOK
    > 01B2;LATIN CAPITAL LETTER V WITH HOOK
    > 01B3;LATIN CAPITAL LETTER Y WITH HOOK
    > 01B4;LATIN SMALL LETTER Y WITH HOOK
    > 01AB;LATIN SMALL LETTER T WITH PALATAL HOOK
    > 01AE;LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
    > 0224;LATIN CAPITAL LETTER Z WITH HOOK
    > 0225;LATIN SMALL LETTER Z WITH HOOK
    >
    > 023D;LATIN CAPITAL LETTER L WITH BAR
    > 019A;LATIN SMALL LETTER L WITH BAR
    > 019F;LATIN CAPITAL LETTER O WITH MIDDLE TILDE
    >
    > 0255;LATIN SMALL LETTER C WITH CURL
    > 0234;LATIN SMALL LETTER L WITH CURL
    > 0235;LATIN SMALL LETTER N WITH CURL
    > 0236;LATIN SMALL LETTER T WITH CURL
    >
    > 023F;LATIN SMALL LETTER S WITH SWASH TAIL
    > 0240;LATIN SMALL LETTER Z WITH SWASH TAIL
    > 0244;LATIN CAPITAL LETTER U BAR
    >
    > 024A;LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
    > 024B;LATIN SMALL LETTER Q WITH HOOK TAIL
    >
    > 026B;LATIN SMALL LETTER L WITH MIDDLE TILDE
    > 026C;LATIN SMALL LETTER L WITH BELT
    >
    > These two contain a 'subtractive diacritic':
    > 0131;LATIN SMALL LETTER DOTLESS I
    > 0237;LATIN SMALL LETTER DOTLESS J
    >
    > These are uncertain cases:
    > 00D0;LATIN CAPITAL LETTER ETH
    > 00F0;LATIN SMALL LETTER ETH
    > 0138;LATIN SMALL LETTER KRA
    >
    > This I think does not really contain a diacritic:
    > 0267;LATIN SMALL LETTER HENG WITH HOOK
    >
    > A diacritic-stripper might need to decompose these ligatures:
    > 00C6;LATIN CAPITAL LETTER AE
    > 00E6;LATIN SMALL LETTER AE
    > 0152;LATIN CAPITAL LIGATURE OE
    > 0153;LATIN SMALL LIGATURE OE
    > 0276;LATIN LETTER SMALL CAPITAL OE
    > 00DF;LATIN SMALL LETTER SHARP S
    >
    > I'm not sure of the decomposition of these - I rather suspect they should
    > decompose to 'gb' and 'kp'.
    > 0238;LATIN SMALL LETTER DB DIGRAPH
    > 0239;LATIN SMALL LETTER QP DIGRAPH
    >
    > I'm not sure whether one should decompose the likes of eng. If so, the
    > treatment could sensibly be context sensitive.
    >
    > > The latter is a false start to just about everything. Note also that
    > many,
    > > if not most, non-Unicode codepages
    > > are NOT "accent free".
    >
    > And stripping non-spacing marks from Indic scripts is plain vandalism.
    >
    > A more useful exercise would be to do something like reducing strings to a
    > 'minimal' string that collates equal at the first level - ideally it
    > would,
    > in some sense, preserve capitalisation.
    >
    > Richard.
    >
    >
    >
    > --------------------------------------------------------------------------
    > -------------
    > Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
    > Aucun virus connu a ce jour par nos services n'a ete detecte.
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Mar 05 2007 - 16:42:12 CST