RE: Case mapping of dotless lowercase letters

From: Michael Everson (everson@evertype.com)
Date: Tue Dec 16 2003 - 11:21:07 EST

  • Next message: Kent Karlsson: "RE: Case mapping of dotless lowercase letters"

    At 16:48 +0100 2003-12-16, Philippe Verdy wrote:
    >Michael Everson wrote:
    >> At 11:03 +0100 2003-12-16, Philippe Verdy wrote:
    >> >Doug Ewell <dewell@adelphia.net> writes:
    >> > > > Wrong here: I have found occurences of dotless lowercase i, used
    >> > > > instead of soft-dotted lowercase i, as base letters for diacritics
    >> > > > added above it (it was an accute accent...)
    >> > >
    >> > > Don't do that.
    >> >
    >> >What? This is VALID UNICODE to have texts coded like this.
    >>
    >> In Irish, it is INCORRECT to spell "físeán"
    >> 'video' with a DOTLESS I + COMBINING ACUTE. It is
    >> a spelling error, and will fail in
    >> spell-checking. The correct spelling is either I
    >> + COMBINING ACUTE or precomposed I WITH ACUTE.
    >
    >Spelling was not the issue there. Only Unicode validity.

    Apparently you should look up the word "valid".

    Any character can follow any other character and
    be "valid". Any combining character can be
    applied to any base character, regardless of
    script.

    > > Texts which contain spelling errors. Or old IPA
    >> texts using any number of ad-hoc IPA font
    >> solutions. Those texts have to be transcoded to
    >> proper Unicode at some stage. What you suggest is
    >> Not Recommended.
    >
    >Not recommanded but still valid (and actually used in Turkish as well!)

    Case folding in Turkish and Azeri is DIFFERENT
    from everywhere else and you have to have a local
    tailoring for it.

    >used in some occasions because of defects in fonts that don't have a
    >precomposed glyph for letter i with the diacritic but have a separate glyph
    >for the combining diacritic and for the dotted and dotless letters i, or
    >that use renderers unable to remove the soft dot.

    What defects there are in FONTS without UNICODE CMAPS is of no concern to us.

    >The IPA-93 font is such one, which allows good
    >typesetting, but which needs glyph processing to
    >select the appropriate base letter.

    It isn't a Unicode font, and so it doesn't
    matter. Data represented in it has to be
    transcoded to Unicode, and the font has to have
    the right thing in it.

    >My main issue is, however with Turkish names found in environments where
    >language identification is not possible (for example a simple filename or a
    >locale-neutral database field or an international HTML form which requests
    >user names and use them as case insensitive identifiers); lowercase dotless
    >i do not work appropriately there.

    Oh well.

    >I think it is completely illogical to match together with case-insensitive
    >compares, the three letters:
    > LATIN SMALL LETTER I (dotted)
    > LATIN CAPITAL LETTER I (dotless)
    > LATIN CAPITAL LETTER I WITH DOT ABOVE
    >but not:
    > LATIN SMALL LETTER DOTLESS I
    >when use locale-neutral compares, given that the normative uppercase mapping
    >of this fourth letter is the second letter above.

    That is not what happens in locale-neutral comparisons, I believe.

    -- 
    Michael Everson * * Everson Typography *  * http://www.evertype.com
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 12:11:23 EST