RE: Case mapping of dotless lowercase letters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 16 2003 - 05:03:40 EST

Next message: Michael Everson: "Re: Stability of WG2 (was: Re: [OT] CJK -> CJC)"

Previous message: Christopher John Fynn: "Re: Stability of WG2"
In reply to: Doug Ewell: "Re: Case mapping of dotless lowercase letters"
Next in thread: Michael Everson: "RE: Case mapping of dotless lowercase letters"
Reply: Michael Everson: "RE: Case mapping of dotless lowercase letters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell <dewell@adelphia.net> writes:
> > Wrong here: I have found occurences of dotless lowercase i, used
> > instead of soft-dotted lowercase i, as base letters for diacritics
> > added above it (it was an accute accent...)
>
> Don't do that.

What? This is VALID UNICODE to have texts coded like this. The proposed
change for soft-dotted/dotless letters used with diacritics is still not in
the standard, and it just gives rendering hints so that both base letters
should have the same rendering, requiring the use of a explicit dot when the
soft dot muct be kept with the diacritic.

> > There was two sequences which looked apparently identical when
> > rendered, and that were distinct after case folding compare check:
> >
> > (1) LATIN SMALL LETTER I, COMBINING ACCUTE ACCENT
> > (2) LATIN SMALL LETTER DOTLESS I, COMBINING ACCUTE ACCENT
> >
> > but were no more distinct when converted to uppercase in a locale
> > neutral environment not using the Turkic rules:
> >
> > (1') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT
> > (2') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT
>
> OK, so you want the default, local-neutral case mapping tables to equate
> U+0069 with U+0131, right?

Yes. And I have good reasons for that, coming from the fact that default
locale-neutral mappings tables already equate their uppercase versions U+049
with U+0130, by returning U+0069 for both of them.

> This is close to being a spoofing problem, though. See TUS 4.0, page
> 141.

If you think this is a spoofing problem, then the existing locale-neutral
full case mapping of U+0130 is bogous and should not be U+0069....

> > The string (2) may have been produced to avoid displaying the dot
> > with some fonts that don't apply the soft-dotted rule when there's
> > an additional diacritic above...
>
> Don't do that. That's misusing the standard. The font should be fixed
> instead.

For whatever reason, encoded texts exist before correct fonts are used to
render them. So there does exist texts which use dotless lowercase i before
a diacritic above, simply because the author of the text did not want it to
be rendered with a superposed dot. These texts are clearly not Turkic (in
Turkish or Azeri, the dot of the soft-dotted i should have been displayed
with the diacritic above it, and the dotless i should have been used to
avoid it explicitly).

But this is not the only reason, I can give other examples which also have
security impacts and filesystems impact.

Suppose you have a database of user names or file names allowing
internationalized names coded along the recommanded Unicode principles. But
these names are used in a way that makes it impossible to track the language
in which these names are entered (filenames or users names or address fields
in a entry form are such cases).

Now provide a facility that allows to identify and avoid duplicate
case-equivalents, using full mappings. Because you can't track the language,
you'll need to use the default case-neutral full case mappings.

Now a Turkish user enters a name or address in a entry form, or creates
files with dotless lowercase i in it, and attempts to reenter later its case
equivalent (dotless) uppercase I. The system will not identify both as being
case equivalents, so it will accept both as if they were distinct.

The Turkish user or the system then attempts to list files or database table
fields matching some regular expression like "i*" with case insensitive
option, to count the number of occurences of the names containing a
(soft-)dotted i (or I). He will get all files containing one of three codes,
and not the fourth one.

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Michael Everson: "Re: Stability of WG2 (was: Re: [OT] CJK -> CJC)"
Previous message: Christopher John Fynn: "Re: Stability of WG2"
In reply to: Doug Ewell: "Re: Case mapping of dotless lowercase letters"
Next in thread: Michael Everson: "RE: Case mapping of dotless lowercase letters"
Reply: Michael Everson: "RE: Case mapping of dotless lowercase letters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 05:53:29 EST