From: Julian Bradfield (jcb+unicode@inf.ed.ac.uk)
Date: Tue Dec 15 2009 - 04:31:56 CST
On 2009-12-14, Michael Everson <everson@evertype.com> wrote:
> On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
>>[...]
> Evidently I was not using [identify] in a technical sense.
The technical sense is also the normal English sense. Things are
"identical" if they're exactly the same.
>> What you presumably mean is "the space in which filenames live
>> *ought* to be the set of utf-8 strings quotiented by canonical
>> equivalence" (so that two canonically equivalent strings are
>> representatives of one and the same filename).
>
> No, that's not what I meant.
>
> I meant that é 00E9 and é 0065 0301 the same platonic entity (acute
> e) in an intrinsic sense, whereas both are different from a Cyrillic
> lookalike, е́ 0435 0301.
>
> *That* kind of identity.
How does what you said differ from what I said, except that I said it
precisely? Your "platonic entity" is my "equivalence
class of UTF-8 strings under canonical equivalence". That defines an
identity on the "platonic entities", NOT on the UTF-8 strings.
As Asmus has pointed out, the question then is, do you ask users to
understand this, and magically know that two apparently different
strings are actually the same?
If they're Windows users, they're used to this, because of the mess
with case of filenames in FAT, but if they're Unix users, they're not
at all used to it.
On the other hand, the complexities of dealing with Unicode
equivalence are a whole different league from dealing with simple case
collapsing.
I don't know what the right answer is - except to agree that it ought
to be possible for a file system to be marked as only allowing UTF-8
filenames, in some normalized form.
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
This archive was generated by hypermail 2.1.5 : Tue Dec 15 2009 - 04:33:52 CST