From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Dec 14 2009 - 14:29:19 CST
Michael Everson wrote:
> On 14 Dec 2009, at 18:55, Peter Edberg wrote:
>
>>>> And should an OS treat "My file" and "My file" as the same file
>>>> name?
>>>
>>> This problem is with us already (on Apple systems, of all things).
>>> MacOS X decomposes Cyrillic Й and Ё in file names and treats
>>> файл and файл as the same file name
>>
>> Which seems appropriate, since they are canonically equivalent.
>
> I agree. Canonical equivalence is identity.
First, й (U+0439) and й (U+0438 U+0306) are not canonically equivalent, or
even compatibility equivalent. The character й (U+0439) has no
decomposition. This may be a design flaw, but anyway it’s how things are
defined in Unicode.
Second, canonical equivalence is not identity. For example, é (U+00E9) and
é (U+0065 U+0301) are not identical: the first one is one code point, the
second one is two code points. (Some programs, maybe even the one I’m using
now, might silently convert U+0065 U+0301 to U+00E9. This by no means proves
they’re identical, any more than other silent conversions make e.g.
hyphen-minus identical to en dash.)
(The letter Ё is comparable to the é case: it has canonical decomposition.
But it is still distinct from its decomposition.)
Canonical equivalent is a relation between sequences of code points.
Programs may ignore the distinction between canonical equivalent sequences,
but they also may make any distinction they like between them, and they may
even recognize just one of canonical equivalent sequences—this is not
uncommon in older software, which may support e.g. é as a precomposed
character but not even recognize the combining acute accent.
Thus, although файл and файл are definitely different strings, programs may
and often do treat them as equivalent or, you might say, ”identical” for
some definition of ”identity”—but then it’s a definition external to
Unicode. Similarly, a file system might treat, say, ”My file” and ”Myfile”
and ”MYFILE” and ”My%20file” all as ”identical” in the sense of naming the
same file, even though they are of course different as strings.
> So long as fonts display
> the pre-composed glyph there should be no problem.
It’s mostly confusing to consider display issues here. Besides, you surely
know that fonts don’t do such things—rendering software might decide to
render a character sequence as a ligature, but that’s a different issue.
-- Yucca, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Mon Dec 14 2009 - 14:30:47 CST