From: karl williamson (public@khwilliamson.com)
Date: Mon Mar 15 2010 - 21:15:00 CST
There are a couple of things going on here. Keep in mind that my
perspective is that of someone who is trying to implement what Unicode says.
First, part of the essence of a medial hyphen is that it not be adjacent
to white space. Therefore to determine if a hyphen is medial, it is
required to check for adjacent white space. But in the same sentence
that Unicode says that hyphens which are medial are to be ignored,
Unicode says that white space is also to be ignored. It is impossible
to both ignore and not ignore white space. The number of
implementations that do what Unicode says here is and will always be zero.
As an aside, it has been my experience that ignoring all white space
usually leads to unintended negative consequences. The 1966 ANSI
Fortran standard suffered from this (I don't know about later versions),
and it led to problems, with economic consequences. It is a pity that
this lesson did not get passed on to later generations. I doubt that
Unicode really wants 'S c r i p t' to mean 'Script', but that's what it
says. It would have been better in my opinion for it to say that
multiple white space is equivalent to a single white space.
But it's probably too late for that, and I haven't thought of all the
implications either. Perhaps the simplest thing would be to change the
standard to say that white space not adjacent to hyphens is to be ignored.
Asmus Freytag wrote:
> On 3/11/2010 10:12 PM, karl williamson wrote:
>> Andrew West wrote:
>>> On 11 March 2010 20:32, karl williamson <public@khwilliamson.com> wrote:
>>>> I think it is actually better to do the following:
>>>> 1. Remove all white space
>>>> 2. Collapse multiple hyphens in a row into one
>>>> 3. Lowercase
>>>> 4. If the result is one of the three problematic ones, we are done.
>>>> 5. Remove all hyphens
>>>>
>>>> Then, if the strings are the same after the transforms, they match.
>>>
>>> No, then "TIBETAN MARK TSA PHRU" would match "TIBETAN MARK TSA -PHRU",
>>> which may be what the user intended, but it is not what they asked
>>> for, and would be as bad as matching e.g. "PERCENT IGN" and "PERCENT
>>> SIGN".
This is a false analogy because Unicode has never said that 'S' is to be
ignored in loose matching. Unicode still says (in TR18) that all
hyphens (except in 3 cases) are to be ignored. If hyphens can be
significant parts of character names, Unicode should never have said
they effectively aren't.
>>>
>>> Andrew
>>>
>>
>> OK, but that is a change from what TR18 says: "names should use a
>> loose match, disregarding case, spaces and hyphen" except for the
>> three problematic situations it mentions. There is no character
>> TIBETAN MARK TSA PHRU,
> But it's a name that could be added to the standard at any moment,
> because it would be formally distinct from any existing
>
> TIBETAN MARK TSA -PHRU
I find this statement very disconcerting, because it means that I cannot
trust what Unicode says. TR18 for the last almost 7 years and 4 or so
versions has said that all hyphens (except for the 3 cases) can be
ignored. Now you're saying that Unicode feels free to add more such
cases, thus causing implementations that relied on Unicode's word to
fail. The failure will probably be subtle, so it won't be immediately
apparent.
Yes it's true that backward compatibility cannot always be guaranteed;
but it should always be a goal, and the reasons for breaking it should
be compelling.
Unicode could choose names that don't violate TR18. Choosing ones that
do shows disrespect to your customers, in my opinion.
That said, I can also say that Perl 5 has not implemented loose matching
for character names, so will not be affected by any immediate changes to
it. I also know that no one has strictly implemented Unicode's
definition of loose matching because it is impossible to do so. But I
don't know what any implementations actually have done.
>
> so you can't simply match according to what might be intended, because
> then, if such a character is later added, everything fails.
>> and I thought the whole point of loose matching is to follow the
>> intent of the user even in the face of certain missing or extraneous
>> punctuation and spacing characters, so even though it is not exactly
>> what they asked for, it is close enough by the traditional definition.
>>
>> I realize that TR18 is not an official part of the standard, and that
>> TR44 is now UAX44, so is. Therefore, this is a change in the
>> standard that I don't believe was listed as a delta.
>>
>>
>
This archive was generated by hypermail 2.1.5 : Mon Mar 15 2010 - 21:22:32 CST