On 4/Jun/2015 14:34 PM, Markus Scherer wrote:
>
> Looks all wrong to me.
>
Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your
points below.
> You can't use simple regular expressions to find word boundaries. That's
> why we have UAX #29.
>
And UAX #29 doesn't work for words which begin or end with apostrophes,
whether represented by U+0027 or U+2019. It erroneously thinks there's a
word boundary between the apostrophe and the rest of the word.
But UAX #29 *would* work if the apostrophes were represented by U+02BC,
which is what I'm suggesting.
Confusion between apostrophe and quoting -- blame the scribe who came up
> with the ambiguous use, not the people who gave it a number.
>
I'm not trying to blame anyone. I'm trying to fix the problem.
I know this problem has a long history.
English is taught as that squiggle being punctuation, not a letter.
>
I think we need make a distinction between the colloquial usage of the word
"punctuation" and the Unicode general category "punctuation" which has
specific technical implications.
I somewhat wish that Unicode had a separate category for "Things that look
like punctuation but behave like letters", which might clear up this
taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF
RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are
actually modifiers, into that category too.) But we don't. And the English
apostrophe behaves like a letter, regardless of what your primary school
teacher might have told you, so with the options available in Unicode, it
needs to be classed as a letter.
"don’t" is a contraction of two words, it is not one word.
>
This is utter nonsense. Should my spell-checker recognise "hasn't" as a
valid word? Or should it consider "hasn't" to be the word "hasn" followed
by the word "t", and then flag both of them as spelling errors?
Is "fo'c'sle" the three separate words "fo", "c", and "sle"?
The idea that words with apostrophes aren't valid words is a regrettable
myth that exists in English, which has repeatedly led to the apostrophe
being an afterthought in computing, leading to situations like this one.
If anything, Unicode might have made a mistake in encoding two of these
> that look identical. How are normal users supposed to find both U+2019 and
> U+02BC on their keyboards, and how are they supposed to deal with
> incorrect
> usage?
>
Yeah, and there are fonts where I can't tell the difference between capital
I and lower-case l. But my spell-checker will underline a word where I
erroneously use an I instead of an l, and I imagine spell-checkers of the
future could underline a word where I erroneously use a closing quote
instead of an apostrophe, or vice versa.
There are other possible solutions too, but I don't want to get into a
discussion about UI design. I'll leave that to UI designers.
- Ted
Received on Wed Jun 10 2015 - 17:12:11 CDT
This archive was generated by hypermail 2.2.0 : Wed Jun 10 2015 - 17:12:11 CDT