Another take on the English apostrophe in Unicode
billposer2 at gmail.com
Thu Jun 11 12:47:39 CDT 2015
To add a factor that I think hasn't been mentioned, there are languages in
which apostrophe is used both as a letter by itself and as part of a
complex letter. Most of the native languages of British Columbia write
glottalized consonants as C+', e.g. <t'> for an ejective alveolar stop, and
many use apostrophe by itself for the glottal stop. (Another common
convention, which produces other difficulties, is to use the number <7> for
On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy <tclancy at mozilla.com> wrote:
> On 4/Jun/2015 14:34 PM, Markus Scherer wrote:
>> Looks all wrong to me.
> Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your
> points below.
>> You can't use simple regular expressions to find word boundaries. That's
>> why we have UAX #29.
> And UAX #29 doesn't work for words which begin or end with apostrophes,
> whether represented by U+0027 or U+2019. It erroneously thinks there's a
> word boundary between the apostrophe and the rest of the word.
> But UAX #29 *would* work if the apostrophes were represented by U+02BC,
> which is what I'm suggesting.
> Confusion between apostrophe and quoting -- blame the scribe who came up
>> with the ambiguous use, not the people who gave it a number.
> I'm not trying to blame anyone. I'm trying to fix the problem.
> I know this problem has a long history.
> English is taught as that squiggle being punctuation, not a letter.
> I think we need make a distinction between the colloquial usage of the
> word "punctuation" and the Unicode general category "punctuation" which has
> specific technical implications.
> I somewhat wish that Unicode had a separate category for "Things that look
> like punctuation but behave like letters", which might clear up this
> taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF
> RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are
> actually modifiers, into that category too.) But we don't. And the English
> apostrophe behaves like a letter, regardless of what your primary school
> teacher might have told you, so with the options available in Unicode, it
> needs to be classed as a letter.
> "don’t" is a contraction of two words, it is not one word.
> This is utter nonsense. Should my spell-checker recognise "hasn't" as a
> valid word? Or should it consider "hasn't" to be the word "hasn" followed
> by the word "t", and then flag both of them as spelling errors?
> Is "fo'c'sle" the three separate words "fo", "c", and "sle"?
> The idea that words with apostrophes aren't valid words is a regrettable
> myth that exists in English, which has repeatedly led to the apostrophe
> being an afterthought in computing, leading to situations like this one.
> If anything, Unicode might have made a mistake in encoding two of these
>> that look identical. How are normal users supposed to find both U+2019
>> U+02BC on their keyboards, and how are they supposed to deal with
> Yeah, and there are fonts where I can't tell the difference between
> capital I and lower-case l. But my spell-checker will underline a word
> where I erroneously use an I instead of an l, and I imagine spell-checkers
> of the future could underline a word where I erroneously use a closing
> quote instead of an apostrophe, or vice versa.
> There are other possible solutions too, but I don't want to get into a
> discussion about UI design. I'll leave that to UI designers.
> - Ted
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode