Re: PRI #186: Word-Joining Hyphen vs LEFT SINGLE QUOTATION MARK from Philippe Verdy on 2011-07-04 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 5 Jul 2011 06:38:44 +0200

2011/7/4 Per Starbäck <starback_at_stp.lingfil.uu.se>:
> At http://www.unicode.org/review/pri186/ is a suggestion that U+2011
> NON-BREAKING HYPHEN should be given the word-break property MidLetter,
> one reason being that some languages use a hyphen character between
> syllables within a word where word breaking, such as by word-selection
> or move-to-next-word commands, should ignore these hyphens.
>
> # The advantage of making this change is that U+2011 NON-BREAKING HYPHEN
> # could be used in orthographies that contain interior hyphens. This
> # would avoid a requirement to encode yet another confusable
> # hyphen/dash/minus character to the over-a-dozen already in Unicode.
>
> The implication is that the alternative to the suggestion is to add a
> new character. I don’t see such a requirement! Yes, it’s sometimes hard
> to know where word boundaries are, and Unicode certainly helps, but that
> doesn’t mean the characters on their own have to completely solve that
> problem. Knowledge about the language being used can also be useful, for
> example.
>
> Compare this with LEFT SINGLE QUOTATION MARK, used as quotation mark and
> apostrophe such that extra knowledge can be needed to know where the
> word divisions are:
>
> ‘Tis just a highfalutin‘ idea, reminding me of that ‘sublime
> masterwork’ L’Étranger that I don‘t approve of.
>
> For instance a mark-word operation on "highfalutin’" should ideally
> include the apostrophe but not on "masterwork".
>
> It would help if the quotation mark and the apostrophe were seen as
> different characters here, even though they look the same, but for
> good reasons they are seen as the same character in Unicode. And
> certainly no one is suggesting different "characters" for joining
> and splitting apostrophes (using terminology from
> http://unicode.org/mail-arch/unicode-ml/y2002-m08/att-0428/01-cimaUTR29.html
> ).

Well, if you really need to disambiguate apostrophes separating words
and those that occur within words, you may easily consider that
separating apostrophes are as if they were followed by a space; if you
need to encode that, a ZERO WIDTH SPACE could be used in French and
Italian, for the cases where it should separate words. It should not
shoke orthographic spellers that would still detect French elided
words like "l’".

Word-breaking anyway is not something as simple as you think, and the
simple character properties only perform very basic breakings. It
works with Informal English words like ’tis occuring at begining of
sentences, and most often with "highfalutin’" as well : the apostrophe
is kept as part of the word and not as a trailing quote because
there's no matching starting quote in the parsing context (such long
parsing context however is not handled in UAS#29 that only focuses on
very limited lengths, but it can certaily be implemented in any
good-enough word breaker).

In fact in French, the apostrophe is almost always binding to the
letter on the left and not to the letter on the right, except in very
few exceptions (like "aujourd’hui", which used to be 4 words "au",
"jour", "d’", "hui", but the last word has completely lost its usage
except in that expression that has then been glued and is now treated
as a single word since long).

But the solution exposed in optional rule WB5a "Break between
apostrophe and vowels (French, Italian)" of UAX#29:

apostrophe ÷ vowels

is probably simpler and does not require such zero-width spaces to
work correctly, and a word-breaker or spell checker will easily spot
the exceptions for the cases where an effectively recognized
apostrophe (not a quotation mark) is binding or not to the letter(s)
on the left and/or right.

(Note that the definition of "vowels" should be better than what is
exposed in your cited email archive: vowels with accents, and "h"
should be included as well in French and Italian. See: "l’été",
"l’âtre", "n’être", "ç’aura", "d’à", "d’heures", "l’œuf", "l’île",
"d’Ô"... with two words in each one of these French cases ; on the
opposite there is apparently no evident case in French with a
word-trailing elision apostophe before the French vowels [æÆèÈùÙÿŸ]
that usually don't start a French word).

In addition, an apostrophe should almost never occur at the begining
of a word, or at end of a word without any letter just after it in
French (unlike English where it is much more frequent in less formal
language). This means that it is easy in French to know if it
represents a trailing quotation mark or not.

More, if it's used as a quotation mark, it can only be a trailing
quotation mark, not a leading quotation mark (those single quotation
marks are not favored in French typography). So this offers good hints
for the curly apostrophe U+2019.

As the apostrophe is also commonly encoded as the ASCII vertical
single quote, such encoding offers less hints and can be more
difficult to disambiguate (this is true as well in English or
Italian).

In all cases, you need knowledge of the language before trying to
implement a word-breaker for that language. The solution in UAX#29
will still provide some basic breaks to reduce the number of cases and
to more easily detect exceptions, and it can be a good first
processing step used in actually working word breakers for spell
checkers, grammatical analysis, and automated translators, and for
disambiguating leading and trailing apostrophes from leading and
trailing quotation marks.
Received on Mon Jul 04 2011 - 23:43:43 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 04 2011 - 23:43:45 CDT