PRI #186: Word-Joining Hyphen vs LEFT SINGLE QUOTATION MARK

From: Per Starbäck <starback_at_stp.lingfil.uu.se>
Date: Mon, 04 Jul 2011 15:56:40 +0200

At http://www.unicode.org/review/pri186/ is a suggestion that U+2011
NON-BREAKING HYPHEN should be given the word-break property MidLetter,
one reason being that some languages use a hyphen character between
syllables within a word where word breaking, such as by word-selection
or move-to-next-word commands, should ignore these hyphens.

# The advantage of making this change is that U+2011 NON-BREAKING HYPHEN
# could be used in orthographies that contain interior hyphens. This
# would avoid a requirement to encode yet another confusable
# hyphen/dash/minus character to the over-a-dozen already in Unicode.

The implication is that the alternative to the suggestion is to add a
new character. I don’t see such a requirement! Yes, it’s sometimes hard
to know where word boundaries are, and Unicode certainly helps, but that
doesn’t mean the characters on their own have to completely solve that
problem. Knowledge about the language being used can also be useful, for
example.

Compare this with LEFT SINGLE QUOTATION MARK, used as quotation mark and
apostrophe such that extra knowledge can be needed to know where the
word divisions are:

    â€˜Tis just a highfalutin‘ idea, reminding me of that ‘sublime
    masterwork’ L’Étranger that I don‘t approve of.

For instance a mark-word operation on "highfalutin’" should ideally
include the apostrophe but not on "masterwork".

It would help if the quotation mark and the apostrophe were seen as
different characters here, even though they look the same, but for
good reasons they are seen as the same character in Unicode. And
certainly no one is suggesting different "characters" for joining
and splitting apostrophes (using terminology from
http://unicode.org/mail-arch/unicode-ml/y2002-m08/att-0428/01-cimaUTR29.html
).

I don’t know about the Iu Mien language mentioned in the PRI, but would
it even be correct to disallow *line* breaks with NON-BREAKING HYPHEN in
many of these cases? Wouldn’t it be acceptable to hyphenate some of
these words?

So I would say, don’t ‘fix’ this:

* hyphens are hyphens, even when they are used for slightly different
  reasons in different orthographies.
* word breaking is hard, and not only partially solvable by Unicode
Received on Mon Jul 04 2011 - 09:00:03 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 04 2011 - 09:00:08 CDT