Misc comments on TR29

L2/02-xxxx

Re:	Misc comments on TR29
From:	Mark Davis
Date:	2002-08-20

The following are comments on TR29 received on unicode/unicore. They are not edited; just copied for reference.

FYI:
There're an open issue regarding grapheme-cluster boundaries in Thai.

* SARA AM as an Other_Grapheme_Extend?

Whether "0E33;THAI CHARACTER SARA AM" should be a GraphemeExtend
character or not?

By Unicode definition, SARA AM is an Lo, not a combining
character. But many Thai applications (MS Office/ Windows/
OpenOffice.org) treats SARA AM like a combining character (unlike SARA
AA), i.e. cursor always jump over it. Whether this is right or not is
controversial but the fact is that Windows users are used to it.

My personal question is that, if it is favorable for Thai to treat
SARA AM as part of the previous grapheme cluster, is it possible for
UTC to consider adding SARA AM as an Other_Grapheme_Extend?

My immediate reaction to this TR was that it was doomed, given how
difficult it is to tokenize text perfectly (I have written a number of
tokenizers for natural language processing, and they are never
complete). However, after reading the draft, I found myself agreeing
that it is reasonable to provide =some= guidance for the 80% solution.
So, I looked at the code for some of my tokenizers. Most of the special
cases covered there are not appropriate for the TR, but I do have the
following suggestion:

Consider adding U+0026 (ampersand) to the MidLetter class. I did a
quick scan through a few million words of New York Times data I have,
and found that most mid-word occurrences would probably not induce word
breaks, e.g.,

   Q&A
   R&R
   AT&T
   P&G
   ...

Exceptions included:

   Ben&Jerry
   How&Why

Perhaps a more conservative rule would involve only uppercase letters ....

A caveat: I am unfamiliar with analogous cases in languages other than
English.

> That being said, here are a few problematic cases for your proposal:
>
> "prud'homme" (a member of an industrial tribunal) is a single word, as
> are his relatives "prud'homal", and "prud'homie".

I believe TR29 gives a much more common example « aujourd'hui » (today) and
admits that it would present a problem for word-breaking.

"Dans 1'S, à une heure d'affluence..."

- Raymond Queneau, Exercices de Style (opening sentence).

At 00:15 15/08/02 -0700, Eric Muller wrote:
> > > Your definition of "LatinVowel" is problematic. Is "Y" only a
>vowel in
> > > French? In a word such as "yeux", it certainly is a consonant. Could
> > > this lead to problems?
> >
> > I don't think so, but I wait for the opinion of French speakers.
> >
> > What I can see is that things like "l'yaourt" [lja'ur] are normal in
> > French
> > spelling, and sometimes are to be found also in Italian ("l'yoghurt"
> > ['ljogurt]).
>
>
>"y" is either a vowel or a semi-consonant. When a semi-consonant, an
>initial "y" does not cause elision, so "le yaourt". Of course, there are
>exceptions: "yeuse" (oak), "yèble" (?) and "yeux" (eyes). The usage is
>both ways for "yole" (skiff). There are a few words starting with a
>vowel "y": "y" (there), "ypérite" (mustard gas), "ytterbium" (?),
>"yttrium" (?). Finally, there is elision before most proper nouns
>starting with "Y": "Yonne" (a river), "York", etc.
>
>That being said, here are a few problematic cases for your proposal:
>
>"prud'homme" (a member of an industrial tribunal) is a single word, as
>are his relatives "prud'homal", and "prud'homie".
>
>Grevisse ("Le bon usage", "the" authority on French usage) gives five
>verbs which are considered a single word: "entr'aimer (s')",
>"entr'apercevoir", "entr'appeler (s')", "entr'avertir (s')",
>"entr'égorger (s')"; Le Petit Robert (1988, a well respected dictionary)
>gives only the second one.
>
>There is elision before the names of the consonants f, h, l, m, n, r, s,
>x: "admissible à l'X" (accepted at X = École Polytechnique), "devant
>l'n" (before the n).
>
>"grand'mère" is definitely one word for me, but "grand'rue",
>"grand'chose" are not so clear. All are archaic forms and Le Petit
>Robert does not list any of those (modern: "grand-mère", "rue
>principale", "grand chose"').
>
>Then there is spoken French: "j'suis allé m'promener" for "je suis allé
>me promener" (I went for a walk). There are many such cases of elision
>before a consonant.
>
>This spoken French is of course very close to many dialects, or even
>close languages (e.g. Picard, spoken in the North of France).
>
>Did we mention that one never breaks a line after an apostrophe that
>represents elision?
>
>Speaking of French line break problems, there is also the case of the
>";", which takes a space before and after: "foo ; bar". Of course, one
>never breaks on the space just after "foo". Same for ":".
>
>Eric.

MC> Consonants [j] and [w] have the special status of "semivowels" in
MC> romance languages, which means that they often behave as vowels
MC> do, including in the rules for elision.

One has to differentiate between phonemes and graphemes. Unicode, of
course, operates on the grapheme level, and thus you simply can't be
certain what a "y" actually stands for (vowel or semivowel)

MC> But, of course, I am aware that there are edge cases that will not
MC> be captured in the general case. I have named one of these edge
MC> cases (the Breton trigraph "c'h"), but it's not difficult to come
MC> up with more -- e.g., when the apostrophe is used as a diacritic
MC> applied to consonants (such as the Wade-Giles romanization of
MC> Chinese "K'ang-hsi").

Just to give another example: Uzbek in Latin script uses "o'" and "g'"
as opposed to "o" and "g", such as in the language designation
"O'zbek" where "o'" stands for the sound designated in Cyrillic script
by U+040E and "g'" is equivalent to U+0493.

MC> BTW, notice that I didn't include precomposed accented letters
MC> because I understand UTR#29 works on NFD normalized text.

Does NFD in this instance mean to include U+0080..00FF, i.e. the
former Latin-1 upper block? It would be of interest to us Germans :-)

MC> However, "ItalianFrenchVowel" doesn't include Esperanto, Occitan
MC> and many Italian and French dialects.

"RomanceVowel"? (Not a lot better.)

I had just added this in the list of possible property changes for discussion at the next UTC meeting. I agree that the right status is MidLetter, for the reasons Marco cites. Note that there is also a dot explicitly used for a hyphenation point, one that should also be included as a MidLetter.

U+00B7 ( · ) {MIDDLE DOT}
U+2027 ( ‧ ) {HYPHENATION POINT}

(Note to all, especially those familiar with non-Latin written languages: be sure to look over the character classes for words and sentences in http://www.unicode.org/reports/tr29/ to see if there are any other punctuation characters that should be included.)

I agree with Marco that on balance it is better to respect the Catalan use, especially since as a MidLetter the character will not interfere with the most common other usages (e.g. as a bullet). There is a character explicitly used for a mathematical dot:

U+22C5 ( ⋅ ) {DOT OPERATOR}

Note that we have a gazillion other dots already:

U+002E ( . ) {FULL STOP}
U+02D9 ( ˙ ) {DOT ABOVE}
U+2024 ( ․ ) {ONE DOT LEADER}
U+22C5 ( ⋅ ) {DOT OPERATOR}
U+FE52 ( ﹒ ) {SMALL FULL STOP}
U+FF0E ( ． ) {FULLWIDTH FULL STOP}

U+FF65 ( ･ ) {HALFWIDTH KATAKANA MIDDLE DOT}

U+2801 ( ⠁ ) {BRAILLE PATTERN DOTS-1}
U+2802 ( ⠂ ) {BRAILLE PATTERN DOTS-2}
U+2804 ( ⠄ ) {BRAILLE PATTERN DOTS-3}
U+2808 ( ⠈ ) {BRAILLE PATTERN DOTS-4}
U+2810 ( ⠐ ) {BRAILLE PATTERN DOTS-5}
U+2820 ( ⠠ ) {BRAILLE PATTERN DOTS-6}
U+2840 ( ⡀ ) {BRAILLE PATTERN DOTS-7}

U+0307 ( ◌̇ ) {COMBINING DOT ABOVE}
U+0323 ( ◌̣ ) {COMBINING DOT BELOW}

U+05C1 ( ◌ׁ ) {HEBREW POINT SHIN DOT}
U+05C2 ( ◌ׂ ) {HEBREW POINT SIN DOT}
U+05C4 ( ◌ׄ ) {HEBREW MARK UPPER DOT}
U+05B9 ( ◌ֹ ) {HEBREW POINT HOLAM}
U+05B4 ( ◌ִ ) {HEBREW POINT HIRIQ}
U+05BC ( ◌ּ ) {HEBREW POINT DAGESH OR MAPIQ}

U+302E ( ◌〮 ) {HANGUL SINGLE DOT TONE MARK}

U+073C ( ◌ܼ ) {SYRIAC HBASA-ESASA DOTTED}
U+073F ( ◌ܿ ) {SYRIAC RWAHA}
U+0740 ( ◌݀ ) {SYRIAC FEMININE DOT}
U+0741 ( ◌݁ ) {SYRIAC QUSHSHAYA}
U+0742 ( ◌݂ ) {SYRIAC RUKKAKHA}

U+093C ( ◌़ ) {DEVANAGARI SIGN NUKTA}

U+09BC ( ◌় ) {BENGALI SIGN NUKTA}
U+0A3C ( ◌਼ ) {GURMUKHI SIGN NUKTA}
U+0ABC ( ◌઼ ) {GUJARATI SIGN NUKTA}
U+0B3C ( ◌଼ ) {ORIYA SIGN NUKTA}

And these are just the obvious ones found with a quick search (and just for the single dots). There are probably more hiding out in little corners of scripts (it's a bit like "Where's Waldo" looking for them. Moreover, I believe we may even be adding more dots for UPA (http://www.unicode.org/unicode/alloc/Pipeline.html).

Perhaps we should have reserved a plane just for the darned dots; who knows how many we will end up with...

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----

From: "Marco Cimarosti" <marco.cimarosti@essetre.it>

To: "'John Cowan'" <jcowan@reutershealth.com>

Cc: <mark.davis@jtcsv.com>; <unicore@unicode.org>; <unicode@unicode.org>; <antonio@tuvalkin.web.pt>

Sent: Wednesday, August 14, 2002 06:23

Subject: RE: New version of TR29:

John Cowan wrote:
> Marco Cimarosti scripsit:
>
> > Moreover, as Martins-TuvÃ¡lkin says, non-Catalan uses of
> U+00B7 are too
> > unusual and uninteresting to be taken as the default.
>
> You omit, however, its very common use as a sign of multiplication.

Actually, I don't see it very often.

> > BTW, notice that the most important of these non-Catalan
> usages work as
> > expected also if U+00B7 is a MidLetter:
>
> However, it prevents a·b (a times b) from being correctly split.

In algebra, multiplication operators are normally omitted in such cases: a
times b is spelled "ab"; twice a is spelled "2a".

A dot-shaped multiplication operator is only used when both operands are
numbers (in which case it would split correctly), and when both operands are
alphabetic but at least one of them is longer than one letter, e.g.:

x·sin 3

But it seems to me that these borderline cases are overly rare to get the
priority over the proper spelling of Catalan or the common notation of
hyphenation in dictionaries.

Moreover, TR29 can be customized for special needs, and math applications
already have lots of things to customize.

_ Marco

----- Original Message -----

From: "Marco Cimarosti" <marco.cimarosti@essetre.it>

To: "'Mark Davis'" <mark.davis@jtcsv.com>; <unicore@unicode.org>; <unicode@unicode.org>

Cc: "'Anto'nio Martins-Tuva'lkin'" <antonio@tuvalkin.web.pt>

Sent: Wednesday, August 14, 2002 04:18

Subject: RE: New version of TR29:

Mark Davis wrote:
> There is a new version of Unicode Technical Report #29: Text
> Boundaries on <http://www.unicode.org/reports/tr29/>,
> [...]
> Feedback that is received before the UTC meeting (starting
> August 20) can be
> made available for the discussion of TR29 at that meeting.

I think that the following comment by António Martins-Tuválkin, from the
thread titled "Is U+0140 (l with middle dot) ever used?", is relevant for
TR29:

| As for the nature of the middle dot, short of a specific code point
| attributed to LATIN LETTER CATALAN MIDDLE DOT, there should be
| something ensuring that this character can be treaded as a letter
| for all things refering to word delimitation (smart select, line
| break, word count, etc.).
|
| I imagine that with 9 million native speakers catalan may appear
| as a weak lobby to push to such a change in the standard, but note
| that while other uses of (non-letter) middle dot are marginal and
| scarcely content-bearing, catalan middle dot is central and
| essencial to quality textual content representation and encoding
| -- which AFAIK Unicode is all about.

I suggest that U+00B7 (·) be added to the "MidLetter" character class in
Table 2 ("Default Word Boundaries"). It would be inconvenient that U+0140
U+004C (ŀl) and U+004C U+00B7 U+004C (l·l) work differently.

A proper Catalan behavior is desirable for Catalan itself, of course, but
also for any other language occasionally using Catalan loanwords or proper
names.

A web search on Goggle for "Paral·lel" (the most famous road in Barcelona)
shows that this name is more common in English web pages (about 16,000) than
in Catalan ones (about 7,500).

Moreover, as Martins-Tuválkin says, non-Catalan uses of U+00B7 are too
unusual and uninteresting to be taken as the default.

BTW, notice that the most important of these non-Catalan usages work as
expected also if U+00B7 is a MidLetter:

1) It works OK as a bullet: it correctly splits because it would never be
preceded by a letter;

2) It works OK as a Greek semicolon: it correctly splits because it is
would always be followed by a space;

3) It works OK as a CJK separator for Western personal names: it correctly
splits because MidLetter is not involved in rules with katakana or
ideographs;

4) It works OK as a hyphenating separator in dictionaries: it correctly
joins as it does in Catalan.

_ Marco