L2/02-xxxx
Re: | Misc comments on TR29 |
From: | Mark Davis |
Date: | 2002-08-20 |
The following are comments on TR29 received on unicode/unicore. They are not edited; just copied for reference.
A.
FYI:
There're an open issue regarding grapheme-cluster boundaries in Thai.
* SARA AM as an Other_Grapheme_Extend?
Whether "0E33;THAI CHARACTER SARA AM" should be a GraphemeExtend
character or not?
By Unicode definition, SARA AM is an Lo, not a combining
character. But many Thai applications (MS Office/ Windows/
OpenOffice.org) treats SARA AM like a combining character (unlike SARA
AA), i.e. cursor always jump over it. Whether this is right or not is
controversial but the fact is that Windows users are used to it.
My personal question is that, if it is favorable for Thai to treat
SARA AM as part of the previous grapheme cluster, is it possible for
UTC to consider adding SARA AM as an Other_Grapheme_Extend?
B.
My immediate reaction to this TR was that it was doomed, given how
difficult it is to tokenize text perfectly (I have written a number of
tokenizers for natural language processing, and they are never
complete). However, after reading the draft, I found myself agreeing
that it is reasonable to provide =some= guidance for the 80% solution.
So, I looked at the code for some of my tokenizers. Most of the special
cases covered there are not appropriate for the TR, but I do have the
following suggestion:
Consider adding U+0026 (ampersand) to the MidLetter class. I did a
quick scan through a few million words of New York Times data I have,
and found that most mid-word occurrences would probably not induce word
breaks, e.g.,
Q&A
R&R
AT&T
P&G
...
Exceptions included:
Ben&Jerry
How&Why
Perhaps a more conservative rule would involve only uppercase letters ....
A caveat: I am unfamiliar with analogous cases in languages other than
English.
C.
> That being said, here are a few problematic cases for your proposal:
>
> "prud'homme" (a member of an industrial tribunal) is a single
word, as
> are his relatives "prud'homal", and "prud'homie".
I believe TR29 gives a much more common example « aujourd'hui » (today) and
admits that it would present a problem for word-breaking.
D.
"Dans 1'S, à une heure d'affluence..."
- Raymond Queneau, Exercices de Style (opening sentence).
At 00:15 15/08/02 -0700, Eric Muller wrote:
> > > Your definition of "LatinVowel" is problematic. Is
"Y" only a
>vowel in
> > > French? In a word such as "yeux", it certainly is a
consonant. Could
> > > this lead to problems?
> >
> > I don't think so, but I wait for the opinion of French speakers.
> >
> > What I can see is that things like "l'yaourt" [lja'ur] are
normal in
> > French
> > spelling, and sometimes are to be found also in Italian ("l'yoghurt"
> > ['ljogurt]).
>
>
>"y" is either a vowel or a semi-consonant. When a semi-consonant,
an
>initial "y" does not cause elision, so "le yaourt". Of
course, there are
>exceptions: "yeuse" (oak), "yèble" (?) and "yeux"
(eyes). The usage is
>both ways for "yole" (skiff). There are a few words starting with
a
>vowel "y": "y" (there), "ypérite"
(mustard gas), "ytterbium" (?),
>"yttrium" (?). Finally, there is elision before most proper nouns
>starting with "Y": "Yonne" (a river), "York",
etc.
>
>That being said, here are a few problematic cases for your proposal:
>
>"prud'homme" (a member of an industrial tribunal) is a single
word, as
>are his relatives "prud'homal", and "prud'homie".
>
>Grevisse ("Le bon usage", "the" authority on French
usage) gives five
>verbs which are considered a single word: "entr'aimer (s')",
>"entr'apercevoir", "entr'appeler (s')", "entr'avertir
(s')",
>"entr'égorger (s')"; Le Petit Robert (1988, a well respected
dictionary)
>gives only the second one.
>
>There is elision before the names of the consonants f, h, l, m, n, r, s,
>x: "admissible à l'X" (accepted at X = École Polytechnique),
"devant
>l'n" (before the n).
>
>"grand'mère" is definitely one word for me, but "grand'rue",
>"grand'chose" are not so clear. All are archaic forms and Le Petit
>Robert does not list any of those (modern: "grand-mère",
"rue
>principale", "grand chose"').
>
>Then there is spoken French: "j'suis allé m'promener" for "je
suis allé
>me promener" (I went for a walk). There are many such cases of elision
>before a consonant.
>
>This spoken French is of course very close to many dialects, or even
>close languages (e.g. Picard, spoken in the North of France).
>
>Did we mention that one never breaks a line after an apostrophe that
>represents elision?
>
>Speaking of French line break problems, there is also the case of the
>";", which takes a space before and after: "foo ; bar".
Of course, one
>never breaks on the space just after "foo". Same for
":".
>
>Eric.
E.
MC> Consonants [j] and [w] have the special status of
"semivowels" in
MC> romance languages, which means that they often behave as vowels
MC> do, including in the rules for elision.
One has to differentiate between phonemes and graphemes. Unicode, of
course, operates on the grapheme level, and thus you simply can't be
certain what a "y" actually stands for (vowel or semivowel)
MC> But, of course, I am aware that there are edge cases that will not
MC> be captured in the general case. I have named one of these edge
MC> cases (the Breton trigraph "c'h"), but it's not difficult to
come
MC> up with more -- e.g., when the apostrophe is used as a diacritic
MC> applied to consonants (such as the Wade-Giles romanization of
MC> Chinese "K'ang-hsi").
Just to give another example: Uzbek in Latin script uses "o'" and
"g'"
as opposed to "o" and "g", such as in the language
designation
"O'zbek" where "o'" stands for the sound designated in
Cyrillic script
by U+040E and "g'" is equivalent to U+0493.
MC> BTW, notice that I didn't include precomposed accented letters
MC> because I understand UTR#29 works on NFD normalized text.
Does NFD in this instance mean to include U+0080..00FF, i.e. the
former Latin-1 upper block? It would be of interest to us Germans :-)
MC> However, "ItalianFrenchVowel" doesn't include Esperanto,
Occitan
MC> and many Italian and French dialects.
"RomanceVowel"? (Not a lot better.)
I had just added this in the list of possible property changes for discussion at the next UTC meeting. I agree that the right status is MidLetter, for the reasons Marco cites. Note that there is also a dot explicitly used for a hyphenation point, one that should also be included as a MidLetter.