From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 15 2009 - 14:38:55 CDT
Christoph Burgmer wrote:
> > How would we treat letter case as of UTR#21? Even using full stop for the
> >
> > compulsory neutral tone turns up wrong title case (example in Python):
> > >>> "bu jy.daw".title()
> >
> > 'Bu Jy.Daw'
> >
> > Though in my eyes it should be
> > 'Bu Jy.daw'
> >
> > Would UTR#21 even handle those cases? Would such a character fall into the
> > "Letter Modifier" class?
>
> I'd like to re-raise this question more explicitly for the compulsory neutral
> tone, as its usage seems to be official.
>
> Would one map this glyph to the full stop U+002e , as Y.R. Chao probably
> designed it, and which is used in IPA to separate syllables, or rather look
> for a character falling in the class "case-ignorable" so that the titlecase
> algorithm from UTR#21 takes effect?
In addition to the points made by Asmus, I'll add my own
elaborations here:
1. U+002E FULL STOP already *is* in the class "case-ignorable",
as is made abundantly clear by the new derived case-related
property Case_Ignorable, now included in DerivedCoreProperties.txt
in the Unicode 5.2 beta. As of today, the file for review is:
http://www.unicode.org/Public/5.2.0/ucd/DerivedCoreProperties-5.2.0d11.txt
and the relevant entry is:
# Derived Property: Case_Ignorable (CI)
# As defined by Unicode Standard Definition D121
# C is defined to be case-ignorable if
# Word_Break(C) = MidLetter or MidNumLet, or
# General_Category(C) = Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf),
Modifier_Letter (Lm), or Modifier_Symbol (Sk).
0027 ; Case_Ignorable # Po APOSTROPHE
002E ; Case_Ignorable # Po FULL STOP
...
2. The reason why U+002E is Case_Ignorable=True is because of its
word-breaking behavior, which is defined in WordBreakProperty.txt.
As of today, the file for review is:
http://www.unicode.org/Public/5.2.0/ucd/auxiliary/WordBreakProperty-5.2.0d12.txt
and the relevant entry is:
0027 ; MidNumLet # Po APOSTROPHE
002E ; MidNumLet # Po FULL STOP
3. The impact of Word_Break=MidNumLet on the default word breaking algorithm
documented in UAX #29 is defined in WB6 and WB7 in that document.
As of today, the document for Unicode 5.2 beta review is:
http://www.unicode.org/reports/tr29/tr29-14.html
And the relevant summary of those rules is: "Do not break letters across
certain punctuation." In the case of these two characters, the
point of making them Word_Break=MidNumLet is so that the default
algorithm would not break across contractions or elisions that
use U+0027 (or U+2019) and would not break at a full stop used
in common constructions like decimal number representations: "25.6%"
and so on.
What that means, in turn, is that a default implementation of UAX #29
word breaking should identify word break boundaries in the string
"bu jy.daw" as #bu# #jy.daw#, since the "." in "jy.daw" is between
letters and would inhibit determination of a word break.
4. This treatment of U+002E FULL STOP in UAX #29 for word breaking behavior
is a relatively recent tweak to the algorithm. The apostrophe was
Word_Break=MidLetter as of Unicode 5.0, but full stop was not. The
introduction of Word_Break=MidNumLet and addition of full stop to
that class came in Unicode 5.1. What *that* means is that for this
particular edge case involving full stop, a conformant implementation
of UAX #29 default word breaking behavior would behave differently
for a Unicode 5.0 implementation than a Unicode 5.1 (or later)
implementation. So a Unicode 5.0 implementation of default word
breaking behavior would break around a full stop and in particular
would break "bu jy.daw" as #bu# #jy#.#daw#
5. I know this is getting long-winded ;-) but what the last point means is that
any default titlecasing algorithm which itself is based on default
word boundary determination will end up titlecasing "bu jy.daw"
differently, depending on which version of Unicode it implements.
6. As a general principle, all discussion of Unicode casing behavior
should cease and desist from referring to UTR #21. As the
web site clearly indicates, UTR #21 has been superseded (as of
Unicode 4.0). This kind of discussion about default casing
behavior in the standard should definitely be referring instead
to Section 3.13 "Default Case Algorithms" in the standard itself.
For the Unicode 5.0 version online, see:
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
7. Titlecasing is, in general, inherently quite variable. Different
typographical traditions follow different rules, so best practice
requires being able to adjust it for specific conventions. Unicode
default titlecasing is an approximation at best, and there is
no way it can or should be expected to be correct for all strings
for all situations. In particular, for specialized orthographies
that use punctuation characters such as U+002E FULL STOP in
unusual contexts, there simply is not way for general software
to simply "get it right" out of the box for all users, because
these usages are inherently contradictory regarding issues such
as word boundaries.
8. And finally, Asmus is correct. There is no way that the UTC
will clone a U+002E FULL STOP character in an attempt to create
a new character that would guarantee correct titlecasing for
Gwoyeu Romatzyh. Tailoring of algorithms is the answer for such
a requirement. The good news, however, is that for *default*
titlecasing behavior, if applications are implementing the
Unicode default algorithms and move to Unicode 5.1 (or later),
you should end up with the titlecasing you want for
Gwoyeu Romatzyh without having to tailor it for the neutral
tone mark.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jul 15 2009 - 14:43:09 CDT