Re: GR and letter case

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 15 2009 - 14:38:55 CDT

Next message: announcements@unicode.org: "[Unicode Announcement] 33rd Internationalization & Unicode Conference - Program Online"

Previous message: Asmus Freytag: "Re: GR and letter case"
Maybe in reply to: Christoph Burgmer: "GR and letter case"
Next in thread: Christoph Burgmer: "Re: GR and letter case"
Reply: Christoph Burgmer: "Re: GR and letter case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Christoph Burgmer wrote:

> > How would we treat letter case as of UTR#21? Even using full stop for the
> >
> > compulsory neutral tone turns up wrong title case (example in Python):
> > >>> "bu jy.daw".title()
> >
> > 'Bu Jy.Daw'
> >
> > Though in my eyes it should be
> > 'Bu Jy.daw'
> >
> > Would UTR#21 even handle those cases? Would such a character fall into the
> > "Letter Modifier" class?
>
> I'd like to re-raise this question more explicitly for the compulsory neutral
> tone, as its usage seems to be official.
>
> Would one map this glyph to the full stop U+002e , as Y.R. Chao probably
> designed it, and which is used in IPA to separate syllables, or rather look
> for a character falling in the class "case-ignorable" so that the titlecase
> algorithm from UTR#21 takes effect?

In addition to the points made by Asmus, I'll add my own
elaborations here:

1. U+002E FULL STOP already *is* in the class "case-ignorable",
   as is made abundantly clear by the new derived case-related
   property Case_Ignorable, now included in DerivedCoreProperties.txt
   in the Unicode 5.2 beta. As of today, the file for review is:

http://www.unicode.org/Public/5.2.0/ucd/DerivedCoreProperties-5.2.0d11.txt

   and the relevant entry is:

# Derived Property: Case_Ignorable (CI)
# As defined by Unicode Standard Definition D121
# C is defined to be case-ignorable if
# Word_Break(C) = MidLetter or MidNumLet, or
# General_Category(C) = Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf),
Modifier_Letter (Lm), or Modifier_Symbol (Sk).

0027 ; Case_Ignorable # Po APOSTROPHE
002E ; Case_Ignorable # Po FULL STOP
...

2. The reason why U+002E is Case_Ignorable=True is because of its
   word-breaking behavior, which is defined in WordBreakProperty.txt.
   As of today, the file for review is:

http://www.unicode.org/Public/5.2.0/ucd/auxiliary/WordBreakProperty-5.2.0d12.txt

   and the relevant entry is:

0027 ; MidNumLet # Po APOSTROPHE
002E ; MidNumLet # Po FULL STOP

3. The impact of Word_Break=MidNumLet on the default word breaking algorithm
   documented in UAX #29 is defined in WB6 and WB7 in that document.
   As of today, the document for Unicode 5.2 beta review is:

http://www.unicode.org/reports/tr29/tr29-14.html

   And the relevant summary of those rules is: "Do not break letters across
   certain punctuation." In the case of these two characters, the
   point of making them Word_Break=MidNumLet is so that the default
   algorithm would not break across contractions or elisions that
   use U+0027 (or U+2019) and would not break at a full stop used
   in common constructions like decimal number representations: "25.6%"
   and so on.

   What that means, in turn, is that a default implementation of UAX #29
   word breaking should identify word break boundaries in the string
   "bu jy.daw" as #bu# #jy.daw#, since the "." in "jy.daw" is between
   letters and would inhibit determination of a word break.

4. This treatment of U+002E FULL STOP in UAX #29 for word breaking behavior
   is a relatively recent tweak to the algorithm. The apostrophe was
   Word_Break=MidLetter as of Unicode 5.0, but full stop was not. The
   introduction of Word_Break=MidNumLet and addition of full stop to
   that class came in Unicode 5.1. What *that* means is that for this
   particular edge case involving full stop, a conformant implementation
   of UAX #29 default word breaking behavior would behave differently
   for a Unicode 5.0 implementation than a Unicode 5.1 (or later)
   implementation. So a Unicode 5.0 implementation of default word
   breaking behavior would break around a full stop and in particular
   would break "bu jy.daw" as #bu# #jy#.#daw#

5. I know this is getting long-winded ;-) but what the last point means is that
   any default titlecasing algorithm which itself is based on default
   word boundary determination will end up titlecasing "bu jy.daw"
   differently, depending on which version of Unicode it implements.

6. As a general principle, all discussion of Unicode casing behavior
   should cease and desist from referring to UTR #21. As the
   web site clearly indicates, UTR #21 has been superseded (as of
   Unicode 4.0). This kind of discussion about default casing
   behavior in the standard should definitely be referring instead
   to Section 3.13 "Default Case Algorithms" in the standard itself.
   For the Unicode 5.0 version online, see:

http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

7. Titlecasing is, in general, inherently quite variable. Different
   typographical traditions follow different rules, so best practice
   requires being able to adjust it for specific conventions. Unicode
   default titlecasing is an approximation at best, and there is
   no way it can or should be expected to be correct for all strings
   for all situations. In particular, for specialized orthographies
   that use punctuation characters such as U+002E FULL STOP in
   unusual contexts, there simply is not way for general software
   to simply "get it right" out of the box for all users, because
   these usages are inherently contradictory regarding issues such
   as word boundaries.

8. And finally, Asmus is correct. There is no way that the UTC
   will clone a U+002E FULL STOP character in an attempt to create
   a new character that would guarantee correct titlecasing for
   Gwoyeu Romatzyh. Tailoring of algorithms is the answer for such
   a requirement. The good news, however, is that for *default*
   titlecasing behavior, if applications are implementing the
   Unicode default algorithms and move to Unicode 5.1 (or later),
   you should end up with the titlecasing you want for
   Gwoyeu Romatzyh without having to tailor it for the neutral
   tone mark.

--Ken

Next message: announcements@unicode.org: "[Unicode Announcement] 33rd Internationalization & Unicode Conference - Program Online"
Previous message: Asmus Freytag: "Re: GR and letter case"
Maybe in reply to: Christoph Burgmer: "GR and letter case"
Next in thread: Christoph Burgmer: "Re: GR and letter case"
Reply: Christoph Burgmer: "Re: GR and letter case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jul 15 2009 - 14:43:09 CDT