Re: Unicode Case Mappings UTR #21

From: Mark Davis (mark@macchiato.com)
Date: Wed Nov 29 2000 - 10:59:55 EST


These are good points.

TR 21 deliberately does not specify the language conventions for using
titlecase, which as you note will change the effect of its use (see
http://www.unicode.org/unicode/reports/tr21/#TitlecaseCaveats). Most
products will have some smarts, but also leave it up to the user when to use
Titlecase since it is difficult or impossible to handle all cases
algorithmically. For example, in MS Word you get the following, where the
original is actually the desired one (by at least some people!):

1. "Taming of the Shrew" => "Taming Of The Shrew"

2. "About 'is Honor" => "About 'Is Honor"
(For non-native speakers, <'is> is a dialect form of <his>)

3. "My Name is d'Avis" => "My Name Is d'Avis" => "My Name Is D'avis"
(Interestingly, Word will do successive changes. For this case you get
different results the second time you apply it.)

4. "The Re-education of Rita" => "The Re-Education Of Rita"

5. "McGowan is Actually a McDonald!" => "Mcgowan Is Actually A Mcdonald"

Some other comments below.

Mark

----- Original Message -----
From: "Carl W. Brown" <cbrown@xnetinc.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Tuesday, November 28, 2000 22:33
Subject: Unicode Case Mappings UTR #21

> I have found some problems trying to implement case mapping. I am making
> some assumptions and have some questions.
>
> #1 It is unclear other than Turkish which languages use the dotless I. I
> assume they are:
>
> Turkish, Azeri, Tatar, and Bashkir.

We don't have confirmation on the latter two. Once we do, we will add them
to SpecialCasing.

>
> #2 What are the rules for Title case and spacing? I assume that a
> non-breaking space is a joiner and does not indicate that the following
> alpha character is a title case character. Also that the zero width
> non-breaking space (BOM) is neutral.

You shouldn't do that for a non-breaking space: it simply controls
line-break. I might have a phrase like "Process<NBSP>A" that I don't want
broken over a line -- it *doesn't* mean it should be treated as a word for
any other process. ZWNBSP should be ignored when not considering linebreak,
so I assume that is what you mean by "neutral". Cfs should also be ignored.
These are good points to add to the TR in the future.

>
> #3 French also has other articles such as d' are there prescribed rules
for
> capitalization? Are there other languages to consider?

I suspect that pretty much every language will have its own set of exception
words (which may include conjugations), special cases, and special rules for
punctuation. A low-level routine would just handle the basics, and leave up
to a higher-level process exactly how to use the low-level routine for
specific languages.

>
> #4 There is no mention of stop words.
>
> Carl



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT