Explicit Properties for Special Casing & Titlecase

L2/01-445
(updating L2/01-390)

Re:	Explicit Properties for Special Casing & Titlecase
To:	UTC
From:	Mark Davis, Markus Scherer
Date:	2001-11-05

A. Explicit Properties

In SpecialCasing.txt a specification is used to determine characters that should be ignored with determining whether a letter is final or not. This is described in the comments of the file as:

# - An ignorable sequence is a sequence of *zero* or more characters from
# the set {HYPHEN, SOFT HYPHEN, general category = Mn}.

This same specification can be used for default titlecasing of strings, to ignore characters that should not have an effect in determining which letters are initial. However, this should be an explicit property in the UCD, rather than just in the documentation of the data file.

Similarly, the TYPE_i value listed there should have an explicit list.

The following new properties are proposed for addition to the next version of Unicode.

Note:

The name TYPE_i has been changed to Special_Dotted to conform to the naming conventions
We have found that some adjustments to the composition of Case_Ignorable were needed. Hyphen needs to be removed, since it is significant in titlecasing names (a principal application of titlecasing); apostrophe, letter-modifiers, and format characters need to be ignored. As with other properties, the exceptional cases go into PropList, while the overall property goes into DerivedCoreProperties.

PropList.txt

0027          ; Other_Case_Ignorable # Po       APOSTROPHE
00AD          ; Other_Case_Ignorable # Pd       SOFT HYPHEN
2019          ; Other_Case_Ignorable # Pf       RIGHT SINGLE QUOTATION MARK

# Total code points: 3

DerivedCoreProperties.txt

# Derived Property: Special_Dotted
#  Generated from: characters whose canonical decompositions
#  end with a combining character sequence that
#  - starts with i or j
#  - has no combining marks above
#  - has no combining marks with zero canonical combining class

0069..006A    ; Special_Dotted # L&   [2] LATIN SMALL LETTER I..LATIN SMALL LETTER J
012F          ; Special_Dotted # L&       LATIN SMALL LETTER I WITH OGONEK
1E2D          ; Special_Dotted # L&       LATIN SMALL LETTER I WITH TILDE BELOW
1ECB          ; Special_Dotted # L&       LATIN SMALL LETTER I WITH DOT BELOW

# Total code points: 5

# ==================

# Derived Property: Case_Ignorable
#  Generated from: Other_Case_Ignorable + Lm + Mn + Me + Cf

0027          ; Case_Ignorable # Po       APOSTROPHE
00AD          ; Case_Ignorable # Pd       SOFT HYPHEN

...

E0001         ; Case_Ignorable # Cf       LANGUAGE TAG
E0020..E007F  ; Case_Ignorable # Cf  [96] TAG SPACE..CANCEL TAG

# Total code points: 657

B. Typos

In addition, there are some typos/problems in SpecialCasing.txt that need to be fixed.

1. Lo

# - A cased letter is any character with general category = Ll or Lo or Lt

Should be Lu, not Lo.

2. Context

Add a clarification to the header:

# The context is always the context of the characters in the original string,
# NOT in the resulting string.

3. Fix Problem with Turkish

For Turkish, we have the following in the file. It is intended to make sure that I and I+dot work the same:

# Remove spurious dot above small i's when lowercasing, if there are no more accents above:

0307; ; 0307; 0307; tr AFTER_i NOT_MORE_ABOVE # COMBINING DOT ABOVE

It is not sufficient, since (a) the context should be after an uppercase I, not lowercase, and (b) the previous I would already have been transformed into a dotless lowercase, which is incorrect. In addition, it is more general (as per Kent's document) to handle lowercasing of the dot_above as locale-independent.

To fix these issues:

1. Add the conditions:

#   AFTER_I:     The last preceding base character was an uppercase I, and
#                no combining character class 230 (above) has intervened.
#   BEFORE_DOT:  The character is followed by combining dot above (U+0307).
#                Any sequence of characters with a combining class that is
#                neither 0 nor 230 may intervene between the current character
#                and the combining dot above.

(Also, add a regular-expression formulation to each condition for clarity.)

2. Modify the rules:

# Remove spurious dot above small i's when lowercasing, if there are no more accents above:

0307; ; 0307; 0307; tr AFTER_i NOT_MORE_ABOVE # COMBINING DOT ABOVE
0307; ; 0307; 0307; az AFTER_i NOT_MORE_ABOVE # COMBINING DOT ABOVE

# Fix case pairs

0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I

0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

to be:

# When lowercasing, remove dot_ above in the sequence I + dot_ above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; AFTER_I # COMBINING DOT ABOVE

# When lowercasing, unless an I is before a dot_above, it turns into a dotless i.

0049; 0131; 0049; 0049; tr NOT_BEFORE_DOT; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az NOT_BEFORE_DOT; # LATIN CAPITAL LETTER I

# When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

The end result will be the following (where * represents a combining dot above, and ^ represents any other above accent). The important formal characteristic is that dotted-I and I + dot always produce the same results.

**Casing Behavior for I/i with/without dot**
Normal	tr & az

3. NFC Versions

The mappings in SpecialCasing predate NFC. Certain of the mappings are not in NFC format. While not formally incorrect, it would be better if these were changed to NFC. The list is the following (the ## lines are the old values)

## 0390; 0390; 0399 0308 0301; 0399 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
0390; 0390; 03AA 0301; 03AA 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS

## 03B0; 03B0; 03A5 0308 0301; 03A5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
03B0; 03B0; 03AB 0301; 03AB 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

## 1FB7; 1FB7; 0391 0342 0345; 0391 0342 0399; # GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
1FB7; 1FB7; 1FBC 0342; 0391 0342 0399; # GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI

## 1FC7; 1FC7; 0397 0342 0345; 0397 0342 0399; # GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
1FC7; 1FC7; 1FCC 0342; 0397 0342 0399; # GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI

## 1FD2; 1FD2; 0399 0308 0300; 0399 0308 0300; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
1FD2; 1FD2; 03AA 0300; 03AA 0300; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA

## 1FD7; 1FD7; 0399 0308 0342; 0399 0308 0342; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
1FD7; 1FD7; 03AA 0342; 03AA 0342; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI

## 1FE2; 1FE2; 03A5 0308 0300; 03A5 0308 0300; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
1FE2; 1FE2; 03AB 0300; 03AB 0300; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA

## 1FE7; 1FE7; 03A5 0308 0342; 03A5 0308 0342; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
1FE7; 1FE7; 03AB 0342; 03AB 0342; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI

## 1FF7; 1FF7; 03A9 0342 0345; 03A9 0342 0399; # GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
1FF7; 1FF7; 1FFC 0342; 03A9 0342 0399; # GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI