L2/01-191R

Dotting the i�s

Kent Karlsson and Vladas Tumasonis

2001-05-05

This is a proposal to update the SpecialCasing.txt data file in the Unicode Character Database. The current handling of dots above for lowercase i�s and j�s in SpecialCasing.txt for case mapping is not sufficient, in particular for Lithuanian where an explicit dot above sometimes needs to be introduced. This proposal also attempts a somewhat more systematic treatment of dots above lowercase i�s and j�s for other languages too.

The dot above lowercase i and lowercase j are 'soft' in the sense that they usually disappear upon uppercasing as well as upon given accents above the i or j. There are, however exceptions to this.� For these exceptions, where the dot is not 'soft', a 'hard dot above' (U+0307) is the best way to deal with this matter.� For Turkish, the soft dot must be �hardened� for uppercasing (when there are no accents above, otherwise the soft dot is already gone), but for Lithuanian it must be �hardened� before accenting above, but not for uppercasing.

The tables in the exposition are not complete.� The formal table in the update to SpecialCasing.txt are, however, intended to be complete.


to upper and to title

Normal

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot, then uppercase. This removes any spurious dot above, a dot that is not recommended to be there in the first place.

 

i+dot (no more accents above)

I

 

i-ogonek+dot (no more accents above)

I-ogonek [etc.]

 

j+dot (no more accents above)

J

Lithuanian

����������� Any lowercase variant of i or j with an unblocked extra dot above, even if there are more accents above on that base letter: remove the extra dot, then uppercase. 

 

i+dot

I

 

j+dot

J

Turkish

����������� An i with an unblocked extra dot above, if there are no more accents above on that base letter: keep the extra dot, but don�t add another one (for the cases below), then uppercase. This, again, takes care of the spurious case where

 

i (no more accents above)

I-dot

 

i+dot (no more accents above)

I-dot

 


to lower

Normal

����������� Any lowercase or uppercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.

 

i+dot (no more accents above)

i

 

i-ogonek+dot (no more accents above)

i-ogonek

 

...

...

 

j+dot (no more accents above)

j

 

I-dot (if more accents above)

i -dot

 

I-dot (if no more accents above)

i (already in UniData.txt)

 

I -dot (if more accents above)

i -dot (for NFD�NFC consistency; already in UniData)

 

I -dot (if no more accents above)

i (for NFD�NFC consistency)

 

J -dot (if no more accents above)

j (some degree of systematic...)

 

Lithuanian

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot. Uppercase I�s and J�s that have extra accents above must get an extra dot above inserted.

 

I (if more accents above)

i -dot

 

J (if more accents above)

j -dot

 

I-ogonek (if more accents above)

i-ogonek -dot

 

I-grave

i -dot -grave

 

I-acute

i -dot -acute

 

I-tilde

i -dot -tilde

 For NFD�NFC consistency a number of �I-letters� that are not used in Lithuanian must be handled too.

 

Turkish

����������� Any lowercase variant of i or j with an unblocked extra dot above, if there are no more accents above on that base letter: remove the extra dot.� Turkish and Azeri (at least) use a dotless i as the lowercase of I. It should not be used if there are more accents above (then use an ordinary i which then looses the dot...).

 

I (no more accents above)

i-dotless

 


Suggested changes to SpecialCasing.txt regarding dotting i�s and j�s

The exposition tables above were not intended to be complete.� The formal tables below are intended to be complete enough to cover the orthographic requirements and also be such that NFD and NFC are handled consistently. Cases like barred i or j-crosstail are not covered. Review and comments are welcome.� The intent is for these modifications to be included in Unicode 3.2, or if possible, in an update to Unicode 3.1.

Old lines (to remove)

1st-------------------
# characters where they are 1-1, and does not have locale-specific mappings.)
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more locales or contexts,
# separated by spaces.
3rd-------------------
# A locale is defined as:
# <locale> := <ISO_639_code> ( "_" <ISO_3166_code> ( "_" <variant> )? )?
# <ISO_3166_code> := 2-letter ISO country code,
# <ISO_639_code> := 2-letter ISO language code
4th-------------------
# A context is one of the following choices:
5th-------------------
# AFTER_i: The last base character was "i" 0069
6th-------------------
7th-------------------
# ================================================================================
# Locale-sensitive mappings
# ================================================================================
# Lithuanian
0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or titlecase
# Turkish, Azeri
0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
end-------------------

New lines (to insert, replacing the old ones listed above)

1st-------------------
# characters where they are 1-1, and does not have language-specific mappings.)
#
# Note that when case mapping a string in a normal form,
# the result need not be in any normal form.
#
2nd-------------------
# The <condition_list> is optional. Where present, it consists of one or more
# contexts, one of which may be a language code, separated by spaces.
3rd-------------------
# A _subset_ of RFC 3066 conforming language codes, _sufficient for this file_,
# can be described as:
# <langcode> := two-letter ISO 639-1 language code
4th-------------------
# A context is a <langcode> or one of the following choices (test on original string):
5th-------------------
# AFTER_i: The last preceding base character was "i" (0069), "j" (006A),
# or has a canonical decomposition that begins with an "i" or "j" but has no
# combining characters above (i.e., i-ogonek (012F), i-tilde-below (1E2D),
# or i-dot-below (1ECB)); AND no combining character class 230 (above) has
# intervened. (Neither i-stroke (0268) or j-crosstailed (029D) need be
# specially handled below, while they also have a soft dot above that
# is lost on normal uppercase or accenting above.)
#
# AFTER_CAP_I: The last preceding base character was "I" (0049), "J" (004A),
# or has a canonical decomposition that begins with an "I" or "J" but has no
# combining characters above (i.e., I-ogonek (012E), I-tilde-below (1E2C),
# or I-dot-below (1ECA)); AND no combining character class 230 (above) has
# intervened. (I-stroke (0197) need not be specially handled below, while
# it also has a soft dot above in lowercase form.)
#
# MORE_ACCENTS_ABOVE: The current combining sequence has at least one class 230
# (above) combining character after the currently considered character.
6th-------------------[no old text]
#-----
# Normal dotting/undotting of i's and j's (capital and small):
#-----
# Remove spurious explicit dot above small i or j when case mapping,
# if no more accents above:
0307; ; ; ; AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# Remove explicit dot above capital i or j when lowercasing,
# if no more accents above (mainly for NFC-NFD consistency for i--I-dot):
0307; ; 0307; 0307; AFTER_CAP_I NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# For NFC-NFD consistency for I-dot--i:
0130; 0069 0307; 0130; 0130; MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH DOT
# Note: the following cases are already in the UnicodeData file.
# 0131; 0131; 0049; 0049; # LATIN SMALL LETTER DOTLESS I
# 0130; 0069; 0130; 0130; [NON_MORE_ACCENTS_ABOVE] # LATIN CAPITAL LETTER I WITH DOT ABOVE
7th-------------------
# ================================================================================
# Language-sensitive mappings
# ================================================================================
#
# Lithuanian:
#
# Remove dot above small i's or j's when uppercasing,
# even if there are more accents above:
0307; 0307; ; ; lt AFTER_i # COMBINING DOT ABOVE
# Introduce an explicit dot above when lowercasing capital I's and J's
# if there are more accents above (grave, acute, tilde above, and ogonek
# occur in Lithuanian; the rest are just for consistency between NFC and NFD):
0049; 0069 0307; 0049; 0049; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH OGONEK
00CC; 0069 0307 0300; 00CC; 00CC; lt # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt # LATIN CAPITAL LETTER I WITH ACUTE
0128; 0069 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
1E2C; 1E2D 0307; 1E2C; 1E2C; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH TILDE BELOW
1ECA; 1ECB 0307; 1ECA; 1ECA; lt MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I WITH DOT BELOW
00CE; 0049 0307 0302; 00CE; 00CE; lt # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
0134; 004A 0307 0302; 0134; 0134; lt # LATIN CAPITAL LETTER J WITH CIRCUMFLEX
0128; 0049 0307 0303; 0128; 0128; lt # LATIN CAPITAL LETTER I WITH TILDE
012A; 0049 0307 0304; 012A; 012A; lt # LATIN CAPITAL LETTER I WITH MACRON
012C; 0049 0307 0306; 012C; 012C; lt # LATIN CAPITAL LETTER I WITH BREVE
01CF; 0049 0307 030C; 01CF; 01CF; lt # LATIN CAPITAL LETTER I WITH CARON
0208; 0049 0307 030F; 0208; 0208; lt # LATIN CAPITAL LETTER I WITH DOUBLE GRAVE
020A; 0049 0307 0311; 020A; 020A; lt # LATIN CAPITAL LETTER I WITH INVERTED BREVE
1E2E; 0049 0307 0308 0301; 1E2E; 1E2E; lt # LATIN CAPITAL LETTER I WITH DIAERESIS AND ACUTE
1EC8; 0049 0307 0309; 1EC8; 1EC8; lt # LATIN CAPITAL LETTER I WITH HOOK ABOVE
#
# Turkish, Azeri:
#
# Remove spurious dot above small i's when lowercasing, if no more accents above:
0307; ; 0307; 0307; tr AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
0307; ; 0307; 0307; az AFTER_i NON_MORE_ACCENTS_ABOVE # COMBINING DOT ABOVE
# I�i-dotless and I-dot--i-with-soft-dot are case pairs in Turkish and Azeri,
# when there are no more accents above (otherwise use the ordinary casing rules):
0069; 0069; 0130; 0130; tr NON_MORE_ACCENTS_ABOVE # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az NON_MORE_ACCENTS_ABOVE # LATIN SMALL LETTER I
0049; 0131; 0049; 0049; tr NON_MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az NON_MORE_ACCENTS_ABOVE # LATIN CAPITAL LETTER I
end-------------