ISO/IEC JTC1/SC2/WG2 N1838
Date: 1998-09-15
Title: |
Proposal to add four binary completion letters to the BMP |
Source: |
Mark Davis |
Status: |
Expert Contribution |
Action: |
For consideration by JTC1/SC2/WG2 |
This document contains the proposal summary (ISO/IEC JTC1/SC2/WG2 form N1352) and a full proposal for the encoding of two new characters in the BMP of ISO/IEC 10646.
1. | Title | Proposal to add four binary completion letters to the BMP |
2. | Requester's name | Mark Davis |
3. | Requester type | Expert contribution |
4. | Submission date | 1998-09-15 |
5. | Requester's reference | |
6a. | Completion | This is a complete proposal. |
6b. | More information to be provided? | No |
1a. | New script? Name? | No |
1b. | Addition of characters to existing block? Name? | Yes, to Latin. Suggested locations are U+1E9C thru U+1E9F. However, the characters could be added at any reasonable place in the BMP. |
2. | Number of characters | 4 |
3. | Proposed category | Category A |
4. | Proposed level of implementation and rationale | Level 1 |
5a. | Character names included in proposal? | Yes |
5b. | Character names in accordance with guidelines? | Yes |
5c. | Character shapes reviewable? | Yes |
6a. | Who will provide computerized font? | Mark Davis (if necessary--it is a trivial modification of any font containing U+01E0, U+01E1, U+1E1C, U+1E1D) |
6b. | Font currently available? | No, but it can be generated quickly |
6c. | Font format? | TrueType |
7a. | Are references (to other character sets, dictionaries, descriptive texts, etc.) provided? | N/A--See below |
7b. | Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached? | N/A--See below |
8. | Does the proposal address other aspects of character data processing? | Yes |
1. | Has this proposal been submitted before? | No |
2. | Contact with the user community? | N/A--See below |
3. | Information on the user community? | N/A--See below |
4a. | The context of use for the proposed characters? | N/A--See below |
4b. | Reference | N/A--See below |
5a. | Proposed characters in current use? | N/A--See below |
5b. | Where? | N/A--See below |
6a. | Characters should be encoded entirely in BMP? | Yes |
6b. | Rationale | Required for efficient normalization of Unicode/10646, as described below. |
7. | Should characters be kept in a continuous range? | It would be useful, but not absolutely necessary |
8a. | Can the characters be considered a presentation form of an existing character or character sequence? | To the same degree as: U+01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON |
8b. | Where? | N/A--See below |
8c. | Reference | N/A--See below |
9a. | Can any of the characters be considered to be similar (in appearance or function) to an existing character? | No |
9b. | Where? | |
9c. | Reference | |
10a. | Combining characters or use of composite sequences included? | No |
10b. | List of composite sequences and their corresponding glyph images provided? | No |
11. | Characters with any special properties such as control function, etc. included? | No |
To be completed by SC2/WG2
1. | Relevant SC 2/WG 2 document numbers: | |
2. | Status (list of meeting number and corresponding action or disposition) | |
3. | Additional contact to user communities, liaison organizations etc. | |
4. | Assigned category and assigned priority/time frame | |
5. | Other Comments |
The proposal is to add the following letters to the BMP:
X001 LATIN CAPITAL LETTER A WITH DOT ABOVE
X002 LATIN SMALL LETTER A WITH DOT ABOVE
X003 LATIN CAPITAL LETTER E WITH CEDILLA
X004 LATIN SMALL LETTER E WITH CEDILLA
While these characters may indeed occur in natural languages or academic use, the principal reason for this proposal has to do with the nature of normalization. There has been a great deal of interest in providing complete specifications for different normalized forms of Unicode/10646. (Cf. http://www.unicode.org/unicode/reports/techreports.html)
One of the normalization forms of particular interest is one that basically normalizes to precomposed forms--for example, that uses the single coded character U+00C0 LATIN CAPITAL LETTER A WITH GRAVE instead of the combining character sequence <U+0041 LATIN CAPITAL LETTER A, U+0300 COMBINING GRAVE>. Such a form is of particular interest for systems supporting implementation Level 1.
Implementations of such a normalization form can be particularly efficient if Unicode and 10646 are coded such that they always have binary canonical decompositions.(For more information on canonical decomposition, see The Unicode Standard, Version 2.0, Chapters 3 and 4.)
A composed character X has a binary canonical decomposition when X is canonically equivalent to composed character sequence:
<B, C1, C2,...,Cn-1,Cn>
and there is another composed character Y which is canonically equivalent to the sequence without the final combining mark:
<B, C1, C2,...,Cn-1>.In such a case, Y is called a canonical binary completion character for X. If X does not have a binary completion character, X is called incomplete.
Notice that only characters with two or more combining marks need to be checked for completeness.
There are only four incomplete characters in 10646/Unicode:
U+01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
U+01E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRONU+1E1C LATIN CAPITAL LETTER E WITH CEDILLA and BREVE
U+1E1D LATIN SMALL LETTER E WITH CEDILLA and BREVE
(Characters 1E1C and 1E1D can be produced by a binary decomposition, but not a canonical binary decomposition.)
The four characters proposed for addition to 10646/Unicode in this document are the canonical binary completion characters for these four incomplete characters.
The value of all composed characters is fundamentally a product of their usefulness in implementations, since they could be expressed with composed character sequences. This is a special case where the addition of these characters is of particular value to a wide variety of implementations ranging from XML parsers to program language parsers.
It is particularly important that these characters be added before Unicode 3.0 is final, since it is likely that that will be the version used in normalization forms.
LATIN CAPITAL LETTER A WITH DOT ABOVE | |
LATIN SMALL LETTER A WITH DOT ABOVE | |
LATIN CAPITAL LETTER E WITH CEDILLA | |
LATIN SMALL LETTER E WITH CEDILLA |
X001;LATIN CAPITAL LETTER A WITH DOT ABOVE;Lu;0;L;0041 0307;;;;N;;;;X002;
X002;LATIN SMALL LETTER A WITH DOT ABOVE;Ll;0;L;0061 0307;;;;N;;;X001;;X001
X003;LATIN CAPITAL LETTER E WITH CEDILLA;Lu;0;L;0045 0327;;;;N;;;;X004;
X004;LATIN SMALL LETTER E WITH CEDILLA;Ll;0;L;0065 0327;;;;N;;;X003;;X003