ISO/IEC JTC1/SC2/WG2 N2236
Date: 2000-08-09
Title: |
Proposal for addition of COMBINING GRAPHEME JOINER |
Source: |
Unicode Technical Committee |
Status: |
Liaison Communication |
Action: |
For consideration by JTC1/SC2/WG2 |
Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters. They include, but are not limited to, combining character sequences such as (g + °), digraphs such as Slovak “ch”, or sequences with letter modifiers such as kw. Grapheme boundaries are important for collation, regular-expressions, and counting “character” positions within text. The Unicode Standard provides a determination of where the default grapheme boundaries fall in a string of characters. This algorithm can be overriden for specific locales, which is what is done in providing contracting characters in collation tailoring tables. For more information, see [Boundaries].
There are circumstances where even the locale-specific determination of grapheme boundaries may need to be overridden on a local basis. These include:
To address this issue, the UTC has approved the addition of a new character at U+0363, the COMBINING GRAPHEME JOINER. The properties of this character are tuned to work well with current software so that such processes as grapheme determination, line-break, and collation will work well with this character. In terms of grapheme determination it functions like the virama. As with a virama, the grapheme joiner is only useful if immediately followed by a base character, so it should always be placed at the end of a combining character sequence. Thus a sequence of <base, grapheme joiner, base> will function as a single grapheme.
In terms of line-break, the character is in the category GLUE (the same as a zero-width no-break space — see [LineBreak]). In collation, the grapheme joiner should be ignored unless it specifically occurs within a tailored collation element mapping. Thus it is given a completely ignorable collation element in the default collation table, like NULL (see [Collation]). However, it can be entered into the tailoring rules for any given language, using the UCA/14651 tailoring capabilities.
In terms of display, the grapheme joiner is an invisible combining character with canonical class of zero. It binds adjacent characters into a single grapheme as the base for combining marks, such as an underbar in "th". For any specified repertoire, implementation support for this capability can be provided by means of ligature tables in the font, or by means of special placement rules (see [OpenType]). Some display engines may be able to supply runtime generative support. As with other combining marks, there is considerable latitude for display depending on the environment (such as the choice of font). Some possibilities are:
The UTC urges WG2 to also approve this character for addition to ISO 10646. The character should be encoded in the BMP, since it is similar to other characters there.
For instructions and guidance for filling in the form please see the document " Principles and Procedures for Allocation of New Characters and Scripts" (http://www.dkuug.dk/JTC1/SC2/WG2/prot)
1. Title: COMBINING GRAPHEME JOINER
2. Requester's name: Unicode Technical Committee
3. Requester type (Member body/Liaison/Individual contribution): Liaison
4. Submission date: 2000-08-10
5. Requester's reference (if applicable):
6. (Choose one of the following:) This is a complete
proposal
This is a complete proposal: ; or,
More information will be provided later:
1. (Choose one of the following:)
a. This proposal is for a new script (set of characters): No
Proposed name of script:
b. The proposal is for addition of character(s) to an existing
block: Yes
Name of the existing block: 0300; 036F; Combining Diacritical Marks
2. Number of characters in proposal: One
3. Proposed category (see section II, Character Categories): Combining Mark
4. Proposed Level of Implementation (see clause 15, ISO/IEC
10646-1): Any level is acceptable
Is a rationale provided for the choice? N/A
If Yes, reference:
5. Is a repertoire including character names provided?: Yes
a. If YES, are the names in accordance with the 'character
naming guidelines' in Annex K of ISO/IEC 10646-1? Yes
b. Are the character shapes attached in a reviewable form? N/A
6. Who will provide the appropriate computerized font (ordered
preference: True Type, PostScript or 96x96 bit-mapped format) for publishing the
standard? The Unicode Technical Committee
If available now, identify source(s) for the font (include address, e-mail,
ftp-site, etc.) and indicate the tools used:
7. References:
a. Are references (to other character sets, dictionaries, descriptive texts
etc.) provided? N/A
b. Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached? N/A
8. Special encoding issues:
Does the proposal address other aspects of character data
processing (if applicable) such as input, presentation, sorting, searching,
indexing, transliteration etc. (if yes please enclose information): Yes, see ISO/IEC
JTC1/SC2/WG2 N2236
1. Has this proposal for addition of character(s) been submitted before? No
If YES explain
2. Has contact been made to members of the user community (for example: National Body, user groups of the script or characters, other experts, etc.)? Yes
If YES, with whom? Unicode member companies (see http://www.unicode.org/unicode/consortium/memblogo.html)
If YES, available relevant documents?
3. Information on the user community for the proposed
characters (for example: size, demographics, information technology use, or
publishing use) is included? major IT industry leaders
Reference:
4. The context of use for the proposed characters (type of
use; common or rare) YES
Reference: see ISO/IEC JTC1/SC2/WG2 N2236
5. Are the proposed characters in current use by the user
community? N/A
If YES, where? Reference:
6. After giving due considerations to the principles in N 1352
must the proposed characters be entirely in the BMP? Yes
If YES, is a rationale provided? Yes
If YES, reference: Yes, see ISO/IEC JTC1/SC2/WG2 N2236
7. Should the proposed characters be kept together in a contiguous range (rather than being scattered)? N/A
8. Can any of the proposed characters be considered a
presentation form of an existing character or character sequence? No
If YES, is a rationale for its inclusion provided?
If YES, reference:
9. Can any of the proposed character(s) be considered to be
similar (in appearance or function) to an existing character? No
If YES, is a rationale for its inclusion provided?
If YES, reference:
10. Does the proposal include use of combining characters
and/or use of composite sequences (see clause 4.11 and
4.13 in ISO/IEC 10646-1)? Yes
If YES, is a rationale for such use provided? Yes
If YES, reference: see ISO/IEC JTC1/SC2/WG2 N2236
Is a list of composite sequences and their corresponding glyph images (graphic
symbols) provided? No
If YES, reference:
11. Does the proposal contain characters with any special
properties such as control function or similar semantics? Yes
If YES, describe in detail (include attachment if necessary) see ISO/IEC
JTC1/SC2/WG2 N2236
1. Relevant SC 2/WG 2 document numbers:
2. Status (list of meeting number and corresponding action or
disposition):
3. Additional contact to user communities, liaison
organizations etc:
4. Assigned category and assigned priority/time frame: