L2/01-224

Proposed Changes to East Asian Width                            

Asmus Freytag

5-24-2001

 

Related documents: L2/01-189, L2/01-223

 

Part I – General

Since the first classifications of characters by East Asian Width (the early drafts are now about five years old) the landscape has changed in two important ways:

 

  1. The nature and support for non-Unicode East Asian character sets (or legacy character sets) has changed.
  2. Common practice in supporting certain subsets of the legacy characters is evolving.

 

These changes need to be reflected in an adjustment to the EAW properties.

I.1      Character sets

Contrary to expectations, new legacy character sets are being created. The most important ones are JIS X0213 and GB 18030. GB 18030 adds the complete repertoire of the Unicode Standard, and the Chinese government requires all vendors to support it. Normally this would lead to a reclassification of all neutral characters to A, since formally all characters now occur in an East Asian legacy character set. However doing so clearly reduces the usefulness of the EAW property in practice (see also I.2).

 

At the same time there are some character sets, like JIS X0212 that were included in the character sets used to determine the EAW assignments, but widespread support for this set did not materialize, and the advent of JIS X0213 makes it unlikely for the future.

 

As a result, we need to explicitly define the legacy character sets that we are using to create the mappings. See II.1 for a concrete proposal.

I.2      Changing practice

Western alphabets

Many EA legacy character sets contain a copy of the Greek and Cyrillic alphabet, and some contain a set of accented Latin characters, in addition to the set of Full-Width ASCII characters. As stated in document L2/01-223, it is becoming common practice, to use treat these alphabetic characters (with the exception of Full-Width ASCII) as narrow characters, i.e. use Western fonts and line layout behaviors for them. In essence their use as wide characters has been recognized as an artifact of their being included in the double-byte portion of legacy character sets, as opposed to an inherent property.

 

If we accept that this development is occurring, then, as a result, we need to re-think the assignment of the ambiguous or A property. Currently, it expresses a simple set relation: “occurs in EA legacy sets as well as in non-EA sets”. For the implementer, it is clearly useful to further identify the subset that actually must get treated in a context dependent manner (such as the ellipsis, etc). See II.1 for a concrete proposal.

Experience with context based disambiguation

Context based disambiguation has proven more difficult in practice than anticipated at first. Context information to guide software is commonly unavailable or unreliable. This is particularly true for web-based implementations. This is documented in more detail in L1/01-223. As a result, the support of ambiguous characters as narrow characters – where possible – has increased. For certain classes of characters, in particular punctuation, this is not possible. The goal should be to not introduce future ambiguous characters.

Part II – Specific

II.1     Proposed overall changes to the EAW properties

 

  1. Remove JIS 0X12 from the set of  supported legacy sets. This affects most accented Latin characters, which would become N instead of A.
  2. State that we explicitly don’t consider GB18030, JIS X0221 and other similar sets that contain all or parts of Unicode as legacy sets for the purpose of making EAW assignments
  3. Split the A property into two subsets, an optional and an obligatory part. Their union could is based on the set membership relation between supported legacy character sets, but the obligatory part is restricted to that subset of ambiguous characters that must be treated in a context dependent manner, while the optional part identifies characters that are commonly treated as narrow in modern implementations. This affects all characters that belong to a non-EA script (such as Latin, Greek, Cyrillic). The full-width ASCII characters are not affected by this, since they are not ambiguous to start with. It also does not affect punctuation.
  4. EAW is a character, not a code point, property. If it is necessary to assign  the property to code points that are not characters, a new X (does not apply) value should be created and assigned.

 

II.2     Proposed detail changes to the EAW properties

Additional detail changes from a recent review of EAW properties as specified in document L2/01-189 Updates to East Asian Width should be applied.