L2/01-224
Proposed Changes to East Asian Width
Asmus Freytag
5-24-2001
Related documents: L2/01-189, L2/01-223
Since the first classifications of characters by East Asian Width (the early drafts are now about five years old) the landscape has changed in two important ways:
These changes need to be reflected in an adjustment to the EAW properties.
Contrary to expectations, new legacy character sets are being created. The most important ones are JIS X0213 and GB 18030. GB 18030 adds the complete repertoire of the Unicode Standard, and the Chinese government requires all vendors to support it. Normally this would lead to a reclassification of all neutral characters to A, since formally all characters now occur in an East Asian legacy character set. However doing so clearly reduces the usefulness of the EAW property in practice (see also I.2).
At the same time there are some character sets, like JIS X0212 that were included in the character sets used to determine the EAW assignments, but widespread support for this set did not materialize, and the advent of JIS X0213 makes it unlikely for the future.
As a result, we need to explicitly define the legacy character sets that we are using to create the mappings. See II.1 for a concrete proposal.
Many EA legacy character sets contain a copy of the Greek and Cyrillic alphabet, and some contain a set of accented Latin characters, in addition to the set of Full-Width ASCII characters. As stated in document L2/01-223, it is becoming common practice, to use treat these alphabetic characters (with the exception of Full-Width ASCII) as narrow characters, i.e. use Western fonts and line layout behaviors for them. In essence their use as wide characters has been recognized as an artifact of their being included in the double-byte portion of legacy character sets, as opposed to an inherent property.
If we accept that this development is occurring, then, as a result, we need to re-think the assignment of the ambiguous or A property. Currently, it expresses a simple set relation: “occurs in EA legacy sets as well as in non-EA sets”. For the implementer, it is clearly useful to further identify the subset that actually must get treated in a context dependent manner (such as the ellipsis, etc). See II.1 for a concrete proposal.
Context based disambiguation has proven more difficult in practice than anticipated at first. Context information to guide software is commonly unavailable or unreliable. This is particularly true for web-based implementations. This is documented in more detail in L1/01-223. As a result, the support of ambiguous characters as narrow characters – where possible – has increased. For certain classes of characters, in particular punctuation, this is not possible. The goal should be to not introduce future ambiguous characters.
Additional detail changes from a recent review of EAW properties as specified in document L2/01-189 Updates to East Asian Width should be applied.