L2/01-223
From: "Michel Suignard" <michelsu@microsoft.com>
23 May 2001
New revised text:
------------------
Usage of character set in East Asia has built a strong legacy which is enshrined in the usage of characters based on their original character code. Characters used in that context can be divided in the following categories:
1. script letters that unequivocally are narrow, western, don't participate
in any CJK logic (ex ASCII letter)
2. symbols that unequivocally are narrow (ex ASCII symbols)
3. symbols that always participate in CJK logic in various subtle ways
There is a fourth category:
4. script letters that may participate in CJK logic (but my take on this is
that this has been largely deprecated based on personal survey, more on this on
following text)
The problem area is obviously to distinguish between 2 and 3. In trying to find a way to illustrate in a dramatic way how these 2 categories are different I found a fairly decent one, which is to invoke the vertical layout flow on East Asian fonts (with Asmus' Unibook tool, you just have to select a font with the '@' prefix to see the effect). There is a good correlation between the fact that a symbol belongs to 3 and the fact that in vertical flow mode it will be upright (it gets a bit more complicated for bracket and parentheses, but those typically belongs to symbol groups where the other characters would be drawn accordingly to a vertical flow layout).
The characters in the category 3, along with standard CJK wide characters (ideographs, Hangul, Jamo, etc...) participate in CJK typography rules in the following ways:
- algorithmic kerning, that is blank space within their advance width will
be removed in some precise situation (this also know as Character Space Control
or CSC by some East Asian experts)
- start of line removal of the same blank space
- end of line removal (hanging punctuation)
- removal or addition of advance width on those 'blank' portion of advance
width when doing line justification
- various glyph adjustment within the bounding box when going from horizontal
to vertical layout flow.
- baseline alignment (ideographic instead of roman baseline in horizontal flow,
and center in vertical flow), note that the difference in baseline alignment
strategy typically implies a specific glyph
None of the characters in category 2 (narrow symbols) are affected by these
effects. For example in vertical flow they would be on their side,
and no algorithmic kerning would ever remove blank space from them (the white
space chars are an exception, but this is beside the point here)
This makes the determination of cat 2 and cat 3 very crucial, and as I said it
is based mostly on East Asian typography experience. I can categorize these
cat3 characters as follows:
Any symbols encoded in the following block by East Asian fonts:
2000-206F General punctuation
2100-214F Letterlike symbols
2460-24FF Enclosed alpha
25A0-25FF Geometric shapes
2600-267F Miscellanous symbols
3000-303F CJK Symbols
3200-33FF Enclosed, CJK Compat
FE30-FE6F CJK Compat forms, Small form variants
FF00-FFEF Half and Full Width Forms
I have found a slight deviation in Taiwanese fonts where they also
categorized the following as cat3:
2500-257F Box drawing
As you can see, the situation is not already pretty as a mixed flow
layout containing a math expression could produce an horrendous layout.
However the impact of these interpretation by East Asian typography
process is limited by some factors:
- the bulk of the math symbols is not part of these cat3 characters
- the categorization splits go by block
- space adjustments concerns only specific characters and those characters are
typically only used in CJK context. However the glaring exceptions are the
bracket/parenthesis characters
From this you can see that the bracket unification between the 2329-232A and
3008-3009 is devastating as suddenly you get math characters from a non cat3
block transformed in some of the characters that are more influenced by East
Asian typography, that is, they are sensitive to:
- special rules about algorithmic kerning
- shape differently depending on flow layout
- line breaking rules, etc...
Concerning the script letters that in the past participated in CJK logic
(these were mostly incomplete subset of Greek and Cyrillic) their usage didn't
survive the full screen terminal mode of yesterday. If you look at modern East
Asian fonts, all these letters are variable length, do not get upright on
vertical flow and do not get involved in East Asian typography as 'Wide'
characters. So, although they may appear in these fonts, they really do not
behave differently than if they were included
in a Western font. And typically because their hinting is not as good as true
Western fonts, they very often get swapped in favor of the later at rendering
time. So despite the fact that more and more Latin and possibly other 'narrow'
scripts are showing in East Asian repertoire, it doesn't really mean that they
are ever treated as 'wide'. So there is no need to treat them as ambiguous.
This really means we should concentrate on symbols concerning width ambiguity, not letters.
The last point I would like to mention is the complexity of the
disambiguation. In tightly controlled document environment (Microsoft
Office is a good example), typically the language and other locale infos are
well known within the context of the text and can be used to
successfully determine the Narrow or Wide nature of ambiguous symbols. But on
the Web context, the locale info is often missing or even worse
incorrect. So amiguity is really bad on that context.
What I am getting to, is that the last thing we want to see is more
ambiguity. The current symbol ambiguity is already bad but is contrained
to some Unicode blocks. Opening it to the whole math bracket repertoire would
make a bad situation much worse.
Michel