On Sat, 29 Dec 2001 DougEwell2@cs.com wrote:
>Tex's example may or may not be realistic -- I have no way of knowing --
>but in suggesting a top-to-bottom directional override, I had hoped it
>would be possible to represent a run of text such as Tex describes
>without resorting to the infamous "higher protocol."
But it is. Unicode just does not take a stand on how it should be
formatted. See below.
>This may seem arbitrary to some; why should overrides of default
>horizontal directionality be a plain-text issue but overrides of default
>vertical directionality be a higher-level "formatting style" issue? I
>hope this discussion can shed some light on this question, and possibly
>help me see what I may be missing.
I think this has to do with the way people conceive the term "plaintext"
-- anything beyond a simple line (or column) based flow layout will likely
be thought of as "rich" instead. The reason is both historical and
practical. Text is laid out like this in most cultures, and early
printing/computer/typewriter technology followed suit. The matter of mixed
writing directions is a relatively new one, and so isn't really covered by
the concept of "plaintext".
The practical reason is that comprehensive layout of fully free direction
text is really difficult, if not impossible, whereas writing systems with
identical line progression directions are more or less compatible, using a
simplish algorithm (Unicode BiDi). If you look at the way text is normally
displayed on 2D media, it's printed in a unidirectional stream and then
chopped into lines at sheet edge. As long as the lines progress in the
same direction, you can always manipulate the order of the symbols within
the stream to get more or less correct display of mixed script
directionalities. (Yes, line breaking and deeply nested BiDi levels are
still troublesome.) This way, lr-tb is sorta compatible with rl-tb.
There are of course three more pairs, not counting boustrophedon and the
likes, but AFAIK this is the most common combination.
It's also where the ease stops. If you try to mix opposite line
progression directions, you will end up with something like the Unicode
BiDi algo, only applied at the paragraph level. That soon becomes
unreadable, and makes for really lousy APIs. (Even BiDi is difficult, as
one usually needs to render entire paragraphs at a time.) Mixing vertical
and horizontal writing modes is even more complicated since you cannot
think of the text as a directional, chopped-into-lines stream, anymore.
You *can* use all sorts of funky heuristics, but keeping the text both
readable and "plain" is pretty much impossible. (If you don't believe
that, think about how you would format a string of 1000 lr-tb, 100 tb-lr,
100 rl-bt and 1000 bt-lr characters. This is not a realistic example, of
course, but illustrates the general point.)
Now, there are many ways to cope with simplified variations of the theme.
One is to rotate nested characters of foreign directionality so that the
character progression direction for all the scripts present remains the
same, no matter what the script. E.g. XSL-FO documentation gives a number
of examples of this approach. Another is to force the character
progression direction to agree between scripts, without rotation. This
only works when characters are graphically separate, like they are in the
Latin script or scripts based on Han ideographs. Top-to-bottom Latin
within Japanese is a good example. (It also illustrates the effects on
readability of messing with the natural directionality of text.) You can
also print short spans of foreign text in its natural direction, within a
line of text of differing native directionality. Metric units, printed in
Latin within tb-rl traditional Japanese, are probably the most common
case. I'm sure that people on this list could cite countless weirder
examples.
The point is, all such solutions are for special cases. They do not solve
the problem of how to fit longer, nested spans with arbitrary
directionality on a page without in some cases making the text as a whole
illegible and/or unaesthetic. Hence, it's better to handle the special
cases as what they are, instead of bringing them all into Unicode and
forcing every Unicode compatible application to incorporate a full page
layout engine. I think this is the ultimate reason why TUS 3.0 leaves this
stuff to those "higher level protocols".
We might in fact say that the Unicode Standard has two completely separate
parts. The first is the logical encoding of any character based script as
a stream of character codes, the second is an actual 2D, line based
rendering of the encoding for the very special case where two scripts of
identical line progression direction are mixed. Anything beyond this could
well be said to be beyond the scope of TUS. We might indeed go as far as
to say that certain combinations of scripts which *can* be encoded in
Unicode, *cannot* actually be consistently rendered on 2D graphical media.
(After all, one shouldn't neglect the possibility of there being a script
which cannot be line-broken at all. This would mean that it cannot be
printed on 2D media of fixed size, but can still be encoded in Unicode. A
cursive script written on an uninterrupted tape, like I hear some Tibetan
text is, could conceivably serve as an example. I think this sort of thing
does not point to problem with Unicode, but rather further illustrates the
fact that a stream of Unicode characters and its rendering are separate
beings.)
BTW, something akin to the above should really go in a FAQ. Is there
anything resembling a Unicode FAQ in existence, anywhere?
>Actually, there is a more serious problem involved with vertical
>directional overrides: They would force the Unicode plain-text mechanism
>to become aware of both vertical directionality and directional
>priority. This sounds obvious, but in fact there are not two, but THREE
>issues involved with text directionality:
Beyond the fact that some characters are used in more than one writing
mode, I don't see a problem with incorporating such properties in the
character database. However, I don't think one should involve these
properties in any default plaintext rendering of Unicode. If anything, I'd
push all of the rendering details into a separate part of the Unicode
standard, and make it clear that not all streams of code points have a
consistent, readable rendering like one would expect.
Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
This archive was generated by hypermail 2.1.2 : Sat Dec 29 2001 - 08:06:00 EST