From: Gregg Reynolds (unicode@arabink.com)
Date: Mon Aug 01 2005 - 23:16:03 CDT
Kenneth Whistler wrote:
>>I assumed that "inherent" Arabic bidirectionality was
>>invented in the wee hours of computer history, maybe in the early
>>sixties, so it never occurred to me that anybody on this list might take
>>it personally.
>
>
> Dear me, unexamined presuppositions can be a problem, can't they? ;)
>
Yeah, but they often come in very handy.
> Visual order Arabic and Hebrew implementations on computers were
> probably "invented" in the 70's, and saw fairly widespread use
> in that timeframe on mainframes and later in the 80's on PC's. A
> lot of that work was done by IBM. An inherent bidirectionality
I figured it was IBM, but I would have guessed the 60's. Now the
question is, why would one go for an MSD encoding design? My
speculation is that, as computation was expensive in those days, they
didn't want to mess with the math routines, and the encoding was
probably motivated primarily by number crunching (Banks, etc.) rather
than text processing.
> algorithm was invented at Xerox PARC in the 80's, I think, although
> others might have had an earlier hand in it. It was implemented
> on the Xerox Star system in that timeframe. You can see it
> discussed in Joe Becker's 1984 Scientific American article, for
> example. And that was the immediate precursor of Arabic and Hebrew
> support on the Macintosh, as well as the inspiration for the
> Unicode bidirectional algorithm.
>
> [Some historians on the list can, no doubt, nail this stuff down
> more precisely...]
That would be very interesting. I hope they do.
>> I really do
>>not understand the assertions that e.g. rtl digits would be a big
>>problem, for reasons that I've explained on other messages. Which makes
>>me think there's something I'm overlooking. That's all.
>
>
> Yes, you are.
>
> Cloning *any* common characters -- let alone all the digits, all
> the common punctuation, and SPACE -- on the basis of directionality
> differences, *would* wreak havoc on information processing. Many
> of the characters in question are in ASCII, which means they
> are baked into hundreds of formal languages, thousands of protocols
> and 10's of thousands of programs and software systems. They have
> been for decades now, and that *includes* Arabic and Hebrew
> information processing systems.
>
> Making the SPACE character in Arabic and Hebrew be something *other*
> than U+0020 SPACE, simply because it might make bidirectional
> editors easier to write if all characters were inherently RTL for
> Arabic, would have the effect of breaking nearly all Arabic
> and Hebrew information processing, deep down in the guts where
> end users can't get at it. The *only* way around it would be to
Hmm. I guess I'm still in the dark. Existing implementations would
still process "legacy" Unicode correctly, no? If new characters are
added - any new characters - software must adjust, *if* it wants to.
After all, Unicode does not require support of any particular block. So
why not let the market decide?
Isn't what you're saying a bit of a way of picking winners? That is, in
my naive way I assume that a well-designed piece of Unicode software
could easily adapt to new characters of whatever ilk. Bad software will
have a harder time. Software makers that want to service the RTL
language market may adapt to RTL 0-9 etc., and users may buy their
software. Software that doesn't care will just say "we don't support
that, just like we don't support Thai, or Limbu, or etc." Software
makers that want to service the market can also just stick with legacy
Unicode digits. Let the buyers decide which products they prefer. I'm
confident that Arabic software that didn't have the cursor weirdness
imposed by Unicode would find a ready market. More importantly, we
would see much much more Arabic enabled software without the bidi
requirement. Make that much much much more.
> introduce such things effectively all pre-deprecated with canonical
> equivalences to the existing characters, so that at least normalized
> data would behave correctly and be interpreted correctly. But then
> there would be no supportable reason for introducing them in
> the first place.
>
> And you haven't thought through the consequences of having duplicated
> digits with different directionality. You might think an end
Ahem, I think I have, at least for applications like a word processor.
But I don't have enough experience with the kinds of things you mention
below, computers passing text around, to really judge the impact. I
don't see any big problem, since they would be Unicode codepoints with
well-defined semantics, but I don't really have enough experience in
that area to judge.
> user has complete control over what they do, with their keyboard
> and their choice of characters -- but text is now *global* data,
> and much of what goes on with data is automated, and consists
> of programs talking to programs through protocols. Once you unleash
> different users using what claims to be the *same* character
> encoding, but with opposite conventions about *which* digits they
> use and what direction those flow, you will inevitably get
> into the situation where one process or another cannot reliably
> tell whether "1234" is to be interpreted a 1234 or 4321.
I don't see that. First of all, 3-RTL and 3-LTR are not the same
character. They look alike, and they are classified as numbers, but
that's all. The new 3-RTL is just another Unicode character with
various properties, just like any other.
If you get the string 1234 in LTR digits you know the first digit is the
MSD and it should be typeset at the extreme left of the string. If they
are RTL digits then the first digit is the LSD, and it must be typeset
at the extreme right. Where is the havoc? Note, BTW, that Unicode
stipulates a polarity by default. It doesn't (but should) stipulate
that the first digit in a digit sequence is the MSD. So adopting RTL
digits would be accompanied by making this explicit - LTR digit strings
are MSD first, no matter what script they are in, and RTL digit strings
are LSD first. Even if Unicode doesn't want to indicate mathematical
values for the characters, applications can know. Again, no different
than the current state of affairs, where applications must know how to
typeset Unicode chars based on their properties.
Malformed strings - mixing RTL and LTR digits - would be a problem, but
no more than any other malformed string, like a digit string with latin
chars interspersed, or too many decimal points, etc.
As far as keyboarding is concerned, well, the user doesn't know now how
character points are stored, why should it be any different with new RTL
chars?
This gets to an aspect of Unicode that I haven't personally seen
articulated, namely that it stipulates not only glyphs but also line
composition. It is as much a typesetting standard as a character standard.
That alone
> is enough for the whole proposal to be completely dead in the water.
> All the proposal would accomplish is to create massive ambiguity
> about what the representation of a given piece of Hebrew or
> Arabic text should be -- and that is a *bad* thing in a character
> encoding.
I still have trouble seeing where the ambiguity is. If you can tell me
exacly what is ambiguous I would appreciate it. Each character has a
semantics - e.g. the number three - a glyph, and a typographic rule.
This is no different than any other character in Unicode. If you see a
glyph "3", it means three. If you see it to the left of another digit
in an RTL context, it means "3 x 10^2"; ditto for an LTR context. I
agree that if there is some ambiguity there that would be bad; I just
don't see the ambiguity. If I've misunderstood something - not
unlikely, as it all seems quite simple to me - I hope you can enlighten me.
The only real problem I see is mixing RTL and LTR digits, but that would
be easily handled, as it only affects typesetting.
>
>
>>Then again, I
>>really do not understand why anybody would think RTL languages are
>>inherently bidi, so maybe there's no point
>
>
> Well, first of all, nobody has claimed that the Arabic *language*
> is inherently bidi. Nor has anybody claimed that the Arabic *script*
> is inherently bidi. So try understanding what the people implementing
> these systems *are* claiming.
Er, page 42 of the Unicode Standard:
"In Semitic scripts such as Hebrew and Arabic, characters are arranged
from right to left into lines, although digits run the other way, making
the scripts inherently bidirectional."
Now, call me nutty, but it looks to me like the official Unicode
position is that RTL "scripts are inherently bidirectional." There are
similar passages elsewhere. If this does not reflect the actual
semantics or intention of Unicode, by all means let's change this text.
It is untrue (meaningless, actually) and harmful, insofar as it
perpetuates a fundamental misunderstanding.
>
> Any functional information processing system concerned with
> textual layout that is aimed at the Hebrew or Arabic language
> markets *must* support bidirectional layout of text. That is
> simply a fact.
Oh come now. That is patently untrue. Or rather, it is a judgement
about sales possibilities. In my opinion a word processor that did
Arabic only, but did it very well, could do quite well. I don't think
Unicode should be in the business of picking winners.
I sure wish we had some evidence about what users actually want. Sure,
pragmatically what you say may be true, but then what that really means
is such software "*must* support Unicode". But that's because of
Unicode's market clout, not because of its virtues *from the user
perspective*. And that may be simply a fact for a big multinational.
But what about the little company in Cairo that wants only to serve the
Arabic market? Why should they have to worry about bidi? The point is
that with a few additional codepoints life would be much easier for
them, which would make life easier for the RTL community as a whole. It
would be vastly more easy to port open source software from e.g. English
to RTL languages if we could dispense with the bidi requirement.
One of the more harmful myths occasionally propagated about Arabic et
al. is that users of such RTL software use, or need, or must have, etc.
support for LTR latinate text. I have yet to see any evidence in
support of this assertion.
Ordinary users of RTL software have no need of bidi support. That
requirement comes from Unicode and multinationals who want to localize
generic software for the least money. It doesn't come from the users.
Naturally, I only have personal experience as evidence. I am unaware of
any scientifically valid survey of user needs in the RTL world. But I
can tell you in my experience lack of LTR latinate support would be no
great loss. Of course there are niche markets where it is required,
just as there are niche markets in the West that require RTL support.
But the vast majority of documents in the Arab world get along just fine
w/out latinate characters. Furthermore, they *want* to get along in
Arabic only.
Take a look at Arabic websites. Even those with international
multilingual audiences use Arabic almost exclusively. For the content
of articles, you virtually never see latin characters. Arabic gets
along quite well with Arabic acronyms (like TCP/IP = تي سي بي/ أي بي
(There is is again; trying to type that little bit of Arabic with parens
that work defeated me. Ridiculous.) Take Al-Jazeera for example. I
would estimate 99.99% of the site is in Arabic.
>
> Furthermore, to do so interoperably -- that is, with the hope
> that Implementation A by Company X will lay out the same underlying
> text as Implementation B by Company Y in the same order, so that
> a human sees and reads it as the "same" text -- they depend on
> a well-defined encoding of the characters and a well-defined
> bidirectional layout algorithm.
Not if they use only monodirectional characters. They only need
well-define encoding, not bidi. That's the whole point. You simply do
not need bidi to do Arabic, given a sufficient repertory of RTL
characters. Sure, you have to have well-defined characters - glyph and
typographic rules - but not bidi. And this isn't just theory. You can
do Arabic just fine in Vim w/out bidi. And if you're a little nutty,
you can do Arabic just fine in Emacs (I do it all the time) which lays
out lines LTR, but words RTL. And you can run monodirectional Arabic
(latin transliteration or not) through TeX or Omega and come out just
fine. Conclusion: there is no need for bidi in order to support RTL
languages. It is purely an artifact of legacy encoding, with no
demonstrated need for it from the broad user community.
One possible choice is consistent
> visual ordering. One possible choice is consistent logical ordering
> and an inherent bidirectional algorithm. The Unicode Standard
> chose the latter, for a number of very good reasons. Trying
> to mix the two is a quick road to hell.
That's exactly my point. Mixing is the road to hell. Mixing = bidi.
Non-mixing works for English, for which Unicode imposes no bidi
requirement. Why are RTL languages/scripts single out for this special
treatment?
Thanks for the response. It helps, although as I've noted, I don't see
any insuperable problems, and certainly not havoc. Maybe we're actually
talking about two different things. I have a sneaking suspicion that
you and I may be working from different definitions of some of this stuff.
Sincerely,
-gregg
This archive was generated by hypermail 2.1.5 : Mon Aug 01 2005 - 23:19:32 CDT