Hebrew Computing FAQ

From: Shlomi Tal (shlompi@hotmail.com)
Date: Tue Apr 09 2002 - 13:48:46 EDT


I have a website at http://www.pcphobia.co.il/hebcomp/ called The Guide to
Hebrew Computing, which is meant for native users of Hebrew and is therefore
entirely in Hebrew (in two versions: UTF-8 encoded logical Hebrew and
ISO-8859-8 encoded visual Hebrew); for the basic questions about Hebrew,
especially about the difference between visual and logical which people have
asked me after seeing those options in Mozilla and Internet Explorer, I have
this FAQ, in English. Criticism and pointing out of errors gladly accepted.

--- BEGIN ---

Hebrew Computing FAQ

by Shlomi Tal (shlompi@hotmail.com)

Contents:

1. What is the difference between ISO-Visual and ISO-Logical?
2. How was Hebrew used on MS-DOS?
3. What is special about MS-Windows Hebrew (windows-1255) encoding?
4. Review of Standards
---------------------------------------------------------------------

1. What is the difference between ISO-Visual and ISO-Logical?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This question needs a long explanation going down to the very
rudiments of human handwriting. ISO is just an encoding scheme; the
difference between visual order and logical order has nothing to do
with the encoding itself (ie the numbers assigned to each letter), but
with the storage order of the numbers.

Let us review the writing of English text by hand. The hand holds the
pen near the top-left corner of the paper and then moves rightwards
constantly. When there is no more room on the paper to the right, the
hand moves back to the left edge and slides one row lower than before,
and then begins the rightwards movement again.

Writing Hebrew (and Arabic and other Semitic languages) by hand is a
different matter. The hand holds the pen near the top-right corner of
the paper and then moves leftwards. However, it moves leftwards as
long as the text is in Hebrew. If numbers (or English text) are to be
written, the hand will move rightwards for them and then resume the
leftwards movement for Hebrew text again. In other words, writing
Hebrew involves bidirectional (left-to-right and right-to-left)
movement of the hand, in contrast to monodirectional English writing.
Finally, upon running out of room to move leftwards, the hand moves
back to the right edge and slides one row lower.

So much for human handwriting. Computers, however, know nothing about
directions. The numbers representing human letters are stored
sequentially on the media. Making them flow from left to right and
move on to the beginning of the next line is the job of software.

Since computer systems were designed around English, the
screen-handling routines have a uniform, clear rule for mimicking the
handwriting process: if a byte follows another byte, it will be
presented on the screen as a letter to the right of the letter that
the previous byte represents:

Sequential bytes:
0x48
0x65
0x6C
0x6C
0x6F

Letters displayed on screen:
Hello

In addition, for word-wrapping applications (such as text editors)
there is a routine for going to the beginning of the next line when
the row is full.

When it comes to displaying Hebrew on the screen, there is great
difficulty. The display mechanisms of computers were originally
designed for English, and can easily be accommodated to other
left-to-right scripts, or even to a monodirectional right-to-left
script by employing a simple display inversion, but Hebrew is
bidirectional and more complicated to display (Arabic is even more
complicated than Hebrew, but that's another story).

There are two options for dealing with Hebrew text display:

1) Forcing Hebrew to conform to the constraints of English text
display (ie treating Hebrew like a monodirectional LTR script).

2) Updating the display software to handle bidirectional display of
Hebrew text in a way akin to its flow in handwriting.

The first option is simple, easy to implement and does not require
large computing resources by the standards of early computing (which
for Hebrew means from the 1960s to the early 1980s). It requires only
an encoding and a font mapping: numbers assigned to Hebrew letters,
and Hebrew fonts for their display. However, it requires an effort on
the part of the writer, since all text, including Hebrew letters, is
written from left to right. Hebrew text must be written with the last
letter typed first, so that the left-to-right display of the text can
form the illusion of natural Hebrew flow. There were a few mechanisms
to aid writers, such as "pushing" input methods for typing the Hebrew
letters the natural way (from right to left), but editing, sorting,
copying and any kind of manipulation stayed a painful task.

The second option, implemented for Arabic first and then for Hebrew,
consists in more intelligent software, and therefore more resources.
The method assigns an implicit directionality to each character: LTR
for English and numbers, RTL for Hebrew letters and neutral for
punctuation marks. The Hebrew text is stored in the same sequential
order as it is written by hand: text to the right first, moving
leftwards. The display mechanism is programmed to display each
character according to its implicit directionality. A series of
numbers in the middle of the Hebrew letters would be displayed from
left to right. Editing, sorting and copying text is now as natural as
in English, though there are a few problems (such as telephone numbers
with hyphens, which have insufficient directional information to look
satisfactory, and the symmetric swapping of brackets and other signs,
which sometimes gets wrong). Of course, storage of Hebrew text in this
way requires a bidirectional display mechanism in order to be
intelligible at all, not just an encoding and a font mapping. Without
the display mechanism, the Hebrew text is displayed in reverse reading
order.

The first option is known explicit or visual order: the text is stored
in the sequence that it is displayed visually on the monodirectional
LTR display system, and right-to-left text needs to be explicitly
stored as right-to-left. The second option is know as implicit or
logical order: the text is stored according to typing sequence, and
the display system controls the flow of characters according to the
implicit directionality of each one.

Visual Hebrew was at first the only way to store Hebrew text, in the
days of scarce resources. It lived side by side with logical Hebrew in
the 1980s, in the MS-DOS and ISO encodings. Microsoft Windows uses
logical Hebrew, and so does the Unicode standard. Visual Hebrew is, in
essence, a kludge to support Hebrew display in systems that had not
been originally designed for it. It is now obsolescent, though still
found on many Hebrew websites (because of its greater capability of
being displayed correctly on any system).

As for ISO, again, it is only an encoding: the mapping of the Hebrew
letters from number 224 to number 250 on the 8-bit encoding.

2. How was Hebrew used on MS-DOS?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hebrew on MS-DOS was at first stored visually. A TSR program
("Terminate and Stay Resident") was loaded, enabling to type Hebrew
text in the natural right-to-left direction, but the storage was
visual. Editors such edit.com and wordprocessors such as QText stored
Hebrew in visual order.

Logical Hebrew was implemented on MS-DOS in the course of the 1980s,
notably on the word processor named Einstein. In such applications,
Hebrew text manipulation was much easier than in other MS-DOS apps,
but it had the disadvantage that the Hebrew text was displayed in
reverse reading order under bare MS-DOS or edit.com.

MS-DOS Hebrew, whether visually or logically stored, was encoded from
number 128 to number 154 (in the C1 range) in the 8-bit IBM PC
Codepage 862.

The current MS-DOS box in Windows 98 (2000 and XP do not support DOS
Hebrew on the box) handles Hebrew in visual order, unless a
logical-storage application is used. Internet Explorer 5 and upwards
support display of MS-DOS Hebrew, but in logical order only.

3. What is special about MS-Windows Hebrew (windows-1255) encoding?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hebrew Microsoft Windows stores Hebrew in logical order, and encodes
the Hebrew letters in exactly the same places as ISO: number 224 to
number 250. The encoding, however, is a superset of ISO Hebrew in that
it allots numbers before 224 to Hebrew vowel-points. The vowel-points
are combining marks which appear above, below or inside a letter to
signify vowel values, zero vowel, geminate consonants and other
minutiae. They are not used in regular Hebrew texts, only in
children's and religious texts (the same is true for Arabic). There
are no places for Hebrew vowel-points in the ISO standard - Hebrew
with vowel-points made on MS-Windows systems would appear as blank
squares or question marks on a system that does not support MS-Windows
encoding.

The Hebrew vowel-points are placed in the Unicode standard in the same
relative positions as in MS-Windows Hebrew encoding. Unicode, however,
is even a greater superset, in that it includes Hebrew cantillation
marks for Biblical typesetting.

4. Review of Standards
^^^^^^^^^^^^^^^^^^^^^^

The first systems supporting Hebrew encoded the Hebrew letters instead
of lower-case Latin letters. Old terminals could be switched between
all-English mode (where Hebrew would appear as meaningless Latin
letters) and Hebrew/English mode (where lower-case Latin letters would
appear as meaningless Hebrew letters). The first national standard for
this was SI960, the Israeli national variant of ISO 646 where Hebrew
letters were encoded instead of the lower-case ASCII places, that is,
from number 96 to number 122. This Hebrew encoding is known as "Old
Code", and is still used in Hebrew teletext. Storage order was visual
only.

The ISO standard for Hebrew, ISO-8859-8, is an 8-bit extension of
SI960, with the Hebrew letters mapped in exactly the same places as in
SI960 with 128 added. Stored visually it is the standard for Unix
systems and Hebrew web pages. In logical storage it is registered as
ISO-8859-8-I ("i" for implicit) and is identical to MS-Windows or
Apple Macintosh Hebrew without vowel-points.

The IBM PC mapped the Hebrew letters in a non-standard way, on top of
the C1 controls. It is now obsolete.

Microsoft Windows and Apple Macintosh employed Hebrew in logical
order, in the same encoding as ISO, but with the addition of
vowel-points. The numbers assigned to the vowel-points are not the
same in MS-Windows and the Macintosh, though.

The Unicode standard maps the Hebrew letters from number 1488 (U+05D0)
to number 1514 (U+05EA). The whole Hebrew block of Unicode stretches
from U+0590 to U+05FF, after the Armenian and before the Arabic block.
The Unicode standard mandates logical storage of Hebrew.

--- END ---

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos:
http://photos.msn.com/support/worldwide.aspx



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 14:43:52 EDT