Pierre,
Doug Felt here at Taligent was kind enough to take a pass at answering
your questions. His comments are marked with "**". I have added on in a
few places, marked with "@@", but haven't looked at the examples as
carefully as Doug.
Mark
===========================================
All,
I've been trying to get a clear picture of what a "plain-text unicode
file" should look like (wrt control chars, bidi markup, &c.).
By "plain-text unicode file" I mean something that would be output by a
plain-text editor, eg. a Unicode-capable vi (Unix) or brief (DOS). No
HTML
or Web implications (altho such an editor could certainly be used to
prepare multi-lingual Web pages).
I have prepared a short text (not semantically very meaningful) with
mixed
directionalites so I can ask some concrete questions. I took the liberty
to attach the GIF to this message (about same size as the text).
Postscript and GIF versions of this text can also be seen at URL
http://www.centrcn.umontreal.ca/~lewis/LJL/uniplain.html
Below, the text is shown in logical order (and all in English), with an
indication of the language in the postscript page (A=Arabic, E=English,
F=French, G=German, Y=Yiddish), and what I believe the levels should be.
Some examples of dates. In Yiddish, "Monday, the 24th February 1997".
1 E................................E Y............................Y
000000000000000000000000000000000000011111111111122111111111111222200
In German, "Monday, the 24th Febrary 1997".
2 E.......E G...........................G
0000000000002222222222222222222222222222200
In Arabic, "Saturday March 90\3\10" (March 10, 1990)
3 E.......E A....................A E............E
0000000000001111111111111112222222000000000000000000
"Shindler's List", so is called my favorite film. The jew has in the
4 E.............E Y...............................................Y
12222222222222221111111111111111111111111111111111111111111111111111
ring written: "All who preserve one soul of Israel the book makes up
to
5 Y..........Y
H......................................................H
11111111111111133333333333333333333333333333333333333333333333333333333
him as if he preserved a whole world.".
6 H..................................H
333333333333333333333333333333333333111
The guest has been in Berlin. He has said: "I am 49 years
7 Y........................................Y G...........G
111111111111111111111111111111111111111111112222222222222
old and am called Boutros". This means in Yiddish: "I am old 49 years
and
8 G...............G A.....A Y...................Y
Y...................Y
2222222222222222223333333111111111111111111111111111111111111221111111111
am called Boutros" (Pierre in French).
9 Y.......Y B.....B F....F Y.......Y
11111111113333333111222222111111111111
Notes:
o Translations are fairly literal (and not always very accurate): just
for general orientation. And there are surely imperfections in all
but the French (with just my name, I'm pretty safe here).
o line 3: I'm not too sure what the logical order of the date in Arabic
is. Could be 10\3\90 (levels 2212122 -- three level-2 numbers
separated by
level-1 backslashes) or 90\3\10 (all at level 2). Not too sure of the
exact translation of words either.
** The logical order is, in general, the spoken order. The fields of
the date
** would probably appear in the order the putative speaker would say
them,
** however this is one place where writing and speaking can diverge.
Here
** it depends on the order in which the putative speaker would type
them.
** My description of what follows assumes the order you present is
correct,
** and the desired appearance is what you present on your web site.
**
** Now as to the levels: This is very long, bear with me.
**
** Solidus (Slash) U+002F is a European Number Separator (ES).
** Reverse Solidus (Backslash) U+005C is Other Neutral (ON). You use
** reverse solidus but I'm not sure if this is to represent mirroring
(neither
** character is mirrored). Either way, neither is a strong directional
** character.
**
** If the digits are Roman, by rule P0 all these numbers are treated as
** Arabic Numerals because the preceeding strong directional character
** is Arabic text (the 'h' in March). You may have intended them to be
** Arabic-Indic digits from the start. Either way, the digits are AN.
**
** If you intended Solidus (ES) this is converted to ON by rule P3. So
** either solidus or reverse solidus is ON.
**
** ON between AN is converted to R by rule N3(c).
**
** The quoted string on line 3 is thus "L R... AN AN R AN R AN AN L"
where
** the L characters are the quote marks surrounding the text. The
** base line direction is LTR because of the initial L (Roman 'I'), so
** the base level is 0. In rule I1 the levels thus become
** "0 1... 2 2 1 2 1 2 2 0". By application of rule L2 this first
becomes
** "Saturday March 09\3\01" as the level 2 runs are reversed, then
** "10\3\90 hcraM yadrutaS" as the levels 1&2 run is reversed.
**
** This is not consistent with the output on your web page. To force
the
** date to be formatted left to right assuming this logical order, you'd
** need to force all date characters to L. This can be done either
using an
LRM
** before the first Roman digit, if the digits are roman, or by
surrounding
** the date with LRO..PDF, if the digits are arabic-indic. Note that
LRE
** won't work because the reverse solidus, being between two AN, would
** still convert to R, instead of L as desired.
**
** For example, using "Saturday March [LRE]90\3\10[PDF]",
** assuming Arabic-indic digits, would resolve the levels to
** 01111111111111112443434420, progressively resulting in
** "Saturday March 09\3\01" -- level 4 reversed
** "Saturday March 10\3\90" -- levels 3 and above reversed
** "Saturday March 09\3\01" -- levels 2 and above reversed
** "10\3\90 hcraM yadrutaS" -- levels 1 and above reversed
** This is a direct result of the fact that the date is not a
** solid run of left-to-right text, because the solidus is still R.
**
** "Saturday March [LRO]90\3\10[PDF]" however would resolve to
** 01111111111111112222222220, progressively resulting in
** "Saturday March 01\3\09" -- level 2 reversed
** "90\3\10 hcraM yadretaS" -- level 1 reversed.
o Quotes aren't the right ones (some should be low quotes, ...).
Questions
1) Do the levels in the above make sense (plus/minus some punctuation)?
It may be that I've totally misunderstood levels.
** Generally, they make sense, see my discussion above. Text does not
** necessarily change level simply because of a quotation, or because of
** a change in language. So in line 2, the level wouldn't change simply
** because of a switch from English to German, since the German
** characters would be L. Only LRE or LRO would do that. Since you
** don't indicate strong formatting characters, I'd have to assume they
** were present to force the levels you indicate.
2) When embedding L2R in L2R (eg German in English, line 2) or R2L in
R2L
(eg. Arabic in Yiddish, line 9, or Hebrew in Yiddish, line 5), should
I use LRE/PDF and RLE/PDF (even though the direction doesn't change)?
** Generally, you wouldn't need to.
3) The second and third paragraphs are right-aligned (R2L main
direction).
How do I indicate this? I thought of making each paragraph a block
(separating them with PS, paragraph separator), and starting each
block
with a strong char of the appropriate directionality. In the second
paragraph, this would mean starting the block with RLM (since the
first
letters are English). Ie. if base level is odd, main directionality
is R2L
and the text is right aligned.
Or, other possibility, starting a right-adjusted paragraph with RLE?
But then what about a left-adjusted paragraph that starts with R2L
text.
** Either way would work. Alignment depends on the base line direction,
** which is determined by the first strong character in the block. The
** explicit directional formatting codes LRE, RLE, LRO, RLO as well as
** RLM and LRM are all strong directional characters. LTR text within
** a RLE embedding will still format LTR, but the overall run of text
** within the embedding will be RTL.
4) What should I use to separate lines? LS or CR or LF or CR/LF? If I
use
LS, which is a block separator, doesn't that interact negatively with
bidi
markup (control chars), in particular embedding markups? Ie. I have
to
reestablish the proper level at each line. And what happens with
right
alignment?
Couldn't this cause confusion. If I have two lines (in logical order)
000 0000 00 00000 RLE 11 1111 LS | English RLE Yiddish LS
11 11111 1 11111 00 0000 ... | Yiddish English ...
and reissue an RLE at start of second, I can no longer tell whether
I have one embedded segment or two (with a 0-level space between,
where
the LS is). Could be an issue if I later reformat (reflow) this text
(as
I might want to do in an editor).
As a matter of fact, if the second line (after LS) starts with a
strong
R2L character and I don't reissue RLE, won't the base level be set to
1?
This would put the following English at level 2 (not intended as the
English isn't embedded in the Yiddish here, but the other way
around).
(I haven't read the recent thread on LS very carefully yet, but it's
not too reassuring: lots of opinions)
@@ The standard is pretty clear. Most of those opinions are from people
@@ who have not read it. Think of these characters in terms of what you
@@ use in a word processor.
@@ For Microsoft word or FrontPage, think of LS as the
@@ character that you get with shift-Return
@@ (causing no paragraph spacing or indent),
@@ and PS as what you get with Return.
@@ (on the Mac, this would be option-Return).
** This is a good observation! We believe the current standard is in
** error and should categorize LS as whitespace instead of as a block
** separator.
**
** This would allow LS characters to be inserted wherever whitespace
** appears and not interfere with explicit formatting codes.
**
** That said, the explicit formatting codes are basically intended for
static
** text interchange only. They pose several problems for editing. One
is that it
** is easy to radically alter the text by inserting, copying, or
deleting
** one of these codes. This can reorder the text within the block and
** completely change the text on several lines. Similarly, the default
** base line direction rule can be problematic, as changes to the text
at
** the start of a block can change the base line direction. Users might
** have difficulty editing unless the editor provides some support (such
** as assisting the user to insert/delete explicit formatting codes and
** their matching PDFs as a unit).
@@ For actual editing of text with different directions, it is far
easier to have
@@ out-of-band style information with explicit embedding levels,
@@ as mentioned briefly on page 3-22.
**
** Additionally, text reordering after levels are computed is done on a
** line by line basis. Depending on where line breaks occur, different
** text may appear on a line, and in different orders. This is
independent
** of the issue of how to represent line breaks-- if they are
represented
** external to the text (a line break table, based on wrapping to some
** width or character count, say) this still happens. This makes
rebreaking
** lines somewhat more of an issue than it is with ASCII text.
**
5) Does PS imply LS? Or would I end a paragraph with LS PS?
** Yes, use only PS to separate paragraphs.
6) Imagine I want to start the third paragraph on a new page. Where do I
put the FF (wrt the LS/CR/LF/ and bidi markup in the vicinity)?
** FF is higher-level formatting, you'd have to interpret it separately.
@@ In particular, you would definitely interpret it as a block
separator.
7) Any specific bidi markup required around the numerals?
In the Arabic date: if levels intended are 2212122, would I need
extra
markup? I would think I would need:
LRO number PDF \ LRO number PDF \ LRO number PDF
(so that the \s, which are "other neutral", stay at level 1)?
** Almost, see my example above. In your example, the separate runs
** of LTR text would occur in RTL order, reversing the year and day of
** the date from what your example shows.
8) What is the intent (as opposed to the effect which the algo surely
makes
clear) of RLE and LRE? When are they useful? (Relates to question
1).
** Quoted text where the text itself contains mixed directions is a
common
** case. You can see it (implicitly) in the examples for rule L2. The
quotes
** logically belong to the surrounding text, and the embedding codes are
** just inside the quotes.
@@ In the vast majority of cases, it is not necessary. The important
cases are
@@ those that Doug mentioned.
@@ RLO and LRO are even more infrequent, and are designed to allow for
cases
@@ such part numbers with mixed numbers and letters, where the character
@@ order is forced.
9) A typesetting question. Where do quotes belong in
mixed-directionality
texts (eg. in line 7)? Should they be at the same level as the text
introducing the quote? Or at the level of the text being quoted. On
line 7, should the quote be at the end of the line instead of where I
put it (in the PS file)? Can't say I'm comfortable with either
solution.
And what style of quotes does one use? That of the quoting or of the
quoted language?
** Quotes are at the same level as the text introducing the quote.
@@ In general, you expect the style of the quotes to be the same as the
containing
@@ text, not the embedded text. However, that is up to the user's
choice.
Thanks in advance for any clarifications.
Pierre
lew@nortel.ca
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT