[unicode] Re: Unicode editing

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Mar 21 2001 - 13:23:52 EST


Edward Cherlin wrote:
> >But, in this case, each *single* character in the block must be
> >independently flagged with the property, so that it retains
> it also if it is
> >copied&pasted somewhere else: the actual start and end codes
> will only be
> >generated when rebuilding the Unicode string at the end of editing.
>
> Definitely not the case. Copying would be a problem when using
> embedded codes in linear marked up text, like HTML, where you might
> have to search the whole text to determine what tags were active at a
> specific point.

I am not sure that I totally grasp what you mean.

This discussion did not necessarily involve other properties apart embedding
levels.

I was mainly considering plain text -- the simplest case --, so I imagined
that each edit line could be an array of things like this:

        struct MyWysiwygGlyph
        {
                wchar_t GlyphCode;
                int EmbeddingLevel;
        };

I think that Roozbeh had something quite similar in mind.

To extend this to a rich text context, I can imagine something much more
complex but not conceptually different:

        struct MyTag;
        struct MyTagList;

        struct MyWysiwygGlyph
        {
                wchar_t GlyphCode;
                int EmbeddingLevel;
                MyTagList * PointerToInnermostTag;
        };

        struct MyTagList
        {
                MyTag * ThisTag;
                MyTagList * PointerToNextTag;
        };

        struct MyTag
        {
                //... Whatever data may represent a tag internally.
        };

> Real rich text editors use *parallel* markup, where each tag is
> associated explicitly with a run of text. The tags can be kept doubly
> indexed. When you cut a section from within a tagged area, you can go
> out and find which tags to associate with the copy very quickly.

I would like to know more about what you are saying here. I am sure that I
have a lot of naive ideas in mind, that a specialist of word processor would
avoid.

But remember that Roozbeh, Peter and I are not designing any actual system
(well, not together at least): we were just discussing about the general
lines of how bidi editing could work.

I am sure that, when you cut a selection of text, there are many good ways
of retrieving the properties associated to that piece of text (e.g. "bold",
inside "italic", inside "font=Helvetica", inside "language=Italian", etc.)
and carry them over to the clipboard. My word processor does this all the
time!

Just I am not so sure that it should work the same way also with bidi
embedding levels, because of a number of caveats:

1) Unicode bidi embedding levels are *numbered* (even numbers represent LTR
text, while odd numbers represent RTL text). On the other hand, there is no
such thing in the nesting of rich text properties: they simply sit in
different positions in a hierarchical structure.

2) Embedding level have a maximum nesting level (64). On the other hand,
rich text and SGML tags normally do not define any maximum depth of nesting.

3) The lowest level in each paragraph *must* be either 0 (for a LTR
paragraph) or 1 (for a RTL paragraph). I don't know how to parallel this to
any rich text feature.

4) Embedding levels are defined implicitly (e.g. a number in Arabic has an
embedding level higher that the surrounding text) or by means of explicit
bidi controls. In any case, they are *orthogonal* to markup tags. So, if you
have a tagging scheme that imposes that tags are nested into each other
(e.g. XML), embedding levels do not necessarily follow the rule. E.g., see
how tagging and Unicode embedding overlap in: "<BOLD> abc &RLE; def </BOLD>
ghi &PDF; ijk".

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:14 EDT