[REPOST, LONG] XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecation Issue)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Feb 21 2003 - 10:44:36 EST

  • Next message: Marco Cimarosti: "RE: XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecatio n Issue)"

    I sent this message yesterday but I didn't see it on the Unicode list.
    Possibly, this was because the ZIP contained two executable programs: now I
    removed them; anyway, the ZIP contains the source code.

    BTW, I took the occasion to correct a few grammar errors...

    _ Marco

    ---------------------

    (Warning: I have probably succeeded in the impossible task of being more
    verbose than Mr. Overington. Please start reading only if you have a few
    free time... :-)

    William Overington wrote:
    >
    > [... an interesting bibliography about XML ...]
    >
    > The more I read about XML the less reason there seems to be to use XML
    > instead of tags!
    >
    > [... many interesting arguments ...]
    >
    > In particular, for the DVB-MHP (Digital Video Broadcasting -
    > Multimedia Home Platform) there is a need to keep the programs
    > as small as possible and to keep text files as small as
    > possible.
    >
    > [... more interesting arguments ...]
    >
    > How would that be done using XML? Would it be done better
    > using XML than using tags? Why, or why not?
    >
    > [... even more interesting arguments and polite greetings ...]

    I confess that I have not been patient enough to read *all* of Mr.
    Overington's post. So, I apologize in advance if I have missed part or all
    of William's point.

    My job is to implement software based on written specifications which
    represent my bosses' understanding of the requirements of our customers.
    Unfortunately, the specifications that I receive are often verbose and fuzzy
    like Mr. Overington's posts... :-) So I had to develop a survival strategy,
    which is to quickly pass through the specification documents in search of
    wording which might represent the core of what the customer actually wants.
    Sometimes this works, sometimes not...

    I will be pretending that William is "Overington Inc.", one of the key
    customers of the company I work with, and that they are asking me to
    implement a protocol to send text over the famous "Overington Multimedia
    Broadcasting (OMB)", with the following requirements:

            1. The text MUST be transmitted in UTF-8 (because the CEO of
    Overington Inc. thinks that UTF-8 is cute).

            2. The transmission protocol MUST implement some form of language
    tagging (the details of the protocol are up to me). Particularly, the system
    needs to distinguish English text from Italian text, because the two
    languages will be displayed in different colors (green and red,
    respectively).

            3. The OveringtonHomeBox(tm) can only accept UTF-8 plain text
    interspersed with escape sequences to change color. The escape sequences
    have the form "{{color=1}}", where "1" is the id of a color (blue, in this
    case).

            4. The text files being transmitted MUST be darn small (bandwidth is
    limited!).

            5. The processing program MUST be darn small (on-board memory is
    limited!).

            6. A working prototype must be ready by tomorrow.

    What I am asked to do is to define the protocol in point 2, and to implement
    a software filter to produce the plain-text stream in point 3. As
    development time is very narrow, I can not loose much time thinking about
    it, so I have to chose one of the two solutions that are on top of my mind:

            P. Plane-14 language tags.

            X. XML.

    I instinctively decide for solution P (because I assume that it would be
    simpler and yield smaller files) and start defining my language tagging
    protocol:

            P.1. According to the intended usage of plane-14 tags, each language
    tag will be introduced by a u0E0001 (LANGUAGE TAG) and will terminate with a
    u0E007F (CANCEL TAG).

            P.2. Within each begin and end tag, I will use a single tag to
    identify languages, in order to save space (point 4):
             - u0E0065 (TAG LATIN SMALL LETTER E) switches to English;
             - u0E0069 (TAG LATIN SMALL LETTER I) switches to Italian;
             - u0E005E (TAG CIRCUMFLEX ACCENT) switches back to the previous
    language.

    Equipped with this simple protocol, I produce a sample text file: see
    <wo.txt> in the attached ZIP file, containing the following text:
     
        "Let's learn the week days in Italian: 'Monday' is 'lunedì', 'Tuesday'
    is" (...omitted...)

    The English sentence is surrounded by tags u0E0001+u0E0065+u0E007F ...
    u0E0001+u0E005E+u0E007F, while each embedded Italian word is surrounded by
    tags u0E0001+u0E0069+u0E007F ... u0E0001+u0E005E+u0E007F.

    Now I need to write a program that converts this file into a file containing
    color switching commands, such as:

        "{{color=2}}Let's learn the week days in Italian: 'Monday' is
    {{color=4}}'lunedì'{{color=2}}, 'Tuesday' is"...

    I begin writing a few utility functions to read and write UTF-8, to write
    the color escape sequences, and to handle a simple stack data structure,
    needed to implement tag u0E0001+u0E005E+u0E007F. See <wo_util.c>, in the
    attached ZIP file.

    Then, I implement my converter as a little program that reads the incoming
    language-tagged file from standard input and writes on standard output the
    plain text file containing the color escape sequences. See the source code
    for the program in <wo_txt.c> in the attached ZIP file.

    The resulting program, <wo_txt> (not included), can be run with the
    following command line:

            wo_txt < wo.txt > out.txt

    As I have a little more time before tomorrow, I try and implement also the
    XML solution, just for the sake of comparing it. With XML, the protocol will
    be slightly different:

            X.1. I need to add a minimal syntactic paraphernalia to make the
    file XML-compliant. As a minimum, I need a "<?xml...?>" declaration at the
    beginning, and a root tag enclosing the whole text, which will be:
    "<wo>...</wo>.

            X.2 In order to save space, I will keep the same one-letter language
    ids that I used before:
                - "<e> ... </e>" will enclose English text;
                - "<i> ... </i>" will enclose Italian text;
                As the tags are already closed by "</e>" and "</i>, I don't need
    an equivalent of u0E0001+u0E005E+u0E007F.

    I convert the sample text file to XML (see <wo.xml> in the attached ZIP
    file), and here comes the first surprise: while the Plane-14 tagged file
    <wo.txt> wad 445 bytes long, the XML files is only 322 bytes long!

    This seems strange, at first: because of the "/" each pair of my XML
    language tags is one character longer than the corresponding pair of
    Plane-14 tags. Moreover, the syntactical overhead in X.1 above cannot be
    less than 30 characters. Of course, the reason for the 123-byte spare is
    that, in UTF-8, the characters composing XML tags only take one byte each,
    while Plane-14 tag character take four bytes each.

    This little gain on point 4 of requirements prompts me to continue with the
    XML experiment. Of course, I guess implementing the converter program must
    be much more complicated for XML that it has been with plain text...

    Surprisingly, this is not: the utility functions that in <wo_util.c> are
    still useful, and only a handful of modification to <wo_txt.c> are necessary
    to implement <wo_xml.c>.

    The code that I wrote to interpret a sequence like u0E0001+u0E0065+u0E007F
    ... u0E0001+u0E005E+u0E007F works equally well for a sequence like "<e> ...
    </e>". Moreover, the same code that I wrote to ignore an unknown Plane-14
    tags (e.g., u0E0001+u0E0067+u0E007F) works equally well with unknown XML
    tags (e.g. "<wo>" or "<?xml version="1.0"?>").

    The only complication that I had to introduce in <wo_xml.c> is a function to
    decode character entities such as "&gt;" or "&#x4e00;". But, after all,
    that's just a few lines of codes. Not implementing this, would have meant
    that characters "<", and "&" could not be transmitted.

    The resulting progra, <wo_xml> (not included) works in the same fashion as
    the other program, and gives the same output:

            wo_xml < wo.xml > out.txt

    Now I try and use those 123 spare bytes for adding some more flesh. The
    attached ZIP files contains a second version of the sample XML text:
    <wo_cute.xml>.

    You may have noticed that, although <wo_cute.xml> is much bigger than
    <wo.xml> (but still slightly smaller than <wo.txt>!), passing it through
    <wo_xml.exe> results in exactly the same output file.

    This is because the extra declarations that I added ("encoding=...",
    "<?xml-stylesheet...?>", "<!DOCTYPE...>"), are simply skipped by the
    bare-bone XML processor embedded in our Overington Multimedia Broadcasting
    system.

    But this information, if present, allows you to publish the *same* material
    on both your proprietary system and on other media, such as the Web.

    If you put <wo_cute.xml>, <wo.css> and <wo.dtd> in the same directory, and
    open <wo_cute.xml> with a decently recent browser, you will see that the
    English text will automagically be displayed in green and Italian text in
    red, exactly as they are supposed to appear on the Overingtonian system.

    But size and Web-compatibility are not the only advantages of the XML
    solution:

            a. An XML file is human readable and may be edited with any text
    editor; although the Plain-14 file claims to be "plain text", each language
    tag character appears as a three black boxes in any UTF-8 editor (and as a
    random twelve "accented" characters in a non-UTF-8 editor).

            b. There are plenty out-of-the-box utilities that everybody can use
    to edit, view and validate XML files. Many of these utilities are free of
    price, or come bundled with operating systems.

            c. Every bookshop round the world sells books about XML, and in
    every town in the world you can easily hire XML programmers. So you don't
    need a big effort to train your content engineers: just hire "XML people"
    and give them your DTD file...

            d. XML is built on top of Unicode, but *not* bound to it. Imagine
    that the CEO of Overington Inc. comes saying that he also wants support for
    ISO 8858-1, JIS X 0208 and a third encoding of my choice... With the XML
    solution, we just ask them to change the "encoding" declaration in the file,
    and to allow us a couple of days to change the software. If we used
    *Unicode* Plane-14 tags, what would we be going to tell them?

            e. XML is extensible. Once we have fixed their silly requirement for
    green English and red Italian, they will perhaps ask for italic, bold,
    pictures, sounds, etc. etc. With XML, defining the protocol for such things
    is straightforward, so we only have to enhance our <wo_xml.exe> program (and
    our <wo.css> for the Web edition). With Plane-14, what are you going to do?
    Do you hope that the Unicode Consortium will accept adding u0E0002, u0E0003,
    u0E0004, and so on, just to match your needs? Or do you want to go on
    playing with your PUA experiments? And do you believe that all producers of
    browser will promptly follow you?

    Ciao.

    _ Marco





    This archive was generated by hypermail 2.1.5 : Fri Feb 21 2003 - 11:31:41 EST