Re: Software support costs (was: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 10 2004 - 16:18:33 CST

  • Next message: Philippe Verdy: "Re: Nicest UTF"

    From: "Carl W. Brown" <cbrown@xnetinc.com>
    > Philippe,
    >> Also a broken opening tag for HTML/XML documents
    >
    > In addition to not having endian problems UTF-8 is also useful when
    > tracing
    > intersystem communications data because XML and other tags are usually in
    > the ASCII subset of UTF-8 and stand out making it easier to find the
    > specific data you are looking for.

    If you are working on XML documents without parsing them first, at least at
    the DOM level (I don't say after validation), then any generic string
    handling will likely fail, because you may break the XML wellformed-ness of
    the document.

    Note however that you are not required to split the document into many
    string objects: you could as well create a DOM tree with nodes referencing
    pairs of offsets in the source document, if you had not to convert also the
    numeric character references.

    If not doing so, you'll need to create subnodes within text elements, i.e.
    working at a level below the normal leaf level in DOM. But anyway, this is
    what you need to do when there are references to named entities that break
    the text level; but for simplicity, you would still need to parse CDATA
    sections to recreate single nodes that may be splitted by CDATA end/start
    markers inserted in a text stream that contains the "]]>" sequence of three
    characters.

    Clearly, the normative syntax of XML comes first before any other
    interpretation of the data in individual parsed nodes as plain-text. So in
    this case, you'll need to create new string instances to store the parsed
    XML nodes in the DOM tree. Under this consideration, the encoding of the XML
    document itself plays a very small role, and as you'll need to create a
    separate "copy" for the parsed text, the encoding you'll choose for parsed
    nodes with which you can create a DOM tree can become independant of the
    encoding actually used in the source XML data, notably because XML allows
    many distinct encodings in multiple documents that have cross-references.

    This means that implementing a conversion of the source encoding to the
    working encoding for DOM tree nodes cannot be avoided, unless you are
    limiting your parser to handle only some classes of XML documents (remember
    that XML uses UTF-8 as the default encoding, so you can't ignore it in any
    XML parser, even if you later decide to handle the parsed node data with
    UTF-16 or UTF-32).

    Then a good question is which prefered central encoding you'll use for the
    parsed nodes: this depends on the Java parser API you use: if this API is
    written for C with byte-oriented null-terminated strings, UTF-8 will be that
    best representation (you may choose GB18030). if this API uses a wide-char C
    interface, UTF-16 or UTF-32 will most often be the only easy solution. In
    both cases, because the XML document may contain nodes with null bytes
    (represented by numeric character references like &#0;), your API will need
    to return an actual string length.

    Then what your application will do with the parsed nodes (i.e. whever it
    will build a DOM tree, or it will use nodes on the fly to create another
    document) is the application choice. If a DOM tree is built, an important
    factor will be the size of XML documents that you can represent and work
    with in memory for the global DOM tree nodes. Whever these nodes, built by
    the application, will be left in UTF-8 or UTF-16 or UTF-32, or stored with a
    more compact representation like SCSU is an application design.

    If XML documents are very large, the size of the DOM tree will become also
    very large, and if your application then needs to perform complex
    transformation on the DOM tree, the constant needs to navigate in the tree
    will mean that therer will be frequent random accesses to the tree nodes. If
    the whole tree does not fit well in memory, this may sollicitate a lot the
    system memory manager, meaning many swaps on disk. Compressing nodes will
    help reduce the I/O overhead and will improve the data locality, meaning
    that the overhead of decompression costs will become much lower than the
    gain in performance caused by reduced system resource usage.

    > However, within the program itself UTF-8 presents a problem when looking
    > for
    > specific data in memory buffers. It is nasty, time consuming and error
    > prone. Mapping UTF-16 to code points is a snap as long as you do not have
    > a
    > lot of surrogates. If you do then probably UTF-32 should be considered.

    This is not demonstrated by experience. Parsing UTF-8 or UTF-16 is not
    complex, even in the case of random accesses to the text data, because you
    always have a bounded and small limit to the number of steps needed to find
    the beginning offset of a fully encoded code point: for UTF-16, this means
    at most 1 range test and 1 possible backward step. For UTF-8, this limit for
    random accesses is at most 3 range tests and 3 possible backward steps.
    UTF-8 and UTF-16 are very easily supporting backwards and forwards
    enumerators; so what else do you need to perform any string handling?

    > From a cost to support there are valid reasons to use a mix of UTF
    > formats.

    For that I do agree, but not in the sense previously given in this list:

    Different UTF encodings should not be mixed within the same plain-text
    element. But you can as well represent the various nodes (that are
    independant plain-text elements) in a built DOM tree with various encodings,
    to optimize their internal storage

    You just need a common String interface (OO programming term) to access
    these nodes, and an implementation (or class) of this interface for each
    candidate string format. What these classes use in their internal backing
    store will then be transparent to the application that will just "see"
    Strings in a common unified (and most probably uncompressed) encoding.

    You may as well reuse common strings using a hashset, or a non-broken
    java.lang.String.intern() transformation to atoms. Note however that for now
    in Java, the intern() method is broken for this usage because it does not
    scale well with large numbers of different strings, because it uses a
    special fast hashmap with a fixed but too limited number of hashbuckets,
    that store different strings with the same hash in a linked list; but the
    same is true and even worse also for the Windows CreateAtom() APIs which
    don't support collision lists for each hash bucket. Once again, this technic
    is usable independantly of the encoding you use for each string atom stored
    in the hashset, so they can still be stored in compressed format with the
    one-interface/multiple-classes technic.



    This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 16:19:39 CST