From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 10 2004 - 16:18:33 CST
From: "Carl W. Brown" <cbrown@xnetinc.com>
> Philippe,
>> Also a broken opening tag for HTML/XML documents
>
> In addition to not having endian problems UTF-8 is also useful when
> tracing
> intersystem communications data because XML and other tags are usually in
> the ASCII subset of UTF-8 and stand out making it easier to find the
> specific data you are looking for.
If you are working on XML documents without parsing them first, at least at
the DOM level (I don't say after validation), then any generic string
handling will likely fail, because you may break the XML wellformed-ness of
the document.
Note however that you are not required to split the document into many
string objects: you could as well create a DOM tree with nodes referencing
pairs of offsets in the source document, if you had not to convert also the
numeric character references.
If not doing so, you'll need to create subnodes within text elements, i.e.
working at a level below the normal leaf level in DOM. But anyway, this is
what you need to do when there are references to named entities that break
the text level; but for simplicity, you would still need to parse CDATA
sections to recreate single nodes that may be splitted by CDATA end/start
markers inserted in a text stream that contains the "]]>" sequence of three
characters.
Clearly, the normative syntax of XML comes first before any other
interpretation of the data in individual parsed nodes as plain-text. So in
this case, you'll need to create new string instances to store the parsed
XML nodes in the DOM tree. Under this consideration, the encoding of the XML
document itself plays a very small role, and as you'll need to create a
separate "copy" for the parsed text, the encoding you'll choose for parsed
nodes with which you can create a DOM tree can become independant of the
encoding actually used in the source XML data, notably because XML allows
many distinct encodings in multiple documents that have cross-references.
This means that implementing a conversion of the source encoding to the
working encoding for DOM tree nodes cannot be avoided, unless you are
limiting your parser to handle only some classes of XML documents (remember
that XML uses UTF-8 as the default encoding, so you can't ignore it in any
XML parser, even if you later decide to handle the parsed node data with
UTF-16 or UTF-32).
Then a good question is which prefered central encoding you'll use for the
parsed nodes: this depends on the Java parser API you use: if this API is
written for C with byte-oriented null-terminated strings, UTF-8 will be that
best representation (you may choose GB18030). if this API uses a wide-char C
interface, UTF-16 or UTF-32 will most often be the only easy solution. In
both cases, because the XML document may contain nodes with null bytes
(represented by numeric character references like �), your API will need
to return an actual string length.
Then what your application will do with the parsed nodes (i.e. whever it
will build a DOM tree, or it will use nodes on the fly to create another
document) is the application choice. If a DOM tree is built, an important
factor will be the size of XML documents that you can represent and work
with in memory for the global DOM tree nodes. Whever these nodes, built by
the application, will be left in UTF-8 or UTF-16 or UTF-32, or stored with a
more compact representation like SCSU is an application design.
If XML documents are very large, the size of the DOM tree will become also
very large, and if your application then needs to perform complex
transformation on the DOM tree, the constant needs to navigate in the tree
will mean that therer will be frequent random accesses to the tree nodes. If
the whole tree does not fit well in memory, this may sollicitate a lot the
system memory manager, meaning many swaps on disk. Compressing nodes will
help reduce the I/O overhead and will improve the data locality, meaning
that the overhead of decompression costs will become much lower than the
gain in performance caused by reduced system resource usage.
> However, within the program itself UTF-8 presents a problem when looking
> for
> specific data in memory buffers. It is nasty, time consuming and error
> prone. Mapping UTF-16 to code points is a snap as long as you do not have
> a
> lot of surrogates. If you do then probably UTF-32 should be considered.
This is not demonstrated by experience. Parsing UTF-8 or UTF-16 is not
complex, even in the case of random accesses to the text data, because you
always have a bounded and small limit to the number of steps needed to find
the beginning offset of a fully encoded code point: for UTF-16, this means
at most 1 range test and 1 possible backward step. For UTF-8, this limit for
random accesses is at most 3 range tests and 3 possible backward steps.
UTF-8 and UTF-16 are very easily supporting backwards and forwards
enumerators; so what else do you need to perform any string handling?
> From a cost to support there are valid reasons to use a mix of UTF
> formats.
For that I do agree, but not in the sense previously given in this list:
Different UTF encodings should not be mixed within the same plain-text
element. But you can as well represent the various nodes (that are
independant plain-text elements) in a built DOM tree with various encodings,
to optimize their internal storage
You just need a common String interface (OO programming term) to access
these nodes, and an implementation (or class) of this interface for each
candidate string format. What these classes use in their internal backing
store will then be transparent to the application that will just "see"
Strings in a common unified (and most probably uncompressed) encoding.
You may as well reuse common strings using a hashset, or a non-broken
java.lang.String.intern() transformation to atoms. Note however that for now
in Java, the intern() method is broken for this usage because it does not
scale well with large numbers of different strings, because it uses a
special fast hashmap with a fixed but too limited number of hashbuckets,
that store different strings with the same hash in a linked list; but the
same is true and even worse also for the Windows CreateAtom() APIs which
don't support collision lists for each hash bucket. Once again, this technic
is usable independantly of the encoding you use for each string atom stored
in the hashset, so they can still be stored in compressed format with the
one-interface/multiple-classes technic.
This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 16:19:39 CST