From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Mon Feb 17 2003 - 05:28:55 EST
Two posts in the Unicode list in the last few days advocate using XML rather
than using plane 14 tags.
I knew very little XML so I started to learn some more so as to assess the
matter of whether there is any good reason for using XML rather than plane
14 tags. Certainly no reasons were stated in the posts.
I asked for XML at http://www.ask.co.uk and found, amongst other results,
the following.
http://hotwired.lycos.com/webmonkey/98/41/index1a.html?tw=xml
This is the first page of a set of seven pages introducing XML. I found
this set of documents very useful.
I also found the following.
http://www.cnet.com/Resources/Info/Glossary/Terms/xml.html
I have read both the above.
I also found the following FAQ.
The XML FAQ
http://www.ucc.ie:8080/cocoon/xmlfaq
I have had a look through that document.
I also found the following.
Extensible Markup Language (XML) 1.0 (Second Edition)
I have glanced at that but not yet in any depth.
The more I read about XML the less reason there seems to be to use XML
instead of tags!
I can understand that there are many possible applications for XML, but any
particular set of named elements produced in an XML document seems by its
nature to be at the same level of standardization and interoperability from
one person to another as using a Private Use Area collection of Unicode
codes. Namely it might well work and be beneficial amongst a group of
people, yet it simply does not have the rigorous definition and formal
standardization of any particular chosen format (that is, of any end user
designed language produced using the XML metalanguage) to be useful
generally throughout computing as a standard for information exchange!
Large companies are, in my view, unlikely to accept as a standard a language
which is produced other than by a formal standards body.
Also, as a scientific applications programmer I just do not understand how
if I write a Java program to act in response to a file of XML text it is in
any way other than much harder to program the Java than if the file is a
file containing plane 14 tags, or individual codes such as in eutocode
graphics. Am I missing something? In particular, for the DVB-MHP (Digital
Video Broadcasting - Multimedia Home Platform) there is a need to keep the
programs as small as possible and to keep text files as small as possible.
Plane 14 tags and a eutocode graphics system offer both ease of decoding,
compactness of documents and ease of preparation of documents starting from
the main plain text body of the document.
I accept that generating information source documents might be easier using
XML for the more complex usage of markup systems but then a Java program can
be used locally to convert into the tag format for interchange or storage,
thereby cutting down on storage space and complexity of decoding at the
receiving end.
Please know that I am not in any way criticising XML or in any way
purporting otherwise than that it is useful in many situations. The matter
under discussion is as to why it is being claimed that XML is better than
tags for specific applications. My feeling is that there is room for both
tags and XML as facilities for people to use. Which is used in any
particular application depends upon the application. There is an overlap of
areas of application where both will do, yet there are areas of application
where each has its own particular advantages. I read recently of how XML is
being used to produce a format for marshalling content from content
originators to broadcasters of interactive television services: that seems a
good use of XML, it has English-like layout of information for use within a
particular user community. Yet there is a qualitative difference between
that and having a tag system where everybody uses the same encoding for all
sorts of applications, such as finding all documents which have been tagged
with a particular Dewey Decimal Classification indicating the nature of the
subject area of the document.
If the examples of the example Cyrillic document and the haiku from my
posting together with eutocode graphics are all looked at, is using XML
instead of my encoding methods in any way whatsoever an improvement? I am
genuinely puzzled over this. Am I missing something or are the suggestions
to use XML rather than to use tags and my suggested new tag types and plane
14 vector graphics codes unfounded?
1. Previously I wrote as follows.
quote
Suppose that there is a plain text document written in Cyrillic script. If
at the start of that document there is a U+E0001 character then some tag
characters indicating the language and then a U+E0002 character and then the
characters U+E0036 U+E0030 U+E0038 then someone could look at the document
using a suitable computer system and find out from the few plane 14
characters at the start of the document in which particular language the
document is written and also that the general topic area of the document is
inventions and patents. This being because 608 is the Dewey Decimal
Classification for inventions and patents. However, in an ordinary document
viewing package, the tags would not be displayed, so they would not get in
the way.
end quote
How would that be done using XML? Would it be done better using XML than
using tags? Why, or why not?
2. Previously I wrote as follows.
quote
My suggestion for U+E0004 could be very useful. Suppose that the haiku
which I included at the end of the document had an International Literary
Work Number, if such a system of International Literary Work Numbers comes
into existence in the future. I could produce a plain text file which
starts with U+E0004 and a number of tag characters and then the text of the
haiku. I could place that file somewhere on the web. Search engines might
log it. If then someone is writing an article about the topic of poetry and
Unicode, then he or she might refer to that haiku and include a tag encoded
reference to it, using its International Literary Work Number. A reader of
that document could decide to have a look at the text and could then search
the internet for the text of the haiku, knowing that the search is made
easier due to the fact that the International Literary Work Number is unique
to that haiku, whereas searching for Phaistos Disc might not find it at all,
or might find it as but one of many search engine matches for the term
Phaistos Disc.
end quote
Please suppose, for the purposes of this discussion, that an International
Literary Work Number is expressed as 15 digits followed by a full stop
followed by 5 digits. (A real world implementation might add a space and a
check digit in the manner of International Standard Book Numbers, but the 21
character model will be adequate for this discussion, the idea here being
that anyone may obtain an ILWN 15 digit code from a web site which has a
database facility by choosing any 15 digit number not starting with a 0
character which number has not already been chosen by someone else, then
that person may allocate the 5 digits after the full stop as he or she
chooses.)
How would that be done using XML? Would it be done better using XML than
using tags? Why, or why not?
3. Previously I wrote as follows.
quote
Looking further at the matter of plane 14, I am wondering whether there is
scope for the eventual production of a vector graphics system to be encoded
in plane 14. I have had some good success with my eutocode graphics system
which is produced using codes from the Private Use Area.
http://www.users.globalnet.co.uk/~ngo/ast03000.htm
http://www.users.globalnet.co.uk/~ngo/ast03100.htm
Eutocode graphics uses 10 bit data input. If a system in plane 14 were
produced, then 12 bit data input could be used, perhaps using all of the
codes U+E2000 through to U+E2FFF for data input. Some of the codes in the
range U+E1000 through to U+E1FFF could be used for control codes for the
system, though not that many of them. At its present stage of development
eutocode graphics uses only a few codes for control, all of them within the
range U+EB00 through to U+EBFF of the Private Use Area.
end quote
Please consider the graphic Winter Night in the second of the above named
web pages. This is a vector graphic. The Winter Night graphic can be a
stand alone graphic in a file or it can be embedded within a text file if
one is using a Java program in an interactive television system to process
text files to produce displays.
How would that be done using XML? Would it be done better using XML than
using tags? Why, or why not?
Your comments would be appreciated please. I recognize that this posting is
at length and that discussing the matter thoroughly may take considerable
time and effort. However, as the Unicode Technical Committee are heading
for making a decision which may have long-lasting and widespread effects
upon the way in which computing develops, I feel that it is important that
the matter is discussed fully and thoroughly. Plane 14 could become a
formally standardized area of futuristic development for the 21st Century
and beyond. I feel that that opportunity for progress should not be blocked
off now by a committee making a decision which prevents opportunities for
technological progress.
William Overington
Monday 17 February 2003
This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 06:39:19 EST