Viranga Ratnaike wrote:
>[...] And apologies for my previously vague questions.
As you have seen, both Misha and I thought that your question was very
clear, nevertheless we understood two totally different things.
Moreover, even after all the attempts of explanations, we both are still
convinced that we understood and replied correctly.
So, clearly, it is not your question, but rather the terminology of this
field. It is not the first time that I see (or even get involved in) similar
arguments on this list about the meaning of some term.
> Tho' I must
> admit that, in hindsight, I'm glad the questions were open to
> interpretation, as I have learned much from the thread : )
And that's the good point of this kind of discussions.
> When I came across the weekly-euc-jp.xml document, I
> was rapt; an xml document with japanese tags. But when
> I looked at the underlying hex, it clearly wasn't
> "encoded" using a UTF.
At least this point seems uncontroversial: Unicode is not the only encoding
allowed in XML document (NOTE: the word "encoding" in the previous phrase
just means the binary representation of the source file's text!).
> [...] Definition: A character is an atomic unit of text as
> specified by ISO/IEC 10646 [ISO/IEC 10646] [...]
I should not try to interpret XML specs. 'Anyway, my understanding is that
the XML legislators are simply saying that they adopt Unicode definition of
"character", and the Unicode *set* (repertoire) of characters. They are not
that they mandate one of Unicode forms as the only encoding for a XML source
file.
> "The mechanism for encoding character code points into bit
> patterns may vary from entity to entity. All XML processors
> must accept the UTF-8 and UTF-16 encodings of 10646; the
> mechanisms for signaling which of the two is in use, or for
> bringing other encodings into play, are discussed later, in
> 4.3.3 Character Encoding in Entities."
Here my understanding is that UTF-8 and UTF-16 are the minimum requirement
(all parser must accept the) and the default "encodings" (binary) of an XML
source file.
So, any XML parser must accept a source with no explicit "encoding=..."
declaration, and it must treat it as either UTF-8 and UTF-16. (NOTE: once
you know that the encoding is either UTF-8 or UTF-16, it is quite easy to
tell one UTF from the other).
> If the character set is specified as ISO/IEC 10646, in what
> circumstances would it be appropriate to use an "encoding"
> other than UTF-8 or UTF-16 ?
Whenever the encoding complies with the Unicode definition of "character"
and its repertoire of characters is identical or a subset of Unicode's
repertoire. I.e., in practice, when that encoding can be converted to
Unicode without loss of meaning. I.e., even more practically, most existing
standards can be used, because one of the design goals of Unicode was to
allow roundtrip conversions from a large set of pre-existing standards.
This means that an XML parser should convert the text to Unicode and use
Unicode internally, or operate as if it did.
> Could I, theoretically, invent my own encoding and say that this
> is conformant XML?
I am curious too about this question. I bet that the answer is "yes" but I
also bet that "theoretically" should be underlined 100 times.
> [...]
> <?xml version="1.0" encoding="euc-jp"?>
> Does '-jp' (or "euc-jp" collectively) imply JIS ?
Yes. Note however that "JIS" is a vague colloquial nickname for "Japanese
national encoding". EUC-JIS (aka EUC-JP) is a complex encoding which
combines several different character sets, as was explained in Jungshik
Shin's message:
" I'm afraid this is slightly misleading because EUC-JP encodes NOT
a *single* coded character set BUT *three* coded character sets,
US-ASCII/JIS X 201, JIS X 208 and JIS X 212. Moreover, calling Japanese
character sets as JIS, however common the practice might be, is not
strictly right. As you know too well, JIS just stands for Japanese
Industrial Standard under which there are numerous standards other than
coded character sets. "
> We have seen references to JIS in his stuff, but would
> rather stick to interfacing with the Unicode stuff (mainly because
> it's so much easier supporting just the one thing internally, and
> we can deal with other character sets by either (converting to
> Unicode) or (promising only storage and retrieval of raw data w/o
> interpreting it in any way).
In fact, I think that this should be the way most modern application work,
not just XML parsers: many encodings should be accepted, but they should be
converted to Unicode as soon as they pass through the doorstep. This "input"
conversion is normally lossless for most encodings.
Notice, however, that there might also be a need to convert outgoing Unicode
data to some different encoding. Generally speaking, this "output"
conversion is a problem because converting Unicode to another (more limited)
encoding is potentially lossy.
But this problem does *not* exist in XML because Unicode characters which
would not be representable in the (binary) encoding can be represented by a
numerical reference. This reference always refers to the *Unicode* character
value, not to the file's encoding, so an XML file can always be losslessly
expressed in any supported encoding.
(Misha, I hope I finally succeeded figuring out what you were meaning!)
Ciao.
_ Marco
This archive was generated by hypermail 2.1.2 : Fri Aug 31 2001 - 13:24:11 EDT