Re: japanese xml

From: John Cowan (cowan@mercury.ccil.org)
Date: Tue Sep 04 2001 - 12:53:39 EDT


Marco Cimarosti scripsit:

> > [...] Definition: A character is an atomic unit of text as
> > specified by ISO/IEC 10646 [ISO/IEC 10646] [...]
>
> I should not try to interpret XML specs. 'Anyway, my understanding is that
> the XML legislators are simply saying that they adopt Unicode definition of
> "character", and the Unicode *set* (repertoire) of characters. They are not
> that they mandate one of Unicode forms as the only encoding for a XML source
> file.

Exactly correct.

> Here my understanding is that UTF-8 and UTF-16 are the minimum requirement
> (all parser must accept the) and the default "encodings" (binary) of an XML
> source file.
>
> So, any XML parser must accept a source with no explicit "encoding=..."
> declaration, and it must treat it as either UTF-8 and UTF-16. (NOTE: once
> you know that the encoding is either UTF-8 or UTF-16, it is quite easy to
> tell one UTF from the other).

In particular, UTF-16 is only recognized without an encoding if the
file begins with a BOM. UTF-8 files may or may not have a BOM.
(However, MIME Content-type headers override any internal indication
of character encoding.)

> > Could I, theoretically, invent my own encoding and say that this
> > is conformant XML?
>
> I am curious too about this question. I bet that the answer is "yes" but I
> also bet that "theoretically" should be underlined 100 times.

Yes, absolutely you can do this. However, you would be well advised
to provide your own parser in this case (you can modify an existing
Open Source parser, of which there are many).

You also need a distinct name for your character encoding, which
should either begin with "x-" or be registered by IANA.

> But this problem does *not* exist in XML because Unicode characters which
> would not be representable in the (binary) encoding can be represented by a
> numerical reference. This reference always refers to the *Unicode* character
> value, not to the file's encoding, so an XML file can always be losslessly
> expressed in any supported encoding.

Only the character content can be represented losslessly, not the
element type names, attribute names, enumerated attribute values,
comments, processing instructions.

-- 
John Cowan           http://www.ccil.org/~cowan              cowan@ccil.org
Please leave your values        |       Check your assumptions.  In fact,
   at the front desk.           |          check your assumptions at the door.
     --sign in Paris hotel      |            --Miles Vorkosigan



This archive was generated by hypermail 2.1.2 : Tue Sep 04 2001 - 13:50:57 EDT