RE: japanese xml

From: Peter_Constable@sil.org
Date: Fri Aug 31 2001 - 00:52:26 EDT


Marco:

>> Furthermore, Viranga's context appears to be XML, in which
>> case it *is* possible to encode *all* Unicode code points
>> using EUC (or ISO-8859-1 or ASCII or ...)
>
>Yes, yes. XML documents can represent characters in at least two ways:

>2) By representing them with numeric references in the form "Ӓ"
etc...

>In the context of Unicode and, more generally, plain-text encoding "to
>encode" means only point 1 above, and "&1234;" is just a six-character
>string. BTW, this is also the interpretation of tools (text editor, etc.)
>used to manipulate XML files -- so it is not a pointless distinction for
>someone working in XML.
>
>Point 2, in Unicode speech, is defined a "higher level protocol",

I agree with you earlier, but on the other hand, suppose we define
UTF-NCR8:

Unicode bit code code code code code
scalar value pattern unit 1 unit 2 unit 3 unit 4 unit
5

0020 - 0027 00wwwwww 00100110 00100011 00110011 0011xxxx
00111011
              where xxxx = wwwwww - 11101 (binary)

0028 - 0031 00wwwwww 00100110 00100011 00110100 0011xxxx
00111011
              where xxxx = wwwwww - 100111 (binary)

0032 - 003b 00wwwwww 00100110 00100011 00110101 0011xxxx
00111011
              where xxxx = wwwwww - 110001 (binary)

etc., but with a handful of exceptions, such as

U+0026: 00100110 01100001 01101101 01110000
00111011

U+003C: 00100110 01101100 01110100 00111011

We can also define UTF-NCR16 in just the same way, but the code units are
16-bit, zero-extended equivalents of the UTF-NCR8 code unites. One of the
interesting aspects of these encodings is that XML parsers understand them
without requiring that the charset be declared, just like UTF-8 and
UTF-16.

Now, if someone interpreted Misha to mean one of these encodings, then he
would be talking about encoding in the same sense as you. :-)

Peter



This archive was generated by hypermail 2.1.2 : Fri Aug 31 2001 - 02:02:29 EDT