From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Nov 21 2003 - 11:58:09 EST
De : Doug Ewell [mailto:dewell@adelphia.net]
> Unless GB 18030 prohibits invalid sequences the way Unicode does, I
> suppose there's no reason you couldn't map invalid GB 18030 sequences to
> PUA code points *within the privacy of your own application* if you
> really want to preserve them in some way, and have some idea what you
> want to do with them. You MAY NOT map them to Unicode noncharacters or
> anything outside the Unicode/10646 range (i.e. beyond U+10FFFD).
I did not propose to use such map externally. An application or system
can use whatever internal encoding it thniks may be useful to handle
legacy cases, even invalid ones, provided that this internal encoding
is not used to create external data claiming it is Unicode. If that
module preserves the invalid sequences that were present on its input,
and provided that the input did not claim to be Unicode (GB18030 is in
that case), I don't think it violates Unicode conformance, simply
because there's no Unicode interface on this system.
Such system could be built explicitly to conform only to GB18030,
without claiming anything else about Unicode. The internal use of
Unicode mappings for some sequences, and extra mappings for characters
or sequences not in Unicode is an internal decision that only influence
the design of the implementation: Unicode in that case is used as a
convenient tool to perform some things, but there's no required
dependency. Using Unicode algorithms or mappings internally just
eases the implementation of the other encoding.
The solution that would map invalid sequences into Unicode PUAs may
have the problem of colliding with other valid PUAs used in GB18030.
These invalid sequences may as well contain information which is not
plain-text for Unicode, such as markup or presentation elements, and
this does not violate the Unicode model used to encode ONLY
plain-text, and leaving other non-standard uses free for markyp or
upper-layer protocols.
So my question remains: does GB18030 permanently binds out-of-range
or invalid sequences to non-characters? If not, GB18030 applications
may use them to encode something else than plain-text, and there
will be a need to map them to extra planes if the internal handling
of text is best done with a extended Unicode encoding form like
UCS-4.
Another solution could be that GB18030 mandates the mapping of invalid
sequences to a well-defined set of Unicode PUAs. This would allow them
to become usable in UTF-16 encoding forms. But as this mapping is not
done for now, the question of the current assignment of GB18030 invalid
sequences to non-characters remains open: is the mapping of GB18030
to Unicode completely closed, or left open for further applications
like markup (annotation or visual formating and layout, font selection,
text alternatives, semantic or syntactic data, pointers or links to
associated information, images, custom bitmap-glyphs, sets of character
properties, phonetic variants...)?
__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Fri Nov 21 2003 - 12:47:03 EST