From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Fri Nov 21 2003 - 10:46:14 EST
On Fri, 21 Nov 2003 15:12:26 +0100, "Philippe Verdy" wrote:
>
> Could an editor loading such incorrect but legacy GB-18030 file accept to
> load it and work with it using an internal-only UCS-4 mapping (or an
> extended UTF-8 mapping), to preserve those out of range sequences, as if
> they were mapped in a extra PUA range?
>
An editor which stored data internally as extended UTF-32 or extended UTF-8
could easily preserve such invalid codepoints, but BabelPad stores data
internally as UTF-16 so it couldn't, and even if it could it wouldn't as its a
Unicode editor, and codepoints beyond U+10FFFF are not Unicode (nor for that
matter are codepoints beyond <E3 32 9A 35> valid GB-18030 as far as I'm aware).
The first thing I'll do this evening is change BabelPad so that GB-18030
codepoints beyond <E3 32 9A 35> are converted to U+FFFD.
> Of course saving the file into a UTF encoding would be forbidden, but saving
> the internal UCS-4 file back to GB-18030 would preserve those out-of-range
> GB-18030 sequences, without making any other interpretation, and without
> changing them arbitrarily into the GB18030 equivalent of U+FFFD?
>
> The editor could still use the Unicode rules for all valid GB18030
> sequences. And the invalid characters could be then represented for example
> with a colored/highlighted glyph such as <U+110000>. As both the input and
> output are not a Unicode scheme, I don't think this invalidates the Unicode
> conformance: the behavior would just be conforming to GB18030 or other
> legacy GB PUAs mappings.
>
I'm pretty sure that there are no such legacy GB mapping, and I doubt that China
will ever want to map characters to extra-Unicode codepoints in GB-18030 ...
they seem far more interested in trying to force everyone else to accept their
unwanted characters in the BMP than putting them in some limbo beyond Plane 16.
Andrew
This archive was generated by hypermail 2.1.5 : Fri Nov 21 2003 - 11:41:41 EST