Here is the issue. Because of the prevalence of UTF-16, and to preserve the
round-tripping of UTFs to and from UTF-16 (even UTF-16 containing
mal-formed text containing non-characters and/or unpaired surrogates), a
UTF must always roundtrip all codepoints between 0 and 10FFFF, inclusive.
It is of course permissible for a UTF converter to offer an option to
detect and throw an error on any mal-formed text.
Mark
Doug Ewell wrote:
> Did I read recently (in a message that I shortsightedly deleted)
> something to the effect that a character encoding scheme (CES) or
> transfer encoding syntax (TES) needs to be able to encode the non-
> characters U+D800 through U+DFFF, and presumably U+xxFFFE and U+xxFFFF
> as well?
>
> I've been playing around with a TES (or maybe it's a CES; I'm still
> having a little trouble knowing exactly where to draw the line). Don't
> worry, I'm not going to propose it anywhere as Yet Another UTF. I'm
> just playing around with Unicode, and hopefully teaching myself
> something along the way.
>
> Anyway, my scheme encodes non-BMP characters not *as* surrogates, but
> using the surrogate mechanism in a slightly modified way. Like UTF-16,
> this makes it impossible to encode the BMP non-characters in the range
> U+D800 through U+DFFF. Normally I wouldn't think this was a problem,
> but I thought someone (Davis?) just said recently that it should be
> possible to round-trip these thingies, for some reason.
>
> The situation would be different in the case of U+xxFFFE and U+xxFFFF,
> because while the surrogates occupy entire ranges that can be utilized
> in a special way, you kind of have to *deliberately* exclude the FFFx
> characters. Nonetheless, the same question applies: Must these bogus
> code points be representable in a CES or TES, or can they be handled
> conformantly by raising an error or mapping them to U+FFFD?
>
> -Doug Ewell
> Fullerton, California
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT