From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 12 2007 - 06:00:29 CDT
Mark E. Shoulson wrote:
> Doug Ewell wrote:
>
> > I'll see if I can find the thread where we talked about that, years ago.
> >
> > Somebody wanted to build that capability into an extension to UTF-8,
> > so it could faithfully represent invalid garbage. We were never able
> > to get him to work through what he wanted to do with the garbage thus
> > preserved.
> >
> Is there an obvious reason we couldn't just treat the garbage UTF-8 as a
> string of 8-bit characters (might be part of a binary file or something)
> and base-64 encode them? That'll definitely preserve round-trippedness.
Given the very strict conformance requirements in UTF-8, this leaves enough
encoding space for such an extension without colliding with standard UTF-8.
So such an extension is clearly possible in many ways, as long as the UTF-8
strict conformance requirements are kept, without breaking compatibility
with UTF-8 itself.
What this means is that UTF-8 does not need to be extended itself, and it
should not: the extension will be private, and would need to be labelled
differently.
However, I would not use UTF-8 for such thing. If one wants to represent
invalid UTF-8 sequences unchanged, the best thing that can be done is just
to do NOTHING about these sequences, but relabel the text with a distinct
charset identifier like "invalid-UTF-8" which will just allow all valid
UTF-8 sequences plus some binary sequences of bytes which are not valid
UTF-8, and in such extension charset, treat each byte as if it was a
non-Unicode codepoint, like U+110000 plus the value of each byte.
Transcoding it to strict UTF-32 would be impossible but transcoding it to
"invalid-UTF-32" would be extremely basic. Transcoding it to UTF-16 would
also be impossible but could be made using sequences forbidden in standard
UTF-16, such as an unpaired surrogate.
Document parsers using these "invalid-UTF-8" or "invalid-UTF-16" or
"invalid-UTF-32" charsets should then need specific character parsers to
recognize the invalid sequences and treat them as distinctive objects
(similar but not equivalent to valid characters). But how these documents
parsers will treat these pseudo-characters is left to implementations, and
Unicode does not need to be updated. It's then up to applications to decide
how to treat these objects, exactly like it is left to applications to
decide what to do with documents that are not correctly encoded with the
standard tagged UTF.
What is clear is that these documents won't be portable across systems in a
heterogeneous system, but such use is still possible locally and their
acceptation, use or non use, is left to applications and developers, as long
as they don't pretend that the output of the programs accepting these
documents is not tagged as UTF-8 or other standard UTF if it still contains
the detected invalid sequences after internal processing of the accepted
input document.
As long as such application can still correctly signal to the user that the
input documents were accepted, despite they had invalid sequences, this will
remain safe (however it won't be safe if the input document was explicitly
tagged with a standard UTF charset name, and the application accepts the
documents and treats it silently, producing valid UTF without signaling the
interpretation caveat to the user.)
For example an input filter could be invoked with:
$ someFilter -inputcharset "UTF-8" -outputcharset "UTF-8"
< someDocument.txt > result.UTF-8.txt
It should not generate the expected output (it will signal an error on
output) if there are invalid UTF-8 sequences in the input document, but the
same filter program could be built to accept:
$ someFilter -inputcharset "x-invalid-UTF-8" -outputcharset "UTF-8"
< someDocument.txt > result.UTF-8.txt
with the same input document and produce a perfectly valid standard UTF-8
output in "result.UTF-8.txt" from this input because it does not pretend
that "somedocument.txt" is a standard UTF-8 text (so this filter is still
fully conforming to Unicode, which does not dictate how filters should treat
other charsets that are not standard UTF's).
Unicode will not say how the non standard "extension" charset should be
named. In fact many schemes are possible and each one defines a new separate
charset; my opinion is that such "extension" charset should completely avoid
containing "UTF" in their names to avoid confusions.
This archive was generated by hypermail 2.1.5 : Wed Sep 12 2007 - 06:03:04 CDT