From: Mark Davis (mark.davis@icu-project.org)
Date: Thu Sep 06 2007 - 14:34:13 CDT
Ccing Unicode in case anyone knows.
I don't know of any public ones. Years ago in ICU we tossed around the idea
of having something like that. It was roughly the following:
- Reserve 256 code points for "bytes that couldn't be converted"
- Reserve one code point for a "quote character"
When converting from a source, say possibly mangled UTF-8, convert all valid
sequences normally, except that a quote character is inserted before any of
the 256 items above. Any invalid sequence is converted to a sequence of the
appropriate ones of the 256 code points. When converting back, the quote
character + following code point is converted directly, and any other of the
256 are emitted as bytes. (The 257 code points could be private use.)
This would round-trip all bytes in a buffer between any single charset X and
Unicode. However, as soon as you get into a situation where you could be
outputting the resulting Unicode to a different charset Y, then it looked
like it started to break down. So it was little more than lunch
conversation.
Mark
On 9/6/07, Steve Bush <Steve.Bush@neosys.com> wrote:
>
> I read somewhere that there were some proposals to work out a lossless
> scheme for round tripping binary (ie all illegal UTF bytes/sequences) to UTF
> and back again.
>
> Can anyone point me in the direction of these efforts?
>
> Steve Bush
> NEOSYS Dubai.
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> icu-support mailing list - icu-support@lists.sourceforge.net
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Thu Sep 06 2007 - 14:38:05 CDT