Re: UTF-c, UTF-i

From: Doug Ewell (doug@ewellic.org)
Date: Sun Feb 27 2011 - 11:35:20 CST

Next message: JP Blankert (thuis & PC based): "Dreamweaver & unicode"

Previous message: Thomas Cropley: "UTF-c, UTF-i"
In reply to: Thomas Cropley: "UTF-c, UTF-i"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Thomas Cropley wrote:

> Some time ago I read that only about half the pages on the internet
> were encoded in UTF-8, and this got me wondering why. But then I
> imagined myself as say a native Greek where I knew characters of my
> native language could be encoded in one byte using a Windows or ISO
> code-page, instead of two bytes in UTF-8. In that situation I would
> choose to use a code-page encoding most of the time and only use UTF-8
> if it was really needed.

This presumes that people creating Web pages would make a conscious
effort to choose the smallest encoding. I'm not necessarily sure that's
true, or that most people are even aware of this issue. Markup and
embedded graphics tend to dominate Web content anyway; pages with 100 KB
of textual content tend to be the exception.

> It was obvious that it would be preferable to have as few character
> encodings as possible, so the next step was to see if one encoding
> could handle the full Unicode character set, and yet be as efficient
> or almost as efficient as one byte per character code-pages. In other
> words I was trying to combine the advantages of UTF-8 with
> Windows/ISO/ISCII etc. code-pages.

That's a pretty good description of what SCSU does.

> My first attempt to solve this problem, which I called UTF-i, was a
> stateful encoding that changed state whenever the first character of a
> non-ASCII alphabetic script was encountered (hereafter I will call a
> character from a non-ASCII alphabetic script a paged-character). It
> didn’t require any special switching codes because the first
> paged-character in a sequence was encoded in long form (ie. two or
> three bytes) and only the following paged-characters were encoded in
> one byte. When a paged-character from a different page was
> encountered, it would be encoded in long form, and the page state
> variable would change. The ASCII page was always active so there was
> no need to switch pages for punctuation or spaces.

Again, this sounds very much like SCSU, although it is not made clear
why avoiding "special switching codes" was a goal, especially since this
greatly compromises encoding efficiency for more than one 64-block. I
suppose you know that many languages require more than ASCII plus a
single 64-block; try encoding Polish or Czech, for example.

> If I was writing an application like a text editor or browser which
> could handle text encoded in multiple formats, the only sensible
> approach would be convert the encoded text to an easily processed form
> (such as a 16-bit Unicode encoding) when the file was read in, and to
> convert back again when the file is saved. It seems to me that
> Microsoft has adopted this approach.

Almost everybody has adopted this approach. It just makes sense.

> So the stateful encoded text only needs to be scanned in the forward
> direction from beginning to end, and thus most of the difficulties are
> avoided. The only inconvenience I can see is for text searching
> applications which scan stored files for text strings.

This is the tip of the problem with trying to develop a single compact
encoding (I won't say "compressed") that meets all needs. As you have
seen, many of the stated goals (yours and others) conflict with other
goals.

Here is a partial list of goals for encoding Unicode text that I have
heard over the years, in no particular order. Try and see how many of
these can be met by a single encoding, and how severe the penalty is in
terms of the goals that are not met:

• fixed-width for ease of processing
• variable-width for shorter encoding of "common" or "likely" characters
• ASCII compatibility (ASCII bytes are used for Basic Latin characters)
• ASCII transparency (as above, plus ASCII bytes are not used for
non-Basic Latin characters)
• compatibility with Latin-1 or other legacy encoding
• avoid NULL
• avoid all C0 control values
• avoid most C0 control values but encode CR, LF, tab, etc. as
themselves
• avoid C1 control values
• avoid high-bit values
• avoid specific target byte values, or constrain to a fixed set of byte
values
• discrete lead and trail byte values for error detection or backward
parsing
• retain binary order of code points
• simplest algorithm to encode/decode
• fastest algorithm to encode/decode
• avoid "magic" tables in algorithms
• optimized (size or speed) for arbitrary Unicode coverage
• optimized (size or speed) for one or few blocks of code points
• no inherent bias (size or speed) toward Basic Latin or any other block
• avoid illegal sequences
• avoid overlong sequences
• avoid byte ordering issues
• avoid signatures
• use a signature to simplify detection
• ability to insert or delete sequence without decoding first
• use what the OS or programming environment prefers

> The other issue which needs to be addressed, is how well the text
> encodings handle bit errors. When I designed UTF-c I assumed that bit
> errors were extremely rare because modern storage devices and
> transmission protocols use extensive error detection and correction.
> Since then I have done some reading and it seems I may have been
> wrong. Apparently present day hardware manufacturers of consumer
> electronics try to save a few dollars by not including error detection
> of memory. There also seems to be widely varying estimates of what
> error rate you can expect. This is important because it would not be
> prudent to design a less efficient encoding that can tolerate bit
> errors well if bit errors are extremely rare, or on the other hand if
> errors are reasonably common, then the text encoding needs to be able
> to localize the damage.

I haven't done the reading and may be wrong, but if error detection in
memory chips were a significant problem, I would expect to see a lot
more error messages and a lot more trouble with unreadable graphic,
sound, and executable files on a day-to-day basis than I do. Most
transmission protocols have had excellent built-in mechanisms for error
detection, correction, and retransmission for at least 20 years now. So
I'm not sure that's an overriding concern either.

> Finally I would prefer not to associate the word “compression” with
> UTF-c or UTF-i. To most people compression schemes are inherently
> complicated, so I would rather describe them as “efficient text
> encodings”. Describing SCSU and BOCU-1 as compression schemes may be
> the reason for the lack of enthusiasm for those encodings.

Perhaps the name "compression" does discourage some implementers. I
know that general-purpose compression algorithms that are used every day
in a variety of environments are far more complicated than SCSU and
BOCU-1, and that doesn't seem to stop implementers or users (the only
tradeoffs you hear about in that realm are between size and speed). In
any case, whatever terminological spin you put on UTF-c applies equally
to SCSU and BOCU-1.

> I have downloaded some sample UTF-c pages to this site
> http://web.aanet.com.au/tec if you interested in testing if your
> server and browser can accept them.

Not surprisingly, my browser happily downloaded the pages but had no
idea how to interpret the characters. Labeling the pages as
"iso-8859-7" and "windows-1251" didn't help. And incompatibility with
existing software is also part of the problem, as Asmus said.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Next message: JP Blankert (thuis & PC based): "Dreamweaver & unicode"
Previous message: Thomas Cropley: "UTF-c, UTF-i"
In reply to: Thomas Cropley: "UTF-c, UTF-i"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Feb 27 2011 - 11:42:27 CST