UTF-c, UTF-i

From: Thomas Cropley (tomcropley@gmail.com)
Date: Sat Feb 26 2011 - 21:37:12 CST

Next message: Doug Ewell: "Re: UTF-c, UTF-i"

Previous message: Philippe Verdy: "Re: UTF-c"
Next in thread: Doug Ewell: "Re: UTF-c, UTF-i"
Reply: Doug Ewell: "Re: UTF-c, UTF-i"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Many thanks to everybody for their comments on UTF-c, especially to Philippe
Verdy. I have been reading them all with much interest.

First of all I would like to clarify what motivated me to developed UTF-c.
Some time ago I read that only about half the pages on the internet were
encoded in UTF-8, and this got me wondering why. But then I imagined myself
as say a native Greek where I knew characters of my native language could be
encoded in one byte using a Windows or ISO code-page, instead of two bytes
in UTF-8. In that situation I would choose to use a code-page encoding most
of the time and only use UTF-8 if it was really needed. It was obvious that
it would be preferable to have as few character encodings as possible, so
the next step was to see if one encoding could handle the full Unicode
character set, and yet be as efficient or almost as efficient as one byte
per character code-pages. In other words I was trying to combine the
advantages of UTF-8 with Windows/ISO/ISCII etc. code-pages.

My first attempt to solve this problem, which I called UTF-i, was a stateful
encoding that changed state whenever the first character of a non-ASCII
alphabetic script was encountered (hereafter I will call a character from a
non-ASCII alphabetic script a paged-character). It didn't require any
special switching codes because the first paged-character in a sequence was
encoded in long form (ie. two or three bytes) and only the following
paged-characters were encoded in one byte. When a paged-character from a
different page was encountered, it would be encoded in long form, and the
page state variable would change. The ASCII page was always active so there
was no need to switch pages for punctuation or spaces. So in effect it was a
dynamic code-page switching encoding. It had the advantage of not requiring
the use of C0 control characters or a file prefix (magic number, pseudo-BOM,
signature). It didn't take me long to reject UTF-i though, because although
it would have been suitable for storage and transmission purposes, trying to
write software like an editor or browser for a stateful encoding like UTF-i
would be a nightmare.

It now occurs to me that I may have been too hasty in my rejection of UTF-i.
If I was writing an application like a text editor or browser which could
handle text encoded in multiple formats, the only sensible approach would be
convert the encoded text to an easily processed form (such as a 16-bit
Unicode encoding) when the file was read in, and to convert back again when
the file is saved. It seems to me that Microsoft has adopted this approach.
So the stateful encoded text only needs to be scanned in the forward
direction from beginning to end, and thus most of the difficulties are
avoided. The only inconvenience I can see is for text searching applications
which scan stored files for text strings.

The other issue which needs to be addressed, is how well the text encodings
handle bit errors. When I designed UTF-c I assumed that bit errors were
extremely rare because modern storage devices and transmission protocols use
extensive error detection and correction. Since then I have done some
reading and it seems I may have been wrong. Apparently present day hardware
manufacturers of consumer electronics try to save a few dollars by not
including error detection of memory. There also seems to be widely varying
estimates of what error rate you can expect. This is important because it
would not be prudent to design a less efficient encoding that can tolerate
bit errors well if bit errors are extremely rare, or on the other hand if
errors are reasonably common, then the text encoding needs to be able to
localize the damage.

Finally I would prefer not to associate the word "compression" with UTF-c or
UTF-i. To most people compression schemes are inherently complicated, so I
would rather describe them as "efficient text encodings". Describing SCSU
and BOCU-1 as compression schemes may be the reason for the lack of
enthusiasm for those encodings.

I have downloaded some sample UTF-c pages to this site
http://web.aanet.com.au/tec if you interested in testing if your server and
browser can accept them.

Tom

Next message: Doug Ewell: "Re: UTF-c, UTF-i"
Previous message: Philippe Verdy: "Re: UTF-c"
Next in thread: Doug Ewell: "Re: UTF-c, UTF-i"
Reply: Doug Ewell: "Re: UTF-c, UTF-i"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 26 2011 - 21:42:20 CST