Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 12:38:22 CST

Next message: Philippe VERDY: "Re: RE: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Doug Ewell: "Re: 32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/19 06:44, Doug Ewell at dewell@adelphia.net wrote:

> Hans Aberg <haberg at math dot su dot se> wrote:
>
>> The UTF-BSS ("UTF-8") is not sensitive to the big/endian issue. And
>> perhaps people might invent other, creative uses.
>
> Here's a creative use that shows how UTF-8 does NOT need to overloaded
> in this way.
>
> I'm developing a "database" (not in the formal sense) of Unicode
> character names, and one of my design goals is to keep the size of the
> file down. I'm storing each word separately as a token, and using
> zero-terminated strings to store sequences of tokens.
>
> Obviously some words, such as LETTER, occur more often in character
> names than others, such as ZZURX, and so I wanted to be able to store
> commonly occurring tokens in fewer bytes than less common tokens. That
> initially pointed me toward using UTF-8 for the strings of tokens, even
> though the UTF-8 sequences wouldn't really be representing "characters"
> as such.
>
> But eventually, I realized that my requirements for this format aren't
> the same that drove the creation of UTF-8:

What you describe here is a special case of data compression algorithms. You
may benefit to look up some of those. Of course, UTF-8 is only one format,
suitable for communications of Unicode code points. Other applications
should use other formats.

The extension we have discussed here has one interesting property,
endianness insensitivity. There are a number of binary formats which are
otherwise better suitable for distributed code applications, such as CORBA,
etc. But if one has a 32-bit file, and wants it put up on the Internet, and
be sure that endianness comes out right´, I just noted that such a UTF-8
extension could be used for that. Most likely, people are developing other
such byte-formats, for special use. This is probably not really of much
concern to Unicode. But if, for some unforeseen reason, one would want to go
beyond the 21-bit limit, it might be good to know what it should look like.
And in my regular expression generator, I can do whatever I want, once I go
beyond the 21-bit limit -- I need only to make sure that the user of it
finds it convenient.

Hans Aberg

Next message: Philippe VERDY: "Re: RE: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Doug Ewell: "Re: 32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 12:39:36 CST