From: Hans Aberg (haberg@math.su.se)
Date: Fri Apr 29 2005 - 04:01:11 CST
At 21:26 -0700 2005/04/28, Asmus Freytag wrote:
>I think the encoding model used by Unicode is reasonably well
>presented in Unicode Technical Report #17: "Character Encoding
>Model" http://www.unicode.org/reports/tr17/. If you think that
>presentation should be improved, I invite you to file a specific
>suggestion using the online reporting form.
This is essentially OK. You have a practical problem of bringing it
out to the public, it seems. :-)
The things I would have done somewhat differently, as a
mathematician, is to develop it around a group of separate concepts,
then linking them together, rather than throwing the different pieces
altogether in one lump.
For example, I would no have use the word "character" everywhere, and
used the word "set" for a collection of something, rather than
different words like "repertoire". So, "abstract character set" seems
better than "abstract character repertoire" seems better in a
technical definition, although the latter term might be used
informally. Then, by Bourbaki "abuse of language", accepting to drop
the word "abstract" when the context is clear, what you call "Coded
Character Set", I would have called "character set numbering". There
is also a mathematical difference between
a mapping from an abstract character repertoire to a set of nonnegative
integers
and
a mapping from an abstract character repertoire to the set of nonnegative
integers
In modern formal mathematical language, a function comes with both
domain and codomain. Even though Unicode probably thinks of having
this codomain fixed and finite, it suffices in this context to have
it to be the set of non-negative integers (i.e., the set of natural
numbers). Then you have
Character Encoding Form
a mapping from a set of nonnegative integers that are elements of
a CCS to a
set of sequences of particular code units of some specified width, such as
32-bit integers
Character Encoding Scheme
a reversible transformation from a set of sequences of code units (from one
or more CEFs to a serialized sequence of bytes
Here I would have inserted the concept of an integer to binary
transformation (function, map), which does not as such have anything
with characters to do. One gets a character encoding when combining
the character numbering map with a integer to binary transformation.
Also, the wording "[the] integers that are elements of a CCS" is
formally incorrect, as they are part of the range (i.e., map image)
of the CCS; so it should have been "[the] integers that are in the
range of a CCS". From the definitions, it is hard to immediately see
the difference between "Character Encoding Form" and "Character
Encoding Scheme"; it appears that the former means that the codomain
of the character number map has been fixed, whereas the latter means
an integer to binary encoding with restricted domain. Also, does the
word "reversible" indicate that the map is invertible or injective
(one-to-one)? If the map is injective, then the inverse image of
every singleton is a singleton, so that the character sequence can be
extracted from the encoded text. Then, as there are many character
maps involved, I would given your "character map" notion a more
descriptive name, "character [to binary] encoding", ie.e, what you
get when combining the two maps, the character numbering map, and the
integer to binary map. You can insert notions of domain and codomain
restrictions here, but the final map, the character encoding map,
will of course be the same if the original character set is not cut
off in the process.
In short, there is nothing wrong with the model itself, but there are
some problems in focusing it logically, and in its definitions.
-- Hans Aberg
This archive was generated by hypermail 2.1.5 : Fri Apr 29 2005 - 04:04:13 CST