From: William J Poser (wjposer@ldc.upenn.edu)
Date: Sun May 31 2009 - 19:26:41 CDT
> There is only one UTF-8, the one defined by Unicode and ISO/IEC 10646,
>which maps valid Unicode/10646 scalar values to sequences of bytes.
>Anything else is not UTF-8. Keep repeating this to yourself.
If I understand Hans Aberg's point, he means that one can abstract
the mapping from the non-negative integers to byte sequences used by
UTF-8 away from Unicode and use it for other purposes. One could,
for example, have a "UTF-8" encoding of the TRON indexed character
set, or of Nelson numbers. In this sense, there is "UTF-8", the
integer->byte sequence mapping, and UTF-8, the Unicode transformation
format that uses this mapping. This seems to me to be a perfectly valid point.
However, so as to avoid confusion, we ought to call them different
things, and since the "U" of "UTF-8" stands for "Unicode", it is the
mapping in the abstract that ought to be given another name, perhaps
the "Thompson mapping" or "diner encoding".
Bill
This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 19:28:31 CDT