From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Nov 25 2004 - 14:05:18 CST
From: "Antoine Leca" <Antoine10646@leca-marti.org>
> On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
>>
>> I'm not seeing a lot in this thread that adds to the store of
>> knowledge on this issue, but I see a number of statements that are
>> easily misconstrued or misapplied, including the thoroughly
>> discredited practice of storing information in the high
>> bit, when piping seven-bit data through eight-bit pathways. The
>> problem  with that approach, of course, is that the assumption
>> that there were never going to be 8-bit data in these same pipes
>> proved fatally wrong.
>
> Since I was the person who did introduce this theme into the thread, I 
> feel
> there is an important point that should be highlighted here. The "widely
> discredited practice of storing information in the high bit" is in fact 
> like
> the Y2K problem, a bad consequence of past practices. Only difference is
> that we do not have a hard time limit to solve it.
Whever an application chooses to use the 8th (or even 9th...) bit of a 
storage or memory or networking byte used also to store an ASCII-coded 
character, as a zero, or as a even or odd parity bit, of for other purpose 
is the choice of the application. It does not change the fact that this 
(these) extra bit(s) is not used to code the character itself.
I see this usage as a data structure, that *contains* (I don't say *is*) a 
character code. This completely out of topic of the ASCII encoding itself 
which is only concerned by the codes assigned to characters, and only 
characters.
In ASCII, or in all other ISO 646 charsets, code positions are ALL in the 
range 0 to 127. Nothing is defined outside of this range, exactly like 
Unicode does not define or mandate anything for code points larger than 
0x10FFFF, should they be stored or handled in memory with 21-, 24-, 32-, or 
64-bit code units, more or less packed according to architecture or network 
framing constraints.
So the question of whever an application can or cannot use the extra bits is 
left to the application, and this has no influence on the standard charset 
encoding or on the encoding of Unicode itself.
So a good question to ask is how to handle values of variables or instances, 
that are supposed to contain a character code, but whose internal storage 
can make values out of the designed range fit in the storage code unit. For 
me it is left to the application, but many applications will simply assume 
that such a datatype is made to accept a unique code per designated 
character. Using the extra storage bits for something else will break this 
legitimate assumption, and so applications must be prepared specially to 
handle this case, by filtering values before checking for character 
identity.
Neither Unicode or US-ASCII or ISO 646 define what an application can do 
there. The code positions or code points they define are *unique* only in 
their *definition domain*. If you use larger domains for values, nothing 
defines in Unicode or ISO 646 or ASCII how to interpret the value: these 
standards will NOT assume that the low-order bits can safely be used to 
index equivalent classes, because these equivalence classes cannot be 
defined strictly within the definition domain of these standard.
So I see no valid rationale behind requiring applications to clear the extra 
bits, or to leave the extra bits unaffected, or to force these applications 
to necessarily interpreting the low order bits as valid code points.
We are out of the definition domain, so any larger domain is 
application-specific, and applications may as well use ASCII or Unicode 
within storage code units which add some offsets, or multiply the standard 
codes by a constant, or apply a reordering transformation (permutation) on 
them and other possible non-character values.
When ASCII and ISO 646 in general define a charset with 128 unique code 
positions, they don't say how this information will be stored (an 
application may as well need to use 7 distinct bytes (or other 
structures...), not necessarily consecutive, to *represent* the unique codes 
that represent ASCII or ISO 646 characters), and they don't restrict the 
usage of these codes separately of any other independant information (such 
as parity bits, or anything else). Any storage structure that allows keeping 
the identity and equivalences of the original standard code in its 
definition domain is equally valid as a representation of the standard, but 
this structure is out of scope of the charset definition.
This archive was generated by hypermail 2.1.5 : Thu Nov 25 2004 - 14:09:55 CST