From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Nov 26 2004 - 09:23:38 CST
From: "Antoine Leca" <Antoine10646@leca-marti.org>
> On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
>>
>> In ASCII, or in all other ISO 646 charsets, code positions are ALL in
>> the range 0 to 127. Nothing is defined outside of this range, exactly
>> like Unicode does not define or mandate anything for code points
>> larger than 0x10FFFF, should they be stored or handled in memory with
>> 21-, 24-, 32-, or 64-bit code units, more or less packed according to
>> architecture or network framing constraints.
>> So the question of whever an application can or cannot use the extra
>> bits is left to the application, and this has no influence on the
>> standard charset encoding or on the encoding of Unicode itself.
>
> What you seem to miss here is that given computers are nowadays based on
> 8-bit units, there have been a strong move in the '80s and the '90s to
> _reserve_ ALL the 8 bits of the octet for characters. And what was asking 
> A.
> Freitag was precisely to avoid bringing different ideas about 
> possibilities
> to encode other class of informations inside the 8th bit of a ASCII-based
> storage of a character.
This is true for example in an API that just says that a "char" (or whatever 
datatype used in some convenient language) contains an ASCII code or Unicode 
code point, and expects that the datatype instance will be equal to the 
ASCII code or Unicode code point.
In that case, the assumption of such API is that you can compare the "char" 
instance for equality instead of comparing only the effective code points, 
and this greately simplifies the programmation.
So an API that says that a "char" will contain ASCII code positions should 
always assume that only the instance values 0 to 127 will be used; same 
thing if an API says that an "int" contains an Unicode code point.
The problem lives only in the usage of the same datatype to store also 
something else (even if it's just a parity bit or bit forced to 1).
As long as this is not documented with the API itself, it should not be 
used, to preserve the rational assumption about identities of chars and 
identies of codes.
So for me, a protocol that adds a parity bit to the ASCII code of a 
character is doing that on purpose, and this should be isolated in that 
documented part of its API. If the protocol wants to snd this data to an API 
or interface that does not document this use, it should remove/clear the 
extra bit, to make sure that the character identity is preserved and 
interpreted correctly (I can't see how such a protocol implementation can 
expect that a '@' character coded as 192 will be correctly interpreted by 
the other simpler interface that expects that all '@' instances will be 
equal to 64...)
In safe programming, any unused field in a storage unit should be given a 
mandatory default. As the simplest form that perserves the code identity in 
ASCII or code point identity in Unicode is the one that use 0 as this 
default, extra bits should be cleared. If not, anything can appear within 
the recipient of the "character":
- the recipient may interpret the value as something else than a character, 
behaving as if the characterdata was absent (so there will be data loss, in 
addition to unpected behavior). Bad practice, given that it is not 
documented in the recipient API or interface.
- the recipient may interpret the value as another character, or may not 
recognize the expected character. It's not clearly a bad programming 
practice for recipients, because it is the simplest form of handling for 
them. However the recipient will not behave the way expected by the sender, 
and it is the sender's fault, not the recipient's fault.
- the recipient may take additional unexpected actions in addition to the 
normal handling of the character without the extra bits. It would be a bad 
programming practive of recipients, if this specific behavior is not 
documented, so senders should not need to care about it.
- the recipient may filter/ignore the value completely... resulting in data 
loss; this may be sometimes a good practice, but only if this recipient 
behavior is documented.
- the recipient may filter/ignore the extra bits (for example by masking); 
for me it's a bad programming practice for recipients...
- the recipient may substitute the incorrect value by another one (such as a 
SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an 
error, without changing the string length).
- an exception may be raised (so the interface will fail) because the given 
value does belong to the expected ASCII code range or Unicode code point 
range (the safest practice for recipients, that are working under the 
"design by contract" model, is to check the domain value range of all its 
incoming data or parameters, to force the senders to obey the contract).
Don't expect blindly that any interface capable of accepting ASCII codes in 
8-bit code units will also accept transparently all values outside of the 
restricted ASCII code range, unless this behavior is explicitly documenting 
how the character will be handled, and if this extension adds some 
equivalences (for example when the recipient will discard the extra bits)...
The only safe way is then:
- to send only values in the definition range of the standard encoding.
- to not accept values out of this range, by raising a run-time exception. 
Run-time checking may sometimes be avoided in some languages that support 
value ranges in their datatype definitions; but this requires a new API with 
new explicitly restricted datatypes than the basic character datatype (the 
Character class in Java is such a datatype, whose constructor restricts 
acceptable values to the Unicode code point range 0..0x10FFFF)...
- to create separate datatype definitions if one wants to pack more 
information in the same storage unit (for example by definining bitfield 
structures in C/C++, or by hiding this packing within the private 
implementation of the storage, not accessible directly without accessor 
methods, and not exposing these storage details to the published or public 
or protected interfaces), possibly with several constructors (only provided 
that the API can also be used to determine if an instance is a character or 
not), but with at least an API to retreive the original unique standard code 
from the instance.
For C/C++ programs that use the native "char" datatype along with C strings, 
the only safe way is to NOT put anything else than the pure standard code in 
the instance value, so that one can effectively make sure that '@'==64 in an 
interface that is expected to receive ASCII characters.
Same thing for Java which assumes that all "char" instances are regular 
UTF-16 code units (this is less a problem for UTF-16, because the whole 
16-bit code unit space is valid and has a normative behavior in Unicode, 
even for surrogate and non-character code units), or for C/C++ programs 
using 16-bit wide code units.
For C/C++ programs that use the ANSI "wchar_t" datatype (which is not 
guaranteed to be 16-bit) no one should expect that extra bits that may exist 
on some platforms may be usable.
For any language that use some fixed-width integer to store UTF-32 code 
units, the definition domain should be checked by recipients, or the 
recipient should document their behavior if other values are possible:
Many applications will not only accept valid code points in 0..0x10FFFF, but 
also some "magic" values like -1 which have other meaning (such as the end 
of the input stream, or no other character available still). When this 
happens, the behavior is (or should be) documented explicitly, because the 
interface does not communicate only with valid characters.
This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 12:30:20 CST