On Wed, May 24, 2000 at 03:44:16 -0800, Marco Cimarosti wrote:
> It is not correct to say that it adjusts to the underlying encoding:
> a C compiler knows no "underlying encoding", apart the one the
> source itslef is written in.
It knows source charset and execution charset and makes appropriate
transcoding of string literals.
> The length of the \x escape sequence depends only on the characters
> following it [...] And this is precisely what I am not confortable
> with, because it makes escape sequences ambiguous. Take for example
> "\x2Two": it expands to { 2, 'T', 'w', 'o', 0 }. But if you
> translate the "Two" in French, you get "\x2Deux" that expands to
> { 45, 'e', 'u', 'x', 0 }...
You can split the string: "\x2""Two" -> "\x2""Deux". Since \x escapes
are variable-length - it seems it's a good idea to always split
strings after an \x escape.
From ISO C draft (formerly known as C9x):
6.4.5 String literals
7 EXAMPLE This pair of adjacent character string literals
"\x12" "3"
produces a single character string literal containing the two
characters whose values are '\x12' and '3', because escape
sequences are converted into single members of the execution
character set just prior to adjacent string literal
concatenation.
Also there *is* a fixed length hex escape in C:
6.4.3 Universal character names
Syntax
[#1]
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit
Constraints
[#2] A universal character name shall not specify a
character whose short identifier is less than 00A0 other
than 0024 ($), 0040 (@), or 0060 (`), nor one in the range
D800 through DFFF inclusive.61)
Description
[#3] Universal character names may be used in identifiers,
character constants, and string literals to designate
characters that are not in the basic character set.
Semantics
[#4] The universal character name \Unnnnnnnn designates the
character whose eight-digit short identifier (as specified
by ISO/IEC 10646) is nnnnnnnn.62) Similarly, the universal
character name \unnnn designates the character whose four-
digit short identifier is nnnn (and whose eight-digit short
identifier is 0000nnnn).
____________________
61)The disallowed characters are the characters in the basic
character set and the code positions reserved by
ISO/IEC 10646 for control characters, the character
DELETE, and the S-zone (reserved for use by UTF-16).
62)Short identifiers for characters were first specified in
ISO/IEC 10646-1/AMD9:1997.
SY, Uwe
-- uwe@ptc.spbu.ru | Zu Grunde kommen http://www.ptc.spbu.ru/~uwe/ | Ist zu Grunde gehen
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT