Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Valeriy E. Ushakov (uwe@ptc.spbu.ru)
Date: Wed May 24 2000 - 09:15:24 EDT


On Wed, May 24, 2000 at 03:44:16 -0800, Marco Cimarosti wrote:

> It is not correct to say that it adjusts to the underlying encoding:
> a C compiler knows no "underlying encoding", apart the one the
> source itslef is written in.

It knows source charset and execution charset and makes appropriate
transcoding of string literals.

> The length of the \x escape sequence depends only on the characters
> following it [...] And this is precisely what I am not confortable
> with, because it makes escape sequences ambiguous. Take for example
> "\x2Two": it expands to { 2, 'T', 'w', 'o', 0 }. But if you
> translate the "Two" in French, you get "\x2Deux" that expands to
> { 45, 'e', 'u', 'x', 0 }...

You can split the string: "\x2""Two" -> "\x2""Deux". Since \x escapes
are variable-length - it seems it's a good idea to always split
strings after an \x escape.

From ISO C draft (formerly known as C9x):

    6.4.5 String literals

    7 EXAMPLE This pair of adjacent character string literals

           "\x12" "3"

       produces a single character string literal containing the two
       characters whose values are '\x12' and '3', because escape
       sequences are converted into single members of the execution
       character set just prior to adjacent string literal
       concatenation.

Also there *is* a fixed length hex escape in C:

       6.4.3 Universal character names

       Syntax

       [#1]

               universal-character-name:
                       \u hex-quad
                       \U hex-quad hex-quad

               hex-quad:
                       hexadecimal-digit hexadecimal-digit
                                       hexadecimal-digit hexadecimal-digit

       Constraints

       [#2] A universal character name shall not specify a
       character whose short identifier is less than 00A0 other
       than 0024 ($), 0040 (@), or 0060 (`), nor one in the range
       D800 through DFFF inclusive.61)

       Description

       [#3] Universal character names may be used in identifiers,
       character constants, and string literals to designate
       characters that are not in the basic character set.

       Semantics

       [#4] The universal character name \Unnnnnnnn designates the
       character whose eight-digit short identifier (as specified
       by ISO/IEC 10646) is nnnnnnnn.62) Similarly, the universal
       character name \unnnn designates the character whose four-
       digit short identifier is nnnn (and whose eight-digit short
       identifier is 0000nnnn).

       ____________________

       61)The disallowed characters are the characters in the basic
          character set and the code positions reserved by
          ISO/IEC 10646 for control characters, the character
          DELETE, and the S-zone (reserved for use by UTF-16).

       62)Short identifiers for characters were first specified in
          ISO/IEC 10646-1/AMD9:1997.

SY, Uwe

-- 
uwe@ptc.spbu.ru                         |       Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/            |       Ist zu Grunde gehen



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT