Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Jonathan Coxhead (jonathan@doves.demon.co.uk)
Date: Wed May 24 2000 - 20:31:57 EDT

Next message: Julie Doll Allen: "OSes w/ Unicode"
Previous message: Julie Doll Allen: "Unicode archive mail files, etc."
Maybe in reply to: Markus Scherer: "Question about \uxxxx etc. for 21-bit code points - need advice"
Next in thread: Antoine Leca: "Re: Question about \uxxxx etc. for 21-bit code points - need advice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> and our test code contains strings like
> // display langage (French)
> { "anglais", "fran\\u00E7ais", "", "grec", "norv\\u00E9gien", "italien", "xx" },
>
> which, of course, need to be double-escaped so that the c compiler does not unescape
> them itself. they are unescaped at runtime by a library function.

Boy, are you in trouble! :-)

In C99 (newly published and implemented almost nowhere), these new \u and \U
escapes are **NOTHING LIKE** \x, \n etc, despite their very confusing visual
simlarity. They are expanded EVERYWHERE in the source file, not just in strings (at
the same time as trigraphs); and they refer to characters in the SOURCE character
set, not the execution character set. But what happens to them depends on where they
appear in the source file.

If you want to spell "cafe" correctly as an identifier, you can do that:
|caf\u00E9| will do nicely. This is the same symbol as |café| and also as
|caf\U000000E9|, all of which which you can use interchangeably within and between
source files. If you have an editor that understands the lexical structure of C, it
might represent them all visually in the same way.

If you use \u, \U in a string, all the following would compare equal (with
strcmp(), in a Latin-1 environment):

"café", "caf\xE9", "caf\u00E9", "caf\U000000E9"

but in a Cyrillic (ISO 8859-5) execution environment, the first, third and fourth
should give an error---no such character in execution character set---because there
is no e-acute in ISO 8859-5; the second (if printed and read aloud) would have to
pronounced something like "cafshcha", because the character at E9 in ISO 8859-5 is
shcha. The first could only be written at all if the compiler's source character set
was ISO 8859-1 or one of the others with an e-acute---cross-compilation is perfectly
well defined by the C standard---but the third and fourth can be written in any
conforming implementation of C. If you really want "cafshcha", you can write
"caf\U0449" in all cases---though again, if that character is not in the execution
character set, you shouldn't expect miracles.

So, if a C compiler saw your example, it would replace "fran\\u00E7ais" by
"fran\çais" when it processed \u and \U escapes---very early on. Later, it
would see \ç, which is not a valid backslash sequence, and give an error.

I guess, since you're doing this preprocessing yourself, you don't
really care about this (despite my emphatic first sentence); but I think
you would be guilty of a very confusing overloading of a new and hard-to-
understand notation if you did it the way you are proposing.

        /|
o o o (_|/
        /|
       (_/

Next message: Julie Doll Allen: "OSes w/ Unicode"
Previous message: Julie Doll Allen: "Unicode archive mail files, etc."
Maybe in reply to: Markus Scherer: "Question about \uxxxx etc. for 21-bit code points - need advice"
Next in thread: Antoine Leca: "Re: Question about \uxxxx etc. for 21-bit code points - need advice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT