From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Dec 09 2003 - 13:14:03 EST
> Hmm. Now here's some C++ source code (syntax colored as
> Philippe suggests, to imply that the text editor understands
> C++ at least well :enough to color it)
>
> int n = wcslen(L"café");
>
> (That's int n = wcslen(L"café"); for those without HTML email)
>
> The L prefix on a string literal makes it a wide-character
> string, and wcslen() is simply a wide-character version of
> strlen(). (There is no guarantee that "wide character" means
> "Unicode character", but let's just assume that it does, for
> the moment).
Even assuming that you can assume that "wide characters" are Unicode, you
have not yet assumed in what kind of UTF they are. (Don't assume I
deliberately making calembours :-)
The only thing that the C(++) standards say about type "wchar_t" is that it
is not smaller that type "char", so a "wide character" could well be a byte,
and a "wide character string" could well be UTF-8, or even ASCII.
> So, should n equal four or five?
Why not six?
If, in our C(++) compiler, type "wchar_t" is an alias for "char", and "wide
character strings" are encoded in UTF-8, and the "é" is decomposed, then n
will be equal to 6.
> The answer would appear to depend on whether or not the
> source file was saved in NFC or NFD format.
The answer is:
int n = wcslen(L"café");
That's why you take the burden to call the "wcslen" library function rather
than assuming a hard-coded value such as:
int n = 4; // the length of string "café"
> There is more to consider than just how and whether a text
> editor normalizes.
Whatever the editor does, what if then the *compiler* normalizes it?
The source file and the compiled object file are not necessarily in the same
encoding and/or normalization.
A certain compiler could accept a certain range of input encodings (maybe
declared with command-line parameter) and convert them all in a certain
internal representation in the compiler object file (e.g., Unicode expressed
in a particular UTF and with a particular normalization).
That's why library functions such as "strlen" or "wcslen" exist. You don't
need to bother what these functions will return in a particular compiler or
environment, as far as the following code is guaranteed to work:
const wchar_t * myText = L"café";
wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1));
if (myBuffer != NULL)
{
wcscpy(myBuffer, myText);
}
> If a text editor is capable of dealing with Unicode text,
> perhaps it should also be able to explicitly DISPLAY the
> actual composition form of every glyph.
Against, this is not possible nor desirable, because a text editor is not
supposed to know how the compiler (or its runtime libraries) will transform
string literals.
> The question I posed in the previous paragraph should
> ideally be obvious by sight - if you see four characters,
> there are four characters; if you see five characters, there
> are five characters.
Provided that you can define what a "character" is... After a few years
reading this mailing list, I haven't seen a single acceptable definition of
"character".
Moreover, I matured the impression that it is totally irrelevant to have
such a definition:
- as an end user, I am interested in a higher level kind of objects (let's
call them "graphemes", i.e. those things I see on the screen and I can
interact with my mouse);
- as a programmer, I am interested in a lower lever kind of objects (let's
call them "encoding units", i.e. those things that I count when I have to
allocate memory for a string, or the like).
The term "character" is in a sort of conceptual limbo which makes it pretty
useless for everybody, IMHO.
_ Marco
This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 14:01:58 EST