From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue May 27 2003 - 18:18:24 EDT
One minor correction:
> However, it's true that ECMAScript will allow you to create invalide
Unicode strings.
More precisely, ECMAScript (and other systems) will allow you to
create 16-bit Unicode strings that are not UTF-16.
See Section 2.7 in http://www.unicode.org/book/preview/ch02.pdf.
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
From: "Philippe Verdy" <verdy_p@wanadoo.fr>
To: <unicode@unicode.org>
Sent: Tuesday, May 27, 2003 14:49
Subject: Re: javascript and unicode
> From: "Markus Scherer" <markus.scherer@jtcsv.com>
> > Paul Hastings wrote:
> > > would it be correct to say that javascript "natively" supports
unicode?
> >
> > ECMAScript, of which JavaScript and JScript are implementations,
is defined on 16-bit Unicode
> > scripts and using 16-bit Unicode strings.
> >
> > In other words, the basic encoding support is there, but there are
basically no Unicode-specific
> > APIs in the standard. No character properties, no collation that
is guaranteed to do more than
> > strcmp, etc. Script writers have to rely on
implementation-specific functions or supply their own.
>
> It would be more correct to say that ECMAScript handles text using
the UTF-16 encoding form on most platforms, and so can handle any
Unicode character. However, it's true that ECMAScript will allow you
to create invalide Unicode strings, as it allows you to create strings
where surrogate characters do not pair.
>
> This says nothing on the internal encoding of strings within ECMA
engines: it could as well use CESU-8 internally, but this will
internal encoding will be hidden.
>
> So the situation of ECMAScript isexactly similar to Java (in which
the builtin type "char" is an unsigned 16 bit integer, and the String
type is handled in terms of "char" code units with UTF-16). However
the serialization of compiled Java classes internally encodes these
strings with UTF-8, which is deserialized to UTF-16 when the class is
loaded.
>
> You will have a similar situation on Windows with the Win32 API, and
in its C/C++ binding using TCHAR (and the T() macro for string
constants) with the _UNICODE compile-time define. Or on all systems
where the ANSI C type wchar_t is defined as a 16 bit integer.
>
> Note that we are speaing here about code units, not codepoints. The
code units is what programming languages use to handle strings, not
codepoints. As code units are well defined in Unicode in relation with
a encoding form, any language or system can be made compliant to fully
support Unicode, if it also provides library functions for string
handling that implement the Unicode-defined algorithms (described in
terms of code points).
>
> It's up to the library (not the language) to make its implementation
of Unicode with code units comply with the standard algorithms based
on code points. Of course it is much easier to implement these
algorithms with 16-bit code units than with 8-bit code units. But the
language itself has no other special Unicode compliance
characteristic.
>
>
This archive was generated by hypermail 2.1.5 : Tue May 27 2003 - 19:11:17 EDT