From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 00:54:28 CDT
Doug Ewell
> Envoyé : samedi 13 octobre 2007 06:01
> À : Unicode Mailing List
> Objet : Re: New FAQ page
>
> Peter Constable <petercon at microsoft dot com> wrote:
>
> > Actually, I think what's happening is that "?" is used as the default
> > code page mapping for characters not supported in a code page. So, if
> > an app takes Unicode data and pumps it into (say) code page 1252, then
> > a character like U+0915 will map into 0x3F "?".
>
> This is explicitly specified in the "best fit" mappings on the Unicode
> site, which are based on .NET behavior, as Peter knows:
>
> CPINFO 1 0x3f 0x003f ;Single Byte CP, Default Char = Question Mark
Note however that the codepage conversion API (in the Windows SDK) allows an
application to specify the behaviour for unmapped characters. Only the
default value of this API is using a mapping to a question mark, but other
behaviour is possible: returning en error or exception, using another
default mapping (for example a SUB control).
For online information about the Windows SDK API, you need now to look into
the .Net documentation, where it is now documented in the "Microsoft.Win32"
namespace (plus many references to the "System" namespace for basic
datatypes and structures, and many of its sub-namespace for non-core
services). This makes the Windows API difficult to use if you don't want
.Net (for example when you just want to program in C or C++)
But if we just consider the .Net API, here is the relevant one in the
"System.Text" namespace:
namespace System.Text;
class Encoding {
static Encoding GetEncoding(
Int32 codepage,
EncoderFallback e,
DecoderFallback d);
}
which is a static factory to get a encoder/decoder pair using two subtypes
fallbacks: replacement fallbacks (where the replacement is not limited to
one character, but may be any string in the target encoding) and exception
fallbacks.
There are some examples in:
http://msdn2.microsoft.com/fr-fr/library/system.text.decoderreplacementfallb
ack(VS.80).aspx
and in:
http://msdn2.microsoft.com/fr-fr/library/system.text.encoderreplacementfallb
ack(VS.80).aspx
(But no info is given about how the .Net core library maps these methods to
the Win32 API, which has similar services, that are now more difficult to
find except in legacy header files provided with Visual C/C++ ; it seems
that Windows will progressively abandon the documentation of its native core
API, and move everything to .Net which will remain the only documented and
stable/portable API, complicating the work for C/C++ developers if they
don't know how .Net works).
Note that .Net uses the (quite abusive) class name "UnicodeEncoding" for
actually meaning the UTF-16LE encoding; other Unicode-defined encodings
which are predefined in the Microsoft .Net core library are named
"BigEndianUnicode" (UTF-16BE), "UTF32Encoding", "UTF8Encoding",
"UTF7Encoding").
Nothing in the definition of the .Net library indicates which internal
encoding is used, because even the "Char" datatype is a class whose internal
representation is hidden, we just know the min and max value of this
datatype using some internal integer interval datatype, "wchar_t" in C/C++,
"char" in J# and VB, which is not necessarily the same internal datatype
used for storing strings; however I can't see how it can store more than 16
unsigned bits, and the .Net documentation is really abusive when it says
that a "char" in .Net represents "a Unicode character", when in fact it
cannot represent a single Unicode character out of the BMP, without using
TWO "char" in .Net (here it will necessarily use surrogates). This is
reflected in the length() method of the System.string class...
Really, if you read the .Net documentation, its terminology does not match
the Unicode definition of the same terms (the definition of the "char" and
"string" datatype being the most confusing).
The actual conversion from strings to arrays of bytes is performed now as a
method of the Encoding interface (overridden in each of its implementation
class, where fallbacks are used and also overridable). However, no fallback
will be ever called when converting ***to*** one of the Unicode-based
encodings, i.e. in Unicode based encoders (the reverse is not true for
decoders used to parse a sequence of bytes into the internal sequence of
chars).
Follow the other links for getting the list of other supported codepages
(only the Unicode-based encodings are part of the .Net core, all others are
supported by using codage definitions installed on the system, or defined by
the application by implementing the Encoding interface within your own
classes).
This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 00:57:30 CDT