Unicode

From: Jean-Baptiste GAGELIN (jbgagelin@attmail.com)
Date: Wed May 22 1996 - 14:26:50 EDT


Hi all,
here are some informations.
They try and help developers.
Feel free for giving me feedback.

Regards,
JB Gagelin. jbgagelin@attmail.com

-------------------------------
Where to search ?:

- http://www.stonehand.com/unicode/ and
http://www.stonehand.com/unicode/resources.html
- Dejanews server (http://www.dejanews.com), mix searches with: UNICODE, UNIX,
WINDOWS,
EUC.....
- http://www.microsoft.com/devonly/tech/global/ (Good site!)
In http://www.microsoft.com/globaldev/gbl-gen/codesets/charsets.htm, retrieve
two samples.
Warning: They are not exhaustive, can't be compiled but can help with the
WideChartomultibyte/multibytetowideChar samples.
- Microsoft Win32 SDK Help: Win32/Overview/International Features. (45 pages)
- Nadine Kano's Book "Developping International Software for W95 and NT",
Microsoft Press.
- Newsgroup: comp.software.international
- subscribe to listserv unicode-request@unicode.org and send subscribe
user@yourhost
- http://www.ntt.jp/japan/note-on-JP/encoding.html

Miscellaneous defs:

EUC-JP (Extended Unix Code JaPan, not ISO)
ISO-2022-JP (multibyte for UNIX, ISO)
Sift-JIS / JIS (ISO-2022-JP) (multibyte for PC).
Unicode. (ISO/IEC 10646-1).
UJIS
KSC5601, GB2312, JIS0208
UTF-7, UTF-8 and UTF-16: encoding of Unicode.

www.stonehand.com/unicode server provides map tables between SHIF JIS and
UNICODE, UTF...

UNIX functions: mbtowc(), mblen(), wctomb(), wchrtbl(), mbstring(),
setlocale(), environ().

DBCS: DBCS stands for double-byte character set. It is a character-set system
that consists of 16-bit characters. Modern writing systems used in the Far
East typically require a minimum of 3K-15K characters.

MBCS: Multi... used on UNIX.

NLSAPI: National Language Support API is a set of API supporting
Internatio/localisation. Not all functions are available on platforms. NT 3.5
supports all the API. In the Win32s API you can find a subset of Win32 API,
relating to ANSI version of Win32 API, but excluding wide character support

wchar_t: The type of UNICODE characters. UNICODE strings are set of wchar_t
characters
ANSI strings are set of 'char' type

Shift JIS: CP932, MBCS type. Depending of the first byte the character is
considered as OK or the second is read. This allows a reduction of the amount
as well as often used characters are one byte long.

IME: Input method Editor (win95).

Backslash becomes yen symbol on Jap.
GetVersionEx.
0XFEFF: unicode marker for the begining of a file.
0XFFEE same for byte reversed (High, Low)

Win32 SDK APIs: To convert strings to or from Unicode: MultiBytetoWideChar/
WideCharToMultiByte is in every Win32 system.

The ANSI mandated multibyte functions in the following list are compatible
with Win32s:
mbtowc
mbstowcs
wctomb
wcstombs

Visual Basic support:
StrConv: Doublebyte <-> Single.
typelib & Bstr if use MultiBytetoWideChar

 Main constraints on the development:

(Here it is assumed that :
ANSI, includes ANSI, DBCS/MBCS: all that deals with Code Pages.
Unicode and wide char are used interchangeably. )

Broad infos:

With the Win32 API, it is provided per functionality ( let us call it 'Fctn')
using strings 3 functions:
        FctnA: The Ansi version. String arguments must be Ansi (char type)
        FctnW: The Unicode version (all is availlable for NT 3.5)
        Fctn: This is the generic function. This is a macro that will be
replaced at comilation time.

Windows provides four levels of compatibility with the NLSAPI.
1) Win32s release 1.2.
2) Windows NT 3.1.
3) Windows NT 3.5.
4) Windows 95.
Let us call them 1,2,3,4.

TREATED TOPICS:

Build code that will either work with Unicode or ANSI depending on
compilation:
(faster code)
For each Fctn, use it's generic.
Compile with or without UNICODE.
On Win 95 there will be ANSI, on NT UNICODE.
=> 2 executables are generated. (optimised, but one can't work on the other)

Build Unicoded data on MS_multiplatform:
We try to build code that will work on every Windows platform, with the same
binary.
This describes how to process string-system functions.
This is a chapter because Microsoft provides a lot of functions dealing with
UNICODE, Wfunctions.
Do not define UNICODE at compile time.

For each Fctn, create a wrapper FctnWrap
FctnWrap =
Find the real function available for the system:
If FctnW is provided, use it (NT only).
else
        if FcntnA is provided (we have win95 or win3.1(plus win32s))
        perform conversion with WideChartomultibyte Win32 API or mbtowc/
wctomb functions
        Use the A version of Fctn, FctnA.
perform conversion with MultibytetowideChar if needed.
else manage as you can...but you should avoid to use this function. This is a
function system dependent.

Example: For CompareString,
The support of CompareString is the following:
CompareStringW is OK for level 2,3
CompareStringA is OK for level 1,3,4
(This can be found in Nadine Kano's Book, p 523, but most of the time
A-function is OK for 1,3,4, W-function is OK for sometimes 2, and 3. All is OK
for 3, always.)

So if the soft runs on level 2 or 3, use CompareStringW.
if the soft runs on level 1,4 take your UNICODE data, convert it with
WideChartomultibyte, run CompareStringA.
Build Unicoded data on UNIX_multiplatform:
We try to build code that will work on every Windows platform, with the same
binary.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT