Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

From: Joel Rees (rees@server.mediafusion.co.jp)
Date: Wed Feb 21 2001 - 03:40:32 EST


Hi.

I took several minutes to scan through your post and I am not sure what you
are asking. Would you like to see some examples, for instance, of real
(assigned) code points that require encoding by surrogate pairs to be
represented as Java char? Looking at what you are trying to do, I think I
would rather try to explain UTF-8, but you indicate you are using Java.

First, a link I couldn't find from the home page:

http://www.unicode.org/charts/draftunicode31

So we have the "musical symbol G clef" at code point 0x1d11e. (I want to say
\u1d11e, but that I think that requires a change to Java syntax.) To encode
that in a Java char, we need two chars:

Subtract 0x10000:
    0xd11e (binary 1101 0001 0001 1110)

Split into two pieces of ten bits each by shifting off the bottom ten bits:
    (binary 11 0100 | 01 0001 1110)
    Hi half: 0x0034 (binary 00 0011 0100)
    Lo half: 0x011e (binary 01 0001 1110)

Add the base of the appropriate surrogate area:
    0xd800 + 0x0034 => 0xd834
    0xdc00 + 0x011e => 0xdd1e

Store these in two char:
    char GClefPair[ 2 ] = { \ud834, \udd1e };

Does this answer your question, and could someone check my math?

Hmm. I would still suggest you check out UTF-8 and see if that standard
transformation might make sense for your application.

Joel Rees, Media Fusion KK
Amagasaki, Japan

----- Original Message -----
From: "William Overington" <WOverington@ngo.globalnet.co.uk>
To: "Unicode List" <unicode@unicode.org>
Cc: <archive@ngo.globalnet.co.uk>
Sent: Wednesday, February 21, 2001 2:30 AM
Subject: Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in
Unicode)

> The following statements have been made by participants in this thread.
>
> 1.
>
> A few days ago I said there was a "widespread belief" that Unicode is a
> 16-bit-only character set that ends at U+FFFF. A corollary is that the
> supplementary characters ranging from U+10000 to U+10FFFF are either
> little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode.
>
> 2.
>
> Can we put this thread on a constructive footing? I am sure there is
> lots of outdated and/or incorrect information out there and I would
> like to preempt its being identified via numerous emails here.
> If the belief is there are misperceptions that need to be corrected, how
> should the problem be addressed? Bear in mind the volunteer nature of the
> organization....
>
> ----
>
> I wonder if some readers might like to have a look at a specific
situation.
> This would certainly help me and might also provide a useful case study on
> the practical problems.
>
> I do not purport to be an expert in unicode. Unicode is but one of many
> interests. I do recognize that unicode is attempting to be a
comprehensive
> standard system and I would like to do what I can within my own research
to
> utilize the unicode system.
>
> As some readers may remember I am producing a computer language called
1456
> object code (in speech, "fourteen fifty-six object code") which is a
> computer language expressible using 7 bit ascii printing characters and
> which may be included in the param statements of an applet call in an HTML
> page. The applet called then calls a Java class file named
Engine1456.class
> and quite substantial computations with graphic output may be achieved
using
> a combination of ready prepared standardized Java classes and programs
> written in 1456 object code using a text editor. The benefit is that
people
> who either do not know Java or do not have Java compiling facilities
> available may reasonably straightforwardly produce, using just a text
editor
> such as Notepad, quite elegant graphics programs with Java quality
graphics.
> There is a speed overhead, but, even for fast running programs, a 1456
> object code program can get up to about 40% of the speed of a specially
> written Java program. With programs that wait for user input, the
> difference in speed may not be noticeable.
>
> The system is fully described on www.users.globalnet.co.uk/~ngo which is
our
> family webspace in England and readers are welcome to study it in full if
> they so wish, yet only a few documents need to be studied, and then only
in
> part, for the purposes of this case study.
>
> The 1456 object code system relies for its underlying standardization that
> the software that interprets the 1456 object code (that is, the 1456
engine)
> is written in Java. Therefore 1456 object code immediately fits in with
> being useable with a standard Java enabled browser on the internet and
also
> to being useable on the JavaTV system as telesoftware. As JavaTV may well
> become a worldwide broadcasting standard there is practical importance in
> 1456 object code having full capability for being able to handle character
> strings in all languages that are encoded in unicode.
>
> Characters are introduced into the 1456 object code system documents in
the
> document
>
> www.users.globalnet.co.uk/~ngo/14560600.htm
>
> where 1456 object code characters are said to be "represented using the 16
> bit unicode characters of Java."
>
> There are various registers explained. The two key items though for this
> discussion is that one may load a character from the software into a
> register as a sort of "load immediate" type instruction in two ways.
>
> A 7 bit ascii printing character may be loaded using a two character
> sequence consisting of the ^ character followed by the desired character.
> For example, ^E can be used to encode the character U+0045 in the
software.
>
> Any 16 bit unicode character may be loaded by a six character sequence
> consisting of 'u and four hexadecimal characters. So, the character
U+0045
> could be loaded using 'u0045 in the software.
>
> Clearly, the six character method can be used for more characters than the
> two character method, as the two character method can only be used for the
> characters that can be entered as 7 bit ascii printing characters from the
> keyboard when programming.
>
> Please note that when the 1456 object code is being obeyed the character
> that follows the ^ character is already existing as a 16 bit Java unicode
> character within the software, the conversion from 7 bit ascii to 16 bit
> unicode having taken place when it was loaded into the applet from the
param
> statement of the applet call.
>
> The page
>
> www.users.globalnet.co.uk/~ngo/14560700.htm
>
> shows how the six character method using 'u may also be used in the entry
of
> strings of characters.
>
> The next page that is needed for this case study is
>
> www.users.globalnet.co.uk/~ngo/14561100.htm
>
> and within that page the demo2.htm example.
>
> Within the source code of the demo2.htm file there are the following uses
of
> the six character method.
>
> 'u00e9
>
> 'u0108
>
> 'u011d
>
> For example, the sequence
>
> [ Caf'u00e9]
>
> is used to load the four character string Cafe from the software where
there
> is an acute accent on the e of the word Cafe.
>
> After that, the 'u method is used where needed to produce desired effects.
> It proved very useful to write the software that produced the diagram used
> in the document
>
> www.users.globalnet.co.uk/~ngo/14563100.htm
>
> later in the sequence. The diagram is near the end of the document.
>
> In that software, the characters
>
> 'u03b1
>
> 'u03b2
>
> 'u03b3
>
> 'u03be
>
> were used.
>
> The fonts that I have used are from Microsoft as mentioned in the document
>
> www.users.globalnet.co.uk/~ngo/14561100.htm
>
> mentioned previously. There are about 600 characters available, which is
> well less than the 65536 that the 'u command could produce. There are
latin
> characters, greek characters and cyrillic characters and more.
>
> Having set the scene of how I apply unicode to my own application at
> present, the question arises as to how to proceed to use the full unicode
> system.
>
> I am quite happy to designate 'v followed by however many characters is
> judged necessary as being the way to load a however many bit unicode
> character into a register from the software. Perhaps that is 'v followed
by
> eight hexadecimal characters, or maybe that is 'v followed by six
> hexadecimal characters. I can use 'V and 'v without any problem if that
is
> what is needed.
>
> Yet two further matters arise.
>
> 1. What about the fact that Java uses 16 bit characters?
>
> 2. Even if I code the extra characters using some system involving 'v and
> maybe 'V commands and however many hexadecimal characters following and
> storing them in the software, how am I supposed to display them on the
> screen? Are these characters available in font files? Suppose that I am
> needing to use an application where only, say, ten of these extra
characters
> are used out of the large number of codes that are available, akin to the
> fact that the fonts that I am using have characters for only about 600 of
> the 65536 possible codes, can an ordinary font file be used to code these
> ten characters with the large code numbers? I would quite like to have a
go
> at encoding the 'v and maybe 'V in a reasonable manner and trying it out
> with real data for real characters.
>
> I have tried in a posting, with reference to just a few web pages, to
> provide sufficient detail of the practical problem that I face in relation
> to the matters raised in this thread and wonder if the people who are
> specialist in unicode might like in their resolution of this thread to
seek
> to prepare a document such that someone who is not a unicode specialist
yet
> is trying to apply unicode to a real project where the unicode aspect is
but
> one part of the project may straightforwardly find an explanation of the
> unicode system sufficient to be able to understand and program the
> underlying structure into software and apply that structure correctly
using
> font files. Such a document would be very helpful. If it already exists,
I
> would be pleased to know of a reference to it.
>
> William Overington
>
> 20 February 2001
>

===============================XML as Best Solution===
$B%j!<%9!!%8%g%(%k(B $B!J(BJoel Rees$B!K(B
  $B3t<02q<R%a%G%#%"%U%e!<%8%g%s!J(BMedia Fusion Co.,Ltd.$B!K(B
   $BEl5~!!(BTEL 03-5833-2965 $B!!(BFAX 03-5833-2972
   $BK\<R!!(BTEL 06-6415-2560 FAX 06-6415-2556
   http://www.mediafusion.co.jp
----------------------------------------------------
               Programmer -- XML Lab
====================================================



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT