Re: string vs. char [was Re: Java and Unicode]

From: addison@inter-locale.com
Date: Fri Nov 17 2000 - 11:44:01 EST


Thanks Mark. I've looked extensively at the ICU code in doing much of the
design on this system. What my email didn't end up saying was, basically,
that the "char" functions end up decoding a scalar value internally in a
32-bit integer value.

The question, I guess, boils down to: put it in the interface, or hide it
in the internals. ICU exposes it. My spec, up to this point, hides it,
because I think that programmers will be working with strings more often
than with individual characters and that perhaps this will seem more
"natural".

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Thu, 16 Nov 2000, Mark Davis wrote:

> We have found that it works pretty well to have a uchar32 datatype, with
> uchar16 storage in strings. In ICU (C version) we use macros for efficient
> access; in ICU (C++) version we use method calls, and for ICU (Java version)
> we have a set of utility static methods (since we can't add to the Java
> String API).
>
> With these functions, the number of changes that you have to make to
> existing code is fairly small, and you don't have to change the way that
> loops are set up, for example.
>
> Mark
>
> ----- Original Message -----
> From: <addison@inter-locale.com>
> To: "Unicode List" <unicode@unicode.org>
> Sent: Thursday, November 16, 2000 13:24
> Subject: string vs. char [was Re: Java and Unicode]
>
>
> > Normally this thread would be of only academic interest to me...
> >
> > ...but this week I'm writing a spec for adding Unicode support to an
> > embedded operating system written in C. Due to Mssrs. O'Conner and
> > Scherer's presentations at the most recent IUC, I was aware of the clash
> > between internal string representations and the Unicode Scalar Value
> > necessary for efficient lookup.
> >
> > Now I'm getting alarmed about the solution I've selected.
> >
> > The OS I'm working on is written in C. I considered, therefore, using
> > UTF-8 as the internal Unicode representation (because I don't have the
> > option of #defining Unicode and using wchar), but the storage expansion
> > and the fact that several existing modules grok UTF-16 (well, UCS-2), led
> > me to go in the direction of UTF-16.
> >
> > I also considered supporting only UCS-2. It's a bad bad bad idea, but it
> > gets me out of the following:
> >
> > I ended up deciding that the Unicode API for this OS will only work in
> > strings. CTYPE replacement functions (such as isalpha) and character based
> > replacement functions (such as strchr) will take and return strings for
> > all of their arguments.
> >
> > Internally, my functions are converting the pointed character to its
> > scalar value (to look it up in the database most efficiently).
> >
> > This isn't very satisfying. It goes somewhat against the grain of 'C'
> > programming. But it's equally unsatisfying to use a 32-bit representation
> > for a character and a 16-bit representation for a string, because in 'C',
> > a string *is* an array of characters. Which is more
> > natural? Which is more common? Iterating across an array of 16-bit values
> > or
> >
> > ===========================================================
> > Addison P. Phillips Principal Consultant
> > Inter-Locale LLC http://www.inter-locale.com
> > Los Gatos, CA, USA mailto:addison@inter-locale.com
> >
> > +1 408.210.3569 (mobile) +1 408.904.4762 (fax)
> > ===========================================================
> > Globalization Engineering & Consulting Services
> >
> >
> >
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT