From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 07 2008 - 19:52:59 CDT
> >This isn't as much of an advantage as it sounds, since in most Unicode
> >processes you need to be prepared to deal with multiple characters at
> >once anyway.
>
> I don't get the point. Whether you're dealing with one character or
> many, life is simpler if they're all the same size.
I think the point that John was making is that if you are
constructing APIs, whether public APIs or internal ones, most
of the time you are better off defining them as a string interface
rather than a character interface.
Even if you "think" you are just dealing with a "character" it
is often the case that what is of interest may actually be
a combining character sequence or a grapheme cluster or
a collation contraction element or some other significant
sequence of code points.
And if you already have a UTF-16 string API in place, the API doesn't
care if it is getting a single-code-unit BMP character or a
two-code-unit SMP character.
Of course the code underneath, if it is actually parsing code points
from the string, needs to know the distinction and behave
correctly. But it is often the case that complex code can
be written much more cleanly if it is just passing string
pointers (or objects) up and down the stacks, rather than
prematurely parsing characters and trying to pass individual
characters as parameters. For Unicode this is particularly
important, because there are so many complex conditions where
the behavior of a character depends on its string context.
Yes, if you do *everything* in UTF-32, the same arguments
for string APIs would apply without having to do surrogate
detection at the point of parsing code point boundaries,
but there are a number of good reasons why people choose
to (or have to) process text in UTF-16, as well.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 19:54:33 CDT