Re: case conversion -> longer UTF8?

From: Steve Watt (swatt@progress.com)
Date: Fri Apr 02 1999 - 18:48:54 EST


This may be true for the informative case rules supplied by Unicode, but
may not be true for other case conversions, such as those needed for
Turkish, which map the ascii "i" to upper case I with dot (2 bytes), and
the upper case I to lower case i without dot (2 bytes). As a general
rule, you should allow for expansion.

Keith Hafen wrote:
>
> >
> > Can a UTF-8 string ever become longer when it's converted to upper- or
> > lowercase?
> >
> > --
> > Hallvard
> >
>
> I assume that you are talking about the BMP characters. In that case,
> for upper case, the answer is no, assuming my conversion tables are correct.
> The key is are there any conversions that cross the following boundaries:
> 0x7f
> 0x7ff
> As that is where you go from 1 byte to two bytes and 2 bytes to 3 bytes.
>
> The only character that crosses a boundary, is 0x17f, which I have
> converting to 0x53, or 'S'. I do not have access to my unicode book
> to see what character 0x17f is. All of the other characters that go
> to upper case stay in the same range as the lower case characters.
> In this one case converting to upper case will shrink the size of the string.
>
> Since we do not convert to lower case, I can not speak for it, but based
> on the above boundaries, and the fact that they do not happen for upper
> casing, I doubt that lower casing a string would convert the size.
>
> For the record, characters that have upper case conversions are in the
> following ranges:
> 0x0000 - 0x05ff
> 0x1e00 - 0x1fff
> 0x2100 - 0x21ff
> 0x2400 - 0x24ff
> 0xff00 - 0xffff
>
> Keith Hafen



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT