>
> Can a UTF-8 string ever become longer when it's converted to upper- or
> lowercase?
>
> --
> Hallvard
>
I assume that you are talking about the BMP characters. In that case,
for upper case, the answer is no, assuming my conversion tables are correct.
The key is are there any conversions that cross the following boundaries:
0x7f
0x7ff
As that is where you go from 1 byte to two bytes and 2 bytes to 3 bytes.
The only character that crosses a boundary, is 0x17f, which I have
converting to 0x53, or 'S'. I do not have access to my unicode book
to see what character 0x17f is. All of the other characters that go
to upper case stay in the same range as the lower case characters.
In this one case converting to upper case will shrink the size of the string.
Since we do not convert to lower case, I can not speak for it, but based
on the above boundaries, and the fact that they do not happen for upper
casing, I doubt that lower casing a string would convert the size.
For the record, characters that have upper case conversions are in the
following ranges:
0x0000 - 0x05ff
0x1e00 - 0x1fff
0x2100 - 0x21ff
0x2400 - 0x24ff
0xff00 - 0xffff
Keith Hafen
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT