From: Doug Ewell (dewell@adelphia.net)
Date: Mon May 29 2006 - 12:57:02 CDT
Theodore H. Smith <delete at elfdata dot com> replied to Cristian
Secară:
>> Every time I try to send a SMS message that includes accented
>> characters for my language (Romanian), I can't stop to blame those
>> who have established the SMS technical standard, because the fixed
>> 2-bytes character for Latin is pure waste of space (and money :).
>
> BOCU would have been more sensible. It can usually encode codepoints
> above 256 in one byte per character, and it can represent every code
> point.
Actually that's not the full story with BOCU-1, because it requires 2
bytes not only to encode a Latin character outside of ASCII but also 2
bytes to encode the next ASCII character (except space or controls).
BOCU-1 works better on text that fits within a 128-byte block.
The Romanian translation of the Universal Declaration of Human Rights --
which is probably not representative of text that would be sent via
SMS -- yield the following sizes:
12,841 bytes in UTF-8
12,454 bytes in SCSU (3% decrease)
13,498 bytes in BOCU-1 (5% increase)
Cristian can probably supply a more appropriate sample text for
comparison.
Additionally, BOCU-1 wasn't available when SMS was developed. And, like
SCSU or UTF-8, it requires an 8-bit byte, which represents a 14%
increase over the existing 7-bit scheme for messages that fit wholly
within the 7-bit GSM scheme.
In any case, however, either SCSU or BOCU-1 would have been a dramatic
improvement for Romanian over simply falling back to 16 bits.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Mon May 29 2006 - 13:11:50 CDT