Re: Unicode, SMS and year 2012 from Martin J. Dürst on 2012-04-28 (Unicode Mail List Archive)

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Sat, 28 Apr 2012 14:10:52 +0900

On 2012/04/28 7:29, Cristian Secară wrote:
> În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:
>
>> Actually, if the goal is to get as many characters in as possible,
>> Punycode might be the best solution. That is the encoding used for
>> internationalized domains. In that form, it uses a smaller number of
>> bytes per character, but a parameterization allows use of all byte
>> values.
>
> I suspect the punycode goal is to take a wide character set into a
> restricted character set, without caring much on resulting string
> length; if the original string happens to be in other character set
> than the target restricted character set, then the string length
> increases too much to be of interest in the SMS discussion.

Not exactly. Compression was very much a goal when designing punycode.
It won against a number of other algorithms as the choice for IDNs and
is clearly very good for that purpose.

> Just do a test: write something in a non-Latin alphabetic script into
> this page here http://demo.icu-project.org/icu-bin/idnbrowser

Well, as a silly example, what about
ααααααααααααααααααααααααααααααααααααααααααααααααααααααααα?
(that's 57 α characters). The result is
xn--mxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,
which is 63 characters long.

Regards, Martin.
Received on Sat Apr 28 2012 - 00:14:58 CDT

This archive was generated by hypermail 2.2.0 : Sat Apr 28 2012 - 00:14:59 CDT