Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.
------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**
On Fri, Apr 27, 2012 at 11:21, Doug Ewell <doug_at_ewellic.org> wrote:
> Cristian Secară <orice at secarica dot ro> wrote:
>
> > It turned out that they (ETSI & its groups) created a way to solve the
> > 70 characters limitation, namely “National Language Single Shift” and
> > “National Language Locking Shift” mechanism. This is described in 3GPP
> > TS 23.038 standard and it was introduced since release 8. In short, it
> > is about a character substitution table, per character or per message,
> > per-language defined.
> >
> > Personally I find this to be a stone-age-like approach, which in my
> > opinion does not work at all if I enter the message from my PC
> > keyboard via the phone's PC application (because the language cannot
> > always be predicted, mainly if I am using dead keys). It is true that
> > the actual SMS stream limit is not much generous, but I wonder if the
> > SCSU would have been a better approach in terms of i18n. I also don't
> > know if the SCSU requires a language to be prior declared, or it
> > simply guess by itself the required window for each character.
>
> I agree that treating character repertoire as simply a matter of
> language selection, and creating language-specific code pages, is a
> backward-looking solution. Not only is language tagging not always an
> option, as Cristian points out, but people don't want to be tied to the
> absolute minimum character repertoire that someone decided was necessary
> to write a given language, even in a text message. Just look at the rise
> of emoji in text messages.
>
> And, of course, I agree that SCSU would have been a much better
> solution. Most of the current arguments against SCSU wouldn't apply to
> SMS: the cross-site scripting argument wouldn't apply if SCSU were the
> only "extended" encoding, or if the protocol tagged it, and the
> complex-encoder argument wouldn't apply to any phone from the last 5
> years that can take pictures and shoot videos and scan bar codes and run
> numerous apps simultaneously. (SCSU doesn't require a complex encoder
> anyway, although it can benefit incrementally from one.)
>
> Interestingly, one of the first mentions I can find on the Unicode list
> of SCSU-like compression — actually a description of RCSU, the
> predecessor to SCSU — was in the context of SMS message compression:
>
> http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.html
>
> Neither RCSU nor SCSU quite fits the original bill, which was to
> represent Unicode in 7 bits per character (with some overhead) and thus
> achieve 160 characters per message. Both schemes use 8-bit code units.
> Still, 140 characters is much better than 70.
>
> > Apparently the SCSU seems to be ok for my language, or Hungarian, or
> > Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic
> > scripts ? This versus the language shift mechanism, which is still 7
> > bit. Release 10 of that standard includes language locking shift
> > tables for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada,
> > Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu.
>
> SCSU works equally well, or almost so, with any text sample where the
> non-ASCII characters fit into a single block of 128 code points. For
> anything other than Latin-1 you need one byte of overhead, to switch to
> another window, and for many scripts you need two, to define a window
> and switch to it. But again, two bytes is not what's holding anyone up.
>
> --
> Doug Ewell | Thornton, Colorado, USA
> http://www.ewellic.org | @DougEwell
>
>
>
>
>
Received on Fri Apr 27 2012 - 14:29:41 CDT
This archive was generated by hypermail 2.2.0 : Fri Apr 27 2012 - 14:29:42 CDT