RE: Surrogate pairs and UTF-8

From: Addison Phillips (addison@yahoo-inc.com)
Date: Wed Jun 21 2006 - 22:33:14 CDT

Next message: Otto Stolz: "Re: Surrogate pairs and UTF-8"

Previous message: Pierpaolo BERNARDI: "Re: Surrogate pairs and UTF-8"
In reply to: Mike: "Re: Surrogate pairs and UTF-8"
Next in thread: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
Reply: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm disturbed by something here.

Pavils wrote:

> am a developer who needs to write UTF-8 encoder and decoder in
> JavaScript.

In JavaScript, Strings (and thus text) are made up of arrays of UTF-16 code
units. Thus U+10000 is represented by the surrogate pair 0xD800 0xDC00. The
String class treats these as two "characters" in a String object (in methods
such as charCodeAt() or index()).

In JavaScript there is no such thing as an "encoding". Text in the DOM or in
documents, headers, and other text sources that you are manipulating is
converted to/from the internal String class by the JavaScript runtime, which
is paying attention to HTTP headers and what the browser thinks the encoding
of the JS source file or the document being read or written is. The
exception to this is when generating URIs from strings, for which there are
a variety of escape methods (escape, unescape, encodeURI,
encodeURIComponent, etc.). What I'm getting at here is: there is no data
type or methods for manipulating bytes or character encodings. There is no
JavaScript equivalent to the C char* or Java byte. There is no way that I'm
aware of to write a UTF-8 encoder or decoder (i.e. code that converts a
String to a UTF-8 byte sequence in an object or vice versa). There are
plenty of ways to put Strings into a UTF-8 file (or read from a UTF-8 file).

There is usually something (else) wrong when a developer is trying to do
this in JavaScript.

Pavils: what is it you are trying to do that you think requires you to
encode or decode UTF-8?

Addison

Addison Phillips
Internationalization Architect - Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org] On Behalf Of Mike
> Sent: mercredi 21 juin 2006 16:34
> To: unicode@unicode.org
> Subject: Re: Surrogate pairs and UTF-8
>
> If you come across a surrogate in an UTF-8 stream, you should treat
> it as an error (e.g. throw an exception or something).
>
> However if you are converting from UTF-16 to UTF-8, then you will
> need two surrogates (a high surrogate and a low surrogate) to
> determine which character is encoded. Table 3.4 in the link you
> cited shows how to convert from surrogate pairs to codepoints.
>
> Once you know which codepoint is encoded, use table 3-5 to compute
> the byte values in the UTF-8 sequence.
>
> Mike
>
> P.S. Here is an array you should find useful in determining how to
> decode UTF-8 sequences:
>
> const uchar Utf8Length[256] = {
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
> 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8
> };
>
> Use the first byte of the UTF-8 sequence as an index into this
> array. The value is the length of the UTF-8 sequence. If it is
> not in the range 1-4, then you have run into an error. Once you
> have the length, if it is in the range 2-4, check the next n-1
> bytes to make sure they return 0.
>
> Pavils Jurjans wrote:
> > Hello all,
> >
> > I am a developer who needs to write UTF-8 encoder and decoder in
> JavaScript. I've found the encoding form in the link
> http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703
> <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703>
> , and that
> is pretty much what I need to do the job. However, I am completely
> lacking in-depth information about the surrogate pairs and
> how to handle
> them in UTF-8. So, here are the questions, what I am looking for:
> > - I have read the theoretical definition of what a
> surrogate pair is.
> However, I have never seen any in "life". Can you give an example of
> some surrogate pairs, and how do their respective character look like?
> > - The guides on unicode.org <http://unicode.org/> site talk only
> about surrogate pair and UTF-16 conversion. How about the UTF-8?
> >
> > Thank you for any clues.
> >
> > With kind regards,
> > Pavils Jurjans
> >
>
>

Next message: Otto Stolz: "Re: Surrogate pairs and UTF-8"
Previous message: Pierpaolo BERNARDI: "Re: Surrogate pairs and UTF-8"
In reply to: Mike: "Re: Surrogate pairs and UTF-8"
Next in thread: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
Reply: Philippe Verdy: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 23:12:03 CDT