Re: unicode Digest V6 #134

From: J Andrew Lipscomb (ewwa@chattanooga.net)
Date: Wed Jun 21 2006 - 18:46:08 CDT

  • Next message: Pierpaolo BERNARDI: "Re: Surrogate pairs and UTF-8"

    > I am a developer who needs to write UTF-8 encoder and decoder in
    > JavaScript.
    > I've found the encoding form in the link
    > http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703
    > , and that is pretty much what I need to do the job. However, I am
    > completely lacking in-depth information about the surrogate pairs
    > and how to
    > handle them in UTF-8. So, here are the questions, what I am looking
    > for:
    > - I have read the theoretical definition of what a surrogate pair is.
    > However, I have never seen any in "life". Can you give an example
    > of some
    > surrogate pairs, and how do their respective character look like?
    > - The guides on unicode.org site talk only about surrogate pair and
    > UTF-16 conversion. How about the UTF-8?

    Surrogate pairs don't exist in UTF-8. Surrogate pairs are a concept
    unique to UTF-16 for representing characters beyond the Basic
    Multilingual Plane. To take an example, consider the character U
    +1D444 (mathematical italic capital Q). Since this requires more than
    16 bits, UTF-16 uses the surrogate pair concept, which works as follows:
    First, subtract &h10000 (using &h to denote hexadecimal), leaving
    &hD444.
    Next, write that out as a 20-bit binary number (0000 1101 0100 0100
    0100). Now, divide that into two groups of 10 bits (0000110101
    0001000100).
    Then stick 110110 in front of the high half (1101 1000 0011 1001),
    and convert to &hD835. That codepoint is the high surrogate.
    Similarly for the low half, using 110111 as the prefix, you get 1101
    1100 0100 0100 or &hDC44. Those two codepoints together represent U
    +1D444 in UTF-16.

    But in UTF-8, you just represent it in the same pattern as any other
    character. The bit pattern is 11101010001000100, which breaks into
    sixes as 11101 010001 000100. But since the first part is too long to
    fit the lead byte of a three-byte character in UTF-8, you have to
    make it a four-byte character, zero-padding to 000 011101 010001
    000100. Adding the appropriate prefixes yields 11110000 10011101
    10010001 10000100, which is the UTF-8 representation of the character
    (F0 9D 91 84).



    This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 19:08:23 CDT