Re: How to encode Hex10FFFF characters with UTF-16??

From: Mike Ayers (mayers@celequest.com)
Date: Thu Mar 16 2006 - 16:14:33 CST

  • Next message: Anto'nio Martins-Tuva'lkin: "New symbol for Russian rouble?"

    Kornkreismuster@web.de wrote:
    > Hi! Here is a small discussion I had privately.
    >
    > I've got a problem to understand how it is possible to encode
    > Hex10FFFF characters with UTF-16. If I try to calculate the range of
    > UTF-16 I always get a maximum number of Hex10F7FF.
    >
    > Calculation:
    >
    > (DBFF - D7FF) * (DFFF - DBFF) + D7FF + FFFF - DFFF
    > (High Surr.) (Low Surr.) (0 to D7FF) (D800 to FFFF)
    >
    > Please tell me how to encode Hex10FFFF characters.

            In hope of enlightenment, and to just generally spread the magic, I
    present the return of the Magic Pocket Encoders. I strongly encourage
    everyone who wants to be more familiar with the UTFs to print these and
    glue 'em together. I keep mine ever handy, and they have served me well
    over the years (except the UTF-32 MPE, whose sole prupose is to make the
    set look bigger and more impressive).

            Enjoy.

    /|/|ike

    P.S. Nonproportional font required

    ------------- Begin Forwarded Message -------------

    Date: Thu, 08 Jul 2004 16:22:14 +0200
    From: Otto Stolz <Otto.Stolz@uni-konstanz.de>
    User-Agent: Thunderbird 0.6 (Windows/20040502)
    X-Accept-Language: de-de, de-at, de, en-us, en
    To: Unicode List <unicode@unicode.org>
    Subject: UTF Magic Pocket Encoders
    X-Virus-Scanned: by amavisd-new at Mailservice RZ Uni-Konstanz
    X-archive-position: 15930
    X-original-sender: Otto.Stolz@uni-konstanz.de
    List-help: <mailto:ecartis@unicode.org?Subject=help>
    List-unsubscribe:
    <mailto:unicode-request@sarasvati.unicode.org?Subject=unsubscribe>
    List-software: Ecartis version 1.0.0
    List-ID: <unicode.sarasvati.unicode.org>
    X-List-ID: <unicode.sarasvati.unicode.org>
    X-list: unicode

    Hello,

    Dominikus Scherkl (MGW) wrote about Cima's UTF-8 Magic Pocket Encoder:

    >> Oha?
    >> Updated without changing version and date?
    >> ;-)

    I had provided a Magic Pocket Encoder for UTF-16, and afterwards
    have been made aware of some spelling, and wording, errors.

    Mike Ayers has contributed the crowning achievement: his
    UTF-32 Magic Pocket Encoder. This one is already perfect,
    hence it will probably never reach version 1.1 :-)

    Attached, you'll find the current versions of all three,
    in a somewhat enhanced typography: I have exploited box-drawing
    characters, arrows, and proper (typographical) apostrophes.
    While not being ASCII proper, these MPEs use only characters
    that were already present in CP 437 (the original PC code).

    I haven't changed the wording, of course, exept the version
    number and date, and the reference to arrows (rather than
    exclamation points), as appropriate.

    I hope this will end the discussion on MPEs, which are toys,
    after all (though they could also be used to visualize the
    three UTF encodings).

    Cheers,
        Otto Stolz

    ------------- End Forwarded Message -------------

    Side 1 (print and cut out):

    ╔════════════╦═══════╦═══════════════════════╦══════╗
    ║ U+0000 ║ yy zz ║ Cima’s UTF-8 Magic ║ Hex↔ ║
    ║ U+007F ║ ↓ ↓ ║ Pocket Encoder ║ B-4 ║
    ║ YZ ║ _ _ ║ ║ ║
    ╟────────────╫───────╚═══════╗ Vers. 1.1 ║ 0↔00 ║
    ║ U+0080 ║ 3x xy │ 2y zz ║ 2004-06-30 ║ 1↔01 ║
    ║ U+07FF ║ 3_ __ │ 2_ ↓ ║ ║ 2↔02 ║
    ║ XYZ ║ _ _ │ _ _ ║ M.C. ║ 3↔03 ║
    ╟────────────╫───────┼───────╚═══════╗ ║ 4↔10 ║
    ║ U+0800 ║ 32 ww │ 2x xy │ 2y zz ║ ║ 5↔11 ║
    ║ U+FFFF ║ ↓ ↓ │ 2_ __ │ 2_ ↓ ║ ║ 6↔12 ║
    ║ WXYZ ║ E _ │ _ _ │ _ _ ║ ║ 7↔13 ║
    ╟────────────╫───────┼───────┼───────╚═══════╣ 8↔20 ║
    ║ U-00010000 ║ 33 0v │ 2v ww │ 2x xy │ 2y zz ║ 9↔21 ║
    ║ U-000FFFFF ║ ↓ 0_ │ 2_ ↓ │ 2_ __ │ 2_ ↓ ║ A↔22 ║
    ║ VWXYZ ║ F _ │ _ _ │ _ _ │ _ _ ║ B↔23 ║
    ╟────────────╫───────┼───────┼───────┼───────╢ C↔30 ║
    ║ U-00100000 ║ 33 10 │ 20 ww │ 2x xy │ 2y zz ║ D↔31 ║
    ║ U-0010FFFF ║ ↓ ↓ │ ↓ ↓ │ 2_ __ │ 2_ ↓ ║ E↔32 ║
    ║ WXYZ ║ F 4 │ 8 _ │ _ _ │ _ _ ║ F↔33 ║
    ╚════════════╩═══════╧═══════╧═══════╧═══════╩══════╝

    Side 2 (print, cut out, and glue on back of side 1):

    ╔═══════════════════════════════════════════════════╗
    ║ Cima’s UTF-8 Magic Pocket Encoder - User’s Manual ║
    ║ (vers. 1.1, 2004-06-30, by Marco Cimarosti) ║
    ║ ║
    ║ - Left column: min and max Unicode scalar values: ║
    ║ pick the row that applies to the code point you ║
    ║ want to convert to UTF-8. Letters V..Z mark the ║
    ║ hexadecimal digits that have to be processed. ║
    ║ - Right column: hexadecimal to base-4 table. ║
    ║ - Central columns: work area to compute each octet║
    ║ (1 to 4) that constitute UTF-8 octet sequences. ║
    ║ Convert each digit marked by V..Z from hex. to ║
    ║ b.-4. Write b.-4 digits on the dots placed under ║
    ║ letters v..z (two b.-4 digits per hex. digit). ║
    ║ Convert 2-digit base-4 number to hex. digits and ║
    ║ write them on the dots on the line. That is your ║
    ║ UTF-8 sequence in hex. ↓ Downwards arrows show ║
    ║ passages that may be skipped, either because the ║
    ║ digit is hard-coded, or because it may be copied ║
    ║ directly from the scalar value. ║
    ╚═══════════════════════════════════════════════════╝

    Obverse: Print with a fixed-width font, such as Lucida Console,
    and cut out.

    ╔════════════╦═════════════╦═════════════════════════════════╗
    ║ U+0000 ║ W X Y Z ║ Otto’s Magic Pocket Encoder ║
    ║ U+D7FF ║ ↓ ↓ ↓ ↓ ║ for UTF-16™╔═══════════════════╣
    ║ WXYZ ║ _ _ _ _ ║ ║ V→vv │ V→vv ║
    ╟────────────╫─────────────╢ Version 1.1 ║ U→uu │ U→uu ║
    ║ U+E000 ║ W X Y Z ║ ©2004-07-05 ║ tt←T │ tt←T ║
    ║ U+FFFF ║ ↓ ↓ ↓ ↓ ║ ║ _←__ │ _←__ ║
    ║ WXYZ ║ _ _ _ _ ║ ║ ────────┼──────── ║
    ╟────────────╫─────────────╚═════════════╣ 0↔00 │ 13←8↔20 ║
    ║ U-00010000 ║ 31 2t tu uv │ 31 3v Y Z ║ 00←1↔01 │ 20←9↔21 ║
    ║ U-000FFFFF ║ ↓ 2_ __ __ │ ↓ 3_ ↓ ↓ ║ 01←2↔02 │ 21←A↔22 ║
    ║ TUVYZ ║ D _ _ _ │ D _ _ _ ║ 02←3↔03 │ 22←B↔23 ║
    ╟────────────╫─────────────┼─────────────╢ 03←4↔10 │ 23←C↔30 ║
    ║ U-00100000 ║ 31 23 3u uv │ 31 3v Y Z ║ 10←5↔11 │ 30←D↔31 ║
    ║ U-0010FFFF ║ ↓ ↓ 3_ __ │ ↓ 3_ ↓ ↓ ║ 11←6↔12 │ 31←E↔32 ║
    ║ UVYZ ║ D B _ _ │ D _ _ _ ║ 12←7↔13 │ 32←F↔33 ║
    ╚════════════╩═════════════╧═════════════╩═══════════════════╝

    ....:....1....:....2....:....3....:....4....:....5....:....6..

    Reverse: Cut out and paste on back of obverse.

    ╔════════════════════════════════════════════════════════════╗
    ║ Otto’s Magic Pocket Encoder for UTF-16 Version 1.1 ║
    ║ User’s Manual (inspired from Cima’s UTF-8 MPE) ║
    ╠════════════════════════════════════════════════════════════╣
    ║• Left column: min and max Unicode scalar values: pick the ║
    ║ row that applies to the code point to be converted. ║
    ║ T…Z mark the hexadecadic digits that have to be processed.║
    ║• Central column: work area to compute UTF-16BE code units. ║
    ║• Right column: hexadecadic to quaternary conversion tables:║
    ║ ← for T to tt; ↔ for U/V to uu/VV (step 1) and for step 2.║
    ║1. Convert each digit marked by T…V from hex to quat. Write ║
    ║ quat digits on the underscores placed under letters t…v. ║
    ║2. Convert 2-digit quat numbers to hex digits or copy digits║
    ║ W…Z, as indicated, and write them on the underscores on ║
    ║ the next line. That’s your UTF-16BE sequence in hex. ║
    ║↓ Downwards arrows indicate shortcuts. ║
    ╚════════════════════════════════════════════════════════════╝

    Enjoy.

    Side 1 (print and cut out):

    ╔════════════╦═══════════════════════╤═══════════════╗
    ║ This space ║ Mike’s UTF-32 Magic │ Vers. 1.0 ║
    ║ for rent ║ Pocket Encoder │ 06 July 2004 ║
    ║ ║ │ ║
    ╠════════════╬═══════╤═══════╤═══════╪═══════╗ ║
    ║ U-00000000 ║ 0 0 │ U V │ W X │ Y Z ║ ║
    ║ U-0010FFFF ║ ↓ ↓ │ ↓ ↓ │ ↓ ↓ │ ↓ ↓ ║ ║
    ║ UVWXYZ ║ 0 0 │ _ _ │ _ _ │ _ _ ║ ║
    ╚════════════╩═══════╧═══════╧═══════╧═══════╩═══════╝

    Side 2 (print, cut out, and glue on back of side 1):

    ╔════════════════════════════════════════════════════╗
    ║ Mike’s UTF-32 Magic Pocket Encoder - User’s Manual ║
    ║ (vers. 1.0, 6 July 2004, by Mike Ayers) ║
    ║ ║
    ║ - Left column: min and max Unicode scalar values. ║
    ║ Letters U..Z mark the hexadecimal digits to be ║
    ║ processed. Read the bytes in the bottom row ║
    ║ left to right, or right to left for UTF-32LE. ║
    ╚════════════════════════════════════════════════════╝



    This archive was generated by hypermail 2.1.5 : Thu Mar 16 2006 - 16:27:59 CST