Re: How to encode Hex10FFFF characters with UTF-16??

From: Mike Ayers (mayers@celequest.com)
Date: Thu Mar 16 2006 - 16:14:33 CST

Next message: Anto'nio Martins-Tuva'lkin: "New symbol for Russian rouble?"

Previous message: Kornkreismuster@web.de: "How to encode Hex10FFFF characters with UTF-16??"
In reply to: Kornkreismuster@web.de: "How to encode Hex10FFFF characters with UTF-16??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kornkreismuster@web.de wrote:
> Hi! Here is a small discussion I had privately.
>
> I've got a problem to understand how it is possible to encode
> Hex10FFFF characters with UTF-16. If I try to calculate the range of
> UTF-16 I always get a maximum number of Hex10F7FF.
>
> Calculation:
>
> (DBFF - D7FF) * (DFFF - DBFF) + D7FF + FFFF - DFFF
> (High Surr.) (Low Surr.) (0 to D7FF) (D800 to FFFF)
>
> Please tell me how to encode Hex10FFFF characters.

In hope of enlightenment, and to just generally spread the magic, I
present the return of the Magic Pocket Encoders. I strongly encourage
everyone who wants to be more familiar with the UTFs to print these and
glue 'em together. I keep mine ever handy, and they have served me well
over the years (except the UTF-32 MPE, whose sole prupose is to make the
set look bigger and more impressive).

Enjoy.

/|/|ike

P.S. Nonproportional font required

------------- Begin Forwarded Message -------------

Date: Thu, 08 Jul 2004 16:22:14 +0200
From: Otto Stolz <Otto.Stolz@uni-konstanz.de>
User-Agent: Thunderbird 0.6 (Windows/20040502)
X-Accept-Language: de-de, de-at, de, en-us, en
To: Unicode List <unicode@unicode.org>
Subject: UTF Magic Pocket Encoders
X-Virus-Scanned: by amavisd-new at Mailservice RZ Uni-Konstanz
X-archive-position: 15930
X-original-sender: Otto.Stolz@uni-konstanz.de
List-help: <mailto:ecartis@unicode.org?Subject=help>
List-unsubscribe:
<mailto:unicode-request@sarasvati.unicode.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-ID: <unicode.sarasvati.unicode.org>
X-List-ID: <unicode.sarasvati.unicode.org>
X-list: unicode

Hello,

Dominikus Scherkl (MGW) wrote about Cima's UTF-8 Magic Pocket Encoder:

>> Oha?
>> Updated without changing version and date?
>> ;-)

I had provided a Magic Pocket Encoder for UTF-16, and afterwards
have been made aware of some spelling, and wording, errors.

Mike Ayers has contributed the crowning achievement: his
UTF-32 Magic Pocket Encoder. This one is already perfect,
hence it will probably never reach version 1.1 :-)

Attached, you'll find the current versions of all three,
in a somewhat enhanced typography: I have exploited box-drawing
characters, arrows, and proper (typographical) apostrophes.
While not being ASCII proper, these MPEs use only characters
that were already present in CP 437 (the original PC code).

I haven't changed the wording, of course, exept the version
number and date, and the reference to arrows (rather than
exclamation points), as appropriate.

I hope this will end the discussion on MPEs, which are toys,
after all (though they could also be used to visualize the
three UTF encodings).

Cheers,
Otto Stolz

------------- End Forwarded Message -------------

Side 1 (print and cut out):

╔════════════╦═══════╦═══════════════════════╦══════╗
║ U+0000 ║ yy zz ║ Cima’s UTF-8 Magic ║ Hex↔ ║
║ U+007F ║ ↓ ↓ ║ Pocket Encoder ║ B-4 ║
║ YZ ║ _ _ ║ ║ ║
╟────────────╫───────╚═══════╗ Vers. 1.1 ║ 0↔00 ║
║ U+0080 ║ 3x xy │ 2y zz ║ 2004-06-30 ║ 1↔01 ║
║ U+07FF ║ 3_ __ │ 2_ ↓ ║ ║ 2↔02 ║
║ XYZ ║ _ _ │ _ _ ║ M.C. ║ 3↔03 ║
╟────────────╫───────┼───────╚═══════╗ ║ 4↔10 ║
║ U+0800 ║ 32 ww │ 2x xy │ 2y zz ║ ║ 5↔11 ║
║ U+FFFF ║ ↓ ↓ │ 2_ __ │ 2_ ↓ ║ ║ 6↔12 ║
║ WXYZ ║ E _ │ _ _ │ _ _ ║ ║ 7↔13 ║
╟────────────╫───────┼───────┼───────╚═══════╣ 8↔20 ║
║ U-00010000 ║ 33 0v │ 2v ww │ 2x xy │ 2y zz ║ 9↔21 ║
║ U-000FFFFF ║ ↓ 0_ │ 2_ ↓ │ 2_ __ │ 2_ ↓ ║ A↔22 ║
║ VWXYZ ║ F _ │ _ _ │ _ _ │ _ _ ║ B↔23 ║
╟────────────╫───────┼───────┼───────┼───────╢ C↔30 ║
║ U-00100000 ║ 33 10 │ 20 ww │ 2x xy │ 2y zz ║ D↔31 ║
║ U-0010FFFF ║ ↓ ↓ │ ↓ ↓ │ 2_ __ │ 2_ ↓ ║ E↔32 ║
║ WXYZ ║ F 4 │ 8 _ │ _ _ │ _ _ ║ F↔33 ║
╚════════════╩═══════╧═══════╧═══════╧═══════╩══════╝

Side 2 (print, cut out, and glue on back of side 1):

╔═══════════════════════════════════════════════════╗
║ Cima’s UTF-8 Magic Pocket Encoder - User’s Manual ║
║ (vers. 1.1, 2004-06-30, by Marco Cimarosti) ║
║ ║
║ - Left column: min and max Unicode scalar values: ║
║ pick the row that applies to the code point you ║
║ want to convert to UTF-8. Letters V..Z mark the ║
║ hexadecimal digits that have to be processed. ║
║ - Right column: hexadecimal to base-4 table. ║
║ - Central columns: work area to compute each octet║
║ (1 to 4) that constitute UTF-8 octet sequences. ║
║ Convert each digit marked by V..Z from hex. to ║
║ b.-4. Write b.-4 digits on the dots placed under ║
║ letters v..z (two b.-4 digits per hex. digit). ║
║ Convert 2-digit base-4 number to hex. digits and ║
║ write them on the dots on the line. That is your ║
║ UTF-8 sequence in hex. ↓ Downwards arrows show ║
║ passages that may be skipped, either because the ║
║ digit is hard-coded, or because it may be copied ║
║ directly from the scalar value. ║
╚═══════════════════════════════════════════════════╝

Obverse: Print with a fixed-width font, such as Lucida Console,
and cut out.

╔════════════╦═════════════╦═════════════════════════════════╗
║ U+0000 ║ W X Y Z ║ Otto’s Magic Pocket Encoder ║
║ U+D7FF ║ ↓ ↓ ↓ ↓ ║ for UTF-16™╔═══════════════════╣
║ WXYZ ║ _ _ _ _ ║ ║ V→vv │ V→vv ║
╟────────────╫─────────────╢ Version 1.1 ║ U→uu │ U→uu ║
║ U+E000 ║ W X Y Z ║ ©2004-07-05 ║ tt←T │ tt←T ║
║ U+FFFF ║ ↓ ↓ ↓ ↓ ║ ║ _←__ │ _←__ ║
║ WXYZ ║ _ _ _ _ ║ ║ ────────┼──────── ║
╟────────────╫─────────────╚═════════════╣ 0↔00 │ 13←8↔20 ║
║ U-00010000 ║ 31 2t tu uv │ 31 3v Y Z ║ 00←1↔01 │ 20←9↔21 ║
║ U-000FFFFF ║ ↓ 2_ __ __ │ ↓ 3_ ↓ ↓ ║ 01←2↔02 │ 21←A↔22 ║
║ TUVYZ ║ D _ _ _ │ D _ _ _ ║ 02←3↔03 │ 22←B↔23 ║
╟────────────╫─────────────┼─────────────╢ 03←4↔10 │ 23←C↔30 ║
║ U-00100000 ║ 31 23 3u uv │ 31 3v Y Z ║ 10←5↔11 │ 30←D↔31 ║
║ U-0010FFFF ║ ↓ ↓ 3_ __ │ ↓ 3_ ↓ ↓ ║ 11←6↔12 │ 31←E↔32 ║
║ UVYZ ║ D B _ _ │ D _ _ _ ║ 12←7↔13 │ 32←F↔33 ║
╚════════════╩═════════════╧═════════════╩═══════════════════╝

....:....1....:....2....:....3....:....4....:....5....:....6..

Reverse: Cut out and paste on back of obverse.

╔════════════════════════════════════════════════════════════╗
║ Otto’s Magic Pocket Encoder for UTF-16 Version 1.1 ║
║ User’s Manual (inspired from Cima’s UTF-8 MPE) ║
╠════════════════════════════════════════════════════════════╣
║• Left column: min and max Unicode scalar values: pick the ║
║ row that applies to the code point to be converted. ║
║ T…Z mark the hexadecadic digits that have to be processed.║
║• Central column: work area to compute UTF-16BE code units. ║
║• Right column: hexadecadic to quaternary conversion tables:║
║ ← for T to tt; ↔ for U/V to uu/VV (step 1) and for step 2.║
║1. Convert each digit marked by T…V from hex to quat. Write ║
║ quat digits on the underscores placed under letters t…v. ║
║2. Convert 2-digit quat numbers to hex digits or copy digits║
║ W…Z, as indicated, and write them on the underscores on ║
║ the next line. That’s your UTF-16BE sequence in hex. ║
║↓ Downwards arrows indicate shortcuts. ║
╚════════════════════════════════════════════════════════════╝

Enjoy.

Side 1 (print and cut out):

╔════════════╦═══════════════════════╤═══════════════╗
║ This space ║ Mike’s UTF-32 Magic │ Vers. 1.0 ║
║ for rent ║ Pocket Encoder │ 06 July 2004 ║
║ ║ │ ║
╠════════════╬═══════╤═══════╤═══════╪═══════╗ ║
║ U-00000000 ║ 0 0 │ U V │ W X │ Y Z ║ ║
║ U-0010FFFF ║ ↓ ↓ │ ↓ ↓ │ ↓ ↓ │ ↓ ↓ ║ ║
║ UVWXYZ ║ 0 0 │ _ _ │ _ _ │ _ _ ║ ║
╚════════════╩═══════╧═══════╧═══════╧═══════╩═══════╝

Side 2 (print, cut out, and glue on back of side 1):

╔════════════════════════════════════════════════════╗
║ Mike’s UTF-32 Magic Pocket Encoder - User’s Manual ║
║ (vers. 1.0, 6 July 2004, by Mike Ayers) ║
║ ║
║ - Left column: min and max Unicode scalar values. ║
║ Letters U..Z mark the hexadecimal digits to be ║
║ processed. Read the bytes in the bottom row ║
║ left to right, or right to left for UTF-32LE. ║
╚════════════════════════════════════════════════════╝

Next message: Anto'nio Martins-Tuva'lkin: "New symbol for Russian rouble?"
Previous message: Kornkreismuster@web.de: "How to encode Hex10FFFF characters with UTF-16??"
In reply to: Kornkreismuster@web.de: "How to encode Hex10FFFF characters with UTF-16??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 16 2006 - 16:27:59 CST