Re: ISO 8859-11 (Thai) cross-mapping table

From: Elliotte Rusty Harold (elharo@metalab.unc.edu)
Date: Tue Oct 08 2002 - 08:30:20 EDT

  • Next message: John Cowan: "Re: ISO 8859-11 (Thai) cross-mapping table"

    At 6:51 AM -0400 10/8/02, John Cowan wrote:
    >Marco Cimarosti scripsit:
    >
    >> Talking about the format of mapping tables, I always wondered why not using
    >> ranges. In the case of ISO 8859-11, the table would become as compact as
    >> three lines:

    In XOM I currently do a quick initial test with if for 0x00 through
    0xA0. This covers the very common case of ASCII very quickly. (The C1
    controls and the non-breaking space are gravy.) The remainder I do
    with a switch statement with one case per value. It's my recollection
    that Java compilers can compile this very efficiently using table
    lookup instructions built into Java's virtual machine. However, array
    lookup might be quicker still. One day I'll have to profile this and
    find out for sure.

    The Verifier class has a similar issue, though there it's a case of
    determining whether or not any given character is a legal XML
    character/name character/name-start character/ etc. This is done with
    a trick introduced in JDOM where the code looks like this:

         public static boolean isXMLLetter(char c) {
             // Note that order is very important here. The search proceeds
             // from lowest to highest values, so that no searching occurs
             // above the character's value. BTW, the first line is equivalent to:
             // if (c >= 0x0041 && c <= 0x005A) return true;

             if (c < 0x0041) return false; if (c <= 0x005a) return true;
             if (c < 0x0061) return false; if (c <= 0x007A) return true;
             if (c < 0x00C0) return false; if (c <= 0x00D6) return true;
             if (c < 0x00D8) return false; if (c <= 0x00F6) return true;
             if (c < 0x00F8) return false; if (c <= 0x00FF) return true;
             if (c < 0x0100) return false; if (c <= 0x0131) return true;
             if (c < 0x0134) return false; if (c <= 0x013E) return true;

    This means ASCII and Latin-1 are pretty quick, but the further you go
    into Unicode the more checks have to be made.

    This almost certainly could be sped up with a table lookup, at the
    cost of carrying around a few static 65,536 element boolean arrays.
    (Anyone happen to know if Java uses one-byte per boolean in arrays or
    not?)

    -- 
    +-----------------------+------------------------+-------------------+
    | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
    +-----------------------+------------------------+-------------------+
    |          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
    |              http://www.cafeconleche.org/books/xian2/              |
    |  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
    +----------------------------------+---------------------------------+
    |  Read Cafe au Lait for Java news:  http://www.cafeaulait.org/      |
    |  Read Cafe con Leche for XML news: http://www.cafeconleche.org/    |
    +----------------------------------+---------------------------------+
    


    This archive was generated by hypermail 2.1.5 : Tue Oct 08 2002 - 09:13:45 EDT