algorithm implementation: coping with code properties

From: spir (denis.spir@free.fr)
Date: Thu Jan 28 2010 - 05:45:31 CST

  • Next message: Karl Pentzlin: "Re: Medievalist ligature character in the PUA"

    Hello,

    It seems code property files typically look like (example from GraphemeBreakProperty.txt):

    ================================
    ...
    007F..009F ; Control # Cc [33] <control-007F>..<control-009F>
    00AD ; Control # Cf SOFT HYPHEN
    0600..0603 ; Control # Cf [4] ARABIC NUMBER SIGN..ARABIC SIGN SAFHA
    06DD ; Control # Cf ARABIC END OF AYAH
    070F ; Control # Cf SYRIAC ABBREVIATION MARK
    ...
    0591..05BD ; Extend # Mn [45] HEBREW ACCENT ETNAHTA..HEBREW POINT METEG
    05BF ; Extend # Mn HEBREW POINT RAFE
    ...
    ================================

    Meaning each code or range is individually assigned a property. I imagine 2 ways to cope with such data in an application.

    -1- using the list as is -------

    Parse the list as is into a collection (eg associative array) that maps each code or range to its property value. Concretely, there can also be two separate collections for individual codes and ranges, but the principle remains. (Note: Even for individual codes, the collection cannot be a list, indeed, for there may be huge holes between code values).

    Let's say property values can be a,b,c. The algorithm's main loop content may look like:

        <get p from c>
        if p == a then ...
        elseif p == b then ...
        ...
        else ...

    (Actually, algorithms often require testing properties of a pair of successive codes in string, but this does not change the overall scheme.)

    The issue is to get p when c is not indvidually listed; this may be the common case, since ranges usually cover a greater number of code values. We then need to iterate over a *global* list of ranges, checking whether any of them holds c. (The code's property value is always searched before the series of tests.)

    -2- constructing categories -------

    An alternative is to create categories (see below *) A,B,C holding the codes or ranges which property is a,b,c, respectively.
    Then, the algorithm rather looks like (OO-like pseudo-code):

        if A.holds(c) then ...
        elseif B.holds(c) then ...
        ...
        else ...

    Again, if ever c is not individually listed, we need to iterate over ranges. But each category only holds a subset of the global list of ranges, and the search will stop as soon as one matches. (Note the property is not searched _before_ the series of tests.)

    Anyway, aside performance, I find this solution cleaner --for any reason.

    A small issue is it first requires grouping the source data under titles, eg:
    ================================
    Control:
    ...
    007F..009F
    00AD
    0600..0603
    06DD
    070F
    ...
    Extend:
    ...
    0591..05BD
    05BF
    ...
    ================================
    Or even more directly something like:
    ================================
    ControlCodes = (..., (007F..009F), 00AD, (0600..0603), 06DD, 070F, ...)
    ExtendCodes = (..., (0591..05BD), 05BF, ...)
    ================================
    But this needs to be done only once per data file and, indeed, the same data parsing/rewriting tool func can apply to all files.

    What do you think?

    Denis

    (*) Category
    A category conceptually is a list of ranges, plus a set of individual codes. Testing whether a code belongs to it is simply checking the set, then each range.
    A type can be built for this kind of object, or simply a factory func (that builds it from a list of codes and ranges) and an associated "holds" func.
    Example code available in Lua and Python.
    ________________________________

    la vita e estrany

    http://spir.wikidot.com/



    This archive was generated by hypermail 2.1.5 : Thu Jan 28 2010 - 05:51:48 CST