algorithm implementation: coping with code properties

From: spir ([email protected])
Date: Thu Jan 28 2010 - 05:45:31 CST

Next message: Karl Pentzlin: "Re: Medievalist ligature character in the PUA"

Previous message: Brett Zamir: "Uniocde protocol or URNs?"
Next in thread: spir: "Re: algorithm implementation: coping with code properties"
Reply: spir: "Re: algorithm implementation: coping with code properties"
Reply: Mark Davis ☕: "Re: algorithm implementation: coping with code properties"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello,

It seems code property files typically look like (example from GraphemeBreakProperty.txt):

================================
...
007F..009F ; Control # Cc [33] <control-007F>..<control-009F>
00AD ; Control # Cf SOFT HYPHEN
0600..0603 ; Control # Cf [4] ARABIC NUMBER SIGN..ARABIC SIGN SAFHA
06DD ; Control # Cf ARABIC END OF AYAH
070F ; Control # Cf SYRIAC ABBREVIATION MARK
...
0591..05BD ; Extend # Mn [45] HEBREW ACCENT ETNAHTA..HEBREW POINT METEG
05BF ; Extend # Mn HEBREW POINT RAFE
...
================================

Meaning each code or range is individually assigned a property. I imagine 2 ways to cope with such data in an application.

-1- using the list as is -------

Parse the list as is into a collection (eg associative array) that maps each code or range to its property value. Concretely, there can also be two separate collections for individual codes and ranges, but the principle remains. (Note: Even for individual codes, the collection cannot be a list, indeed, for there may be huge holes between code values).

Let's say property values can be a,b,c. The algorithm's main loop content may look like:

    <get p from c>
    if p == a then ...
    elseif p == b then ...
    ...
    else ...

(Actually, algorithms often require testing properties of a pair of successive codes in string, but this does not change the overall scheme.)

The issue is to get p when c is not indvidually listed; this may be the common case, since ranges usually cover a greater number of code values. We then need to iterate over a *global* list of ranges, checking whether any of them holds c. (The code's property value is always searched before the series of tests.)

-2- constructing categories -------

An alternative is to create categories (see below *) A,B,C holding the codes or ranges which property is a,b,c, respectively.
Then, the algorithm rather looks like (OO-like pseudo-code):

    if A.holds(c) then ...
    elseif B.holds(c) then ...
    ...
    else ...

Again, if ever c is not individually listed, we need to iterate over ranges. But each category only holds a subset of the global list of ranges, and the search will stop as soon as one matches. (Note the property is not searched _before_ the series of tests.)

Anyway, aside performance, I find this solution cleaner --for any reason.

A small issue is it first requires grouping the source data under titles, eg:
================================
Control:
...
007F..009F
00AD
0600..0603
06DD
070F
...
Extend:
...
0591..05BD
05BF
...
================================
Or even more directly something like:
================================
ControlCodes = (..., (007F..009F), 00AD, (0600..0603), 06DD, 070F, ...)
ExtendCodes = (..., (0591..05BD), 05BF, ...)
================================
But this needs to be done only once per data file and, indeed, the same data parsing/rewriting tool func can apply to all files.

What do you think?

Denis

(*) Category
A category conceptually is a list of ranges, plus a set of individual codes. Testing whether a code belongs to it is simply checking the set, then each range.
A type can be built for this kind of object, or simply a factory func (that builds it from a list of codes and ranges) and an associated "holds" func.
Example code available in Lua and Python.
________________________________

la vita e estrany

http://spir.wikidot.com/

Next message: Karl Pentzlin: "Re: Medievalist ligature character in the PUA"
Previous message: Brett Zamir: "Uniocde protocol or URNs?"
Next in thread: spir: "Re: algorithm implementation: coping with code properties"
Reply: spir: "Re: algorithm implementation: coping with code properties"
Reply: Mark Davis ☕: "Re: algorithm implementation: coping with code properties"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 28 2010 - 05:51:48 CST