From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue Oct 08 2002 - 11:59:38 EDT
We use this structure in ICU (in UnicodeSet). For a high-level explanation,
see my site (www.macchiato.com), "Bits of Unicode".
As to the binary search, we have used in various contexts before a
"completely unrolled" binary loop, following John Bentley's formulation. It
is quite an interesting structure. Here is a Java version: it is about 40%
faster than the straightforward binary search (this is not a production
version).
/**
* @return greatest index such that data[index] <= searchValue
* If there is no such index (e.g. searchValue < data[0]), then -1 is
returned
*/
public int highestIndexLEQ(int searchValue) {
if (!isValid) validate();
int temp;
// set up initial range to search. Each subrange is a power of two in
length
int high = searchValue < data[topOfLow] ? topOfLow : topOfHigh;
// Completely unrolled binary search, following "Programming Pearls"
// Each case deliberately falls through to the next
// Logically, data[-1] < all_search_values && data[count] >
all_search_values
// although the values -1 and count are never actually touched.
// The bounds at each point are low & high,
// where low == high - delta*2
// so high - delta is the midpoint
// The invariant AFTER each line is that data[low] < searchValue <=
data[high]
switch (power) {
//case 31: if (searchValue < data[temp = high-0x40000000]) high = temp; //
no unsigned int in Java
case 30: if (searchValue < data[temp = high-0x20000000]) high = temp;
case 29: if (searchValue < data[temp = high-0x10000000]) high = temp;
case 28: if (searchValue < data[temp = high- 0x8000000]) high = temp;
case 27: if (searchValue < data[temp = high- 0x4000000]) high = temp;
case 26: if (searchValue < data[temp = high- 0x2000000]) high = temp;
case 25: if (searchValue < data[temp = high- 0x1000000]) high = temp;
case 24: if (searchValue < data[temp = high- 0x800000]) high = temp;
case 23: if (searchValue < data[temp = high- 0x400000]) high = temp;
case 22: if (searchValue < data[temp = high- 0x200000]) high = temp;
case 21: if (searchValue < data[temp = high- 0x100000]) high = temp;
case 20: if (searchValue < data[temp = high- 0x80000]) high = temp;
case 19: if (searchValue < data[temp = high- 0x40000]) high = temp;
case 18: if (searchValue < data[temp = high- 0x20000]) high = temp;
case 17: if (searchValue < data[temp = high- 0x10000]) high = temp;
case 16: if (searchValue < data[temp = high- 0x8000]) high = temp;
case 15: if (searchValue < data[temp = high- 0x4000]) high = temp;
case 14: if (searchValue < data[temp = high- 0x2000]) high = temp;
case 13: if (searchValue < data[temp = high- 0x1000]) high = temp;
case 12: if (searchValue < data[temp = high- 0x800]) high = temp;
case 11: if (searchValue < data[temp = high- 0x400]) high = temp;
case 10: if (searchValue < data[temp = high- 0x200]) high = temp;
case 9: if (searchValue < data[temp = high- 0x100]) high = temp;
case 8: if (searchValue < data[temp = high- 0x80]) high = temp;
case 7: if (searchValue < data[temp = high- 0x40]) high = temp;
case 6: if (searchValue < data[temp = high- 0x20]) high = temp;
case 5: if (searchValue < data[temp = high- 0x10]) high = temp;
case 4: if (searchValue < data[temp = high- 0x8]) high = temp;
case 3: if (searchValue < data[temp = high- 0x4]) high = temp;
case 2: if (searchValue < data[temp = high- 0x2]) high = temp;
case 1: if (searchValue < data[temp = high- 0x1]) high = temp;
}
if (high == topOfHigh && searchValue >= data[high]) return high;
return high-1;
}
// NOTE: on some machines the above may not be optimal, if the size of the
function
// forces code out of the cache. For that case, it would be better for
program in a loop, like the following
public int highestIndexLEQ2(int searchValue) {
if (!isValid) validate();
int temp;
int high = searchValue < data[topOfLow] ? topOfLow : topOfHigh;
for (int delta = deltaStart; delta != 0; delta >>= 1) {
if (searchValue < data[temp = high-delta]) high = temp;
}
if (high == topOfHigh && searchValue >= data[high]) return high;
return high-1;
}
// ================ Privates ================
// data
int data[];
int count;
// validate internal parameters
private void validate() {
if (count <= 1) throw new IllegalArgumentException("Array must have at
least 2 elements");
// find greatest power of 2 less than or equal to count
for (power = exp2.length-1; power > 0 && exp2[power] > count; power--) {}
// determine the starting points
topOfLow = exp2[power] - 1;
topOfHigh = count - 1;
deltaStart = exp2[power-1];
isValid = true;
}
private boolean isValid = false;
private int topOfLow;
private int topOfHigh;
private int power;
private int deltaStart;
private static final int exp2[] = {
0x1, 0x2, 0x4, 0x8,
0x10, 0x20, 0x40, 0x80,
0x100, 0x200, 0x400, 0x800,
0x1000, 0x2000, 0x4000, 0x8000,
0x10000, 0x20000, 0x40000, 0x80000,
0x100000, 0x200000, 0x400000, 0x800000,
0x1000000, 0x2000000, 0x4000000, 0x8000000,
0x10000000, 0x20000000 // , 0x40000000 // no unsigned int in Java
};
Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄
----- Original Message -----
From: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
To: "John Cowan" <jcowan@reutershealth.com>
Cc: <unicode@unicode.org>; <xom-interest@lists.ibiblio.org>
Sent: Tuesday, October 08, 2002 06:25
Subject: Re: ISO 8859-11 (Thai) cross-mapping table
> At 8:44 AM -0400 10/8/02, John Cowan wrote:
>
>
> >The underlying data structure here is called a "range table", and is
> >a list of ranges in codepoint order, expressed thus:
> >
> > start of first range
> > end of first range + 1
> > start of second range
> > end of second range + 1
> >
> >etc. etc. What you are doing is equivalent to a linear search over
> >this range followed by loop unrolling. However, you can do better,
> >especially in complex cases, with a *binary* search over the range
> >followed by loop unrolling. The trick here is that if the binary
> >search returns an even value, it succeeds; an odd value fails.
> >
>
> Interesting. Do you have any references on that I can explore
> further? A quick google search didn't turn up anything relevant. I'm
> curious to see how the algorithm actually works.
> --
>
> +-----------------------+------------------------+-------------------+
> | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
> +-----------------------+------------------------+-------------------+
> | XML in a Nutshell, 2nd Edition (O'Reilly, 2002) |
> | http://www.cafeconleche.org/books/xian2/ |
> | http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/ |
> +----------------------------------+---------------------------------+
> | Read Cafe au Lait for Java news: http://www.cafeaulait.org/ |
> | Read Cafe con Leche for XML news: http://www.cafeconleche.org/ |
> +----------------------------------+---------------------------------+
>
>
This archive was generated by hypermail 2.1.5 : Tue Oct 08 2002 - 12:45:37 EDT