From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 19:55:54 CDT
They could have been defined in Table 4, but there was no need to exclude
them in this same table from the definition of the “Sep” property. The
proposed change in the SB* rules where “Sep” is referenced just make these
rules unnecessarily longer and more difficult to read.
May be this change is justified by the need to have a single property value
for the same character, so that the implementation of table-based tailored
breakers will be simpler: it will just be enough to define a single map from
characters to their Sentence_Break property.
So this means that CR and LF characters are assigned now a single CR and LF
Sentence_Break property value, instead of having subclasses in the “Sep”
class. From what you say, I see now the interest of the change: the table 4
provides now a complete partition of the UCS into separate classes, each
class being defined by the single property value mapped from each character.
But if you wanted to make implementations simpler, it should be noted that
some classes are unnecessarily split, despite they are currently assigned
distinct property values. This is visible when you look at the chart: there
are identical rows, and the associated columns also have identical contents.
The charts could be simpler by merging rows and columns that have identical
content, reducing the number of cases to consider within an implementation
(and possibly the number of bits needed to encode these property values in
the compiled map). This also means that the current set of rules could be
simplified.
Regarding the implementation, I tend to think that one way to compact the
compiled map is to reorder the rows and columns in the chart to produce a
triangular matrix: the result in the code implementing the algorithm is that
the code can be written very efficiently as “if… then… elseif… then…
elseif…then… elseif… then” without embedding multiple ifs, or if the “then…”
actions are not simple branches, using binary lookup by grouping the cases
according to their frequency (so that the minimum number of tests will be
computed at runtime for the most frequent cases, building the weighted
binary decision tree in a way similar to the Huffman compression of
characters into variable-width bitfields).
As the result of read from cells in the matrix is just “break here” or “no
break here”, this can finally turn into a simple boolean expression, fully
factorized producing the minimum number of branches in the compiled code,
and the fastest code (thanks to branch prediction), without even using a
bidimensional table lookup (which is not very efficient due to the needed
computing of cell addresses, to the extra indirection, and inefficient use
of the data cache for locality in processor).
I have already used this approach (transforming complex multidimensional
rules into a triangular matrix (or triangular hypermatrix if there are more
than 2 dimensions for the lookup) then transformed into a single Boolean
expression) and this really enhanced a lot the speed of the code at runtime
(less cache misses, better branch prediction that does not depend on runtime
profiling but is correct even at the beginning of evaluation, and finally a
better throughput) when working with very large volumes of data.
_____
De : mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] De la
part de Mark Davis
Envoyé : mercredi 26 septembre 2007 01:22
À : verdy_p@wanadoo.fr
Cc : Rick McGowan; unicode@unicode.org
Objet : Re: Public Review Issues update: UAX #31
Someone complained that while CR and LF are used in the rules, they were not
defined in the table of values for Sentence Break. So the UTC decided to
make this change for consistency. It doesn't make any change in the actual
outcome. (Personally, I don't think it was of much value, but don't feel
that strongly against it either.)
Mark
This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 19:57:06 CDT