RE: numeric ordering

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Sep 20 2001 - 05:31:18 EDT


Viranga Ratnaike wrote:
> [...]
> <UTR10>
> [...]
> 2. numeric formatting: numbers composed of a string of digits or
> other numerics will not necessarily sort in numerical order.

That's right. Unicode is a standard for encoding text, so also its
guidelines for sorting only deal with textual sorting.

This does not mean that a mixed numerical/textual sorting may not be
implemented with Unicode: it just means that specifying such a thing is out
of the scope of UTR#10.

> [...]
> 1. Is there another document/algorithm/table that does provide
> guidelines for sorting numbers within strings? Something
> that deals with different scripts?

I don't know, probably you may found something on Internet. It is not an
Unicode-specific problem.

I can try and come up with some common sense ideas about such an algorithm.
I think that the first thing to do should be to split your string in textual
and numerical segments, and compare each segment with on its own.

Say that your string is "1.2.3 Sorting Techniques". You should split it
into six typed segments (types are N=numeric and T=textual):

1) N "1"
2) T "."
3) N "2"
4) T "."
5) N: "3"
6) T: " Sorting Techniques"

Notice that, in order to do such a segmentation, you must define your own
syntax for numbers. I.e., it is up to you to define whether "1,234" is
number 1234 or number one + "," + number 234.

Then you can sort the text using a compare algorithm like this:

a) take the 1st segments of both strings;

b) if the two segments have different types, the N segment comes before (or
after) the T segment;

c) if both segments are N, compare them numerically (the smallest number
comes first);

d) if both segments are T, compare them textually (e.g., apply UTR#10);

e) if the two segments compare equal, and both strings have at least a next
segment, take the next segment and go back to point (b);

f) if all segments compared equal, forget the segments and compare the whole
string textually (e.g., apply UTR#10).

> 2. In practice, are digits from different scripts ever mixed?

I don't think this normally happens.

E.g., imagine mixing Arabic-Hindi digits with European digits: that would be
a mess for the reader because Arabic digits "five" and "six" look almost
identical to European digits "zero" and "seven".

However, it is common to mix European digits with non-digital numbering
systems, such as the Roman numerals. It is common to see section numbers in
books labeled like this: "VII.9.6".

These old numbering systems, however, have the additional problem that they
are not easily distinguished from other text. In most cases, these numbers
are not spelled with special numeric characters (such as the digits), but
rather use the normal letters or ideographs used to spell normal text. This
problem occurs with the old numbering systems of several scripts: Latin,
Greek, Armenian, Georgian, Hebrew, Arabic, and Chinese.

> If so, how do you sort two different digits which have the
> same numeric value?

I suggested point (f) in the algorithm above: if all else fails, revert to a
normal textual compare.

_ Marco



This archive was generated by hypermail 2.1.2 : Thu Sep 20 2001 - 04:54:48 EDT