From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Tue Nov 11 2003 - 07:02:23 EST
(long argument deleted)
If you are suggesting that the natural sort algorithm won't work without
separate codepoints for hex digits then you are of course correct, but
that is an argument in favor of hex-digit-characters, not against them.
Ordering natural numbers (whole numbers >= 0)) expressed as numerals,
usually sequences of digits, can be made to work for any base as long as
one can write the digits in a convenient way. (That does NOT mean digit
clones of A-Z.).
If you like you can lobby OS/UI makers (or sort order implementation
providers in general) to supply a "hackers's option" where A-F and a-f
are regarded as digits (possibly with some heuristic to determine which
As are hex digits and which are not). I would have that "off" by default
though; most users would not find hexadecimal very uncomfortable, and
indeed surprising. They would be even more surprised to find some
As not sort like other As (if there were such clones), looking just the
same. Note that all the existing clones of A-Z and a-z are ordered just
like the ordinary letters in the default order of the UCA (and the CTT
of 14651). Likewise the roman number compatibility characters are
ordered as the letters that constitute them; not in any numeric order.
The natural sort algorithm works identically in all radices. There is
nothing special about radix ten. Furthermore, the same sort order is
guaranteed in all radices. An implementation of a natural sort algorithm
does NOT need to "know" the radix. It does not need to guess. It does
not need to assume. It does not need to infer. It does not even need to
care. All it needs are the functions IsDigit(codepoint) and
GetDigitValue(codepoint). The return value of the latter is only
required to be defined if the return value of the former is true. That's
ALL it needs.
That's one way of doing it. Another is to prehandle the string, as
explained in annex C.3 of ISO/IEC 14651, and use suitable weighting
for the characters used in the numerals, and then just apply the
ordinary
collation key calculation (by demand or complete) and compare the
strings as "usual" (for 14651 or UCA comparisons). Incidentally, that
annex also considers negative numerals, and numerals with a fraction
part. It only considers decimal base in the examples, but there is no
problem in generalising to other integer bases >= 2, just as long as you
have enough characters to express the digits (which could in principle
be
expressed with multiple characters each, even a varying number).
(If you use a base greater than decimal, then your right that decimal
numerals orders in the expected way, having done the prehandling,
as long as you stick to decimal digits in the actual strings.)
/kent k
This archive was generated by hypermail 2.1.5 : Tue Nov 11 2003 - 07:50:49 EST