Re: Processing Digit Variants from Philippe Verdy on 2013-03-20 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 20 Mar 2013 21:05:35 +0100

2013/3/20 Markus Scherer <markus.icu_at_gmail.com>:
> Numeric collation is actually much more limited than number parsing, to
> strictly strings of digits, not including sign (thus only non-negative),
> decimal, exponent, etc. More processing in the bowels of the collation code
> would be very complicated, and ambiguous: "file-5.txt" is probably file
> number 5 rather than file minus five.

File names are identifiers, they are not real phulane language, thye
don't obey to any grammatical rule from any language, even if they may
be named according to some convention in a given language, but they
are frequently abbreviated and use a reduced set of characters.

So collation parsing of numbers for sorting filenames is in fact
collation parsing in technical identifiers. It would be different if
performing collation in a true text like a book, or even in OCR'd
facsimile of accounting reports, when preparing them to rebuild a
spreadsheet.

Imagine toy import a list of filenames in a spreadsheet, the column
type would be set as "text", not numbers. In such cases, sorting as
"text" should use the sort options appropriate for sorting
identifiers. Numbers imported in a "number" column should convert any
number, accepting signs, exponent notations, and correctly filtering
out control formats ot compute the effective value.

So for converting formatted numbers to effective numeric values, the
lenient parsing should be used (numbers will then not sort using
collation, but using their effective numeric value after this
operation).

If the lenient parsing of numbers fails, the column in the spreadsheet
will be trated as "text" and will sort with collation but with a
reduced supported format for numbers (so effectively the ambiguous
ASCII hyphen-minus will be treated as a.hyphen punctuation, not as a
minus sign.

If filenames have to be sorted according to the represented numeric
value, the ambiguous ASCII hyphen-minus should not be used, ans the
mathematical MINUS character should be used in their name (and it
shoul dremain interpreted as a sign in the more restrictive collation
parsing of numbers in identifiers).
.
Received on Wed Mar 20 2013 - 15:08:06 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 20 2013 - 15:08:06 CDT