Re: Processing Digit Variants

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 20 Mar 2013 11:06:16 +0100

2013/3/20 David Starner <prosfilaes_at_gmail.com>:
> On Tue, Mar 19, 2013 at 10:13 PM, Steven R. Loomis <srl_at_icu-project.org> wrote:
>> Richard,
>> For parse, it's pretty simple: U+0031 has a Unicode digit value. U+FE0E
>> does not. ( Nor is it part of the defined numbering systems in LDML - see
>> http://unicode.org/reports/tr35/#Numbering System Data )
>> So, U+FE0E is the end of the sequence - not a number. End of parsing.
>>
>>>
>>> > > 10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text
>>> > > likely to be rendered by a cursive Latin font
>>
>>
>> It's not reasonable for numeric parsing, however.
>
> Which is one of those things that frustrate people to no end.
> Invisible characters that mean that numbers aren't actually numbers
> will mean that somewhere, someone will beat their head against the
> desk and probably eventually work around a problem they will never
> understand.

I also disagree with the comment "not reasonnable for numeric
parsing". It is evident that this string, even if it uses ZWJ to show
the alternate, joining for mof digits, is still made of digits and
that the whole still means the same as 1000 for the reader and that it
has an inherent numeric value.

The good question to ask is which kind of numeric parsing do you want :
- strict, for pure numeric data independantly of the rendering form
(programming languages, tabular data interchange)
- lenient, for correct interpretation of numbers found in texts.

So I do not see joiners or variant selectors as being blocking for
numeric parsing in plain text. For me this is the same kind of
difference between "words" in natural languages, and "identifiers" for
programs/data: the forms adopted by numbers has a large variability
even if it exists restrictions for programming languages and tabular
data (or in applications using input forms whse data should be
validated, possibly filtered transparently and converted before being
sent or stored in some database).

The fact that joiners or variation selectors or other controls are not
listed as "digits" or being part of a numeric system is also not a
blocking situation for lenient parsing (which is still needed when
extracting data from a more general plain-text initially not intended
to be used as data).

The current numeric properties only concentrate on a narrow essential
subset to make the numeric system work, it does not mean that other
characters will not be inserted within numbers, for legitimate
reasons.
Received on Wed Mar 20 2013 - 05:10:40 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 20 2013 - 05:10:40 CDT