Re: Processing Digit Variants from Richard Wordingham on 2013-03-19 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 19 Mar 2013 20:45:05 +0000

On Mon, 18 Mar 2013 17:28:30 -0700
"Steven R. Loomis" <srl_at_icu-project.org> wrote:
> On Monday, March 18, 2013, Richard Wordingham wrote:

> > The issue is rather with emphatically plain text <U+0031, U+FE0E,
> > U+0032, U+FE0E>.

> It's the same situation to something like an implementation of LDML
> number parsing. U+FE0E is not part of a number.

I agree that the same arguments are applicable to both parsing and
collating, though not necessarily with equal force.

Formally, <U+0031, U+FE0E, U+0032, U+FE0E> seems to be just as much a
number as <U+FF11 FULLWIDTH DIGIT ONE, U+FF12 FULLWIDTH DIGIT TWO>,
which the current LDML semantics do treat on an even footing with
"12". If the emoji digits had been encoded as new characters, ICU
would support them without batting an eyelid. Because the difference
does not merit full characterhood, they are encoded by a sequence
rather than a single character. Remember, all that U+FE0E does is to
request a particular glyph. In a sense, we have 20 new decimal digits,
<U+0030, U+FE0E> to <U+0039, U+FE0F> and <U+0030, U+FE0F> to <U+0039,
U+FE0F>.

So, why do you consider <U+0031, U+FE0E, U+0032, U+FE0E> not to be
a valid decimal number?

> > 10<ZWJ>0<ZWJ>0 would be perfectly reasonable for text
> > likely to be rendered by a cursive Latin font.

> Identifying such an edge case does not prove that numeric tailoring is
> broken.

An 'edge case' is often just a case that shows that an algorithm that
often works has not been thought through thoroughly. Now, as CLDR
seems to value speed above perfect correctness, perhaps handling
variation sequences will be rejected on that basis. All I was trying
to find out on this list was whether <U+0031, U+FE0E, U+0032, U+FE0E>
should be regarded as a proper number.

Special characters intended for just one aspect of text processing
should not affect other aspects. Unfortunately, a parametric tailoring
to ignore irrelevant characters while complying with the UCA is not
quite as simple as just ignoring them. The issues arise with the
blocking of discontiguous contractions and the possibility that, for
example, one might wish to collate character variants differently. On
the other hand, ignoring variation selectors by default might be
excusable, for they should not occur where they might block canonical
reordering (antepenultimate paragraph of TUS 6.2.0 Section 16.4).

Richard.
Received on Tue Mar 19 2013 - 15:49:34 CDT

This archive was generated by hypermail 2.2.0 : Tue Mar 19 2013 - 15:49:36 CDT