From: karl williamson (public@khwilliamson.com)
Date: Sun Jul 25 2010 - 17:00:14 CDT
CE Whitehead wrote:
> 
> 
> Sorry for my last email; I have that signature in hotmail and always 
> delete it but do have it for a few private emails; but sorry as I ment 
> to delete it but was very very tired.
>  
> --C. E. Whitehead
> cewcathar@hotmail.com <mailto:cewcathar@hotmail.com> 
> ------------------------------------------------------------------------
> From: cewcathar@hotmail.com
> To: public@khwilliamson.com; verdy_p@wanadoo.fr
> CC: kent.karlsson14@telia.com; unicode@unicode.org
> Subject: RE: Reasonable to propose stability policy on numeric type = 
> decimal
> Date: Sun, 25 Jul 2010 16:24:01 -0400
> 
> 
>  > . . . 
>  > Date: Sun, 25 Jul 2010 10:43:11 -0600
>  > From: public@khwilliamson.com
>  > To: verdy_p@wanadoo.fr
>  > CC: kent.karlsson14@telia.com; unicode@unicode.org
>  > Subject: Re: Reasonable to propose stability policy on numeric type = 
> decimal
>  >
>  > Philippe Verdy wrote:
>  > > "Kent Karlsson" <kent.karlsson14@telia.com> wrote:
>  > >> Den 2010-07-25 03.09, skrev "Michael Everson" <everson@evertype.com>:
>  > >>> On 25 Jul 2010, at 02:02, Bill Poser wrote:
>  > >>>> As I said, it isn't a huge issue, but scattering the digits 
> makes the
>  > >>>> programming a bit more complex and error-prone and the programs 
> a little less
>  > >>>> efficient.
>  > >>> But it would still *work*. So my hyperbole was not outrageous. 
> And nobody has
>  > >>> actually scattered them. THough there are various types of "runs" 
> in existing
>  > >>> encoded digits and numbers.
>  > >> While not formally of general category Nd (they are "No"), the 
> superscript
>  > >> digits are a bit scattered:
>  > >>
>  > >> 00B2;SUPERSCRIPT TWO
>  > >> 00B3;SUPERSCRIPT THREE
>  > >> 00B9;SUPERSCRIPT ONE
>  > >> 2070;SUPERSCRIPT ZERO
>  > >> 2074;SUPERSCRIPT FOUR
>  > >> ...
>  > >> 2079;SUPERSCRIPT NINE
>  > >>
>  > >> And there are situations where one wants to interpret them as in a
>  > >> decimal-position system.
>  > >
>  > > Scattering does not only affect decimal digits, but also mathematical
>  > > operators needed to represent:
>  > >
>  > > - the numeric sign (« - » or « + »), with at least two variants for
>  > > the same system to represent the minus sign (either the ambiguous
>  > > minus-heighten, the only one supported in many text-to-number
>  > > conversions, or the true mathematical minus sign U+2212 « − » that has
>  > > the same width as the plus sign), including some « alternating signs »
>  > > that exist in two opposite versions (« ± », « ∓ »);
>  > >
>  > > - the characters that represent the decimal separator (« . » or « , »)
>  > > which is almost always needed but locale-specific (this is not just a
>  > > property of the script);
>  > >
>  > > - the optional character used to note exponential notations and used
>  > > in text-to-number conversion (usually « e » or « E »);
>  > >
>  > > - the optional characters used in the conventional formatting for
>  > > grouping digits (NNBSP alias « fine », with possible automatic
>  > > fallback to THINSP in font renderers and in rich-text documents
>  > > controlling the breaking property with separate style, or fallback to
>  > > NBSP in plain-text documents, or fallback to standard SPACE in
>  > > preformatted plain-text documents, « , », or « ' », and possibly other
>  > > punctuations in their « wide » form, for ideographic scripts).
>  > >
>  > > Some of them exist in exponential/superscript or indice/subscript
>  > > versions (notably digits and decimal separators), but not all of them
>  > > (not all separators for grouping digits, using NNBSP may not be
>  > > appropriate as its width is not adjusted and it does not have the
>  > > semantic of a superscript or subscript).
>  > >
>  > > For generality, it seems better to assume that digits and other
>  > > characters needed to note numbers in the positional decimal system may
>  > > be scattered (libraries may still avoid the small overhead of
>  > > performing table lookups, by just inspecting a property of the
>  > > character '0' or of the convention use, that will either say that it
>  > > starts a contiguous ranges, or that the complete sequence is stored in
>  > > a lookup array for the 10 digits.
>  > >
>  > > The general category "Nd" may not always be accurate to find all
>  > > digits usable in decimal notations of integers, because the sequence
>  > > may have been incomplete when it was first encoded, and completed
>  > > later in scattered positions.
>  > >
>  > > In this case, the digits will often have a general property of "No"
>  > > (or even "Nl") that will remain stable. What should also be stable is
>  > > their numeric value property (but I'm not sure that this is the case
>  > > of "Nl" digits, notably for scripts systems using letters in a way
>  > > similar to Greek or Hebrew letters as digits, even if Greek and Hebrew
>  > > digits are not encoded separately from the letters that these number
>  > > notations are borrowing).
>  > >
>  > > Also I'm not sure that scripts that define "half-digits", or digits
>  > > with higher numeric values than 9, are permitting the use of their
>  > > digits with a numeric value between 0 and 9, in a positional decimal
>  > > system. The Roman numeric system is such a numeric system (borrowing
>  > > some scattered Latin letters and adding a few other specific digits)
>  > > where this will be completely wrong.
>  > >
>  > > Or another base than 10 could be assumed by their positional system,
>  > > even if their digits are encoded in a contiguous range of characters
>  > > for the subset of values 0 to 9. This is probably no longer the case
>  > > with scripts that have modern use, but in historical scripts or in
>  > > historical texts using a modern script, the implied base may be
>  > > different and would have used more or less distinct digits. So instead
>  > > of guessing automatically from the encoded text, it may be preferable
>  > > to annotate the text (easy to insert if the conversion of the
>  > > historical text uses some rich-text format) to specify how to
>  > > interpret the numeric value of the original number.
>  > >
>  > > And sometimes, the conversion to superscripts/subscripts compatibility
>  > > characters will not be possible even if some of them may be converted
>  > > safely to their numeric value, after detecting that they are in
>  > > superscript/subscript and that they don't behave the same as normal
>  > > digits (16²⁰ must NOT be interpreted as the numeric value 1620, but
>  > > must be parsed as two successive numbers 16 and 20, where the second
>  > > one has the semantic of an exponent, as if there was an exponentiation
>  > > operator between the two numbers).
>  > >
>  > > It is also very frequent that only a few superscript digits will be
>  > > supported in one font, and other digits may be borrowed from another
>  > > font using a completely distinct style with distinct metrics or may
>  > > not be displayed at all (missing glyph). The result is then horrible
>  > > if you can't predict which font will be used that support the 10
>  > > digits in a contiguous range of values (even if they are scattered in
>  > > the code space).
>  > >
> This does seem relevant to me.
>  > > When converting numbers to text with exponential notations, the use of
>  > > superscripts should only be used with care, knowing that this won't be
>  > > possible in all scripts, and that only integers without grouping
>  > > separators can be used.
>  > >
>  > > Some writing systems (unified as « scripts » in Unicode) will still 
> require to:
>  > >
>  > > - either use rich-text styling for superscripts used in the
>  > > conventional notation of exponents,
>  > >
>  > > - or use an explicit exponentiation operator, such as the ASCII symbol
>  > > U+005E "^" (which is not the same as a modifier letter circonflex
>  > > U+02C6 "ˆ", and that many fonts render at with glyph size and position
>  > > different from the the combining diacritic and implied by the modifier
>  > > letter), or a mathemetical operator or modifier letter (like the
>  > > upward arrow head U+02C4 "˄" that some fonts render as the
>  > > mathematical wedge operator on the baseline U+2227 "∧", or the less
>  > > ambiguous upward arrow U+2191 "↑").
>  > >
>  > > Philippe.
>  > >
>  > >
>  > >
>  > That all may be true, but it is really besides the point.
>  >
>  > I'm considering extending an existing computer programming language
>  > which currently only understands numbers composed solely by the ASCII
>  > numbers to also understand those from other scripts. I'm not going to
>  > do it unless it is easy within the existing implementation (not some
>  > theoretical better implementation) and efficient and not a security 
> threat.
>  >
>  > The symbols for operators like exponentiation are already set in stone.,
>  > and their being scattered isn't relevant. Likewise, non-decimal-digit
>  > numbers, like subscripts, are also not relevant.
>  >
>  > I found a way to do the implementation that meets all my criteria, but
>  > is based on the existing pattern of Gc=Nd (or Nt=De) code point
>  > assignments. The assignments have so far been prudent, to use Asmus'
>  > term. I was merely trying to see if this prudence could be codified so
>  > that my implementation wouldn't get obsoleted on a whim in some future
>  > Unicode release.
>  >
>  > I hadn't thought of the case where a zero is later found or its usage
>  > develops in a script, and suddenly all the digits in that script change
>  > from Nt=Di to Nt=De, which because of an existing stability policy would
>  > necessarily require their general category changing to Nd.
>  >
>  > Prudence would dictate, then, that when assigning code points to the
>  > numbers in a script, that a contiguous block of 12-13 be reserved for
>  > them, such that the first one in the block be set aside for ZERO; the
>  > next for ONE, etc.
>  >
>  > My original question comes down to then, would it be reasonable to
>  > codify this prudence? People have said it will never happen. But no
>  > one has said why that is.
>  >
>  > Obviously, things can happen that will mess this up--the Phaistos disk
>  > could turn out to be a base-46 numbering system, as an extremely
>  > unlikely example. But by dictating prudence now, most such eventualities
>  > wouldn't happen.
>  >
>  > I have since looked at the Nt=Di characters. The ones that aren't in
>  > contiguous runs are the superscripts and ones that would never be
>  > considered to be decimal digits, such as a circled ZERO.
> Hi
> Are you proposing that superscripts be in contiguous runs or not? 
I was not proposing that.  Just the codification of what existing 
practice has been for Numeric_Type=Decimal_Digit.  Superscripts are of 
Numeric_Type=Digit; the two names are too similar, and cause confusion.
I know of no general purpose programming language that figures out math 
equations with superscripts.  If you want exponentiation, you have to 
specify an exponentiation operator.
  Above
> you disallowed subscripts (although
> I think mathematically subscripts have some meaning in equations as do 
> superscripts and it might worth converting them albeit separately from 
> other numbers; if these were converted it would allow complete equations 
> to be converted from character strings -- but with only digits 1-9 I do 
> not see that much of an issue; I'd personally like to find a subscript 
> i; but so far I've just looked at:  
> http://unicode.org/charts/PDF/U2070.pdf where the subscripts 0-9 are all 
> contiguous but the superscript 1, 2, and 3 are not; searching through 
> http://unicode.org/Public/UNIDATA/UnicodeData.txt that was all I found; 
> I then started going through code charts one by one and so far have 
> gotten as far as Old South Arabian and have not found superscript i or 
> more superscript decimal numbers though maybe I've missed something -- 
> the Arabic sukun is not going to be part of a series of superscripts in 
> any case). 
>  
>  > The only run
>  > in the BMP which doesn't have a zero is Ethiopic. It seems extremely
>  > unlikely to me that a zero will be discovered or come into use with that
>  > script. I'm guessing that they have adopted European numbers in order
>  > to have commerce with the rest of the world.
>  >
>  > There are several runs in the SMP, but the code point where a zero would
>  > go isn't assigned.
>  >
>  > I don't know for sure, but it appears to me that we are running out of
>  > non-dead scripts to encode. I see that draft 6.0 has only 544 BMP code
>  > points not in any block and not much in the pipeline. I would think
>  > that most any script yet to be encoded would have borrowed numbering
>  > systems from their neighbors.
>  >
>  > And there is still plenty of space in the SMP, so this proposal to
>  > require prudence should not use up too many precious unassigned code 
> points.
>  >
> If it does not take up too much space; I support this proposal although 
> there is no way that characters are contiguous in any case -- so for 
> doing sorts and such this is not going to help really normally.
>  
> Best,
>  
> C. E. Whitehead
> cewcathar@hotmail.com <mailto:cewcathar@hotmail.com>
>  
This archive was generated by hypermail 2.1.5 : Sun Jul 25 2010 - 17:04:31 CDT