From: Jim Allan (jallan@smrtytrek.com)
Date: Thu Nov 27 2003 - 12:01:39 EST
Arcane Jill wrote:
> But there doesn't seem to be any way of specifying operator precedence
> in Unicode text (by which I mean the precedence of ZWJ compared with the
> precedence of any modifier). I can see a case for "invisible brackets"
> here to control such precedence.
Unicode is intended to encode normal text as written or inscribed by
human beings and as read by human beings.
If the plain text is ambiguous (where in operator precedence or in some
other way) it is normally not for Unicode to resolve the semantics of
the text.
Invisible characters that cannot be seen by human beings resolve nothing
when the text is viewed in a normal mode by human beings. The normal
purpose for which text is created is to be read by human beings.
Source code is not an exception.
Source code is intended primarily to allow instructions to a computer to
be created in a way that is more easily comprehended by human beings
than binary instructions. Invisible characters that effect the semantics
of source would only create unresolvable ambiguity for the human beings
who read the source code in normal display or who print it out.
If you want a notation in text to be unambiguous, make it unambiguous
using characters that can be seen by the humans who interpret it.
Unicode does provide the invisible operators U+2061 FUNCTION
APPLICATION, U+2062 INVISIBLE TIMES and U+2063 INVISIBLE SEPARATOR for
particular use for text which is intended to also be used for
mathematical calculations. I would be surprised if the use of these
characters did not turn out to be very dangerous in practice.
> The review on Ethiopic and Tamil non-decimal digits is interesting, but
> I can't help but feel it was a culturally biased decision (read:
> mistake) to EVER have had a "radix ten" property without any similar
> property for any other radix, thereby forcing non-decimal digits to end
> up being classified as No (Other_Number) instead of Nd (Number_Decimal).
> It's a mistake because, even in /my/ culture, digit one followed by
> digit two is not always interpretted as the number twelve. Phone numbers
> and PINs are one exception. Version numbers such as "version 12.12.12"
> are another exception. Octal is another
That a character has the property of being a decimal digit makes no
assertion that the character may not be used in other ways: octal digit,
base-25 digit, used as a letter with phonetic value in some
transliteration systems or used as part of a character description in
Rongorongo transliteration (see
http://www.rongorongo.org/corpus/codes.html). Unicode lays *no* limits
on how users may use any character. That is not Unicode's business.
All characters that are decimal digits also have the property hex_digit.
But such a digit may in fact be used in ways that are neither decimal or
hexadecimal. The properties only reflect what users of scripts that use
them see as the normal interpretation of such characters. They are only
useful hints.
I see no cultural bias in noting that certain characters in certain
scripts are primarily used as part of radix ten notation when that is
indeed the primary meaning of these characters.
> One implication is that hexadecimal numbers cannot be expressed in
> Unicode without violating this property. For instance, is the string
> "U+0012" valid Unicode, given that "the sequence of the ONE character
> followed by the TWO character is [NOT] interpreted as having the value
> of twelve"?
The string "U+0012" is valid Unicode.
Similary the strings "U+0A53", "U+X&@2" and "+U0012" are valid Unicode.
The interpretation of strings produced by users is not Unicode's
business. That the meaning of a particular string is nonsense or
ambiguous is not Unicode's business. The probable meaning of the string
"Ӓ" versus the string "ሴ" is not Unicode's business. The
Unicode standard provides no instructions about necessary interpretation
of strings.
That "12" might be hexadecimal or octal or something else other than
decimal twelve in some contexts is outside of any Unicode specification.
Unicode's task is only to provide a coding that allows representation of
the string "12".
As an additional piece of usefulness the Unicode specification provides
properties that make it easier for processes to find and interpret
numeric quantities in text. But these properties are really only hints,
indicating the most common uses for such characters, certainly not
limiting them to such uses.
If someone wants to represent the medieval spelling of _knight_ by
_kni3t_ (using "3" instead of the proper yogh symbol U+021C) because the
yogh is likely not to appear properly in many applications, they may
certainly do so even though "3" is not a letter.
> Perhaps it would have made sense to simply have different properties all
> round, such as: "number positional" for digits in any radix; "number
> integer" for integer types such as circled 2 which can't be used
> positionally; "number fraction" for fractions, and "number other" for
> everything else. Or maybe some other similar scheme. Is it too late to
> change things now?
Judging from the past, additional properties will be added to the
Unicode specification. The reason for new properties being added should
be that they are *generally useful for character handling* rather than
that they are useful to specialized applications. Specialized
applications can and should define their own properties for their own
needs or use.
As to "'number positional'" for digits in any radix, it might be useful
to add a property "possible positional digit for any radix up to 36" for
the normal ASCII digits and the uppercase and lowercase characters of
the normal twenty-six letters in the ASCII character set.
But is this generally useful enough to warrant it being part of the
Unicode specification?
And is that not also culturally biased? But then all scripts are to
some degree bound to a particular culture or to particular cultures.
Jim Allan
This archive was generated by hypermail 2.1.5 : Thu Nov 27 2003 - 13:02:51 EST