Re: Rationale wanted for Unicode identifier rules

From: Timothy Partridge (timpart@perdix.demon.co.uk)
Date: Wed Mar 01 2000 - 14:36:11 EST


In message <200003011848.KAA29610@unicode.org> you recently said:

> (Still waiting for my bookstore to get 3.0 book.)
>
> Section 5.14 of 2.0 says:
>
> # The formal syntax provided here is intended to capture the general
> # intent that an identifier consists of a string of characters that starts
> # with a letter or an ideograph, and then follows with any number of letters,
> # ideographs, digits, or underscores.
>
> Can anyone give me a rationale for rejecting the following argument:
>
> > There are some [syntax] characters we know we need to prohibit [in
> > identifiers, such as +, -, etc.], as well as a couple of ranges of
> > control characters, but other than that I'm not sure why it's worth
> > bothering.
> >
> > [...] I don't see the need for prohibiting every possible
> > punctuation character or characters such as a smiley or a snow man,
> > even though I would probably not use them in an [identifier] myself. As
> > long as they don't conflict with the [rest of the] syntax, it makes no
> > difference [to the] parser.
>
> In other words, programming languages have historically tended to allow
> anything in an identifier that wasn't used for some syntactic purpose;
> leading digits were forbidden to make lexers simpler. What specific
> reason is there not to treat all hitherto-unknown Unicode characters
> as legitimate in identifiers, in the manner of the Plan9 C compiler
> (which extends C to treat everything from U+00A0 on up as valid)?
>
> I need this to help me write a draft standard, so I'm not asking out
> of randomness.

Identifiers are often noun phrases (for variables) or verb phrases /
sentances (for functions etc). These are written in a human language.
Traditionally this has been English so A to Z, (and a to z if available) are
used with 0 to 9 added for a bit of variety. There is sometimes a special
character which can be used to separate words (Underscore in most recent
languages.)

COBOL A to Z, 0 to 9 and -. - is also used as a minus sign.
BASIC A to Z, 0 to 9. Nothing to separate words. $ and % have special
significance at end of variable to indicate type.
FORTH Anything you fancy apart from space or controls. (Numbers are flexible
in Forth too. Base 40 anyone? If it isn't a known identifier Forth trys
parsing as a number in the current base.)

It's all down to your philosophy. COBOL identifiers aren't valid Unicode
ones according to the rules suggested, but are close. Forth identifiers
definitely don't conform and someone just might program their radio
telescope to point at Mars with a subroutine named with the astrological
symbol for Mars. Would you want a chi-squared statistical subroutine
to have a two character name?

   Tim

-- 
Tim Partridge. Any opinions expressed are mine only and not those of my employer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT