Rationale wanted for Unicode identifier rules

From: John Cowan (jcowan@reutershealth.com)
Date: Wed Mar 01 2000 - 13:53:17 EST


(Still waiting for my bookstore to get 3.0 book.)

Section 5.14 of 2.0 says:

# The formal syntax provided here is intended to capture the general
# intent that an identifier consists of a string of characters that starts
# with a letter or an ideograph, and then follows with any number of letters,
# ideographs, digits, or underscores.

Can anyone give me a rationale for rejecting the following argument:

> There are some [syntax] characters we know we need to prohibit [in
> identifiers, such as +, -, etc.], as well as a couple of ranges of
> control characters, but other than that I'm not sure why it's worth
> bothering.
>
> [...] I don't see the need for prohibiting every possible
> punctuation character or characters such as a smiley or a snow man,
> even though I would probably not use them in an [identifier] myself. As
> long as they don't conflict with the [rest of the] syntax, it makes no
> difference [to the] parser.

In other words, programming languages have historically tended to allow
anything in an identifier that wasn't used for some syntactic purpose;
leading digits were forbidden to make lexers simpler. What specific
reason is there not to treat all hitherto-unknown Unicode characters
as legitimate in identifiers, in the manner of the Plan9 C compiler
(which extends C to treat everything from U+00A0 on up as valid)?

I need this to help me write a draft standard, so I'm not asking out
of randomness.

-- 

Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT