More on identifiers: TR 10176 Annex A, MIDDLE DOT

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 02 2000 - 15:02:03 EST


Kent Karlsson asked:

> For the record, on the Unicode list, can you please
> state the relationsip between the identifier recommendation
> below and the one of 10176 revised Annex A? "Level 3"
> combining marks, formatting control, and some compatibility
> (decomposable) characters are not listed in the latter, partly
> to avoid problems if normalisation is not used (initially).
> 10176 revised Annex A does not state the relationship, maybe
> the Unicode 3.0 book does (but I too don't have a copy yet).

The revised Annex A of ISO TR 10176 is intended to be a strict subset
of what the Unicode Standard recommends as allowable for
identifiers. Thus if a programming language follows the
recommendations of 10176, identifiers constructed using the
characters in Annex A should also be considered valid identifiers
for a typical Unicode implementation.

The Unicode recommendation includes more, as you indicate, including
combining marks and formatting control characters such as ZWJ and
ZWNJ.

What the revision of Annex A was aiming to avoid was a situation
where Annex A allowed some characters that the Unicode recommendation
did not *and* the Unicode recommendation allowed some characters
that Annex A did not. Note that not only was Annex A of TR 10176
modified to accomplish this -- the properties of a few Unicode
characters were also modified by the UTC to help bring these two
recommendations into line.

>
> PS
> What happended with MIDDLE DOT? KATAKANA MIDDLE DOT
> is Pc, but MIDDLE DOT is Po.

The Unicode 2.0 recommendation for identifiers included U+00B7 MIDDLE DOT
in its role as an extender. (It is found as part of words in Catalan,
for example, and to mark length on vowels in many American orthographies.)

However, during the discussions about 10176 Annex A, problems with
MIDDLE DOT were specifically brought up. MIDDLE DOT has wide occurrence
as punctuation (as a small bullet). It also may be confused with or
used for the multiplicative operator. Some vendors and programming
language specialists were strongly of the opinion that it should be
excluded from identifiers. The UTC decided that way, and the identifier
recommendation was modified to include only General Category Pc (connecting
punctuation), and not all extenders.

U+30FB KATAKANA MIDDLE DOT is a different animal entirely. It is connecting
punctuation, used to bind together two-part (or multiple-part) katakana
representations of foreign words or names. In a programming context, it
can be seen as functioning something like the use of "_" in C to form
a multi_word_identifier, for example. The Japanese explicitly requested
that it be allowed in identifiers.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT