From: Hans Aberg (haberg@math.su.se)
Date: Thu Apr 09 2009 - 08:09:40 CDT
On 9 Apr 2009, at 08:47, William_J_G Overington wrote:
>> In fact I would like to see a
>> clear distinction between ASCII and Unicode so that
>> characters like "&" if typed as text (i.e.
>> Unicode) would NOT be interpreted as an ASCII interrupt
>> character in HTML/Java/PHP/.... etc.
>
> The problem is because the Unicode characters are used to mean
> something other than the Unicode meaning in what is regarded as a
> mark-up format.
>
> There is a similar problem with XML where the Unicode < and >
> characters are used to mean things other than the defined Unicode
> meanings.
>
> However, as I understand the situation, at least in former times -
> maybe still now, the Unicode Technical Committee does not want to
> encode anything which could be regarded as mark-up and simply states
> that things considered as mark-up should be encoded using higher
> level protocols.
>
> A U-turn on this policy could be worth considering seriously. If an
> "escape ampersand open" and an "escape semicolon close" and an "xml
> bubble open" and an "xml bubble close" were encoded as regular
> Unicode characters, then various edge effects could be resolved for
> the future.
This problem a problem of computer language design, rather than the
character set used. There is a tendency to set of certain character
combinations (tokens) as context independent keywords. For example, C+
+ introduced "<" and ">" as matching pairs (like parenthesizes) for
templates. So one can write
template<class T, class Comp> Sort {
...
;
void f (...) {
...
Sort< Comparator<int> >::sort(vi);
...
}
Now, the problem is that one cannot write, as would be natural for
natural pairs,
Sort< Comparator<int>>::sort(vi);
because ">>" is a keyword: a reserved, context-free token. Further,
this is a legacy from the C-syntax.
But from the point of view of computer language design, it is easy to
fix such problems. It was on the agenda for some C++ revision. But
then it has to be synced with the legacy code.
So as such, introducing special Unicode characters will not solve the
problem of poor computer language syntax. And once a computer syntax
has been fixed, it may be difficult to change it, in view of that it
may break legacy code. So one will have to asses how much and how
important the code is that will break, and how likely it is that it
will be rewritten.
Hans
This archive was generated by hypermail 2.1.5 : Thu Apr 09 2009 - 08:13:05 CDT