> In XNS 1.0, XNS personal, business, and general names all
> follow the same normalization rules:
These normalization rules only work for ASCII, so why bother using Unicode?
After all, they can all keep on using ASCII (cmp.
> Names can be up to 64 characters of XML text (Unicode 2.0
> characters as
> defined by the W3C XML 1.0 specification).
I think this means that text is normalized by *composition*, right?
This means that letters with diacritics will be handled as completely
different from their base letter. This would be a nightmare for languages
where diacritics have an "optional status". A few example of these funny
minority languages: English, Arabic, Italian, Hebrew (add also, e.g., French
and Spanish, if you consider the old deprecated usage of removing accents in
It means that, say, "www.co÷perate.ut" and "www.cooperate.ut" would be
considered as different names, which is certainly not what most users want.
A better choice, IMHO, would be to normalize by *decomposition*. In this
way, the problem above would be addressed by rule 3 below.
> For purposes of name representation, all characters are legal
> except the
> XNS global namespace prefix characters "=", "@", "+", the namespace
> delimiter character "/", and the XML markup tag delimiter
> characters "<"
> and "">".
Shouldn't ":" be out as well? It acts as the separator for the port number.
How do you distinguish a name like "www.unicode.org:80" (where ":80" is part
of the name) from "www.unicode.org" with a ":80" suffix?
And how about "?" and "~"?
> For purposes of name registration uniqueness, the only significant
> characters are numbers and letter as defined by the Java
> function returning TRUE. This function determines if a character is a
> letter or digit according to the Unicode 2.0 standard
> (category "Lu", "Ll",
> "Lt", "Lm", "Lo", or "Nd" in the Unicode specification data
> file). For the
> full specification, see Gosling, Joy, and Steele, The Java Language
I think that a *much* more careful research should be carried on, regarding
what characters are to be considered "top significance", and which ones
An example of characters that would be excluded from this rule:
- All vowels in Indian and South-East Asian languages! -- unless they
casually occur at the beginning of words, in which case they are "Lo".
- Indic viramas! -- Removing viramas in Indic alphabets is like adding
random "a"'s to Western text.
- Tibetan subscribed consonants! -- which are consonant on the same ground
of Tibetan "Lo"'s, just they happen not to not be *preceded* by vowel.
Moreover, why considering only "Nd" characters? All numerical ("N*")
characters represent numbers, and are significant to the same degree. I see
no reason why "www.number-1.ut" and "www.number-2.ut" should be considered
different names, while "www.number-I.ut" and www.number-II.ut" should be
considered the *same* name (www.number.ut)!
> Letters in the ASCII range are normalized to lower case. (In
> XNS 1.0, case
> normalization is not applied in to any other Unicode character range.)
This is the nicest one!!
Why should ASCII (a *part* of the Latin alphabet) be any different from
other cased alphabets (the *rest* of the Latin alphabet, Greek, Cyrillic,
I don't think I need any further explanation or example about this last
point. Could you please explain the reason behind this last rule, if any?
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT