Re: Mixed up priorities

From: John Cowan (cowan@locke.ccil.org)
Date: Fri Oct 22 1999 - 11:04:29 EDT


G. Adam Stanislav scripsit:

> >In some non-Slavic language adaptations of the Cyrillic script, up to four
> >letters may be combined to represent a single sound, and these
> >'quadragraphs' are often listed as single letters of the alphabet and have
> >specific sorting and hyphenation rules. Are you suggesting that each of
> >these sequences _needs_ to be encoded as a precomposed character?
>
> I am not talking about transliteration. I am talking about native use.

This is not transliteration: many non-Slavic languages in the former
USSR have no other representation except the Cyrillic script.

> If some language natively considers a quadragraph a character in its own
> right, then yes, we need to encode it. Or we need to stop referring to
> Unicode as CHARACTER ENCODING. Either solution is acceptable.

Or let go of the notion that 1 letter = 1 character. "Character" is a
technical term anyway.

> Consistency. There is a DZ, for example. It is a character is several
> languages (Slovak included).

It exists to make a specific technical trick easy. That trick is now
basically irrelevant, and the characters are fairly useless (indeed, any
character with a canonical equivalent is implicitly deprecated).

> No, it is a standard for encoding _characters_. It states so quite explicitly.

What is a character?

> I have never asked to have the CH encoded right after the H and before the
> I. That would be sorting. I am not talking about sorting at all. I am
> talking about a separate character, which just happens to consist of two
> glyphs.

Or you are talking about a separate letter, which just happens to be
encoded as two characters. There is no reason why sorting algorithms
need to sort character-by-character, and for many languages that
algorithm does not work at all. Swedish w, for example, is sorted
as if it were v.

> Yes, it is possible to encode the CH as the C followed by the H, and the N
> caron by the N followed by some connection code followed by a caron. And it
> is perfectly possible for software to handle it. But that would not be
> CHARACTER encoding. Unicode clearly states its goal to be the encoding of
> characters of all languages, existing and defunct. CH is a character is in
> Slovak.

Semantics of the word "character". *shrug*

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT