From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sat Sep 10 2005 - 17:40:50 CDT
On Sat, 10 Sep 2005, Mark Davis wrote:
> Beyond that, there are problems in your formulation.
Certainly. It was just an initial attempt to list down some issues, in 
order to find a useful answer to the question "what languages does Unicode 
support". I'm assuming that the question is relatively frequent. It's 
important at least indirectly: there are many short descriptions of 
Unicode that make the unqualified, unrestricted claim that Unicode 
supports all languages, or something similar.
>> All living languages, and many dead languages, can be written in their
>> normal writing system(s) using Unicode characters.
>
> 1. It is not true of "all living languages"; there are some minority 
> languages that need additional characters.
My tentative answer was much simplified, and it plays with the word 
"normal". What worries me is how this thing could be expressed so the 
common man, or maybe even a pointy-haired boss, would get roughly the 
right idea. "All living languages" is too much, and "Almost all living 
languages" is rather vague. "Minority language" means here probably much 
smaller than the average man thinks.
>> However, some
>> of their characters cannot be represented as single Unicode characters
>> but as combinations.
>
> 3. The 'however' is misleading. It is not a deficiency that some of what 
> users may perceive of as separate characters are encoded by sequences.
I know it can be misleading, but if we just say that Unicode provides a 
unique number for every character, it's misleading, too. And even 
incorrect. ( http://www.unicode.org/standard/WhatIsUnicode.html )
After one gets acquainted with Unicode, it probably comes as a surprise to 
most people that some characters (using a person's intuitive 
understanding of "character") cannot, after all, be represented
using a single Unicode code point.
As a thought experiment, let us suppose that the letter "w" had not been 
included into ASCII or other character codes but written as "vv", and that 
Unicode had not changed this. When people would then ask for "w", the 
answer would be that it is just a typographic variant of "vv", a ligature
(as it historically is, in fact). Maybe after quite some debate we would 
then be told to use the combination of three Unicode characters,
"v", word joiner, and "v". Could we then _really_ say that Unicode 
supports the English alphabet, for example, and a separate code point is 
not needed for "w"?
Unicode contains _most_ accented letters used in human languages
as precomposed characters, but not all. There's a clear distinction here.
I'm not questioning the policy decision that effectively freezes the set 
of such precomposed characters. What I'm saying is that we should admit 
that it implies differences on how languages are supported.
>> Some orthographic and typographic constructs, which
>> could in principle be expressed in plain text, cannot be expressed
>> in Unicode.
>
> 4. Also not a deficiency. If Unicode attempted to encode all typographic 
> constructs, it would be a horrible mess. It provides a foundation for other 
> mechanisms (CSS, etc) to build upon; they can provide typographical 
> constructs. And by 'orthographic constructs', you'd have to provide examples 
> of what you mean.
Here, too, my text was supposed to address people's intuitive expectations 
rather than take a position on what should or should not be encoded in 
Unicode. For example, since the acute accent used in French, the accent 
used in Polish, and the tonos in Greek normally look different from each 
other, it is natural to expect that they are treated as different marks.
Unicode may have made the right choice in unifying all of them to acute 
accent, but it still means that a difference that could have been - and 
should have been made in some people's opinion - made in plain text cannot 
be made in Unicode. If someone says that Unicode does not support Polish 
because Unicode does not recognize the Polish accent mark as distinct from 
the French mark, he might well be wrong by some criteria; but it's still 
an opinion that people have and that is ultimately not a completely 
objective matter.
>> Some of the properties of characters as defined by the
>> Unicode Standard do not correspond to their behavior in different
>> languages.
>
> 5. Again, you'd have to provide examples to clarify what you mean.
For example, line breaking behavior. Unicode line breaking rules allow, 
for example, a line break after ":" in a string like "YK:ssa". Yet, that 
string is the way to write an inflected form of an abbreviation in 
Finnish, and such an expression should not be divided, and if it really 
needs to be divided, the break must not appear after the colon but at 
syllable boundary ("YK:s-sa"). I take this example, since I have had to 
fight with such problems when fixing the typesetting of books, when the 
typesetting program supported this part of Unicode line breaking rules but 
no way to override them except with awkward trick.
In this example, the point is that the default line breaking rules break 
the processing of texts, by introducing allowed break points that might be 
OK in many contexts, but not in others. I know that such rules are not 
normative and they are meant to be defaults that can be overridden as 
needed, but that's largely just theory. The point is that by introducing 
such properties, Unicode has created, for work with text at the plain text 
level, problems that didn't exist in earlier character codes. (Of course, 
the default line breaking rules _solve_ many problems, too.)
>> Moreover, Unicode is meant to describe plain text only, so it generally
>> lacks any support that might be needed for display and processing of
>> text by language-specific rules.
>
> 6. Again, by design, to avoid above-mentioned horrible mess. If you want 
> language tagging so as to customize appearance for different languages, use 
> higher level markup or structure, such as xml:lang or equivalent.
My statement was meant to explain, and perhaps apologize a bit, rather 
than to present the issue as a drawback of Unicode. Here, too, we have a 
problem of expectations. Unicode has, after all, quite a many features 
that actually operate on a level higher than plain text, such as 
typographic variants encoded as characters, even variant selectors, and 
language tags that are not recommended but still exist in Unicode.
The question then arises why this or that feature is absent.
And in reality, this means that the level of support to languages varies 
by language. If someone expresses this by saying that Unicode does not 
support this or that language, because Unicode does not recognize some 
difference as a character difference but just as a glyph difference that 
does not affect the coding of characters, I would say that the opinion
is not quite right - but it is understandable, and it demonstrates a bit 
how relative the "support to a language" concept is.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 17:42:16 CDT