Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 04 2006 - 16:19:30 CDT

Next message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"

Previous message: Cristian Secară: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote on Sunday, June 04, 2006 at 7:05 PM
> On 4 Jun 2006, at 16:59, Richard Wordingham wrote:
>> Theodore H. Smith wrote on Sunday, June 04, 2006 at 12:38 PM

> I don't understand NFD yet either :) All I know is that combining
> characters may cause problems for search functions, and that some
> re-ordering is necessary to fix this. But what kind of re- ordering
> exactly I do not know.

There are two reasons the need for something like NFD arises.

The idea is that Unicode not specify the order of diacritic marks. What is
felt to be the natural order in one language may not be the same in another
language. Moreover, there is no guarantee that everyone will type multiple
marks in the same order when they may be typed separately. Thus an 'a' with
a circumflex above and a dot below may be entered in the order <U+0061,
U+0302, U+0323> in Vietnamese, where the dot below is a tonemark. However,
there is an interpretation of the ISO:11940:1998 transliteration scheme for
Thai where the dot below would indicate that the vowel is implicit and the
circumflex indicates the tone mark. Especially for someone used to the Thai
typing rule of adding marks to a consonant from bottom to top, this would
naturally be written as <U+0061, U+0323, U+0302>.

The second reason is that, apparently contrary to the original conception of
Unicode, combinations of base letter and diacritic may be encoded as single
characters. (This practice of adding such codepoints to the standard has
now been largely discontinued.) Thus 'a' circumflex may encoded as either
<U+0061, U+0302> or <U+00E2>.

There are in fact six different ways of encoding small 'a' with a circumflex
above and a dot below:

<U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW>
<U+00E2, U+0323>
<U+0061, U+0302, U+0323>
<U+1EA1 LATIN SMALL LETTER A WITH DOT BELOW, U+0302>
<U+0061, U+0323, U+0302>

The process of determining whether these are the same is:

1) Decompose, so we can see what we get.
2) Declare that the order of diacritics only matters if they compete for
position on paper, e.g. when diacritics stack vertically or are arranged in
order from left to right. This test is difficult to computerise, so it is
expressed as follows:
(a) Assign diacritics to combining classes, with the same number if they
compete for position, and different numbers if they do not. This is called
combining class.
(b) Identify non-diacritics by assigning them combining class zero. One
complication is that decomposition is not always into diacritic and
non-diacritic - two part Indic vowels usually also have decompositions,
generally with both being of combining class zero.

Sequences which then have no differences that matter are 'canonically
equivalent'.

Finally, to reduce comparison of strings to comparison of codepoint
sequences, we define a normalised form, i.e. select a representative element
from each class of equivalent sequences. How do we do this? We define a
sequence to be in Normal Form D (D for 'decomposed') (NFD) if:
(i) No character in it has a canonical decomposition;
(ii) Every sequence of characters of non-zero combining class is in order of
combining class.
Every canonical equivalence class of codepoint sequences has exactly one
member which is in NFD.

For the example above, it is <U+0061, U+0323, U+0302> that is NFD.

The objection to NFD is that it is the form which uses the most codepoints.
Converting to NFC is the deterministic selection of a compact member of the
equivalence class to act as its representative. (There are probably
examples where it is not the most compact member, even without considering
'composition exclusions'.) I don't think I have anything useful to add to
the account of NFC I gave earlier.

> That is to say that for f(x)=y, you can get different values of y for the
> same value of x?

Writing ~ for canonical equivalence, and = for identity of codepoint
sequence, then for a Unicode-compliant transformation f, if a~b, then we
require f(a)~f(b). Uppercasing and lowercasing are required (defined?) to
be Unicode-compliant transformations. Inconveniently, even ignoring the
issues of locales, simply applying the casing data in UnicodeData.txt does
not result in a Unicode-compliant transformation. The problem is the
behaviour of subscript iota. When text is converted to uppercase, the
subscript iota becomes a full capital iota. Subscript iota has positve
combining class; capital iota has combining class 0.

> I do indeed get Ω Ι ̓ ͂ when trying to
> uppercase <U+03C9, U+0345, U+0313, U+0342>. Why? What's wrong with the
> result?

U+03A9 has combining class zero.
U+03C9 has combining class zero.
U+0399 has combining class zero.
U+0345 has combing class 240.
U+0313 has combining class 230.
U+0343 has combining class 230.

<U+03C9, U+0345, U+0313, U+0342> is therefore canonically equivalent (~) to
<U+03C9, U+0313, U+0342, U+0345>,
~ <U+ 1F60 GREEK SMALL LETTER OMEGA WITH PSILI, U+0342, U+0345>
~ <1F66 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI, U+0345>
~ <1FA6 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND
YPOGEGRAMMENI>

But http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt states that the
upper case form of U+1FA6 is <U+1F6E, U+0399>. But
<U+1F6E, U+0399> ~ <U+03A9, U+0313, U+0342, U+0399>, which is not
canonically equivalent to <U+03A9, U+0399, U+0313, U+0342>. That is what is
wrong.

> If there is something wrong with the result, it could be that perhaps
> with smarter input UTF-8 conversion data tables, I can get correct
> result.

http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt does tell you what
you need to do - 'IMPORTANT-when capitalizing iota-subscript (0345), it MUST
be in normalized form--moved to the end of any sequence of combining marks.'

Essentially, before you apply simple capitalisation, you must ensure that
any character containing U+0345 in its decomposition is final or is followed
by a character of combining class zero. It is for this reason that U+0345
has the highest canonical combining class (240), and is the only character
in that class. One way of doing this - the one involving the least
programming effort - is to convert to NFD and then capitalise.

>> So your process is not Unicode-compliant, for, to use the standard
>> citation form for Unicode codepoints, <U+0391, U+033D, U+0399> and
>> <U+0391, U+0399, U+033D> are not canonically equivalent, whereas the
>> inputs, <U+03B1, U+033D, U+0345> and <U+03B1, U+0345, U+033D>, are.

This of course is another example of exactly the same problem.

This contrived example demonstrates that NFC only works for normal Greek.
The NFC of <U+03B1, U+033D, U+0345> is <U+1FB3 GREEK SMALL LETTER ALPHA WITH
YPOGEGRAMMENI, U+033D>, and it would naively uppercase to <U+0391 GREEK
CAPITAL LETTER ALPHA, U+0399, U+033D>, which is not equivalent to the naive
upper case of the NFD form, <U+0391, U+033D, U+0399>. I raised this
combination as an aside because it did not seem semantically correct. An
even better example of the same thing is
ᾔ̲δ̲η̲ (with combining underline under all letters). In NFC it is

<U+1F94 GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI,
U+0332 COMBINING LOW LINE, U+03B4 GREEK SMALL LETTER DELTA, U+0332,
U+03B7 GREEK SMALL LETTER ETA, U+0332>. That capitalises by the rules (or
at least, if you first convert to NFD) to
Ἤ̲ΙΔ̲Η̲ (with just three of the four letters underlined!)

<U+1F2C GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA,
U+0332, U+0399 GREEK CAPITAL LETTER IOTA, U+0394 GREEK CAPITAL LETTER DELTA,
U+0332,
U+0397 GREEK CAPITAL LETTER ETA, U+0332>. Clearly underlining and
uppercasing do not commute!

Richard.

Next message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Previous message: Cristian Secară: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 16:32:40 CDT