From: verdy_p (verdy_p@wanadoo.fr)
Date: Sun Sep 06 2009 - 16:57:31 CDT
"Shriramana Sharma" wrote:
> Correct me if I am wrong, but the single Greek letter sigma is said to
> have two different forms, one in word-final and other in other places.
> These are encoded in Unicode as 03C2 and 03C1 respectively.
>
> Now are these two symbols not just two different ways of writing the
> same character? If yes, how can they be separately encoded? Is it only
> to keep compatibility with some earlier standard? Or can these two
> actually be considered as two different characters?
It would be simple if the correct letter form could be decided from a simple context, not exceeding the state of
some properties of the previous character or the one for the next character. In that case, a solution like Arabic
contectual letter forms would work.
Even if there are more advenced text rendering engines and font formats that can manage more complex cases (with
substitution rules that should not have to apply some canonical reordering of "equivalent" encodings to cover all
the possibilities), you cannot depend only on these technics.
For the same reason, in the Latin script, there's the case of the long s (which is no longer used in modern
languages) and whose position that cannot be reliably defined by simple algorithms, because its has always depended
on the authors, and these positions were also not respected by the same author in the same texts.
The case of the Greek sigma is quite similar: it has some tricks where the final form of the lowercase sigma needs
to be present in the middle of a word, or even the opposite. There's also the need for backward compatibility with
legacy ISO encodings that treated these two characters as distinct (because they coould not depend on more advanced
contextual rendering with the simple on-to-one mappings from characters to glyphs in almost all legacy fonts for
Greek).
For these reasons, the two letters need to be considered as distinct in lowercase, even if their capitalized form
are to the same capital Sigma letter.
(In fact, to help prevent the loss of information, the capitalization of text by mapping algorithms should no longer
be performed at all on texts, when this is just needed either as a rendering style, or for collation and search
purposes, unless the capitalization is absolutely required by the standard orthography of languages: the capitals
are to be treated as distinct from the small letters; this is especially important for dictionaries, and can explain
why, for example, Wiktionnary.org instances were created with significant case including for the first letter of
article names; instead the search facility can cope with those difference, and can help find the other articles and
provide links to the other articles when appropriate).
So if you accept that case is significant, you have to accept that other letter forms are also significant in
multicameral scripts like Latin and Greek (which are not purely bicameral). Similar considerations could have been
applied to Arabic, but for legacy reasons, the contextual forms are not made distinct and this creates the
additional encoding difficulties to control the letter forms with extra joiners/disjoiners, that make no sense in
Arabic by themselves, unless you consider that the letter + the (dis)joiner are the way we encode "atomically" the
letter forms (but this is not the way it works: the (dis)joiners have to be inserted contextually, and this does not
help automating the text input:
There should exist a way to remap a text that contain unnecessary (dis)joiners to its canonical form (according to
the existing joining rules), and make the reverse without changing the text. But the joining properties are not
considered in the current specification of the standard canonicalisation forms of Unicode. I think that this should
be corrected by adding another canonicalization mapping specific to Arabic (even if this "changes" apparently the
normalized equivalences). Similar algorithms should be developed as well for other Asian scripts that use joining
controls. For Latin, the usage of compatibility mappings (NFKC/NFKD) should also be deprecated in favor of the
systematic use of joiner controls where appropriate (for example for the ligatures fi/fl/ffi/ffl/... by remapping
the ligatures with equivalent sequences using normal letters and joiners, and rempping the non-ligatured letters
with the disjoiners, and also making these equivalent under the new canonicalization schemes).
Note: I'm not advocating for the change of the standard canonicalization algorithms, but for the development of
better (and still safe) canonicalization algorithms, which must still be stable over the existing NFC/NFD
equivalences. These would simplify a lot the development of other tools like rendering engines, simplifying the
development of fonts (less substitution rules needed in font tables), input methods and keyboard drivers, plain text
searches (using NFKC/D is really a mess there giving too many false hits). And these algorithms should be open for
later changes.
Under the new canonicalization schemes, we could also support the Hangul script in a much simpler way (the
distinction of initial and final consonnants to exhibit the delimitation of composition squares is artificial, and
does not work as intended with older Korean texts or complex syllables). With this tool, it would be possible to
reliably remap the much simpler alphabet (using a simpler set of jamos only) to its final form preferred by the
modern usage of the script (using the full set of jamos and preencoded pseudo-syllables). And finally, we would be
able to assert that the collation tables are setup correctly and consistantly, including after tailoring (something
that cannot be asserted, as of today, except possibly with the default collation table that has been tweaked
manually, and sometimes with bugs not detected in early versions of the DUCET).
This archive was generated by hypermail 2.1.5 : Sun Sep 06 2009 - 17:00:42 CDT