From: Peter Kirk (peterkirk@qaya.org)
Date: Fri May 14 2004 - 08:35:09 CDT
On 13/05/2004 14:33, Kenneth Whistler wrote:
>Peter Kirk noted:
>
>
>
>>>PS Multi-language bibliographies are common in Russian books. They are
>>>usually printed with the Latin script entries following the Cyrillic
>>>script ones, but I have seen interleaved ones.
>>>
>>>
>
>Chris Jacobs noted:
>
>
>
>>has an index in which greek and latin script are interleaved.
>>
>>The greek words are sorted according to their transliteration:
>>
>> ̔ sorts as h
>>φ sorts as ph
>>
>>
>
>These illustrate the typical situation with cross-script,
>cross-language interfiling: They are *custom* solutions for
>particular indexing problems. And they may involve issues of
>transliteration or other adaptation to make like match with
>like for the purposes of the people using the interfiled list.
>
>Such tasks should *not* be attributed to the default collation
>element table for the Unicode Collation Algorithm. ...
>
I agree that such situations are typical of cross-script interfiling,
and so I do not support any suggestion of including a general mechanism
for this in the default collation table. This table is not the place to
define general purpose transliteration schemes.
But there is an exceptional issue within the family of north-west
Semitic scripts, which may apply also to others e.g. Greek, Coptic and
archaic Greek - possibly also the Indic scripts. Within these sets of
scripts there is NO ambiguity about which characters correspond to
which, as they have identical repertoires, with possibly additional
letters in some of the scripts for which no equivalent can be defined in
the other scripts. These are marginal cases where some users prefer
disunification and others prefer unification. Furthermore, they are
cases where texts originally in the same language and script are encoded
in Unicode in a variety of scripts, because of changes in Unicode e.g.
Coptic disunification and because of different scholarly preferences.
For such cases, in my opinion, a good case can be made for interfiling
the scripts in the default algorithm. The major advantage of doing this
is to allow integrated searching of text corpora in which texts have
been encoded in more than one script.
>...
>
>Mike Ayers is on the right track here, I believe. The scenarios
>which people are adducing in arguing for interfiling should
>be addressed instead by appropriately designed normalizations --
>which can be implemented using fairly easy-to-program,
>reusable scripts. Then sort on the *normalized* data using
>a much, much simpler collation table to accomplish what you
>need.
>
>
Mike Ayers suggested that users should write Perl scripts. This is
something which computer geeks may be able to do, but it is simply
impossible for the rest of humanity including scholars of ancient
languages. Perl is not "God's gift to academic researchers" in general,
although it may be God's gift to computer geeks.
The other problem with this is that the large corpora to be searched are
not necessarily directly available to the users for normalisation. I
can't normalise the whole Internet before doing a Google search for a
Coptic or Phoenician word. What I need is a search engine which can (at
least as a tailoring) collate together Coptic and Greek, Phoenician and
Hebrew.
Ken wrote separately, to Dean Snyder:
>Nobody plans to take away your rights and ability to continue
>doing what you now do, if it works very well for you. Please,
>sir, continue doing what you are doing with your current data.
>
>
>
Understood, and I note the smiley. But if some people continue to do
what they are doing and others follow a new script, that is a recipe for
confusion. The whole point of Unicode is to bring some consistency into
the previous mess of different character encodings and masquerades. If
the Unicode staff are now saying that it is OK to write Phoenician
either with Hebrew characters masquerading as Phoenician or with the
proposed Phoenician block, that opens the way to perpetuation of the
confusion which existed before Unicode. It really would be far better,
in the long run, if you said openly that anyone who continues to write
Phoenician with Hebrew characters after the new block is accepted is
wrong and breaking the standard, and should change their practices
immediately.
But then if you said that you would of course add a lot more flame to
the fire, and you would be forced to consider properly whether such
proposals as the separate Phoenician script have consensus support from
the majority of regular professional users of the script.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Fri May 14 2004 - 09:42:44 CDT