From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 22 2003 - 20:41:36 EST
Peter Kirk said:
> Anyway, I don't see the main purpose of
> collation as producing lists of legible words, but rather as matching in
> text and database searches.
Collation is used for both purposes, of course. And there is nothing
which requires you to use the same rules for sorting lists as for
matching for searches.
Just as a search might choose to ignore case, a search can be defined
which would ignore specific script differences via a tailored
weighting. Thus for instance you could, right now, choose to
implement a tailoring of the UCA default tables which would
give Syriac letters identical weights as [square] Hebrew letters.
You could then turn a search using that collation weighting loose
on a corpus of Aramaic data in both Hebrew and Syriac script and
get the kind of cross-script matching for identical Aramaic
"underlying forms" that you are looking for, I presume.
Of course, none of that would be free out of the box from any OS,
but with advanced tools like ICU it is not that difficult to
create specialized collations along these lines and then use
them to implement custom searches. It is a little more
difficult to integrate them into off-the-shelf databases, but
most databases implement some kind of capability for stored
procedures, and you can create indexes off stored key fields
that are built using such stored procedures. That should enable
arbitrarily defined searching into data stores.
> I think that it just might be acceptable to encode
> the various ancient Semitic scripts separately if they are unified for
> collation.
As Michael indicated, separate scripts defined and encoded in
the Unicode Standard will, in the default collation table, get
separate primary weighting. That is the basic pattern followed
in the table, and is the most conservative approach, since it
does not presume removal of distinctions for the default.
In my opinion, the structure of the collation table should not,
however, be the main consideration which goes into determining:
A. Whether a particular historic variant of some writing system
should be separately encoded. (Meaning does the graphological
analysis in the context of character encoding suggest that
separate encoding makes more sense than unification with
something else already encoded?)
B. Whether, given a technical determination in (A) that a
separate script encoding is warranted, whether it should be
encoded at all. (Meaning is there any actual scholarly need
for an encoding of that particular form, or would encoding
simply be an exercise in script coverage completeness,
without any actual application?)
For "Aramaic", it isn't clear to me that we have consensus
yet about either of these "shoulds".
> But if you are saying that it must be all or nothing, I will
> continue to fight on behalf of the users of these scripts for all of
> what they want, rather than what you have apparently unilaterally (on
> the basis of a book which describes glyph shape differences rather than
> the systematic differences which really distinguish scripts) decided
> that they ought to want and have written into your Roadmap.
Them's fightin' words. Howzabout, as Michael suggested, we
simply cool it a little about Aramaic? Ancient forms of Aramaic
aren't going to be taken up anytime soon for any consideration
for encoding. And the Roadmap cannot be taken as a predetermination
of the eventual decisions in this regard, in my opinion.
If there is, however, some consensus that Samaritan and
Manichaen *do* deserve separate encoding consideration, how
about pursuing the furthering of encoding proposals for those
as distinct scripts and then come back around later to review
the ancient forms once again after some more of the
pieces have fallen into place?
In the meantime, rather than harumphalating that Aramaic
scholars are being confused by the Unicode Roadmap, I think
it would serve everyone much better if someone knowledgable
about Aramaic scholars' text encoding needs and practices
(you and others contributing to this discussion on the Hebrew
list in particular?) would write up a "Guide to Best Practices for
Aramaic Text Representation Using Unicode" and publish
it as a Unicode Technical Note. Then people could refer to and
be referred to *that*, instead of puzzling over a bunch of
sketchy, possible script encoding assignments on the Roadmap
which may or may not represent anything that will ever actually be
encoded in this area.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Dec 22 2003 - 21:27:21 EST