Re: Last Call: UTF-16

From: peter_constable@sil.org
Date: Wed Aug 18 1999 - 21:15:51 EDT


       fdc>The final objective is to cover all human languages and
       writing systems in the UCS. But it seems each one needs a
       trial run in a simpler, self-contained character set. Perhaps,
       then, it is not practical to avoid creating additional
       single-byte sets. So...

       fdc>Let's hope that these new 8-bit character sets can be
       published and shared so the same work does not need to be
       needlessly replicated, and to facilitate a review process that
       might ensure a better result for the final UCS versions.

       From this, I understand that you would like to see various
       8-bit character sets developed by SIL published in some form.
       I'm not sure if you're also wanting to see something in the way
       of developing standards around such character sets or of
       inclusion in some registry.

       I need to warn that there are literally many hundreds of these
       that have been developed in the 4+ decades in which SIL has
       been working with language data electronically. (While working
       on my MA thesis on a Mayan language in Mexico, I was fortunate
       to be able to get a sizeable electronic corpus of text from a
       closely related language. It came from the data fed into a
       typesetter for a publication done in 1955, was on paper tape,
       and was 7-bit; I don't recall to what extent escape sequences
       were involved, but I'm pretty sure there were some. There's
       *lots* more of this kind of stuff in our archives, though.)

       Also, it may not all be considered pretty by current standards.
       Often these have been developed by linguists who may not have
       had as much knowledge and skill in IT as in linguistics.
       Generally, all of these were created to get a job done, but
       those jobs may have been focussing on a particular process and
       using proprietary systems. (In 1955, was there anything that
       wasn't proprietary?) In some cases, though, there have been
       alternate or competing encodings for a single language -
       alternates may have been developed for different purposes, and
       different researchers may have developed different encodings
       where a single encoding would have sufficed merely by
       historical accident.

       I've been thinking for a while that there may be some value in
       my starting to collect info on SIL-developed encodings, but it
       would be an enourmous undertaking, and in an organisation where
       the work far exceeds the available personnel, I can't say for
       sure how successful I'd be. If I have time, though, I may try
       to make a start anyway. If I do, I can certainly discuss making
       that info available to others who are interested.

       fdc>Let's hope that they are not designed around some
       particular proprietary architecture and that some consideration
       has been given to interchange, so that users of these writing
       systems have some choice about platforms (this might mean, for
       example, following the guidelines of ISO 4873).

       I'm still climbing the learning curve on all these IT
       standards, and this is one I don't recall encountering yet. Can
       you give a brief explanation of what 4873 is all about?

       fdc>And let's hope we can avoid the term "legacy" in this
       connection. There's nothing legacy about it. It's
       groundbreaking work. It's nothing to be ashamed of.

       I can accept that. I guess I use legacy from a feeling that I'd
       rather give up 8-bit encodings for good, as I mentioned
       earlier. But you're right, there's a lot that's still
       ground-breaking and not a cause for shame.
       fdc>I suspect that Michael's or SIL's website might be a good
       place from which to coordinate this activity, to whatever
       extent this is not being done already.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT