Marco Cimarosti writes:
> Tom Emerson wrote:
> > One gotcha, that I run into every six months or so, is forgetting that
> > the punctuation characters in the Basic Latin block are classified as
> > Latin script. This trips me up because most of my text processing work
> > involves CJK, so I'll write something to filter latin characters with
> > (in Rosette notation):
>
> That must be a Rosette-specific behavior: in UTR#24 (and in its database
> <Scripts.txt>), the only ASCII-range code-points classified as "Latin" are
> the upper- and lower-case letters.
Indeed. It turns out that the Rosette script assignments (in the
version I'm using) predate UTR#24 by three or four years and are based
on the information in <blocks.txt> with some hand editing by engineers
long past.
The next major Rosette release, which includes Unicode 3.1 support,
will use the data from UTR#24, and my problem will mostly go away.
-tree
-- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
This archive was generated by hypermail 2.1.2 : Fri Nov 09 2001 - 17:32:54 EST