One gotcha, that I run into every six months or so, is forgetting that
the punctuation characters in the Basic Latin block are classified as
Latin script. This trips me up because most of my text processing work
involves CJK, so I'll write something to filter latin characters with
(in Rosette notation):
if (UnicodeCharacter::GetScriptSystem(c) == ss_Latin) {
// blah blah blah
}
while what I really wanted to say is:
if (UnicodeCharacter::GetScriptSystem(c) == ss_Latin &&
!AnyPunctuation(c)) {
// blah blah blah
}
This is confusing because the ideographic punctuation is not
considered to be CJKScript. For example, U+3001 has undefined script,
but U+002C is Latin script.
So my question is this: why (for I assume there is a Good Reason(tm)
for it) are latin punctuation classified as Latin script, but CJK
punctuation not classified as CJKScript?
I use U+002C when writing with Cyrillic and in Han'gul, two script
systems I think we can all agree are not Latin.
Thanks.
-tree
-- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
This archive was generated by hypermail 2.1.2 : Fri Nov 09 2001 - 10:09:58 EST