Re: Arial Unicode MS and Code2000

From: James Kass (jameskass@worldnet.att.net)
Date: Fri Jul 06 2001 - 12:09:01 EDT


Rajesh Chandrakar wrote:

> >
> > Another problem has to do with searching/indexing. Search/index applications
> > are "broken" by non-Standard encodings.
>
> but how far searching and indexing is possible for encoded standards?
>

Hopefully, someone on our list with better knowledge of search
engine technology will respond to your question.

Here is a link which might also be helpful:
A Devanagari Search Engine for Unicode Documents with Compression
http://www.cse.iitk.ac.in/research/mtech1998/9811101.html

And, you can see one problem with Private Use Area encoding
and searching on one of my pages,
http://home.att.net/~jameskass/tamiltutf.htm
At the bottom of that page, there is some Tamil text encoded
with correct Unicode and then duplicated using Private Use
Area Tamil glyphs as found in Code2000.

(Now I can see that there might be some typos on my page
in the Devanagari portion.)

Here is the title in proper Unicode (UTF-8)
மின்கணிப்பிற்கான தமிழ் தரங்கள்

If you copy/paste this into your search utility, you will only find
one match on that page, even though you may see this title displayed
twice (depending on font and operating system). This can be very
frustrating for the casual user who can see the words plainly on the
screen, yet being unaware of the underlying encoding can't understand
why the search utility is not capable of finding the search string.

On the system here, the title in correct Unicode looks exactly
the same as the title using the Private Use Area code points. But,
when I made the page a few years back, the correct Unicode did
not display well. (I have upgraded the operating system here.)

Hope this helps.

Best regards,

James Kass.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 13:48:07 EDT