Re: Devanagari

From: James Kass (jameskass@worldnet.att.net)
Date: Sun Jan 20 2002 - 03:48:58 EST


Aman Chawla wrote,

> I would be grateful if I could get opinions on the following:

> 1. Which encoding/character set is most suitable for using Hindi/Marathi
> (both of which use Devanagari) on the internet as well as in databases, and
> why? In your response, please refer to:
> http://www.iiit.net/ltrc/Publications/iscii_plugin_display.html,
> particularly the following paragraphs:
<snip>

Unicode is the best. It is the World's standard for computer encoding, and,
as such, offers the best possibility that text can be exchanged around the
globe and cross-platform.

The arguments about relative size are true, but in this day and age are
considered unimportant. Graphics files are extremely large in comparison
with text files of any script and so are sound files. Devanagari UTF-8 is
three bytes. The four byte UTF-8 sequences so far are only used for
Plane One Unicode and up.

> 3. With reference to the previous question, can programs that convert
> the myriad Devangari encodings in use today to a standard encoding
> (question 1) be made freely available, and how?

Yes, converters exist and are being distributed. Just go to the Google
search engine and input "character conversion Unicode" into the box.
Look for ICU and Rosette, to name a few. You might even run across
Mark Leisher's download page at:
  http://crl.nmsu.edu/~mleisher/download.html
and see the PERL script for converting the Naidunia Devanagari encoding
to UTF-16.

> 4. Is there any search engine on the internet that maintains an up to date
> index of sites in Devanagari? If not, what can be done to encourage
> proprietary search engines to support Hindi? Google supposedly has a
> Hindi language option, but surprise, it's in Roman script! Several emails
> to them have elicited the response: "At the moment we don't support
> Devanagari..."

This appears to be because Google is converting UTF-8 strings input
to the search words box into decimal NCRs.

Pasted यूनिकोड क्या है into the Google box, it displays fine. Since the
"What is Unicode?" pages are popular and have been up for a while,
thought that it would have a good chance of being indexed. But,
there were no hits for the resulting search string:
&#2351;&#2370;&#2344;&#2367;&#2325;&#2379;&#2337;
&#2325;&#2381;&#2351;&#2366; &#2361;&#2376;
...which is not surprising since the actual page doesn't use NCRs.

Best regards,

James Kass.



This archive was generated by hypermail 2.1.2 : Sun Jan 20 2002 - 02:25:18 EST