From: James Kass (thunder-bird@earthlink.net)
Date: Sat Jan 03 2009 - 10:53:23 CST
Asmus Freytag wrote,
>> The existence of a private agreement is a given, otherwise
>> neither interpretation nor processing would be desired. In
>> contexts where the nature of the private agreement cannot
>> be determined, no interpretation is possible. Processing can
>> be done on uninterpreted strings. I don't need to be able to
>> speak Hindi in order to enter, store, search, and collate text
>> written in Devanagari, and neither does my plain-text editor.
>
>But your plain text database on your web server cannot present Hindi
>words in the order a user of your website in India would expect them,
>unless the text (that is its character codes) can be interpreted.
Method # 1 (easy way)
First download any of several free on-line hindi dictionaries.
Extract a word list to a database file. Just the one field:
VAR1, character field, length to be determined by the longest dict. word
then,
USE <mainfilename>
INDEX ON <fieldname> TO <indexfilename>
I'd store it as UTF-8 mojibake after first normalizing the
data in the same manner to be performed on any incoming
word lists needing sorting. Then I'd store the incoming
data in a two field database:
VAR1, character field, also UTF-8 mojibake after normalization
REFNO, numeric field
then,
CLEA ALL
SELE 1
USE <mainfilename> INDEX <indexfilename>
SELE 2
USE <newlistfilename>
x = SPACE(40)
y = 0
DO WHIL .NOT. EOF()
x = TRIM(VAR1)
SELE 1
FIND &x
y = RECNO()
SELE 2
REPL REFNO WITH y
SKIP
ENDD
CLOS DATA
USE <newlistfilename>
SORT ON REFNO TO <newsortedlistfilename>
CLEA ALL
USE <newsortedlistfilename>
* and you're ready to use your new sorted list.
* Caveat, above not tested, it seems straightforward.
That's a straight binary comparison, exact. The machine
doesn't need to interpret text; the machine's programmer
does that. Someone would probably throw you a word
which wasn't in your dictionary, though.
Method # 2 (easy way)
Download an existing set of collation rules for Hindi, study
them, then write your sorting code accordingly.
Method # 3 (easy way)
Hire a database administrator fluent in Hindi, and pay your
new employee to write the code.
Method # 4 (easiest way)
Use somebody else's Hindi subroutine and/or list. (It has been
said that we all stand on the shoulders of giants.)
Now, suppose that, instead of Devanagari Hindi, it's Verdurian
ConScript PUA. It's *your* web server running *your* database.
So you ought to know all about Verdurian and write your code
accordingly.
If your web server running your database is storing Verdurian
from other web sites, you must first identify it as Verdurian.
(Font-face mark-up clues looked up in a researched database of
font names linked to researched information in other databases.
Frequency counts might be used to determine which fonts would
get researched and which fonts would be disregarded.)
or
(Use the hypothetical tag identifying the PUA scheme, go to
where the Verdurian PUA scheme information is hosted in
a hypothetically consistent fashion [from one scheme hosting
web site to the next] and get that information. Check periodically
for updates. The hypothetical tag includes information about
where to go for needed data, of course.)
If, instead of Verdurian, it's some other unknown PUA script
*without* any PUA scheme information (unknown font or
lack of scheme tag or unknown scheme tag), then the page
isn't *about* you, it's an example of public exchange of user
defined characters which is a secret. You can still index this
stuff as binary strings for search/comparison purposes, but
that's about all you can do.
Even if your database is grabbing data straight from a cell
phone, the cell phone message protocol should be sufficient to
determine the vendor, and by extension, it's scheme.
>> Success in interpreting the text, then, lies in determining the
>> nature of the private agreement. This is not a new concept,
>> it has been discussed here previously, unless I'm mistaken.
>> Mark-up was one method mentioned, if I recall correctly.
>> Search engines can interpret mark-up.
>>
>If that was as easy and straightforward, we wouldn't have a Unicode
>Standard.
It's as easy and straightforward to write first time code to
interpret Klingon text as it was to write the first time code
to interpret Telugu. (Indeed, it's even easier, Klingon is not
a complex script!)
We have a Unicode Standard for standardized plain-text
exchange. Search engines mainly are indexing rich-text
pages. Cell phone vendors in Japan are exchanging icons
using PUA plain-text as mark-up. The rich-text exchanged
by Japanese phone vendors sometimes ends up on rich-
text web pages where it might be grabbed by a rich-text
search engine.
Any perceived rich-text problem here requires a
rich-text solution.
The Japanese phone vendors may well continue to use their
PUA characters for icon exchange. Suppose they want to
enter a non-Japanese market, say, Latin America. Wouldn't
the new subscribers to their service want their *own* icons,
reflecting their own cultures? And wouldn't those vendors
whip some up and extend their proprietary user-defined
icon sets? How about expanding sales and service in
southeast Asia?
If UTC has a working relationship with these icon-making
vendors, wouldn't it be better to work with them to switch
from Shift-JIS to Unicode? To help them understand and
implement complex script shaping rules on their sets?
Imagine, if they switched to Unicode and got the complex
shaping worked out, they might have a good chance in
southeast Asian markets. Isn't *that* the proper role of
Unicode -- education about and promotion of the computer
plain-text encoding standard?
Or, would it be better when those vendors increase their
icon sets as new markets are added and existing fads change
to eagerly await each icon addition so that they can be
promoted into Unicode?
>If I remember correctly, before Unicode, everybody had their own
>character sets, and in Japan, every vendor had their own. In order to
>communicate you had to know what character set the other party was
>using. ISO 2022 even had internal markup (control sequences) to allow
>switching of character sets on the fly.
>
>Interestingly enough, vendors, users and implementers voted with their
>feet to abandon such systems and go to a unified encoding where the
>semantics of each code are unambiguous on the character level, where
>there's no need to switch on the fly, and where the processes can be
>written without undue complication.
This is revisionist, in my opinion. It wasn't that easy,
it was an uphill climb. There are even still some people
out there stuck with Unix systems locked into 8859-01.
Pasting a Malayalam Unicode text word into the search
box on Unicode's own mail list archive page results in
some kind of conversion of the material into ASCII
mojibake, for which the expected match isn't found.
>Suitability requirements are different between ordinary and
>compatibility characters - that's a long held design principle for the
>Unicode Standard.
These aren't compatibility characters unless the book definition
in 5.0 has been trashed/revised.
>> We shouldn't exclude text-like characters from being included
>> in a plain-text encoding standard as long as all the criteria are
>> met.
>
>Criteria for encoding are different between ordinary and compatibility
>characters. Requiring that the criteria for ordinary character are to be
>met, is tantamount to freezing all encoding of compatibility characters.
>That's not a useful starting point.
Compatibility characters are variants of characters which
already exist in Unicode or they have a compatibility decomposition.
This has shifted around some over the years, but that's it, isn't it?
>But the ones that are not ordinary characters are not immediately out of
>consideration. You need to triage these further and make a careful
>deliberation whether they qualify (or not) as compatibility characters.
Can you please point me to the new definition of compatibility
in this regard?
>> The vendors who invented this icon set should continue to use
>> the PUA to exchange them. They are icons/signage and are
>> being exchanged and interpreted by humans as icons/signage.
>> Any machine interpretation of them should emulate what
>> people are doing. It's OK for there to be some overlap between
>> icons/signage and plain-text characters, after all, many of
>> those icons are pictures of those characters.
>>
>This sounds like you are confusing the emoticon and the emoji discussion.
An alternative would be that the emoticon and emoji discussion
may be confusing.
Emoticon is plain-text when it is a plain text string, usually ASCII
interpreted by the reader. Some emoticon plain-text characters exist.
Emoticon is rich-text (*.ICO, *.GIF) when an application replaces
a text string with an icon or bypasses text strings altogether.
Emoji are rich-text only, I think the Japanese use a different word
for ASCII/text strings used as emoticons in the plain-text sense.
Emoji and emoticon-as-rich-text are identical concepts. Do a web
search on keywords emoji emoticon and see how many pages equate
the two as opposed to how many distinguish them.
>The fact that the request to provide a solution using non-PUA character
>codes is so strongly supported by leading search engine manufacturer(s)
>should give you pause here.
Of course it does. Search engines deal with rich-text.
(Both leading search engine manufacturers are strongly
supporting this? Or just the one? How strongly?)
How much support do the phone vendors have for this?
Wouldn't they be chatting it up on their web sites?
How about the user community, anyone ever take a poll
to see if they think they're picking icons from an
icon-picker and inserting them into their rich-text,
or if they think they're exchanging plain-text?
Do they understand or care about the difference?
How about the designers of these icons and the programmers
who chose, sensibly, to exchange these icons referenced as
single user defined characters -- what is their conception?
How does the government of Japan regard these items?
Does the government of Japan push for plain-text encoding?
Did the government of Japan standardize these in JIS?
If the phone vendors want to work this out, they'll do
so. Apparently they already have. If the search engines
want to process/interpret emoji, they'll have to work it
out. Plenty of alternatives.
(To anyone who might have made it through, apologies
for length.)
Best regards,
James Kass
This archive was generated by hypermail 2.1.5 : Sat Jan 03 2009 - 10:55:32 CST