From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 22 2008 - 16:23:04 CST
> (If your e-mail system does not support UTF-8, how will
> inclusion of emoji in Unicode generate correct display
> on your system?)
The fact that I'm stuck with a corporate engineering Unix network
set to ISO 8859-1 is not relevant actually. If I wanted
to *display* the email, I would forward it to a Windows
Vista system.
> A search engine which disallows PUA string searches and fails
> to index PUA web pages does a disservice to its users, in my
> opinion.
Good luck with that. Private use is *private*, from the point
of view of global text search engines. If you want a search
engine that works with your particular private use, you
are always free to write one. ;-)
> It's been alleged that some of this material is leaking out of closed
> systems and infiltrating databases and so forth.
More than alleged. We have multiple vendor testimony that this
is becoming a major headache for them.
> Of course, one
> solution would be to find and plug those leaks.
Plug the leaks between iPhones and the internet? Umm, good
luck on that one, too.
> But, failing that,
> may we please see some concrete examples of this along with an
> explanation of why it is a problem?
The major search engines convert and process text data in
Unicode? Why? Should be pretty clear: it's the obviously best
solution for handling petabytes of text data, originating from
many diverse sources, including multiple character sets.
If the conversion results in private use codes, then you have
to guess on semantics based on source and context, but often
the source and context get separated from the text.
> The nature of these proposed characters strikes me as ephemeral.
> Text messages sent between cell phone users appear to be transient.
> (You send your friend a text message including emotional overtones
> indicated with emoticons. She reads your message, laughs or cries,
> then deletes it.)
You conveniently ignored the example that I was commenting on,
provided by Clark Cox. Those 4 emoji, even though used as
an illustration of the process and not for their actual semantic
values, were not transient text messages from one phone to
another, soon forgotten. They were in a phone to internet
email message, which was posted on a public email forum,
and which is then automatically *archived* into that email
forum message archive. And the archive is itself then
indexed by various automatic processes which further munge and
store its content in various *other* databases and archives.
This all happens automatically, without intervention by
some human monitor who then decides that the four emoji
themselves are too ephemeral and unimportant to matter for
the rest of the text processing.
And for that matter, where the heck do you think the phone
carriers actually *store* all those bezillions of text messages?
It's not as if those messages are insta-teleported from your
phone keyboard right to my phone screen. If all the messages
stay inside one network, then in principle they can store
everything in a known extension to SJIS, and everything works.
Except they don't, and it doesn't.
> The fact that this set is still evolving indicates that encoding may
> be premature.
The proposal still has problems, but they are not really the
result of *this* set being some day-to-day, evolving,
uncertain set. Examination of the table and some familiarity
with the process of researching it over the last couple of
years should make it clear that *this* set is actually the
cross-mapping of 3 well-defined, already deployed symbol sets.
Those aren't "still evolving" at all.
--Ken
This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST