Re: Emoji and Search Engines

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 05 2009 - 18:25:07 CST

  • Next message: André Szabolcs Szelp: "Re: Emoji and Search Engines"

    > Asmus Freytag wrote:
    >
    > > The fact that the request to provide a solution using non-PUA
    > > character codes is so strongly supported by leading search engine
    > > manufacturer(s) should give you pause here.

    John Hudson responded.

    > It did give me pause, and that was the point that I realised that it is
    > not in the interests of those companies to use any character codes for
    > emoji, PUA or otherwise.

    There is an ambiguity in John's response here. It is unclear whether
    he means by "those companies" DoCoMo, KDDI, and SoftBank, i.e. the
    Japanese wireless carriers who created these emoji sets in the first
    place and used character codes to transmit emoji. Or whether he is
    responding to Asmus' contention in particular, and means by "those
    companies" Google, Yahoo!, and Microsoft, i.e. the leading search
    engine companies who didn't *create* the emoji sets, but who have
    to deal with the conversion of this wireless data to function with
    their search engine technology.

    The UTC has heard quite clearly from the search engine companies
    (and others) that having a standard character encoding for dealing
    with these existing characters in data is better than a PUA-based
    encoding -- and that that clearly *is* in their interests. Any
    such contention is quite different from the assessment as to
    whether the Japanese wireless carriers should have treated any
    of this stuff as SJIS extension gaiji characters in the first
    place.

    > Quite simply, Unicode character codes do not
    > appear to be up to the task of encoding an open ended set of images that
    > users might want to transmit to and from mobile devices,

    I certainly wouldn't quarrel with that statement, and I am
    pretty sure that would also be a consensus position among the
    UTC.

    > which is the
    > problem that those companies should be trying to solve,

    Again with the "those companies", however. The search engines
    aren't trying to solve the problem of transmitting an open-ended
    set of images to and from mobile devices. Nor do I think that
    really is their purview. They will, of course, want to be
    able to search and index such images as occur archived in
    data on the internet, but that is a different problem.

    > rather than
    > hijacking a plain text encoding standard with an insufficient subset of
    > such images. I understand that it was easy and convenient to use text
    > character codes for this limited set of images to date, but it was not a
    > good idea: I'm tempted to say it was a lazy, stop-gap measure, lacking
    > in any sort of vision about the social use of technology that these
    > companies are supposed to understand.

    "These" companies? The lazy, stop-gap measure, if such it was,
    was perpetrated by the Japanese wireless companies, which sought
    an easy way to extend their character sets to make
    various culturally appropriate symbols, pictographs, and emoticons
    available quickly on phones. And they did it by a methodology
    that has a long history in Japan: gaiji. As I've pointed out
    before, this is just the latest example of this process in
    Japan, cycling around from when the Japanese OS companies did
    this kind of thing in extending JIS for Japanese computers back
    in the 80's.

    The "characters" here are now de facto data and need some
    character encoding solution other than PUA, just as the need
    to interoperate with the earlier instantiations of gaiji made
    those extensions also necessary for standardized character encoding.

    > It is not simply that I think
    > these images do not belong in Unicode, but that adding them to Unicode
    > does not solve the real problem.

    And I think John has misidentified the real problem. Or rather, that
    what he has identified as the "real" communication problem below
    ("the long future use of inline images in communications") is not
    the real problem that the UTC is trying to address for the search
    engine (and database) companies -- namely interoperating with the
    existing, de facto, set of SJIS extensions used *as characters*
    by the wireless operators in Japan.

    > Since a proper solution, capable of
    > addressing the long future use of inline images in communications, would
    > by its nature also solve the present problem of non-standard handling of
    > current emoji sets, why spend so much time and energy forcing Unicode to
    > accept something that will be obsolete almost as soon as it completes
    > the ballot process?

    The "proper solution" envisioned here would *obsolete* the need to
    resort to character-based, non-extensible hacks for transmitting
    pictographic symbols in the way the wireless carriers in Japan now
    are doing -- but it would not solve the *present problem* of
    dealing with the de facto existing characters *as* characters,
    which is what we are up against here.

    By the way, one of the reasons I have spoken out strongly against
    having the 10 flag-icon-based-locale-symbols in the emoji sets
    being turned into an excuse for an open-ended scheme for encoding
    flags as characters is because I *agree* with John's general
    contention about the inappropriateness of using characters to
    represent entities that are essentially images, rather than
    text symbols.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jan 05 2009 - 18:28:20 CST