From: Michael D'Errico (mike-list@pobox.com)
Date: Mon Jan 05 2009 - 16:03:18 CST
> one possible scenario is that a vendor migrates all of their text
> messaging to being based on HTML rather than plain text, and the
> emoji get represented in messages as proprietary URL references.
Using URL's would suggest that the emoji are not plain text. Several
key Unicode Consortium members are vehemently arguing that they are,
in fact, plain text.
Although I think they are not plain text, I do believe that they can
be encoded in Unicode such that markup is not required. The thing
about encoding the emoji that I disagree with is that each one needs
its own code point. Think of the emoji as words and phrases; what is
needed is an alphabet to "write" them. An alphabet is a) small,
b) closed, and c) able to represent an unlimited number of "words".
Think of how much mileage we get out of 26 latin letters.
The suggestion I made on the UnicoRe list was to provide the mobile
phone companies with 11 code points: an emoji_base character to
represent the start of an emoji, plus 10 special emoji digits to be
used to indicate an index into a list from the Emoji Specification
(to be published by the phone companies). You could eliminate the
base character if distinct emoji are separated by spaces.
After much thought, I think that this alphabet should be greatly
expanded to include not just digits, but letters and punctuation.
This would allow an emoji of a cow to be spelled emoji_C emoji_O
emoji_W. If the UTC developed, say, a 256-element Unicode subset
to be used for this Direct Unicode Markup "script" (Mike's DUM
Proposal), then you could replicate it 255 times in Plane D for
private use, and 255 times in Plane C for things such as emoji.
One side effect of creating a common alphabet for this is that an
application wouldn't need to know what any of them meant, yet the
example emoji above could be displayed as "[cow]" (much better
than "e-2FC"). This would be true even for the private use area.
If I were to design the script, I'd start with ASCII minus controls
except TAB, CR, and LF, and then add in the most common diacritics
(no sense wasting space with composite characters), and defer to the
experts on what other common characters should be included. Thus
you would have the letter "A" at U+D0041, U+D0141, U+D0241, etc. for
private use, and also at U+C0041, U+C0141, etc. to be assigned to
things like the emoji.
Companies could design their products using the Structured Private
Use Plane (plane D), with up to 255 custom object types with identi-
fiers all spelled using the DUM alphabet (differing only in the
prefix which identifies the object type). Then if the application
enjoys enough success to get an official Unicode rubber stamp, the
only thing that needs to be done is to change the prefix from one in
plane D to one in Unicode proper. The requirements for encoding
would be proper use of the DUM alphabet, evidence of enough use, and
an organization assigned to handle publication of the identifiers.
A secondary use of the DUM alphabet would be to simply write text
using it (using a particular prefix). A map application might use
different alphabets to write the parts of an address: number, street,
city, state/province, country. ISO would love this, since it would
be similar to their X.520 naming system. The best thing about it in
my view is that it allows you to do things that currently you can
only do with XML. This scheme would bring plain text closer to the
power of XML, yet with much less complexity.
Getting back to the emoji -- with the DUM alphabet and a specification
from the phone companies, it is trivial to allow users to exchange
their own custom emoji. As an example, they might decide to use base-
64 encoding surrounded by curly braces. The base-64 data would some-
how encode the custom graphic image, yet would be transmitted as plain
text. Of course all characters would come from the emoji alphabet.
And for people that absolutely *HATE* the emoji, or any of the other
yet-to-be-invented uses of the DUM alphabet, it would be a trivial
task to filter out those code points based on their prefix.
That's all I have time to write for now. I can clarify anything that
wasn't clear enough.
Note: If you think this is the dumbest thing you've ever heard and
want to comment on it, please provide a reason why you think it's
dumb instead of just stating so.
Mike
This archive was generated by hypermail 2.1.5 : Mon Jan 05 2009 - 16:05:22 CST