Dealing with Unencodeable Characters
irgendeinbenutzername at gmail.com
Thu Oct 6 09:54:07 CDT 2016
One of Unicode's goals is round-trip compatibility with old legacy
character sets, which is why we gathered many compatibility characters over
time that would normally have been out of scope for the standard. It's why
Zapf Dingbats and arabic presentation forms are in Unicode for example.
However, there are some characters that form part of these sets yet are
deliberately not encoded in Unicode because they were considered unsuitable
for inclusion. The two that come to mind are the Windows logo from
Wingdings and the Shibuya 109 emoji from the original Japanese vendor sets.
Given that these two have no Unicode equivalents, their source character
sets are not fully compatible with Unicode, i.e. there is going to be data
loss and confusion when trying to convert into or from Unicode.
If theoretically I wanted to convert an old Shift JIS document containing
emoji to Unicode, how should I ideally handle Shibuya 109?
I remember the early emoji proposal documents originally contained "emoji
compatibility symbols" which where used to map to source characters that
weren't meant to be included with a specified semantic. I believe STATUE OF
LIBERTY was one of those characters and was simply called EMOJI
COMPATIBILITY SYMBOL-XX so that that specific landmark wouldn't strictly be
part of Unicode. Obviously this approach ultimatively wasn't implemented,
but I wonder whether there could be designated compatibility characters for
this kind of issue. Private use characters are an obvious choice but of
course their meaning is user-defined, so while all other emoji in my Shift
JIS document would receive an unambiguous Unicode mapping, Shibuya 109
would remain vague and very limited in interchange options.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode