From: Jeroen Ruigrok van der Werven (asmodai@in-nomine.org)
Date: Wed Jan 09 2008 - 11:38:08 CST
-On [20080109 18:16], Damon Anderson (damon@corigo.com) wrote:
>I maybe an old dog trying to learn new tricks, but I simply can't
>understand how Unicode is implemented in GUI editors. From Word to
>OpenOffice to DreamWeaver when I type Unicode characters and then go to
>look at the source I see nothing but gobbledy gook hodge podge of odd ASCII
>characters or Character pairs/groups.
ASCII is just using 128 positions of a byte (7 bits to be precise) whereas
with Unicode you need multiple bytes, generally (depends on the encoding
format chosen, to represent characters.
So yes, when you view Unicode data in an editor that does not understand it it
will try to interpret it as ASCII or ISO-8859-1(5) and it looks nonsensical.
The same happens when trying to view, say, KOI8-R. You get back lots of
accented characters.
>I can, of course type into source directly a properly escaped HTML decimal
>unicode character and it will display in the UI correctly, but when I type in
>the UI and view the source I have no way to verify that the correct Unicode
>is being used, as no Unicode is apparent, no escape characters, no hex or
>dec. I am completely baffled.
Which is logical actually. The editor understands Unicode and will show the
correct associated character with the codepoint. As such there are no escape
characters since Unicode does not use these.
>What is happening and why/how is the Unicode being recoded or displayed in
>non-unicode format in the source? Is there a proper source editor that will
>display the actual Unicode encodings? Is the problem in my Unikey
>Vietnamese keyboard driver? Unikey seems to send HTML unicode, but that's
>not what Dreamweaver displays in the source.
You are really looking at it from the narrow view of ASCII I am afraid. Just
as ASCII is a character encoding that correlates ASCII A to hex 0x41 and Z to
hex 0x5a, so does Unicode use a scheme that uses multiple byte values in order
to encode various characters. Using UTF-16, for example, an A would be
encoded as 0x0041. If you'd view such a file in hex with a binary editor you'd
see the coded sequence 00 41. But of course, if the editor understands proper
Unicode it will just interpret such code sequences as the proper Unicode
codepoints and show the relevant characters. Nothing magical and different
from how ASCII is shown to be honest.
>Then there's OpenOffice... I have had to actually submit a bug to OOo
>because when I use it to read directly from my database which is storing
>correctly escaped HTML unicode it converts all of my ampersand escape
>characters to & so ỡ becomes &7905. That one just baffles me,
>as they are supposed to be supporting Unicode, but convert my Unicode and
>then don't even convert it to Unicode but use & instead.
This is a bit different. HTML supports encoding Unicode codepoints as entities
using a scheme &#NNNN; The &<..>; combination is the standard for encoding
HTML entities, OpenOffice should not have messed with the & to make it &.
Almost sounds as if they did not support Unicode entities in the first place.
For HTML you can use either entities or just type in the characters. But some
editors translate such codepoints to entities underwater. Personally I dislike
that.
Personally I am happy enough using (g)vim on Unix and Windows for my Unicode
needs, but you could also try out BabelPad by our very own Andrew West for a
good Unicode supporting editor. Alternatively there are a lot of other editors
that should be ok. Notepad2 also supports Unicode editing and has syntax
highlighting for various file formats (if you're on Windows).
-- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ When you have eliminated the impossible, whatever remains, however improbable, must be the truth...
This archive was generated by hypermail 2.1.5 : Wed Jan 09 2008 - 11:40:06 CST