utf-8 != latin-1

From: Steven R. Loomis (srl@jtcsv.com)
Date: Fri Oct 13 2000 - 21:51:26 EDT

Next message: Doug Ewell: "Re: utf-8 != latin-1"
Previous message: John Jenkins: "Re: "Giga Character Set": Anything besides noise"
Next in thread: Doug Ewell: "Re: utf-8 != latin-1"
Maybe reply: Doug Ewell: "Re: utf-8 != latin-1"
Maybe reply: George Zeigler: "Re: utf-8 != latin-1"
Maybe reply: Steven R. Loomis: "Re: utf-8 != latin-1"
Maybe reply: Michael \(michka\) Kaplan: "Re: utf-8 != latin-1"
Maybe reply: Mark Davis: "Re: utf-8 != latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Here's a gotcha story ..

Someone was working on documentation files in XML. The PDF generator
all of a sudden started choking, complaining that there was "Illegal
character U+DC73" somewhere in the late stages of PDF generation. Well,
the low surrogate certainly didn't belong there. Software bug? Memory
corruption?

I converted the 1.1mb intermediate file into literal \uXXXX notation and
searched for DC73. Sure enough, there was lower\uE54E\uDC73e (U+E54E, a
PUA, and U+DC73) .. in place of what was "lower-case" in the source
text. Definitely memory corruption.. But wait..
On a hunch, I deleted the hyphen and replaced it, which worked somehow.
I was told that the text "lower-case" was copied from another document.

Further inspection showed that the offending hypen was actually \xAD,
"soft hyphen". Since the XML document had no encoding tag, it defaults
to ..... UTF-8! What happened was that the sequence AD 63 61 73 was
interpreted as U+E54E U+DC73..

So moral: BE CAREFUL when you are pasting text into utf-8 documents..

-steven

Next message: Doug Ewell: "Re: utf-8 != latin-1"
Previous message: John Jenkins: "Re: "Giga Character Set": Anything besides noise"
Next in thread: Doug Ewell: "Re: utf-8 != latin-1"
Maybe reply: Doug Ewell: "Re: utf-8 != latin-1"
Maybe reply: George Zeigler: "Re: utf-8 != latin-1"
Maybe reply: Steven R. Loomis: "Re: utf-8 != latin-1"
Maybe reply: Michael \(michka\) Kaplan: "Re: utf-8 != latin-1"
Maybe reply: Mark Davis: "Re: utf-8 != latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT