From: Addison Phillips (addison@yahoo-inc.com)
Date: Sat Feb 10 2007 - 14:17:31 CST
Mike wrote:
>>> One problem is detecting text with the MS-DOS box-drawing characters
>
> You could look for long runs of the single and/or double horizontal
> box drawing characters. If you want to be extra careful, look at
> the previous/next character to see if it's a corner or T.
>
I suspect that Doug's real problem isn't so much with the box drawing
characters per se. They're rare in any real user-entered text.
The problem with the DOS code pages is that they use different byte
values from the more modern Windows encodings (which tend to be based on
standards such as the ISO 8859 series). In some ways, they kind of
resemble Shift-JIS, with the box drawing gunk in the middle of the
"extended ASCII" range and the accented letters appearing to one side or
the other of that range.
The problem here is more likely to be with letter pairs when guessing
the encoding. For most Western European languages, the majority of the
data will be 7-bit ASCII, and a smallish run of data might have only one
or two non-ASCII characters embedded in it to assist in guessing.
For example, in CP 850, U+00C8 (capital E with acute) is represented by
the byte value 0xD4. In CP 1252, this same character is represented by
the byte 0xC8 and 0xD4 represents U+00D4 (capital O with circumflex).
Finally, in CP 850, the byte 0xC8 represents a box drawing character.
The question is: given that I have a byte 0xD4, is it more likely to be
an E-acute or O-cirumflex? If I guess CP 850, then any bytes 0xC8 that
appear will be box drawing characters (that is, the "guess" is quite
likely to be wrong).
Addison
-- Addison Phillips Globalization Architect -- Yahoo! Inc. Internationalization is an architecture. It is not a feature.
This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 14:19:37 CST