Re: ASCII control codes in sequences of multibyte character sets

From: Steffen <sdaoden_at_gmail.com>
Date: Thu, 05 Sep 2013 17:10:53 +0200

"Dreiheller, Albrecht" <albrecht.dreiheller_at_siemens.com> wrote:
 |In this context, it might be useful to know that there are some codepoints
 |in some Chinese multi-byte encodings, which contain a byte looking like
 |a Backslash "\" 0x5C as trail byte.
 |This can cause problems in C-like string literals where \ acts as a meta-character.
 |
 |Examples:
 |
 |in BIG5 (Win CP 950) Traditional Chinese
 |U+03B1 maps to A3 5C
 |U+4E48 maps to A4 5C
 |U+4FDF maps to AB 5C
 |
 |in GBK (Win CP 936) Simplified Chinese
 |U+2010 maps to A9 5C
 |U+2558 maps to A8 5C
 |U+4E57 maps to 81 5C

Thank you – well of course it is, for every very hungry caterpillar.

--steffen

attached mail follows:


From: Steffen Daode Nurpmeso, Saturday, August 31, 2013 4:37 PM

> Likewise, the byte values used to encode <period>, <slash>,
> <newline> and <carriage-return> shall not occur as part of any
> other character in any locale.

In this context, it might be useful to know that there are some codepoints
in some Chinese multi-byte encodings, which contain a byte looking like
a Backslash "\" 0x5C as trail byte.
This can cause problems in C-like string literals where \ acts as a meta-character.

Examples:

in BIG5 (Win CP 950) Traditional Chinese
U+03B1 maps to A3 5C
U+4E48 maps to A4 5C
U+4FDF maps to AB 5C

in GBK (Win CP 936) Simplified Chinese
U+2010 maps to A9 5C
U+2558 maps to A8 5C
U+4E57 maps to 81 5C


Received on Thu Sep 05 2013 - 10:13:24 CDT

This archive was generated by hypermail 2.2.0 : Thu Sep 05 2013 - 10:13:25 CDT