From: Deepak Chand Rathore (deepakr@aztec.soft.net)
Date: Wed Jan 14 2004 - 01:21:25 EST
Hi all,
Great to hear so many views on detecting encoding
I would also like to share something related to detecting UTF8 encoding
As most of u would be knowing, we can check any stream of bytes for utf8
encoding, if any of the following sequence of bytes appears.
If not , we simply consider it not to be in utf8
unicode range
utf 8 encoded bytes
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
similarly using the above principle , we can write our own function that
converts wide char to utf8 & vice versa
according to me , this will work. ( am i right ??)
This approach will surely help as we don't have to rely on the library (for
eg. some utf8 functions require that the locale to be set to xxx.UTF-8
locale, so dependency on such locale)
But, there is one concern. In some cases the utf8 byte stream starts with a
BOM,( for eg. when we try reading bytes from a text file that
is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
i suppose first 3 bytes), the actual text start.
So how do we detect whether the byte stream starts with a BOM or not ??
or the first few bytes represent BOM or the actual text ??
with regards
( DC )
deepak chand rathore
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 02:08:43 EST