From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Jan 12 2004 - 07:14:09 EST
On 12/01/2004 03:09, Marco Cimarosti wrote:
> ...
>
>It is extremely unlikely that a text file encoded in any single- or
>multi-byte encoding (including UTF-8) would contain a zero byte, so the
>presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
>UTF-32.
>
>
>
Is it not dangerous to assume that U+0000 is not used? This is a valid
character and is commonly used e.g. as a string terminator. Perhaps it
should not be used in truly plain text. But it is likely to occur in
files which are basically text but include certain kinds of markup.
>... This is due to the fact that, in any language, shared characters in
>the Latin-1 range (controls, space, digits, punctuation, etc.) should be
>more frequent than occasional code points of form <U+??00>. ...
>
This one also looks dangerous. Some scripts include their own digits and
punctuation; not all scripts use spaces; and controls are not
necessarily used, if U+2028 LINE SEPARATOR is used for new lines. But
there may be some characters U+??00 which are used rather commonly in a
particular script and so occur commonly in some text files.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 07:52:33 EST