Character counter?

From: Doug Ewell (dewell@compuserve.com)
Date: Sat Nov 11 2000 - 12:42:03 EST


11digitboy@bolt.com wrote:

> Is there a program that will count characters in
> a text file?

Since determining the number of *bytes* in a file is such a rudimentary
task, I will assume there is more to this question.

To count *characters* in a character-set-independent way, you have to
know the encoding. For ISO 8859-*, of course, this is the same as
counting bytes. For East Asian double-byte systems, you must treat
combinations of lead byte and trail byte as one character. If the file
is well-formed, you could do this by simply ignoring trail bytes in the
count.

For UTF-8, again you must treat multi-byte sequences as only one
character. As Markus Kuhn points out in his Unicode Web page, this can
be done by ignoring continuation bytes (0x80 through 0xBF) in the count,
but again this presumes that the file is well-formed. (And if it's not,
what are you supposed to do then?)

For UCS-2, count the bytes and divide by two (same for UCS-4, but divide
by four). UTF-16 is handled just like UCS-2, except that once again,
you must treat surrogate pairs as single characters. And once again,
if you can assume that surrogate pairs are properly matched, then you
can ignore all 16-bit code units from U+DC00 through U+DFFF. But all
of this does imply that you must actually examine the bytes, unlike the
solutions for UCS-2 and UCS-4.

For UTF-EBCDIC... well, you get the idea.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT