Re: Character counter?

From: Mark Davis (mark@macchiato.com)
Date: Sat Nov 11 2000 - 14:01:46 EST


Doug is right, if you are counting *encoded characters*. This is fine for
programmers, so if that is the purpose, you can use that method. (If the
text is not well-formed, then you probably want to filter (e.g. not count)
isolated half-surrogates, ill-formed UTF-8, and noncharacters.

However, if your target is end-users, what you want to count are
*graphemes*, not encoded characters. See UTR#18 for a discussion of that,
and of the difference between locale-independent graphemes and
locale-dependent graphemes. In such a case, you probably also want to filter
out alternate format characters (e.g. ZWJ) and non-whitespace control
characters, perhaps also collapsing whitespace.

Mark

----- Original Message -----
From: "Doug Ewell" <dewell@compuserve.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Saturday, November 11, 2000 09:25
Subject: Character counter?

11digitboy@bolt.com wrote:

> Is there a program that will count characters in
> a text file?

Since determining the number of *bytes* in a file is such a rudimentary
task, I will assume there is more to this question.

To count *characters* in a character-set-independent way, you have to
know the encoding. For ISO 8859-*, of course, this is the same as
counting bytes. For East Asian double-byte systems, you must treat
combinations of lead byte and trail byte as one character. If the file
is well-formed, you could do this by simply ignoring trail bytes in the
count.

For UTF-8, again you must treat multi-byte sequences as only one
character. As Markus Kuhn points out in his Unicode Web page, this can
be done by ignoring continuation bytes (0x80 through 0xBF) in the count,
but again this presumes that the file is well-formed. (And if it's not,
what are you supposed to do then?)

For UCS-2, count the bytes and divide by two (same for UCS-4, but divide
by four). UTF-16 is handled just like UCS-2, except that once again,
you must treat surrogate pairs as single characters. And once again,
if you can assume that surrogate pairs are properly matched, then you
can ignore all 16-bit code units from U+DC00 through U+DFFF. But all
of this does imply that you must actually examine the bytes, unlike the
solutions for UCS-2 and UCS-4.

For UTF-EBCDIC... well, you get the idea.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT