From: Doug Ewell (doug@ewellic.org)
Date: Tue Apr 28 2009 - 22:29:15 CDT
Asmus Freytag <asmusf at ix dot netcom dot com> wrote:
> For UTF-8 there are many tasks where conversion can be entirely
> avoided, or can be avoided for a large percentage of the data. In
> reading the Unihan Database, for example, each of the > 1 million
> lines contains two ASCII-only fields. The character code, e.g.
> "U+4E00" and the tag name e.g. "kRSUnicode". Only the third field will
> contain unrestricted UTF-8 (depending on the tag).
>
> About 1/2 of the 28MB file therefore can be read as ASCII. Any
> conversion is wasted effort, and performance gains became visible the
> minute my tokenizer was retargeted to collect tokens in UTF-8.
I haven't benchmarked it, but I would have thought reading ASCII as
UTF-8 would be pretty efficient. Maybe I missed something.
> The point is, there are occasional scenarios where close attention to
> the cost of data conversion pays off. Piecemeal conversion (one line
> at a time) definitely is too coarse, and if you wrap it into a
> "getline" type API, that adds even more overhead. So, that's
> recommended only where text throughput is not critical.
Now I know I've missed something, because I definitely would not have
expected that translating UTF-8 bytes into Unicode code points would add
noticeable overhead to a "getline" function that reads data from
storage.
-- Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 22:33:09 CDT