From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Apr 26 2009 - 15:24:29 CDT
On 4/26/2009 8:40 AM, Doug Ewell wrote:
> From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
>
>> Now, if we replace each character by its UTF-8 encoding, we would ob-
>> tain a regular expression and corresponding automata that match the
>> same language, but would operate directly on bytes:
>>
>> /(A|B|...|a|b|...|\xC3\x80|...)(...)/
>
> I know this isn't the answer you're looking for, but it almost always
> makes more sense to decode UTF-8 code units into Unicode code points
> FIRST and then apply other algorithms to operate on Unicode text,
> instead of trying to build UTF-8 decoding into every algorithm.
>
I respectfully disagree.
For small amounts of data, and for applications that need to handle
multiple data formats/encodings, it makes sense indeed to first convert
into a common format and then implement the algorithm only once.
However, when you need to scan (in real time) large amounts of data
known to be in UTF-8, the conversion costs will kill you. In my
consulting practice I've come across cases where that matters.
These days, I'm working on an upgrade to Unibook
(http://unicode.org/unibook) that can read information from the Unihan
data base (>1 million lines of UTF-8). Supporting UTF-8 by conversion
proved unacceptably slow for use in an interactive environment.
I investigated a number of optimizations. The big ones included
reimplementing the tokenizer to work directly on UTF-8, and limiting the
conversion to data that are later used as strings in formatting and
display. With that, Unibook can read an un-preprocessed Unihan DB fast
enough, so that in can take in the background during startup.
If I understand him correctly, Bjoern also suggests his method to give
yet another avenue for Unicode-enabling of existing multi-byte aware
applications. Depending on the circumstances in each case, such retrofit
might make sense.
Having a larger toolbox is always nice.
A./
This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 15:28:13 CDT