Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Apr 26 2009 - 15:24:29 CDT

Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: John (Eljay) Love-Jensen: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 4/26/2009 8:40 AM, Doug Ewell wrote:
> From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
>
>> Now, if we replace each character by its UTF-8 encoding, we would ob-
>> tain a regular expression and corresponding automata that match the
>> same language, but would operate directly on bytes:
>>
>> /(A|B|...|a|b|...|\xC3\x80|...)(...)/
>
> I know this isn't the answer you're looking for, but it almost always
> makes more sense to decode UTF-8 code units into Unicode code points
> FIRST and then apply other algorithms to operate on Unicode text,
> instead of trying to build UTF-8 decoding into every algorithm.
>
I respectfully disagree.

For small amounts of data, and for applications that need to handle
multiple data formats/encodings, it makes sense indeed to first convert
into a common format and then implement the algorithm only once.

However, when you need to scan (in real time) large amounts of data
known to be in UTF-8, the conversion costs will kill you. In my
consulting practice I've come across cases where that matters.

These days, I'm working on an upgrade to Unibook
(http://unicode.org/unibook) that can read information from the Unihan
data base (>1 million lines of UTF-8). Supporting UTF-8 by conversion
proved unacceptably slow for use in an interactive environment.

I investigated a number of optimizations. The big ones included
reimplementing the tokenizer to work directly on UTF-8, and limiting the
conversion to data that are later used as strings in formatting and
display. With that, Unibook can read an un-preprocessed Unihan DB fast
enough, so that in can take in the background during startup.

If I understand him correctly, Bjoern also suggests his method to give
yet another avenue for Unicode-enabling of existing multi-byte aware
applications. Depending on the circumstances in each case, such retrofit
might make sense.

Having a larger toolbox is always nice.

A./

Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: John (Eljay) Love-Jensen: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 15:28:13 CDT