From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Apr 27 2009 - 12:24:03 CDT
On 4/27/2009 5:09 AM, John (Eljay) Love-Jensen wrote:
> Hi Asmus,
>
>
>> I respectfully disagree.
>>
>> For small amounts of data, and for applications that need to handle
>> multiple data formats/encodings, it makes sense indeed to first convert
>> into a common format and then implement the algorithm only once.
>>
>> However, when you need to scan (in real time) large amounts of data
>> known to be in UTF-8, the conversion costs will kill you. In my
>> consulting practice I've come across cases where that matters.
>>
>
> Wouldn't it be prudent to have the regular expression expressed in Unicode,
> and then translate that (for performance on the data stream in the data
> stream's format) into a UTF-8 one, or UTF-16LE, or UTF-16BE, or UTF-32LE, or
> UTF-32BE as appropriate?
>
> Rather than specifying the optimized regular expression in native UTF-8 in
> the first place, and perhaps another in UTF-16BE, and perhaps another in
> yada yada...
>
> That would avoid the brittleness issue raised by others.
>
That's a good point and a bit orthogonal to what I was trying to
highlight. My focus was on calling attention to the fact that multi-step
implementations with separate and independent phases for conversion and
algorithmic text processing can be cost-prohibitive in high-volume
(real-time) applications. Such application domains exist and are real
scenarios, even though they are not the standard case.
> And have high performance you are looking for -- working on the native data
> stream without decoding the data stream into UTF-32 (platform native 32-bit)
> characters.
>
> Putting the burden on the regular expression compiler, which would have to
> be Unicode savvy, and able to optimize the regex into a particular Unicode
> transformation format.
>
That's a straightforward application of the principle that the
optimizations should be encapsulated. I would certainly not disagree
with that. It's the way to go in new code.
One of Bjoern's point was about retrofitting existing implementations.
You always have the choice of whether you modify them to handle UTF-8
directly (as yet another byte-oriented encoding) or whether you convert
them to all UTF-16 or UTF-32 internally. For existing implementations
the correct choice depends on a large number of variables, including
expected life-time of the application, whether or not the existing code
base already handles multi-byte encodings, what types of processing is
done on the data and how much, what external components need to be
interfaced with and with what encoding forms, how localized text
handling is in the architecture, etc., etc.
Having UTF-8 direct implementations of core algorithms (like character
classification, regex, etc) at your command allows you to fine tune the
retrofit. You may think that little code is left that needs to be
retrofitted, but in my consulting practice I keep coming across fresh
examples.
A./
This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 12:27:53 CDT