From: Doug Ewell (doug@ewellic.org)
Date: Sun Apr 26 2009 - 10:40:25 CDT
From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
> Now, if we replace each character by its UTF-8 encoding, we would ob-
> tain a regular expression and corresponding automata that match the
> same language, but would operate directly on bytes:
>
> /(A|B|...|a|b|...|\xC3\x80|...)(...)/
I know this isn't the answer you're looking for, but it almost always
makes more sense to decode UTF-8 code units into Unicode code points
FIRST and then apply other algorithms to operate on Unicode text,
instead of trying to build UTF-8 decoding into every algorithm.
-- Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 10:45:01 CDT