John Cowan wrote:
>
>
> In addition, in some applications those processing inefficiencies are
> not present, thanks to the self-segregating nature of UTF-8. For
> example, the Plan 9 "fgrep" program (which searches a stream of text
> for the presence of one or more of a list of strings) need never convert
> to UCS format at all; the strings are UTF-8 and so is the text, and
> in fact the program looks the same as the corresponding 8-bit program.
>
This is not completely true, fgrep to be Unicode compliant must
deal correctly with combining characters. e.g.
и ( <latin small letter "e" with grave "`" U00E9> ) is exactly
equal to
<latin small letter e U0065> <modifier letter low grave accent ' U02CE>
So, grep should match <U00E9> with <U0065><U02CE> to be truly
Unicode compliant.
See section 2.5 of "The Unicode Standard 2.0" !
Not to say it isn't a good start with fgrep ...
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT