-----BEGIN PGP SIGNED MESSAGE-----
David Starner wrote:
> On Thu, Feb 14, 2002 at 03:15:24PM +0000, David Hopwood wrote:
> > [re: a hypothetical charset that has almost all the properties of UTF-8]
> >
> > (The exception is that naïve substring searching could find a
> > match starting part-way through a character - but it would be easy to
> > reject false matches by looking at the previous byte.)
>
> But the fact that systems that can search arbitrary 8-bit charsets can
> search UTF-8 has proven to be a useful ability in the Unix world.
Not having to add a few more lines of code to grep and sed is a good
trade-off for a 50% penalty in encoding efficiency for Indic & Southeast
Asian scripts, Katakana, Hiragana and a few others? I don't think so.
In another post, you wrote:
> grep doesn't know if it's working on UTF-8 text or raw binary
It does know when it's working on raw binary (by default it guesses,
but the behaviour is different for binary vs text).
> or Latin-1 (I frequently do grep foo file | recode l1..utf-8), and it
> doesn't know whether its output is going to the screen or a file or
> the tail of a file or the input of another program.
If "foo" is a US-ASCII string, "grep foo file" will work fine with any
US-ASCII-superset charset for which non-ASCII characters do not use
bytes < 0x80, including the hypothetical one I described, with no
possibility of a false match. However "grep fóó file" will work only
if the current shell charset (i.e. of argv[1]) matches the encoding of
"file". So grep could safely assume by default that its input will be
in the current shell charset. (This is also true when the grep command
is in a script, since the charset of the script must be known by the
shell.)
The fact that grep/egrep doesn't allow the charset of its input to be
explicitly specified when different to the shell charset is arguably a
bug. It is not correct to say that it doesn't need to know (even in
simple cases, never mind when using more advanced regexp features that
rely on character properties).
- --
David Hopwood <david.hopwood@zetnet.co.uk>
Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv
iQEVAwUBPGz+lDkCAxeYt5gVAQFOHQf9FgddVwIBdntfS1librB2NwBL++/i0x2u
m0OpOSKHgjyrkXwz3OOzBv9arKaqgd6FK61iluOzOZ0ulpDjnbghf00+zQIwdgLk
E3+Unr02bHWt56lbGxyqTBRIWfwrgWOIttOukSf5qdNXBeCLplPw6OnsmgSQdWmD
FKzHOmq/rstgXXDCkdO5UM6edxMzUFX9bNccen5MTmHv+fLrFM9XiPZx2PkBitUH
YWjYbmIZwu59X0t/P1f2zAhc7Gx9MBG0JBflwU2JoVcMFnm1/UdHUDS6SsL3xJGy
dMtCrWsAQwYGmPjisgABQVVo48HJWWCNWFlgUl3Nixu7oWgiWldS6g==
=EXHN
-----END PGP SIGNATURE-----
This archive was generated by hypermail 2.1.2 : Sat Feb 16 2002 - 12:59:21 EST