>From: "Alain LaBont\i\" <alb@sct.gouv.qc.ca>"@ausmail.austin.ibm.com
>Reply-To: unicode@unicode.org
>To: Unicode List <unicode@unicode.org>
>Date: Thu, 12 Mar 1998 06:33:52 -0800 (PST)
>Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)
>
>A 04:17 98-03-12 -0800, Jeroen Hellingman a =E9crit :
>>his field of knowledge, ASCII can be overseen, but Unicode is too large
>>for most people to oversee the effects of a range selection.
>
>[Alain] :
>Even ASCII range is problematic in English...
>
>"A to Z" does not imply "a to z", does it ?
>
>One should not expect the end-user to know what is under the hood!
>
>And "A to z" leads to no hit in EBCDIC, while "a to Z" will leads to no h=
>it
>in ASCII!
>
>Alain LaBont=E9
>Qu=E9bec
>
I believe that the POSIX approach of character class expressions
can shed some light in this area. The character class expressions
are based on examination of what end-users have traditionally intended
when expressions such as [a-z], [A-Z], [0-9], [a-zA-Z], etc. were
used. The examination concluded that generally the intent was
[a-z] - lowercase
[A-Z] - uppercase
[0-9] - digits
[a-zA-Z] - alphabetics
This lead to the notation:
[:alnum:]
[:alpha:]
[:blank:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
which allows end-users to obtain the necessary information without
having to "know what is under the hood."
-------------------------------------------------------------------------
Gary W. Miller Internet - gwm@austin.ibm.com
IBM JTMS/903 ZIP 9374 X/Open - g.miller@xopen.co.uk
11400 Burnet Road VNET - AUSTIN(GWM) / GWM at AUSTIN
Austin, Texas 78758 SENDFILE - GWM at AUSVM6
Phone: (512) 838-8297 Fax: (512) 838-0169
-------------------------------------------------------------------------
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT