From: Dean Harding (dean.harding@dload.com.au)
Date: Mon Jun 12 2006 - 21:49:23 CDT
> > 1.Is it true that there are many ways of encoding the same character in
> > UTF-16?
>
> No. There is exactly one way of encoding each character in UTF-16. See
> TUS 4.0 Section 2.5 'Encoding Forms', especially p29.
I think this may be referring to the various normalized forms for strings.
For example, "e with an acute accent" could be <U+00E9> or it could be
<U+0065, U+0301>
Which CAN be a problem for regular expressions, unless they're designed with
this in mind. The simplest solution is to normalize the input strings to the
same form before doing matching (for example, .NET provides the
String.Normalize [http://msdn2.microsoft.com/en-us/ebza6ck1.aspx] method).
Dean.
This archive was generated by hypermail 2.1.5 : Mon Jun 12 2006 - 21:57:04 CDT