From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon Jun 12 2006 - 21:15:27 CDT
----- Original Message -----
From: "Magda Danish (Unicode)" <v-magdad@microsoft.com>
To: <unicode@unicode.org>
Cc: <wikoh@msn.com>
Sent: Monday, June 12, 2006 8:04 PM
Subject: FW: Other Question, Problem, or Feedback
>
> -----Original Message-----
> Date/Time: Sat Jun 10 14:54:43 CDT 2006
> Contact: wikoh@msn.com
> Name:
> Report Type: UTF-16 & UTF-32
>
> I haven't been able to find a an answer in the FAQ or googling the site to
> these questions...
>
> 1.Is it true that there are many ways of encoding the same character in
> UTF-16?
No. There is exactly one way of encoding each character in UTF-16. See TUS
4.0 Section 2.5 'Encoding Forms', especially p29.
> Do you know if common regular expression search functions like those of
> .NET or Perl will find a character regardless of in what fashion it was
> encoded?
This problem therefore does not arise.
> 2.Why is there now UTF-32?
Binarism. A 27-bit word is perfectly capable of representing any valid
codepoint. Anything that can be validly done with UTF-32 can be done with
any word size from 21 bits upwards. (Any one contemplating using a
non-binary representation should consult the final part of TUS 4.0 Section
2.4 for the implications on Unicode data tables :-).
> Are there even that many characters in the world that they need 32-bit
> representation?
If everyone invented a character and it were accepted, despite the alleged
rule on not encoding novel or idiosyncratic characters ('Note, however, that
the Unicode Standard does not encode idiosyncratic, personal, novel, or
private-use characters, nor does it encode logos or graphics.' - TUS 4.0
Section 1.1 Paragraph 3), 32 bits would not be enough. However, it is
currently strenuously maintained that 21 bits will suffice. The range of
values is 0 to 0x10FFFF (TUS 4.0 Section 2.4 Paragraph 3).
This archive was generated by hypermail 2.1.5 : Mon Jun 12 2006 - 21:30:11 CDT