Re: UCS-2

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 02 2000 - 14:40:17 EDT

Next message: John Jenkins: "Re: Encoding Bengali Vowel forms (again)"
Previous message: Michael Everson: "Indic"
Maybe in reply to: Samir.Mehrotra@mail.iflexsolutions.com: "UCS-2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Antoine asked:

> Peter Constable wrote:
> >
> > You can, in fact, state this more strongly: *No characters will
> > ever be assigned* in Unicode that require the five-byte and
> > six-byte UTF-8 forms. Based on recent WG2 decisions (I think
> > they made this decision last month), the same is true for ISO
> > 10646. All that's left now would be to formally change the
> > definition for UTF-8 to eliminate the five- and six-byte forms.
>
> Do they intent to deprecate private use characters in the ranges 00E00000
> to 00FF0000 and 60000000 to 7FFFFFFF?

Yes. This decision has already been taken by WG2 in March at the Beijing
meeting:

Resolution M38.6 (Restriction of encoding space):

"WG2 accepts the proposal in document N2175 towards removing the provision
for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to
ensure internal consistency in the standard between UCS-4, UTF-8 and
UTF-16 encoding formats, and instructs its editor [to] prepare suitable
text for processing as a future Technical Corrigendum or an Amendment to
10646-1:2000."

Of course, it will take awhile for the TC or AMD to proceed through its
approval ballotting, but this resolution was passed unanimously, and
there is little reason to suppose that it will not pass in the ballotting.

>
> As far as I know, for the moment there are available for use, at least
> with UCS-4 (I understand they should be avoided if using UTF-32).

True, but given the above action by WG2, it would not be wise to start
using those codepoints at this time.

As I have pointed out before, there are 131,068 private use characters
available in Planes 15 and 16. And those are will remain. For most
applications, 131,068 (plus the 6400 available in the BMP) is clearly
sufficient. But in extreme cases, a simple extension mechanism of using
private use characters in pairs to reference entities would give you
the capability of encoding multiple billions of entities without
having to make use of codepoints beyond 0x10FFFD.

--Ken

>
>
> Antoine
>

Next message: John Jenkins: "Re: Encoding Bengali Vowel forms (again)"
Previous message: Michael Everson: "Indic"
Maybe in reply to: Samir.Mehrotra@mail.iflexsolutions.com: "UCS-2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT