The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Fri Apr 18, 2014 6:00 am

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 4 posts ] 
Author Message
 Post subject: Question about invalid code points
PostPosted: Thu Dec 23, 2010 7:21 pm 
Offline
Guest

Joined: Tue Dec 21, 2010 5:26 pm
Posts: 3
From the following question that first appeared on the unicode list it appears there's some lingering confustion about invalid code points:
Quote:
I have long thought only values corresponding to surrogates were invalid codes. But I recently discovered ...some other codes are invalid, like fffe & ffff.
I'm a bit lost. What's true, then? And where can I find actual and *clear* definitions of code validity?
I also discovered ... that ICU does not reject unpaired surrogates...???


Top
 Profile  
 
 Post subject: Re: Question about invalid code points
PostPosted: Thu Dec 23, 2010 7:56 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 185
There are 66 noncharacters in Unicode. The 34 code points ending in 0xFFFF and 0xFFFE (in other words, the last two code points on each plane) and a range of 32 non-character code points from FDD0 to FDEF on the BMP.

For more information, see Chapter 3 in Version 6.0, especially the definitions of "well-formed" vs "ill-formed".


Top
 Profile  
 
 Post subject: Re: Question about invalid code points
PostPosted: Tue Mar 05, 2013 7:17 pm 
Offline

Joined: Mon Mar 04, 2013 6:58 am
Posts: 6
Location: Metz, France
Moreover, I feel to understand there is no concept of invalid code‑point.

Unicode Standard, at chapter §2.4 says:
Unicode Standard 6.2 wrote:
The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.

In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.

A code‑point is just anything in the code‑space, which is the type's domain. If there are 1,114,112 elements in the code‑space and there are 1,114,112 code‑points, that means they all are code‑points, which implies there is nothing like “invalid code‑point”.

However, some code‑points, the ones in the range D800h .. DFFFh, cannot be encoded in an UTF‑8/16/32 sequence (otherwise the sequence is ill‑formed). That implies these code‑points cannot be transmitted, but that still does not means these are invalid code‑points.

A code‑point may just be valid or not valid according to some matters and usage.


Top
 Profile  
 
 Post subject: Re: Question about invalid code points
PostPosted: Wed Mar 06, 2013 1:18 pm 
Offline

Joined: Mon Mar 04, 2013 6:58 am
Posts: 6
Location: Metz, France
Hibou57 wrote:
However, some code‑points, the ones in the range D800h .. DFFFh, cannot be encoded in an UTF‑8/16/32 sequence (otherwise the sequence is ill‑formed).

As well understanding of things goes with appropriate naming, the set of serialisable code‑points is called scalar‑values.

Unicode at chapter §2.4 wrote:
Surrogate code points cannot be conformantly interchanged using Unicode encoding forms. They do not correspond to Unicode scalar values.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com