To assess whether a string is invalid, it all depends on what the string is
supposed to be.
1. As Ken says, if a string is supposed to be in a given encoding form
(UTF), but it consists of an ill-formed sequence of code units for that
encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in
UTF-16 or any surrogate (eg 0x0000D800) in UTF-32 would make the string
invalid. For example, a Java String may be an invalid UTF-16 string. See
http://www.unicode.org/glossary/#unicode_encoding_form
2. However, a "Unicode X-bit string" does not have the same restrictions:
it may contain sequences that would be ill-formed in the corresponding UTF-X
encoding form. So a Java String is always a valid Unicode 16-bit string.
See http://www.unicode.org/glossary/#unicode_string
3. Noncharacters are also valid in interchange, depending on the sense of
"interchange". The TUS says ""In effect, noncharacters can be thought of as
application-internal private-use code points." If I couldn't interchange
them ever, even internal to my application, or between different modules
that compose my application, they'd be pointless. They are, however,
strongly discouraged in *public* interchange. The glossary entry and some
of the standard text is a bit old here, and needs to be clarified.
4. The quotation "we select a substring that begins with a combining
character, this new string will not be a valid string in Unicode." is
wrong. It *is* a valid Unicode string. It isn't particularly useful in
isolation, but it is valid. For some *specific purpose*, any particular
string might be invalid. For example, the string mark#d might be invalid in
some systems as a password, where # is disallowed, or where passwords might
be required to be 8 characters long.
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**
On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller
<stephan.stiller_at_gmail.com>wrote:
>
> A Unicode string in UTF-8 encoding form could be ill-formed if the bytes
>> don't follow the specification for UTF-8, for example.
>>
> Given that answer, add "in UTF-32" to my email just now, for simplicity's
> sake. Or let's simply assume we're dealing with some sort of sequence of
> abstract integers from hex+0 to hex+10FFFF, to abstract away from "encoding
> form" issues.
>
> Stephan
>
>
>
Received on Fri Jan 04 2013 - 19:15:14 CST
This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 19:15:15 CST