RE: What does it mean to "not be a valid string in Unicode"?

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Sat, 5 Jan 2013 00:02:01 +0000

One of the reasons why the Unicode Standard avoids the term “valid string”, is that it immediate begs the question, valid *for what*?

The Unicode string <U+0061, U+FFFF, U+0062> is just a sequence of 3 Unicode characters. It is valid *for* use in internal processing, because for my own processing I can decide I need to use the noncharacter value U+FFFF for some internal sentinel (or whatever). It is not, however, valid *for* open interchange, because there is no conformant way by the standard (by design) for me to communicate to you how to interpret U+FFFF in that string. However, the string <U+0061, U+FFFF, U+0062> is valid *as* a NFC-normalized Unicode string, because the normalization algorithm must correctly process all Unicode code points, including noncharacters.

The Unicode string <U+0061, U+E000, U+0062> contains a private use character U+E0000. That is valid *for* open interchange, but it is not interpretable according the standard itself. It requires an external agreement as to the interpretation of U+E000.

The Unicode string <U+0061, U+002A, U+0062> (“a*b”) is not valid *as* an identifier, because it contains a pattern-syntax character, the asterisk. However, it is certainly valid *for* use as an expression, for example.

And so on up the chain of potential uses to which a Unicode string could be put.

People (and particularly programmers) should not get too hung up on the notion of validity of a Unicode string, IMO. It is not some absolute kind of condition which should be tested in code with a bunch of assert() conditions every time a string hits an API. That way lies bad implementations of bad code. ;-)

Essentially, most Unicode string handling APIs just pass through string pointers (or string objects) the same way old ASCII-based programs passed around ASCII strings. Checks for “validity” are only done at points where they make sense, and where the context is available for determining what the conditions for validity actually are. For example, a character set conversion API absolutely should be checking for ill-formedness for UTF-8, for example, and have appropriate error-handling, as well as checking for uninterpretable conversions (mapping not in the table), again with appropriate error-handling.

But, on the other hand, an API which converts Unicode strings between UTF-8 and UTF-16, for example, absolutely should not – must not – concern itself with the presence of a defective combining character sequence. If it doesn’t convert the defective combining character sequence in UTF-8 into the corresponding defective combining character sequence in UTF-16, then the API is just broken. Never mind the fact that the defective combining character sequence itself might not then be valid *for* some other operation, say a display algorithm which detects that as an unacceptable edge condition and inserts a virtual base for the combining mark in order not to break the display.

--Ken




What does it mean to not be a valid string in Unicode?

Is there a concise answer in one place? For example, if one uses the noncharacters just mentioned by Ken Whistler ("intended for process-internal uses, but [...] not permitted for interchange"), what precisely does that mean? Naively, all strings over the alphabet {U+0000, ..., U+10FFFF} seem "valid", but section 16.7 clarifies that noncharacters are "forbidden for use in open interchange of Unicode text data". I'm assuming there is a set of isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has been asked before and ICU documentation has an answer, but this page
    http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add them up. An implementation can use characters that are "invalid in interchange", but I wouldn't expect implementation-internal aspects of anything to be subject to any standard in the first place (so, why write this?). Also it makes me wonder about the runtime of the algorithm checking for valid Unicode strings of a particular length. Of course the answer is "linear" complexity-wise, but as it or a variation of it (depending on how one treats holes and noncharacters) will be dependent on the positioning of those special characters, how fast does this function perform in practice? This also relates to Markus Scherer's reply to the "holes" thread just now.

Stephan
Received on Fri Jan 04 2013 - 18:05:36 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 18:05:38 CST