Unicode 3.2 comments - part 2 of 4

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Sun Feb 10 2002 - 03:38:07 EST

Previous message: Doug Ewell: "Re: Bytext FAQ, Security"
Next in thread: Tom Gewecke: "Re: Unicode 3.2 comments"
Reply: Tom Gewecke: "Re: Unicode 3.2 comments"
Reply: Otto Stolz: "Re: Unicode 3.2 comments - part 2 of 4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

-----BEGIN PGP SIGNED MESSAGE-----

Below '#' is used to quote from the Unicode 3.2 standard as proposed
in PDUTR #28, and '>' is used to quote my suggested changes.

Patches on patches

  Because PDUTR #28 is a delta to Unicode 3.1, three documents have
  to be referred to to reconstruct the intended text. This gets
  particularly convoluted for the conformance chapter. For example
  the single clause C10 is now defined partly by PDUTR #28, partly
  by UAX #27, and partly by Unicode 3.0.

  I think it would be much clearer to write out the clauses/definitions
  that have changed since 3.0, including the descriptions of UTF-8,
  UTF-16 and UTF-32, in full (with the definition of UTF-16 replacing
  section 3.7).

  In fact, there have been so many changes that affect the conformance
  chapter since Unicode 3.0, that the best approach IMHO would be to
  republish the whole chapter. I know that probably can't be done before
  the March deadline for 3.2, but it should be done as a minor revision
  (say 3.2.1) within the next few months.

  A minor nit is that the standard doesn't use consistent terminology for
  sequences of code units. Instead it uses any of the following terms:
    "coded character representation"
    "coded character sequence"
    "code unit sequence"
    "byte sequence" (in C11)

I'll consistently use "code sequence" below.

Irregular sequences in UTF-32

  Suppose that a UTF-32 string is converted to UTF-16. UAX#19 does
  not prohibit high and low surrogate pairs in UTF-32, so those pairs
  would probably be converted as-is by most implementations.

  This means that the same kinds of potential security problems can
  occur for irregular sequences in UTF-32 (and also UTF-EBCDIC), as
  for UTF-8. Therefore, UTF-32 and UTF-EBCDIC should be modified to
  prohibit code sequences that are currently defined as irregular.
  (The specification of UTF-32 is short enough to write it out in
  full as I suggested above; UTF-EBCDIC should be modified separately.)

That would make the definition of UTF-32 more consistent with ISO 10646,
and it would ensure the following properties:

   - the set of strings that can be represented in all UTFs is the same.
   - for each UTF with fixed byte order, there is only one way to
     represent a given string.
   - conversions between UTFs are bijective on the set of well-formed
     code sequences, and produce errors for ill-formed code sequences.

  Note that this change should not cause any compatibility problems,
  because it has always been nonconformant (to both Unicode and
  ISO 10646) to generate UTF-32/UCS-4 strings that include surrogate
  codes.

Definitions

Many of the definitions relating to UTFs are now incorrect or inaccurate:

- The comment after D4 seems to assume UTF-16 is the only encoding
form.

   - In D5, delete the sentence "These 16-bit code values are also
     known simply as /Unicode values/." Also change the definition
     to be of "code unit" with "code value" as a synonym, rather than
     vice-versa.

- Delete the note after D6, which incorrectly says that surrogate
codes are "not currently used to represent any abstract characters".

- The note to D7 says:

     # Unless specified otherwise for clarity, in the text of the Unicode
     # Standard the term /character/ alone generally designates a coded
     # character representation. Similarly, the term /character sequence/
     # alone generally designates a coded character sequence.

     I'm not convinced about that. AFAICS, "character" used alone almost
     always means a Unicode abstract character (not including combining
     sequences); it does *not* usually mean a character representation.
     I suggest deleting this note, and replacing the definition with:

> D7 Code sequence: an ordered sequence of code units.

Then change the fourth note to D3, to:

> - The abstract characters encoded by the Unicode Standard are
> known as Unicode abstract characters. Unless otherwise
> specified for clarity, in the text of the Unicode Standard
> the term /character/ alone generally designates a Unicode
> abstract character.

   - D29 says the following, which should be deleted since it is
     no longer correct (at least for the "UTF-8 -> code point sequence"
     mapping):

     # To ensure that round-trip transcoding is possible, a UTF mapping
     # /must also/ map invalid Unicode scalar values to unique code
     # value sequences. These invalid scalar values include FFFE_16,
     # FFFF_16, and unpaired surrogates.

     If irregular sequences are also disallowed for UTF-32, then for
     any UTF, round-trip transcoding is only possible for Unicode strings
     that do not contain U+D800..DFFF (but that may contain noncharacters).

   - D31 should be deleted. If all UTFs are one-to-one, then illegal
     code sequences are now by definition the same thing as
     ill-formed code sequences. Also, "filtering out" such sequences
     should not be allowed (see my addition to C12 below).

   - D32 should be deleted. (UAX#27 removed the second sentence about
     UTF-8, but disallowing irregular UTF-32 also makes the remaining
     part redundant.)

   - D36 (as modified by UAX#27) still defines "illegal" and "irregular"
     sequences for UTF-8, when these sequences should all now be called
     "ill-formed".

   Also, the term "Unicode value" is defined circularly: it is described
   in D5 as a UTF-16 code unit, but is also used in D35 to define what
   UTF-16 is. Nowhere (not even in section 3.7) is there an adequate
   non-circular definition of it. I suggest removing this term; instead
   "UTF-16 code unit" should be defined by giving the mapping of a valid
   sequence of code points to valid UTF-16, and similarly for UTF-8 and
   UTF-32. There is nothing special about UTF-16 that makes it different
   from the other UTFs.

Conformance clauses

The text for C5 in Unicode 3.1 and proposed for C10 in Unicode 3.2 is:

  # C5 A process shall not interpret a noncharacter code point as an
  # abstract character.
  #
  # - The code points may be used internally, such as for sentinel
  # values or delimiters, but should not be exchanged publicly.
  #
  # C10 A process shall make no change in a valid coded character
  # representation other than the following, if that process
  # purports not to modify the interpretation of that coded
  # character sequence:
  # (a) the possible replacement of character sequences by their
  # canonical-equivalent sequences, or
  # (b) the deletion of noncharacter code points, or
  # (c) the replacement of U+FEFF ZERO WIDTH NO-BREAK SPACE, where
  # not used with signature semantics, by U+2060 WORD JOINER
  #
  # - If a noncharacter which does not have a specific internal use is
  # unexpectedly encountered in processing, an implementation may signal
  # an error or delete or ignore the noncharacter. If these options are
  # not taken, the noncharacter should be treated as an unassigned code
  # point. For example, an API that returned a character property value
  # for a noncharacter would return the same value as the default value
  # for an unassigned code point.
  #
  # [These notes were in Unicode 3.0; it's not clear to me whether
  # Unicode 3.1 deleted them or not:]
  #
  # - Replacement of a character sequence by a compatibility-equivalent
  # sequence does modify the interpretation of the text.
  #
  # - Replacement or deletion of a character sequence that the process
  # cannot or does not interpret does modify the interpretation of the
  # text.
  #
  # - Changing the bit or byte ordering when transforming between different
  # machine architectures does not modify the interpretation of the text.
  #
  # - Transforming to a different encoding form does not modify the
  # interpretation of the text.

  C10 (c) is extremely ugly; why treat this case as an exception?
  There are literally hundreds of other characters that are no longer,
  or never were, the preferred encoding of a particular semantic.
  IMHO, it makes no sense to have a special rule that treats changing
  ZERO WIDTH NO-BREAK SPACE to WORD JOINER as "not a modification", when
  changing any of these other characters to their preferred forms *is*
  a modification.

  On the contrary, changing ZERO WIDTH NO-BREAK SPACE to WORD JOINER
  *must* be considered a modification to the string, since WORD JOINER
  is not supported by a large amount of existing software (it will
  display as an 'unknown glyph' box). A Unicode 3.2 process that could
  output data to processes that potentially only support Unicode 3.1
  or earlier, will have to take that into account.

  C10 (b), allowing deletion of noncharacters, may lead to security
  problems along the same lines as irregular encodings in UTFs - see
  the example below. (This problem was introduced in Unicode 3.1;
  Unicode 3.0 did not say anything about deletion of noncharacters
  not modifying the interpretation of a string.)

This is what I think clauses C5 and C10 should be:

> C5 A process shall not interpret a noncharacter code point as an
> abstract character, and shall not generate noncharacters in
> code sequences that are transferred publicly.
>
> - If a noncharacter is encountered in a code sequence input from
> another process, then the process must either signal an error,
> or treat the noncharacter as it would an unassigned character.
> For example, an API that returns a character property value for
> a noncharacter would either signal an error, or return the same
> value as the default value for an unassigned code point, and
> should document which of these alternatives is used.
>
> - If a process inputs and then outputs part of a string without
> changing it, the fact that noncharacters may appear in the output
> if they appeared in the input, does not imply that the process is
> non-conformant.
>
> C10 A process shall make no change in a valid code sequence other
> than the possible replacement of character sequences by their
> canonical-equivalent sequences, if that process purports not to
> modify the interpretation of that code sequence.
>
> - This requirement does not preclude replacing characters or character
> sequences with variants that more accurately reflect the intended
> meaning (for example, replacing U+FEFF ZERO WIDTH NO-BREAK SPACE,
> where not used as a byte order signature, with U+2060 WORD JOINER).
> Similarly, it does not preclude replacing deprecated characters
> or compatibility composites with a compatibility-equivalent sequence.
> However, such changes must be considered as modifying the
> interpretation of the text, since other processes that receive the
> modified string will potentially treat it differently to the
> original string.
>
> - Replacement or deletion of a character sequence that the process
> cannot or does not interpret does modify the interpretation of the
> text.
>
> - Note that version 3.1 of Unicode defined deletion of noncharacters
> as not changing the interpretation of text, but also allowed them
> to be treated as errors or to be ignored. This might introduce
> security problems in some situations.
>
> For example, suppose that a security check is performed
> that involves testing for the substring ".." (<U+002E, U+002E>) in
> a file name. If it is left ambiguous whether or not noncharacters
> are to be deleted, then the string <U+002E, U+FFFE, U+002E> could
> potentially pass this check, but still be treated as equivalent to
> ".." by the filesystem. Accordingly the required behaviour for
> processes that receive noncharacters in input has been changed (see
> C5 above): this should either cause an error, or the noncharacters
> should be treated as unassigned; they must not be automatically
> deleted. This does not preclude a higher-level protocol from
> specifying explicitly that a string should be modified by deleting
> noncharacters at a well-defined stage of its processing.
>
> - Changing the bit or byte ordering when transforming between different
> machine architectures does not modify the interpretation of the text.
>
> - Transforming to a different Unicode Transformation Format does not
> modify the interpretation of the text.

Add to the end of C9:

> - A higher-level protocol may require a particular normalization
> form to be used for a given data format. In that case a process
> that reads the format may assume that it is appropriately normalized;
> this may mean that it does not need to interpret canonically
> equivalent strings identically. However, failure of this assumption
> must not be allowed to lead to any security weakness. If a process
> is designed to perform strict validatation of such a format, it
> should detect whether data is properly normalized.

UAX#27 added some notes to C12:

  # - The definition of each UTF specifies the illegal code unit sequences
  # in that UTF. For example, the definition of UTF-8 (D36) specifies
  # that code unit sequences such as <C0 AF> are illegal.
  #
  # - Internally, a particular function might be used that does not check
  # for illegal code unit sequences. However, a conformant process can
  # use that function only on data that has already been certified to
  # not contain any illegal code unit sequences.

Change "illegal" to "ill-formed" in the above two notes.

  # - Processes that require unique representation must not interpret
  # irregular UTF code unit sequences as characters. They may, for
  # example, reject or remove those sequences.
  #
  # - Processes may transform irregular code unit sequences into the
  # equivalent well-formed code unit sequences.

  Delete these two notes. If UTF-32 is changed as I suggested then there
  are no longer any irregular code sequences, and even if there were, it
  would be incorrect and potentially insecure to "remove" (i.e. silently
  delete) them. The second of these notes contradicts the new semantics
  for UTF-8: "transforming irregular code unit sequences into the
  equivalent well-formed code unit sequences" is the same thing as
  interpreting them.

The fifth note doesn't need changing from the version in PDUTR #28,
but I'll write it out just for context:

  # - Conformant processes cannot interpret ill-formed code unit
  # sequences. However, the conformance clauses do not, for
  # example, prevent utility programs from operating on "mangled"
  # text. For example, a UTF-8 file could have had CRLF sequences
  # introduced at every 80 bytes by a bad mailer program. This
  # could result in some UTF-8 byte sequences being interrupted
  # by CRLFs, producing ill-formed byte sequences. This mangled
  # text is no longer UTF-8. It is permissible for a conformant
  # program to repair such text, recognizing that the mangled
  # text was originally well-formed UTF-8 byte sequences. However,
  # such repair of mangled data is a special case, and must not be
  # used in circumstances where it would cause security problems.

Add two more notes:

> - The error condition that results from an ill-formed code sequence
> need not cause all of the input in which the error occurs to be
> rejected. It is permitted to store text in a form that allows
> ill-formed code sequences to be regenerated when the text is output,
> but only if this output is in the same Unicode Transformation
> Format as the original ill-formed input.

  The reason for requiring that ill-formed sequences only be regenerated
  if the output UTF is the same as the input UTF, is that otherwise we
  would effectively be endorsing non-standard UTFs that are not bijective.

  For example, consider the "UTF-8B" proposal for round-trip conversion
  of UTF-8 -> UTF-16 -> UTF-8 (do a search in the archive of this list).
  That method is fine *provided that the non-standard UTF-16 that it
  produces can only appear internally*. If it can appear externally and
  is interpreted by general-purpose UTF-16 -> UTF-8 encoders, then that
  would create the same multiple representation problem for UTF-16 that
  Unicode 3.2 is trying to fix for UTF-8, because the UTF-16 -> UTF-8B
  conversion is not one-to-one.

  The above text tries to prohibit this, but without prohibiting, say,
  a UTF-16-based text-editor that uses UTF-8B in order to read and
  write UTF-8 files without destroying ill-formed sequences. The latter
  is harmless and does not create any multiple representation problems.

> It is also permitted to
> replace an ill-formed code sequence by a code reserved by the
> implementation for that purpose, for example by a noncharacter code.

  Should a specific code be reserved for this? It is not the same thing
  as U+FFFD REPLACEMENT CHARACTER, even though that is what some
  transcoders use. Plan-9 calls it "rune_error" and uses U+0080, IIRC.
  I suggest U+FDEF.

> Ill-formed sequences should not be deleted, however, since that
> introduces similar security concerns to those described for
> noncharacters in the notes to clause C10.

> - Transformations between the Unicode 3.2 versions of UTF-8, UTF-16
> and UTF-32 are bijections between the corresponding sets of valid
> (i.e. not ill-formed) code sequences. Ill-formed code sequences
> detected during transformation are treated as error conditions
> as described above.

At the end of the modified D21:

# - Replacing a compatibility composite by its compatibility decomposition
# may lose round-trip convertibility with a base standard.

add:
> In some cases it may also lose semantic distinctions, for example
> between a CJK radical and a corresponding ideograph, or between
> circled, superscripted, or superscripted characters and the
> corresponding plain character.

"Canonical Ordering" subsection of section 3.10

This has been superceded by UAX#15; it should be replaced by a
reference to UAX#15.

Section 3.12 (BiDi)

This has been superceded by UAX#9; it should be replaced by a
reference to UAX#9.

Phi glyphs

  # With Unicode 3.0 and the concurrent second edition of ISO/IEC 10646-1,
  # the reference glyphs for U+03C6 GREEK LETTER SMALL PHI and
  # U+03D5 GREEK PHI SYMBOL were swapped.

  I initially read this incorrectly, as saying that there was a mistake
  in Unicode 3.0. It's actually saying that there was a correction
  made in Unicode 3.0 (which hasn't been documented until now), but that
  isn't clear until two paragraphs further down. It could be rewritten
  more clearly.

PDUTR #28 HTML file

  # In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through
  # U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up
  # large versions of the fences (, ), [, ], {, and }, and of the large
  # operators ?x2211; and ?x222B.

  This is just a meta-comment on the HTML encoding: it's rather optimistic to
  expect the NCRs ∑ and ∫ to be rendered correctly. Encoding
  this as "... the large operators U+2211 (∑) and U+222B (∫).",
  or as an image, would ensure that the meaning is not lost.

Standardized Variants HTML file

  The description of the variant appearance for U+2269 is given
  as "GREATER-THAN AND NOT DOUBLE EQUAL with vertical stroke".
  It should be "GREATER-THAN BUT NOT EQUAL TO with vertical stroke".

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPGYwaDkCAxeYt5gVAQH7agf+OG5ljJ7jjIcgs6yzXp40Ebc5c2XsM1vd
QIso0aNoRPtf+Kyyboq2CBXWwXNPGOX/85GmgZZfc8jfzBoiKJJdSGfqelcruQqO
A0pFC8A8W5NVjEl2a58I5FqCivSinzSnDZsT49CkITQidoJb+zlHj3c+VbWJ/q1q
FRtUZqMK+ZoIA+bmeTHZEuCM5Jmna9NDmdfbzBh9WPWzclJTPH7mdD9+w1u/54Ld
OAJybivOl8aQwc32veAaJbQnQ9fntMxZKCYkq/n5kPqfniyeb98xY8We3OO4Nbt6
ujKmDZBfFwU/a1R3TUpYK+LYTXatt7vg/YcwM9J3Ppy6DXEtXxKCow==
=g/zh
-----END PGP SIGNATURE-----

Previous message: Doug Ewell: "Re: Bytext FAQ, Security"
Next in thread: Tom Gewecke: "Re: Unicode 3.2 comments"
Reply: Tom Gewecke: "Re: Unicode 3.2 comments"
Reply: Otto Stolz: "Re: Unicode 3.2 comments - part 2 of 4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Mon Feb 11 2002 - 03:37:48 EST