Re: [question] UTF-8 issue

From: Michael D'Errico (mike-list@pobox.com)
Date: Thu Oct 08 2009 - 11:56:57 CDT

  • Next message: Mark Davis ☕: "Re: Unicode Haiku Contest"

    Chat,

    It is possible to incorrectly encode a character in UTF-8 using more
    bytes than necessary. For example, an ASCII character is in the range
    of 0 to 127, so it should be encoded into UTF-8 as a single byte with
    that value. But if you look at the way a character is encoded using
    2 bytes, it is possible to encode a number between 0 and 127 into the
    two-byte sequence:

       2-byte UTF-8: 110yyyyx 10xxxxxx
                             ^ ^^^^^^

    A number between 0 and 127 would require all of the 'y' bits to be 0
    with the value encoded in the 7 'x' bits. You could similarly encode
    the same numbers using a three-byte sequence or a four-byte sequence.

    Recently the Unicode Consortium has declared these "non-shortest form"
    sequences to be illegal and you must not treat them as valid. This is
    because it has been possible to trick some software into mis-handling
    user input that contains these alternate encodings. If the software
    was trying to filter out certain harmful character sequences, for
    example, it might accidentally pass through some malicious input.

    To determine that the sequence above is illegal, you should note that
    all of the 'y' bits are 0, so the first byte of the sequence would be
    either C0 or C1 hex. So if your UTF-8 decoder ever sees C0 or C1 as a
    lead byte, it knows that it has found a non-shortest form sequence.

    Another way to do the check is to just decode the UTF-8 sequence as if
    it was valid, then count the number of bits in the resulting value and
    check that it falls within the proper range:

       1-byte UTF-8: 7 bits or less
       2-byte UTF-8: between 8 and 11 bits
       3-byte UTF-8: between 12 and 16 bits
       4-byte UTF-8: between 17 and 21 bits

    (You don't need to check the upper bound since that is limited by the
    encoding.)

    Furthermore, you need to check that a 4-byte sequence does not encode
    a value beyond 10FFFF hex since those values are not characters. Also
    a 3-byte sequence must not encode a surrogate (D800 through DFFF hex).

    Mike

    ----
    http://mikestoolbox.com
    Chat S. Depasucat wrote:
    > 
    > im really thankful that i get to find this mailing list.
    > 
    > i have few UTF-8 issues that I wish somebody could give light on:
    > 
    > I understand that in UTF-8 encoding, Unicode characters can be 
    > represented in more than one way. Like for the US ASCII characters, it 
    > can be represented as "shortest form" and "non-shortest form". With 
    > these issues, java1.6.0_11 changed the UTF-8 charset implementation to 
    > disregard the "non-shortest form".
    > 
    > Here are my questions:
    > 1. How does UTF-8 identify that a byte sequence is illegal? That the 
    > sequence is in the non-shortest form?
    > 2. Who/how does "non-shortest form" be encoded?
    >    For xml files for example, who transforms these characters into bytes 
    > which in turn could turn into "shortest" or "non-shortest"?
    >     When a program reads from an xml file with UTF-8 encoding, is it 
    > possible that the byte decoded is in "non-shortest form?"
    > 
    > 
    > hope somebody could help me understand this.
    > thanks so much
    


    This archive was generated by hypermail 2.1.5 : Thu Oct 08 2009 - 11:59:56 CDT