Re: [question] UTF-8 issue

From: Michael D'Errico (mike-list@pobox.com)
Date: Thu Oct 08 2009 - 11:56:57 CDT

Next message: Mark Davis ☕: "Re: Unicode Haiku Contest"

Previous message: Rick McGowan: "Re: Unicode Haiku Contest"
In reply to: Chat S. Depasucat: "[question] UTF-8 issue"
Next in thread: John (Eljay) Love-Jensen: "RE: [question] UTF-8 issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Chat,

It is possible to incorrectly encode a character in UTF-8 using more
bytes than necessary. For example, an ASCII character is in the range
of 0 to 127, so it should be encoded into UTF-8 as a single byte with
that value. But if you look at the way a character is encoded using
2 bytes, it is possible to encode a number between 0 and 127 into the
two-byte sequence:

2-byte UTF-8: 110yyyyx 10xxxxxx
^ ^^^^^^

A number between 0 and 127 would require all of the 'y' bits to be 0
with the value encoded in the 7 'x' bits. You could similarly encode
the same numbers using a three-byte sequence or a four-byte sequence.

Recently the Unicode Consortium has declared these "non-shortest form"
sequences to be illegal and you must not treat them as valid. This is
because it has been possible to trick some software into mis-handling
user input that contains these alternate encodings. If the software
was trying to filter out certain harmful character sequences, for
example, it might accidentally pass through some malicious input.

To determine that the sequence above is illegal, you should note that
all of the 'y' bits are 0, so the first byte of the sequence would be
either C0 or C1 hex. So if your UTF-8 decoder ever sees C0 or C1 as a
lead byte, it knows that it has found a non-shortest form sequence.

Another way to do the check is to just decode the UTF-8 sequence as if
it was valid, then count the number of bits in the resulting value and
check that it falls within the proper range:

   1-byte UTF-8: 7 bits or less
   2-byte UTF-8: between 8 and 11 bits
   3-byte UTF-8: between 12 and 16 bits
   4-byte UTF-8: between 17 and 21 bits

(You don't need to check the upper bound since that is limited by the
encoding.)

Furthermore, you need to check that a 4-byte sequence does not encode
a value beyond 10FFFF hex since those values are not characters. Also
a 3-byte sequence must not encode a surrogate (D800 through DFFF hex).

Mike

----
http://mikestoolbox.com
Chat S. Depasucat wrote:
> 
> im really thankful that i get to find this mailing list.
> 
> i have few UTF-8 issues that I wish somebody could give light on:
> 
> I understand that in UTF-8 encoding, Unicode characters can be 
> represented in more than one way. Like for the US ASCII characters, it 
> can be represented as "shortest form" and "non-shortest form". With 
> these issues, java1.6.0_11 changed the UTF-8 charset implementation to 
> disregard the "non-shortest form".
> 
> Here are my questions:
> 1. How does UTF-8 identify that a byte sequence is illegal? That the 
> sequence is in the non-shortest form?
> 2. Who/how does "non-shortest form" be encoded?
>    For xml files for example, who transforms these characters into bytes 
> which in turn could turn into "shortest" or "non-shortest"?
>     When a program reads from an xml file with UTF-8 encoding, is it 
> possible that the byte decoded is in "non-shortest form?"
> 
> 
> hope somebody could help me understand this.
> thanks so much

Next message: Mark Davis ☕: "Re: Unicode Haiku Contest"
Previous message: Rick McGowan: "Re: Unicode Haiku Contest"
In reply to: Chat S. Depasucat: "[question] UTF-8 issue"
Next in thread: John (Eljay) Love-Jensen: "RE: [question] UTF-8 issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 08 2009 - 11:59:56 CDT