Re: Comments on <draft-ietf-acap-mlsf-00.txt>?

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Thu Jun 05 1997 - 18:11:00 EDT


On Thu, 5 Jun 1997, Timothy Pardridge wrote:

> In message <9706041245.AA11103@unicode.org> you recently said:
>
> > > Any comments on
> > > ftp://ds.internic.net/internet-drafts/draft-ietf-acap-mlsf-00.txt
> > > ?
> >
> > > Language tags are encoded by mapping them to upper-case, then
> > > adding hexidecimal A0 to each octet. The result is broken up into
> > > groups of five octets followed by a final group of five or fewer
> > > octets. Each group is prefixed by a UTF-8-style length count with
> > > the low bits set to 0.
> >
> > If I have not misunderstood UTF-8 or "MLSF" completely:
> >
> > A.
> > 1. A UTF-8-style length count with the low bits set to 0 is
> > **not** an "illegal" UTF-8 "start character code" octet.
>
> I think they are unusual though because the low order bits (except
> the highest one) will have at least one bit set becuase of the
> character being represented.

> e.g 00000yyyyyxxxxxx fits in two bytes. yyyyy is non zero since
> otherwise one byte could be used. The bytes are 110yyyyy and 10xxxxxx.
> MLSF would use 11000000 which can never occur in UTF-8.

We have just recently had a discussion initiated by a company that
wanted to have some of their implementation pecularities standardized
as an UTF-8 variant. I don't yet know how this discussion has ended.
The standard and the code suggest that you accept encodings even
if they use one byte too much.

UTF-8 has some redundancy, and this is a very valuable thing.
It is obviously starting to become a favorite target for attacks.
But it should stay as is. If several parties bite a bit off here
and a bit there, chances are that we won't have anything left
in the end, and even worse, that those various parties will
badly bite each other.

> > 2. Adding hexadecimal A0 to the "ASCII" codes for A-Z produces
> > something that is an "illegal" UTF-8 continuation octet, but
> > *is* a legal "start character code" octet (111xxxxx, where
> > each x may be 1 or 0 independently of the others, with some
> > exclusions).
> >
> > I think this would confuse most UTF-8 decoders, and is unlikely
> > to be silently ignored.
>
> He may well be assuming an implementation where a count byte triggers
> a loop which reads a number of following bytes. As you say there are
> other ways of implementing a decoder.

What we have to assume is not one or another decoder, but the
total of all decoders. And that doesn't leave much room.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT