Re: UTF-5 specification

From: Martin J. Duerst (duerst@w3.org)
Date: Sat Mar 04 2000 - 00:41:16 EST


As the inventor of both UTF-5 and its name, let me say a few things:

- The main purpose of the initial draft that contained UTF-5 was to
  show that it would be possible to add i18n domain names to the domain
  system without any changes to DNS nor to any other protocols.

- The main reason for writing that draft was to silence some people
  who said that it might be possible to internationalize part of URIs,
  but never the domain name part. The idea was not to actually use that
  as a solution.

- Several other solutions for i18n domain names have been proposed
  and tested, all of them with their advantages and disadvantages.
  I personally clearly favor to use UTF-8, but discussion on the
  requirements and proposals should be conducted on the list of the
  IETF idn WG, not here, so I refrain from further comments.

- In no way should UTF-5 be seen as a general scheme. If it has any
  utility, then in the area of domain names and maybe related stuff.
  For everything else, there is clearly some better way to do things.

- I agree with Ken that the name 'UTF-5' is quite sloppy. The only thing
  I can say in my defense is that at the time I started to use it, UTF-7
  was still around.

- In case you really want to use UTF-5 (I suggest you better
  wait for the IETF work to go on, or even better, contribute to that),
  you need some way to distinguish between UTF-5 and plain (ASCII)
  text. Although there might be some heuristics, it's no a good idea
  to use this in something as central as the DNS. The original draft
  proposed to use some top level domain such as .i or .i18n.int for
  this purpose. That means that either the whole domain name is in
  UTF-5, or the whole domain name is (straight) ASCII. According to
  DNS people, this already creates management headaches. If you in
  addition make the dns suffix decide on how the left part of an
  email address is encoded, that complicates things much more.
  I think this is something that should be addressed better in
  a future version of the current UTF-5 draft.

Regards, Martin.

At 07:30 00/03/02 -0800, Doug Ewell wrote:
> I've been working on implementing a UTF-5 encoder and decoder based on
> the specifications in the file
>
> http://ftp.univie.ac.at/netinfo/internet-drafts/draft-jseng-utf5-01.txt
>
> and I am running into problems with what I will call "UTF-5 mode,"
> which I apparently need to be able to switch into and out of, but which
> is not mentioned anywhere in the spec.
>
> Section 3, "Examples of UTF-5," states:
>
> > The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262,
> > 0391, 002E) may be encoded as follows:
> >
> > "K1I262J91IE"
>
> In this example, the two ASCII characters 'A' and '.' are encoded in
> UTF-5 along with the non-ASCII characters, U+2262 and U+0391.
>
> Section 4.b, "Internationalization of Simple Mail Transfer Protocol
> Address," states:
>
> > For example, an SMTP Email address for "yamaguchi@asahi.ninhon"
> > (5C71 53J3 '@' 671D 65E5 '.' 65E5 672C) can be represented in
> > UTF-5 "LC71L3E3@M71DM5E5.M5E5M72C". This is a valid [RFC822] Email
> > address which will not be rejected. It will then be the responsiblity
> > of the user interface to render "LC71L3E3@M71DM5E5.M5E5M72C" properly
> > as "yamaguchi@asahi.ninhon".
>
> In this example, the two ASCII characters '@' and '.' are NOT encoded
> in UTF-5 along with everything else, but remain in ASCII.
>
> So in the same document, we are told first that the character U+002E
> should be encoded in UTF-5, and then that it should not. This creates
> a problem for encoders, since they must know when to encode characters
> like U+002E and when not to. It also creates a problem for decoders,
> which must figure out how and when to switch into and out of UTF-5 mode
> within a "UTF-5" string or document.
>
> This notion is inconsistent with Section 2.5, "Detecting a UTF-5
> string," which states:
>
> > Nevertheless, if the string is sufficiently long, it is possible to
> > do some detection of UTF-5 string based on the fact that
> > 1. UTF-5 strings only have characters within '0'-'9' and 'A'-'V'.
> > 2. UTF-5 strings have a well-defined inital octet of 'G' to 'V'.
> > 3. The 'G' character always occurs as the inital and only octet.
> > In other word, the shortest UTF-5 sequence is "G". For example,
> > "GF" is not a valid UTF-5 sequence.
>
> The encoded e-mail address "LC71L3E3@M71DM5E5.M5E5M72C" in Section 4.b
> violates rules 1 and 2, and thus would not qualify as a UTF-5 string
> according to these criteria.
>
> There are other potential ambiguities. The specification says that
> characters in the range U+0000 through U+000F are represented by
> quintets in the range 10000 through 11111 (binary), and converted
> thereby to characters in the range 'G' through 'V'. This would seem
> to imply that control characters like Carriage Return (U+000D) and
> Line Feed (U+000A) should be encoded as 'Q' and 'T' respectively. The
> extreme example of this is trying to store pure UTF-5 strings in C or
> C++ null-terminated character arrays, while encoding the null character
> U+0000 itself as the letter 'G' as specified in Section 3.
>
> It appears that UTF-5 was designed solely to allow non-ASCII characters
> in Internet domain names and e-mail addresses, and the problem of what
> to do about characters like '@' and '.' was ignored. But in a proper
> specification, these ambiguities should not exist. Compare the UTF-5
> document to the specification of UTF-7 (RFC 2152). In that document,
> it is specified clearly which characters are encoded and which are not,
> and when and by what means it is necessary to switch modes. (This is
> not to imply that UTF-7 is all that simple to implement, but at least
> the specification is complete.)
>
> In short, the UTF-5 specification needs to acknowedge the need to
> switch into and out of UTF-5 mode. It should specify when certain
> characters are to be left in ASCII rather than being encoded into
> UTF-5, and it should provide guidelines for decoders about how
> "invalid" UTF-5 characters ([^0-9A-V]) are to be handled in a UTF-5
> stream. These details must be covered explicitly by the spec, not
> left as "undefined" for each implementation to handle differently.
>
> -Doug Ewell
> Fullerton, California
>
>
>

#-#-# Martin J. Du"rst, World Wide Web Consortium
#-#-# mailto:duerst@w3.org http://www.w3.org



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT