From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Tue Apr 01 2008 - 15:26:46 CST
Thomas Hühn wrote:
> Would a string like "Thomas Hühn" (ThoU+200Dmas HuU+0308hU+200Dn be
>
> (a) a valid Unicode string with some semantics?
I don't see why not, but the semantics is to be assigned at a higher
protocol level, apart from the technical aspects, e.g. "T" is definitely
an uppercase Latin letter and U+200D has certain defined properties and
meaning: it suggests ligature or cursive rendering
> (b) a valid Unicode string that may be used to transmit the
> information that someone is called "Thomas Hühn"?
Yes. The U+200D character is basically typographic in nature, and
although it is generally pointless to use it that way (ligatures for
"om" and "hn" are not actually used and it is difficult to see how they
_could_ be used), but hey, it's a suggestion and can be ignored.
Representing ü as u followed by U+0308 is not common, but surely
possible, and it's just the canonically decomposed form of "ü". Many
programs will choke on it, but that's a different story. Beware that
although the rendering of the two representations of ü _should_
generally be the same, it often isn't. And you should not expect
programs to treat them as different, but neither should you rely on
their _not_ being treated as different.
> Question (b) aims at whether this string might be a valid From: in
> some Internet mail message (properly MIME-encoded, of course) or just
> a bunch of characters that just don't fit together semantically.
This really depends on the Internet message header specifications, i.e.
on higher level protocols. It is up to them to define which characters
are allowed in such contexts.
Many people still refrain from using any non-ASCII characters in
Internet message headers (including even Subject headers, resulting in
distortion of texts), and I can't really blame them, since I know that
they still cause trouble. (I have even seen an E-mail message bounce
back just because a recipient was specified in a Cc header so that his
name contained a non-ASCII letter, "ä", properly inside quotation marks
and with MIME encoding, and the bounce came from the primary recipient's
E-mail system...) And surely U+0308 and U+200D can be expected to be
more risky in message headers than the precomposed ü, U+00FD
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Tue Apr 01 2008 - 15:35:09 CST