[Note -- I'm redirecting this from the ACAP list to the IETF-languages
list (for discussion of RFC 1766 and applications thereof. Subscription
is by sending: "subscribe ietf-languages" in the body of a message to
<majordomo@uninett.no>]
On Wed, 4 Jun 1997, Kent Karlsson wrote:
> If I have not misunderstood UTF-8 or "MLSF" completely:
>
> A.
> 1. A UTF-8-style length count with the low bits set to 0 is
> **not** an "illegal" UTF-8 "start character code" octet.
>
> 2. Adding hexadecimal A0 to the "ASCII" codes for A-Z produces
> something that is an "illegal" UTF-8 continuation octet, but
> *is* a legal "start character code" octet (111xxxxx, where
> each x may be 1 or 0 independently of the others, with some
> exclusions).
This is correct. Using this technique allows language tags to be skipped
more easily and does less damage to the heuristic detection feature of
UTF-8, than encoding the tags in the single-octet 80-BF range would.
> I think this would confuse most UTF-8 decoders, and is unlikely
> to be silently ignored.
I'm not sure about the confusion part, but I'd have to agree that just
about any scheme won't be silently ignored by all decoders. On the other
hand, the modification to make them be silently ignored is trivial.
> B. This trick is designed for UTF-8 only, and does *not* work for
> Unicode/ISO/IEC10646 in general, which means it **cannot** be
> transformed into UTF-16 (nor UCS-4), without using some
> *other* way of representing the language tags.
As Mark Crispin mentioned, this can be a desirable feature in many
circumstances. Language tagged plaintext probably needs to be a different
level from a simple character string.
> C. "Higher level protocols" (e.g. MS-doc/RTF, HTML, etc., etc.)
> seems to be a more suitable place for handling language tags
> (and is where they are handled now).
I disagree strongly. These are needed in searchable attribute values,
protocol error strings, and all sorts of other places where rich text is
entirely unacceptable due to it's complexity and lack of searchability.
If the only alternative is rich text, I will use MLSF.
- Chris
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT