Re: Comments on <draft-ietf-acap-mlsf-00.txt>

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Fri Jun 06 1997 - 15:58:29 EDT

Next message: CN=Lisa Moore/OU=Santa Teresa/O=IBM: "Re: Yet another Unihan Q"
Previous message: jenkins: "Re: Yet another Unihan Q"
Maybe in reply to: Martin J. Duerst: "Re: Comments on <draft-ietf-acap-mlsf-00.txt>"
Next in thread: Martin J. Duerst: "Re: Comments on <draft-ietf-acap-mlsf-00.txt>"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello Ned,

You may not be too pleased with the first part of this mail.
But please read all of it before you start to write your answer.

On Fri, 6 Jun 1997, Ned Freed wrote:

> I agree with Mark about this. Like it or not, one thing that came out of the
> IAB charset workshop was a clear statement that language tagging facilities
> provide critical functionality and therefore must be part of protocols
> developed in the IETF. (RFC2130, section 3.1.1.4.) And because of this
> designers of Internet protocols either have to provide language labelling
> facilities (or something comparable) or else risk having their protocol designs
> rejected by the IESG.

Let's have a look at the relevant passages of RFC 2130. There are three
paragraphs about language.

<<>> 3.1.1.4: Language
<<>>
<<>> This component specifies the language of the transmitted text. At
<<>> times and in specific cases, language information may be required to
<<>> achieve a particular level of quality for the purpose of displaying a
<<>> text stream. For example, UTF-8 encoded Han may require transmission
<<>> of a language tag to select the specific glyphs to be displayed at a
<<>> particular level of quality.

Note the "at times and in specific cases", and "at a particular level of
quality". No hard requirement in sight. And I would very well agree if
the ACAP specification said, in a reasonable way, that language tags
are not neccessary.

<<>> Note that information other than language may be used to achieve the
<<>> required level of quality in a display process. In particular, a
<<>> font tag is sufficient to produce identical results. However, the
<<>> association of a language with a specific block of text has
<<>> usefulness far beyond its use in display. In particular, as the
<<>> amount of information available in multiple languages on the World
<<>> Wide Web grows, it becomes critical to specify which language is in
<<>> use in particular documents, to assist automatic indexing and
<<>> retrieval of relevant documents.

Very nice. I have helped do the job for HMTL. The situation for other
protocols is different, and may require different solutions. Having
a seamingly "solve-all" solution will only detract from the careful
analysis of what's really needed. For example, there is a big danger
that language alternatives are used as an alternative to proper
language negotiation for user-targetted messages. Language alternatives
don't scale, but that may only be found out when it's already too late.

<<>> The term 'language tag' should be reserved for the short identifier
<<>> of RFC 1766 [RFC-1766] that only serves to identify the language.
<<>> While there may be other text attributes intimately associated with
<<>> the language of the document, such as desired font or text direction,
<<>> these should be specified with other identifiers rather than
<<>> overloading the language tag.

One could discuss whether using language tags for disambiguating typographical
traditions (which in particular are rather orthogonal to language) doesn't
consist an overloading. But I guess this is a detail.

> So it really doesn't matter whether or not some people don't believe there's a
> problem to be solved here or whether or not some people believe it to be
> unimportant, because the IETF has already effectively concluded there is a
> problem and it is important. I therefore regard the basic issue of whether or
> not what we're trying to do here is even necessary as closed.

I have rechecked the section 8, recommendations. Neither there nor
in section 3.1.1.4, which is fully copied above, I can see any text that
would justify your statement (repeated from above):

> Like it or not, one thing that came out of the
> IAB charset workshop was a clear statement that language tagging facilities
> provide critical functionality and therefore must be part of protocols
> developed in the IETF. (RFC2130, section 3.1.1.4.)

In particular, I have problems with the words "critical" and "must".
If I have missed something in RFC 2130, then please tell me.

The situation is very much similar to security. What we want protocol
developers and implementers to do is to seriously analyze and consider
the respective needs (security needs and internationalization needs).
A statement such as "there are no security issues" is as unaccepted
as a statement saying "there are no internationalization issues".
On the other hand, there is no requirement that all internet protocols
must provide autenthication or other security features.

> This is the issue we should be discussing.

As should be clear from above, both issues should be discussed.

> As I see it there are several
> alternative approaches we can take to adding language tags. Here's my current
> list along with what I see as the advantages and disadvantages of each
> approach:
>
> (0) Define new character code points for the tags.
>
> + Works with UCS-16 and UTF-16 as well as UTF-8.
> - Ends up with codepoints that aren't characters.
> - Requires support of codepoints outside of BMP.
> - May potentially conflict with future UTC codepoint assignment or with
> private use assignments (depending on the region used).
>
> (1) Embed the tags using illegal UTF-8 sequences. (MLSF is one such scheme;
> there are others.)
>
> + Very easy to parse.
> + Invivible at the codepoint level; keeps tags out of the character stream.
> + The additional level added is very lightweight.
> - Not completely compatible with UTF-8.
> - Cannot be used in conjunction with UCS-16 or UTF-16.
>
> (2) Use some form of rich text.
>
> + Conceptually simpler than any other scheme.
> + Lots of experience with systems of this sort; we know they work.
> - Very heavy compared to (1).

This is a very good overview. The second point of (2) takes up what
I was saying earlier when referring to IETF engineering principles.
What I don't agree with is the second point in (1). Whether these
codes are invisible, whether they will turn up as something else
(whatever that might be) or whether they will break an application
is totally open. Anything may happen, and will happen. That
the tags are not conforming to the UTF-8 syntax neither makes
them invisible (unless you strip them, which you can do with
all other solutions) nor does it keep them out of the
character stream.

> There are also other approaches that combine the charaacteristics of each of
> the schemes I've presented, such as defining a single new codepoint to introduce
> language tag.

That is definitely the best idea I have seen so far!
Not that this means that I give up my general scepticism I
have expressed in many other mails, but that, combined
with the use of plaintext for the tag itself, and a suitable way
to determine the end of the tag (an arbitrary ASCII character
not allowed in language tags will do), would be my preferred solution
if one is needed. In a real plain text editor, that tag
could be displayed with a special glyph (what about a reversal
of the generic currency identifier, a circle with four little
strokes inside instead of outside?). Seen like this, it would
essentially be (2), with all its advantages. On the other hand,
the advantages of (1) and (0) are still nicely met.

As for code positions, U+036F (the end of the column with the
two double diacritics) or U+2069 (just before the bastards
like Inhibit Symmetric Swapping) might be likely candidates.
As for political issues, in the worst, WG2 might be convinced
to declare this codepoint as "not a character", or just give
it a name and not worry about it anymore. Unicode might have
several choices for wording. I would probably prefer something
like "reserved for use in certain protocols, not recommended
for general use".

The only problem with searching would be false positives
inside the language tags, which are very rare and easy to
eliminate.

> I currently like (1) the best, but like Mark I could be convinced to go
> with some other approach. The one approach that isn't acceptable to me is
> to say that there's no problem we have to solve here.

It may sometimes have sounded like "there is no problem".
But it was usually "what exactly is your problem", with detailled
analysis of the answers and very valid arguments.

> Finally, there's also the question of whether or not we want a way to present
> alternatives as part of this system. My current inclination is that this is
> much too complex for many of the situations where we want to be able to use
> this mechanism, but again I could be convinced otherwise.

This is a very good question. Complexity is one problem. Scalability
is another, and is probably worse, because complexity may get mastered
with time, whereas scalability problems will increase with more and
more languages being integrated.

Regards, Martin.

Next message: CN=Lisa Moore/OU=Santa Teresa/O=IBM: "Re: Yet another Unihan Q"
Previous message: jenkins: "Re: Yet another Unihan Q"
Maybe in reply to: Martin J. Duerst: "Re: Comments on <draft-ietf-acap-mlsf-00.txt>"
Next in thread: Martin J. Duerst: "Re: Comments on <draft-ietf-acap-mlsf-00.txt>"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT