Re: Plane 14 codes for language tagging?

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Sat Jun 07 1997 - 10:36:02 EDT


On Fri, 6 Jun 1997, Chris Newman wrote:

> On Fri, 6 Jun 1997, David Goldsmith wrote:
> > 2. Language tagging is part of a higher level protocol. In that case,
> > there are plenty of existing characters in the standard which can be used
> > to implement it.
> >
> > My personal preference is for number 2. I kind of like Martin's proposal
> > for introducing a plain-text language tag using a control code, and I
> > think the existing control codes are fine.

Good idea. Indeed the C1 area is not used in the Internet as far as I know.

> > You do need to handle it
> > specially during searches and such, but the code to do so is not going to
> > be hundreds of lines. Someone would have to make a pretty detailed
> > argument to convince me this is substantially different from the code to
> > handle MLSF during searching.
>
> I'll do my best. I'm a big proponent of the K.I.S.S. principle. In other
> words, if there is a way to avoid unnecessary work, then do it.

I understand. Your comment about searching in 1522 was very good.
But there is a big difference between 1522 and the various alternatives
we are discussing here. It's about two or three lines more or less.
I wrote the "remove language tags" function yesterday for my
proposal. It's one or two lines more than yours. But I don't need
the UTF-8 256 table, so I save.

> There are two places where an escaped scheme causes unnecessary complexity
> relative to an out-of-band scheme. First, take a server which is
> searching through a large list of small UTF8 text strings. With an
> out-of-band scheme, the server can simply perform the search as if the
> text had no markup.

It's not out of band. What if I search for two words in a text, and
by chance there is a language marker between them? You get a false
negative.

> With an escaped scheme, the server has to either
> perform a decode step before searching, or on every match back up and make
> sure the match didn't occur in an escaped sequence. This is unnecessary
> complexity, and if it can be avoided, it should be.

I understand that you care about searching, and a small number of other
operations. If you look at operations other than that, you actually
introduce complexity. For example, assume I want to prepare some data
to be loaded into an ACAP server by hand (i.e. with the real plain text
editor I have now, which can produce UTF-8). How am I supposed to
insert the illegal MLSF sequences? Here MLSF adds considerable
complexity.

> But the even bigger problem is the client side. Clients need to be as
> simple and stupid as possible -- otherwise people won't write clients
> (witness the number of SMTP vs. X.400 clients). With ACAP, the numbers
> rolling around are a target of 100:1 client to server ratio. Most clients
> simply don't care about language tags and want to have nothing to do with
> them. Now adding UTF-8 support requires that clients deal with the
> problem of unknown/unsupported characters. This is a fairly high cost as
> it requires education of client authors. But it's well worth it to be
> able to have a proper international character set.

Very true indeed.

> But language tags
> aren't valuable enough to merit adding more complexity. So it has to be
> possible to fold the logic to ignore language tags into the same logic
> necessary to deal with unknown/unsupported characters. Both MLSF and the
> codepage 14 proposal meet this critera. Clients get to remain dirt
> stupid. An escaping scheme requires the addition of an extra decoding
> step with separate ignore logic.

I have difficulties understanding this. Unknown/unsupported characters
are ususally handled by the display logic. I don't think you want to
do the display logic yourself in your ACAP client, because otherwise
you wouldn't complain about a few lines more in a searching algorithm.
Unknown characters also in many cases are indeed displayed. My editor,
for examlpe, displays little hex numbers in a 2 by 2 block (plus one
leading number in front for things beyond the BMP).
Now if you have a display logic you want to rely on, you will have to
send it something really standard. For various good reasons, in many
cases this will be UTF-16 or UCS4, and not UTF-8, although I can
immagine that interfaces with UTF-8 may also be available in some places.
The idea of Mark Crispin to have something and just send it to the
display doesn't work so easily.

> A rough principle is that humans can easily deal with about four layers of
> abstraction. Requiring another layer just to deal with language tags is
> not worth it.

Just a moment. Didn't we agree that language tagging would be on
a separate layer, and didn't you claim that MLSF was a separate
layer?
I for myself think that it is much better to have separate things
on separate layers, which are clearly visible as such. It's much
better to have five layers if you need five, than to squeese
one of the layers into a third of a layer, confusing people
as to whether it is separate or the same. It's this confusion
which we have to avoid at all cost.

> Please don't interpret this to mean I think language tagging is useless.
> I can see it being very helpful to a multi-lingual blind person and in
> other contexts. ACAP is a container -- I want to be able to fill it with
> useful things. But it's important that the common case (ignore langauge
> tags) is simple and that the uncommon case (use language tags) is
> possible.

To help programmers to make the implementation easy, why don't you
have an option in ACAP, so that clients can tell the server whether
they need language tags or not? Filtering them out on the server
side even if they are of a complexity of let's say text/enriched
should be easy. And those clients that really need them and know
what to do with them also shouldn't have problems with that amout
of complexity.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT