Thanks to Kent Karlsson, some technical problems in the proposed
MLSF have been discovered. People also have woried about:
- The proposed solution actually being disguised in-stream
language codes, with all their disadvantages.
- The proposed solution looking like "yet another UTF", which
we don't need.
- The proposed solution looking too close to UTF-8, with the
danger of being confused.
Here is a proposal for an alternative solution that will avoid
all these problems. It is based on very old Internet (application
level, that is) principles to have things as readable as possible,
and not to use control codes.
Limiting myself on just ASCII for the moment, we could just
select two characters (one for language tags, and one for
alternatives) as being special. Let's take @ and %.
An example could then look as follows:
@fr-fr@soixante-dix%@fr-ch@septante%@en@seventy%@de@siebzig
Details will have to be discussed.
The above is plain ASCII only, and easily readable and editable.
Now the problem with this is that the special characters in
the ASCII range are well used for all kinds of purposes, which
creates problems for searchabitily and so on. Still, with "@@"
as an escape code for real "@", and "%%" for "%", it could work.
But there is another solution. Unicode contains tons of
special characters. The best candidate for our purpose
are the two rows in the Latin-1 area. They are available
for display on most systems, and they are below 0x800
(thus need two bytes in UTF-8).
Such a solution has various advantages:
- It's plain UTF-8 text, not a new encoding or something
that looks like a new encoding. It clearly shows
that an application-level problem is solved with
a well-known IETF application-level solution.
- It's very easily parsable. I haven't written code yet,
but I guess it's shorter than what we have now.
- It's more understandable even on systems that can't
display anything else than ASCII (the language
tags are still plain text), or if it gets
interpreted as something else than UTF-8.
- It's in now way going to mess up UTF-8 heuristic detection,
or giving errors to receivers that expect correct
UTF-8.
- It can be used in UTF-16, UTF-7, and (depending on the
specials choosen) even in other encodings.
Looking forward to your comments,
Martin.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT