Re: UTF-7,5

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Tue Jul 15 1997 - 13:31:52 EDT


KNAPPEN@MZDMZA.ZDV.UNI-MAINZ.DE wrote on 1997-07-15 09:25 UTC:
> >And on those systems, *ALL* non-ASCII characters are messed
> >make things any worse. Broken is broken.
>
> But containing C1 control characters the file is reasonably more broken
> than a file containing only Latin-1 characters. The latter I can view (
> because there is a latin-1 font on my system), I can even analyse a suspect
> spot, I can edit it and I can store it as a text file.

These are all tasks that only an expert user like you and me can
perform. But an expert user can also as easily install GNU recode
and just do the conversion on literally any system in a few minutes.
For your convenience, I even append you my utf8tolat1.c tool below,
a simple wrapper for the GNU recode conversion procedure.

The non-expert user will not be able to distinguish between "almost
human readable" and "umlaut mess". The difference between "Where do
those strange pound signs in front of my umlauts come from and how do
I get rid of them?" and "Where are my umlauts?" is IMHO not that
significant. They will call support and you'll have to fix it in
both cases.

I know of very few tools who do with C1 characters anything else than
ignoring them or replacing them with hexadecimal replacements, since those
tools had to deal with accidentially sent binary code in the past also very
frequently. On Windows and Macs you see the normal characters, under Unix you
normally use a pager like "less" that escapes C1 characters. Other systems
than Windows, Mac, and Unix have less than 5% market relevance anyway and
have similar solutions in place. As I asked before: Where are those systems
that are endangered by C1? Yes, a very few terminal emulators will get
a hick-up, but most applications will filter out the C1 characters before
display anyway as a precaution against accidentially displayed binary
files. I am convinced that this so-called C1 problem is a non-issue.
That was the reason why UTF-1 was removed from ISO 10646.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu

#include <stdio.h>

static int
utf8_to_ucs (FILE *input_file, FILE *output_file)
{
  int reversible; /* reversibility of recoding */
  int input_char; /* current character */
  int output_bits; /* 8 = Latin-1, 16 = UCS-2, 32 = UCS-4 */
  unsigned long ucs;
  int i, count;

  output_bits = 8;

  reversible = 1;
  while (input_char = getc (input_file), input_char != EOF) {
    if (input_char < 0x80) {
      /* plain ASCII is just copied */
      for (i = 8; i < output_bits; i += 8)
        putc (0, output_file);
      putc (input_char, output_file);
    } else {
      /* read start byte of multi-byte sequence*/
      if ((input_char & 0xe0) == 0xc0) {
        count = 1;
        ucs = (input_char & 0x1f);
      } else if ((input_char & 0xf0) == 0xe0) {
        count = 2;
        ucs = (input_char & 0x0f);
      } else if ((input_char & 0xf8) == 0xf0) {
        count = 3;
        ucs = (input_char & 0x07);
      } else if ((input_char & 0xfc) == 0xf8) {
        count = 4;
        ucs = (input_char & 0x03);
      } else if ((input_char & 0xfe) == 0xfc) {
        count = 5;
        ucs = (input_char & 0x01);
      } else {
        /* we have encountered 0xfe, 0xff, or an unexpected 10xxxxxx byte */
        reversible = 0;
        continue;
      }
      /* read following 10xxxxxx bytes */
      while (count--) {
        input_char = getc (input_file);
        if (input_char == EOF) {
          reversible = 0;
          return 0;
        }
        if ((input_char & 0xc0) != 0x80) {
          /* byte 10xxxxxx expected, but not encountered */
          reversible = 0;
          break;
        }
        ucs = (ucs << 6) | (input_char & 0x3f);
      }
      if (count == -1) {
        /* write output character */
        if (output_bits < 32 && (ucs >> output_bits) != 0)
          reversible = 0;
        else
          for (i = output_bits - 8; i >= 0; i -= 8)
            putc ((ucs >> i) & 0xff, output_file);
      }
    }
  }

  return reversible;
}

int main()
{
  if (!utf8_to_ucs(stdin, stdout))
    fprintf(stderr, "Conversion not reversible.\n");

  return 0;
}



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT