RE: Plain Text

From: Paul Dempsey (Exchange) (paulde@exchange.microsoft.com)
Date: Sun Jul 04 1999 - 15:49:59 EDT


> > Frank da Cruz:
> > So at minimum, a text file should be tagged according to character set.
To
> > my knowledge, this has never been done at the file-system level.
> John Cowan:
> Either that, or there needs to be only one character set! :-)

We'll have to deal with multiple untagged codepages/encodings/charsets for a
long time yet. It's unlikely we'll get file systems to carry any
meta-information beyond the filename in any portable way and certainly not
retroactively.

What we CAN do is use encoding signatures for all Unicode files. The various
forms of Unicode are still relatively new and we still have a chance to
establish the conventions.

The Unicode standard lists signatures for _some_ Unicode encodings, in
section 13.6 Specials, Encoding Form Signature:

UCS-2(UTF-16) FE FF
UCS-4 00 00 FE FF

However, this is incomplete. The most important thing we're missing from the
standard is:

UTF-8 EF BB BF

These are all the ZERO WIDTH NO BREAK SPACE (a.k.a BYTE ORDER MARK) in the
corresponding representation.

Without a signature for UTF-8, you can't reliably assume you're working with
UTF-8 and not some other MBCS. A number of Microsoft programs (Notepad,
Visual Studio, richedit) are using this signature for UTF-8.

For the rest of what constitutes "plain text", the Unicode standard covers
most of the issues, but not explicitly in one place. The grayer part of
this discussion is about what constitutes "preformatted plain text". I don't
think this can be standardized to practical effect. That is, you could write
a standard, but would anyone use it? This quickly gets into the domain of
presentation and document structure, which is beyond the scope of the
Unicode standard proper. It is still worthwhile to capture the common
conventions and make recommendations.

--- Paul Chase Dempsey
Microsoft Visual Studio Text Editor Development



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT