emails: utf-8 vs. ls & ps

From: schererm@us.ibm.com
Date: Mon Feb 15 1999 - 10:45:29 EST


Hello,

I am interested in a clarification of the following:

Line ends in Unicode may be unambiguously coded with LS (Line Separator,
U+2028) and PS (Paragraph Separator, U+2029) characters, see TR 13.

This means for emails in UTF-8, that they may not be "well-formed" because
they may not contain CR (13) and/or LF (10) ASCII line ends.

I believe there are (at least) three ways to deal with this, and I would
like to know which one(s) is (are) recommended or used:

1) Disregard TR13 for emails and write only ASCII-style (LF, CR, CRLF)
line ends.

2) Write Unicode email bodies with a modified or new encoding that breaks
lines with LF...
    that are not part of the Unicode text, and encode the text itself with:
2a) disregard the minimum-length rule for UTF-8 and encode U+0000 to U+001f
with
    (otherwise UTF-8-compliant) two-byte codes
2b) binary/base64-encoded UTF-16
2c) create an email-only variable-length encoding with 7 bits/email-byte
2d) ?

3) Do not use LS and PS but instead require Unicode email bodies to use
    HTML or similar, and use <br> and <p> ;
    similar to (2), old-style line ends are inserted only for the sake of
    protocol-conformance and are not part of the displayed text

I guess that (1) and (3) would be the most popular choices.

Sincerely,

markus

Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
schererm@us.ibm.com
                        Unicode is here! --> http://www.unicode.org/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT