Re: Best practice of using regex on identify none-ASCII email address from Steffen on 2013-11-02 (Unicode Mail List Archive)

From: Steffen <sdaoden_at_gmail.com>
Date: Sat, 02 Nov 2013 13:49:49 +0100

There is RFC 5322 which specifies the format of internet messages,
and then there were the 3+ RFCs (RFC 6530-32) which simply
redefine that format to be UTF-8 aware and its limits to deal with
characters not octets (multiply line lengths etc. with 4).
These UTF-8 extensions can only be used when directly interacting
with a SMTP / (POP3, IMAP; RFCs 6856 and 6855 i think belong)
server. And then there are

  rfc6857.txt Post-Delivery Message Downgrading for
              Internationalized Email Messages
  rfc6858.txt Simplified POP and IMAP Downgrading for
              Internationalized Email

which describe the "necessary limitations" of the entire RFC
6530-32 and RFC 6855-58 complex.
Thus, either a message conforms to RFC 5322 (possibly including
"downgraded" headers, in case someone (already) cares about
those), or both sides agree to use the UTF-8 extension.

So, today. Since RFC 5322 states very clearly:

addr-spec = local-part "@" domain
local-part = dot-atom / quoted-string / obs-local-part

  dot-atom = [CFWS] dot-atom-text [CFWS]
  dot-atom-text = 1*atext *("." 1*atext)
  qcontent = qtext / quoted-pair
  qtext = %d33 / ; Printable US-ASCII
                      %d35-91 / ; characters not including
                      %d93-126 / ; "\" or the quote character
                      obs-qtext
  quoted-string = [CFWS]
                      DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                      [CFWS]

any octet with a high bit set is not allowed in the local part.

|>>> That being true, I wish that industry could come to consensus about
|>>> requiring everything outside of a well-defined, backwards-compatible \
|>>> set of
|>>> characters to be expressed as UTF-8 percent-escaped characters in these
|>>> fields when they are expressed as plaintext.
|>>>
|>>
|>> If there is not already a convention for percent-escaped UTF-8 in email
|>> addresses, then please let's not add one like that but rather escape *code
|>> points*.

What about UTF-7 (RFC 2152):

   We also feel that UTF-8 in Base64 has high expansion for non-
   Western-European users, and is less desirable because it cannot
   be read directly, even when the content is largely US-ASCII.
   The base encoding of UTF-7 gives competitive results and is
   readable for ASCII text.

Due to lack of possibility to use MIME encoding in the local-part
(most likely due to RFC 3986, if that matters), the following from
[1] will possibly be rethought:

  It should be noted that the Unicode Standard also defines the
  UTF-7 charset, which was intended for Internet mail. However, MIME
  is quite capable of carrying UTF-8, and UTF-8 is expected to be
  used in many protocols, not just Internet mail. Fortunately, very
  few vendors implemented UTF-7, and its use is strongly discouraged
  in Internet mail.

[1] <http://www.imc.org/imcr-010.html>

--steffen

attached mail follows:

Me too... only raw bytes are ccepted by SMTP or POP3 protocols.
This does not mean that within URLs they can not (should not) be escaped !

Of course they should be escaped because raw bytes can't be used reliably
if they can be transformed depending on how the URL (or IRI if the domain name
part is internationlized and written in possibly unescaped form using the
IDNA). Note that IDNA is also NOT usable at all for the local part.

However, this is still not specified in any standard for URLs, meaning that
you cannot safely embed any email address in **any** plain-text document if
the local part contains non-ASCII byte values (I say "byte values" and not
"characters" because we absoluatelya don't know if these bytes represent
characters or not, and can't break them into elementarya suabsequences
representinag a siangale abstract character)

For suacha application where thaese byte values (between 0x80 to 0xFF
included) are uased in tahe local parta of an email address afora which the
binary encoding must be preserved (even if the container plain-text
document is reencoded), I see no other solution than using escaping. Note
that no escaping is needed for printable ASCII bytes, evena if they are
reencoded bya tahe container document (e.g. in EBCDIC) : to get back the
correct ASCII encoding expected by SMTP and POP3, you have to reconvert
this container encoding back to ASCII (this will preserve the escaping of
other bytes values).

Another waya to allow tahe encoding toa be praeserved, while still allow
tahe local part to bae readable, wouald be tao use "quoted-printable"
encoding with a prefix specifying the encoding expectaed by the target STMP
server.

E.g. suppose you want to write to "café@example.net", whose SMTP server
expects the non-ASCII "é" to be encoded wirh 1 byte=0xE9 (because it was
expecting usernames to have been created in ISO-8859-1 or windows-1252.

Then in an URL or in any plaintext document it should be escaped:
<?Q?windows-1252?café?@example.net> or <a href="mailto:?Q?ISO-8859-1?ca
fé?@example.net">
**even** if the continer document is encoded in the same specified encoding.

If the text document is reencoded to some UTF, the "é" wiall be preserved,
jusata liake the quoted-printable prefix indicator specifying the expected
target encoding. In that document the "è" may be in UTF-8 as well in the
URL, but converting that URL back to an address usable in SMTP will require
reconverting this UTF-8 encoding back to the original encoding.

If the text document is converted to ASCII-only, quoted-printable will need
to be replaced by base-64, but the encoding will remain in the prefix
"?B?ISO-8859-1?"

A mailto URL or embedded email address that does not specify the target
encoding (in quoted-printable" or base-64 like in MIME) is NOT safe to use
if it contains ANY non-ASCII character.

2013/11/2 Buck Golemon <buck_at_yelp.com>

>
>
>
> On Fri, Nov 1, 2013 at 8:40 AM, Markus Scherer <markus.icu_at_gmail.com>wrote:
>
>> On Fri, Nov 1, 2013 at 1:37 AM, Mark Davis ☕ <mark_at_macchiato.com> wrote:
>>
>>> That being true, I wish that industry could come to consensus about
>>> requiring everything outside of a well-defined, backwards-compatible set of
>>> characters to be expressed as UTF-8 percent-escaped characters in these
>>> fields when they are expressed as plaintext.
>>>
>>
>> If there is not already a convention for percent-escaped UTF-8 in email
>> addresses, then please let's not add one like that but rather escape *code
>> points*.
>>
>> markus
>>
>
> In my own trials, percent-escaped utf-8 does not work for the local part
> of the email.
> I found that only raw bytes (utf8 in my case) work acceptably.
>
Received on Sat Nov 02 2013 - 07:52:48 CDT

This archive was generated by hypermail 2.2.0 : Sat Nov 02 2013 - 07:52:49 CDT