Re: Best practice of using regex on identify none-ASCII email address from Mark Davis ☕ on 2013-11-01 (Unicode Mail List Archive)

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Fri, 1 Nov 2013 14:19:13 +0100

I'm not saying that what is sent to the server has to be those bytes; I'm
saying that if we use the convention that punctuation, whitespace, etc gets
escaped, it would allow us to recognize the boundaries of the local part in
plain text.

I think what you mention is part of a more general problem. Let's suppose
that I have an email address where the bytes that the server recognizes for
the local part are <61 B3>@foo.com. I convert that using Latin-14 to aġ@
foo.com. I send it in an email to you, and you receive it as UTF-8. You see
aġ@foo.com, but underneath the covers it is bytes <61 C4 A1>. But then you
send to the server <61 C4 A1>@foo.com, and it fails. Or worse yet, reaches
someone whose email is aÄ¡@foo.com. (Ok, I could have poked around and
found a more compelling example, but you see the point).

If I really wanted to be absolutely certain that my email wouldn't be
munged by a conversion, I'd never convert from bytes: we'd never see "
mark_at_foo.com", we'd always see the equivalent of %6D%61%72%6B_at_foo.com.

Mark <https://google.com/+MarkDavis>
*
*
*— Il meglio è l’inimico del bene —*
**

On Fri, Nov 1, 2013 at 1:36 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

>
>
> 2013/11/1 Mark Davis ☕ <mark_at_macchiato.com>
>
>> These are two well-known serious flaws in EAI and URLs; there is no
>> useful syntactic limit on what is in the query part of a URL or on the
>> local part of an email address that would allow their boundaries to be
>> detected in plaintext.
>>
>> No use complaining about them, because people are concerned with
>> backwards compatibility, and wouldn't change the underlying specs.
>>
>> That being true, I wish that industry could come to consensus about
>> requiring everything outside of a well-defined, backwards-compatible set of
>> characters to be expressed as UTF-8 percent-escaped characters in these
>> fields when they are expressed as plaintext. (Something like XID_Continue ±
>> exceptions.) That would allow for unambiguous parsing in plaintext.
>>
>
> Why "UTF-8" only ? There exists already email accounts created with
> various ISO8859-* or windows codepages, or KOI-8R (or U). And none of these
> addresses are aliased with an UTF-8 encoded account name reaching the same
> mailbox (creting these aliases would help these users having such accounts
> to protect their privacy, however there may exist rare cases where these
> aliases woulda conflict with distinct mail accounts
>
Received on Fri Nov 01 2013 - 08:22:00 CDT

This archive was generated by hypermail 2.2.0 : Fri Nov 01 2013 - 08:22:02 CDT