We all know that a regex to validate emails properly would be quite complicated. However, jQuery's validation plugin has a shorter regex (contributed by Scott Gonzalez), spanning only a few lines:

我们都知道,要正确地验证电子邮件的regex将非常复杂。然而,jQuery的验证插件有一个更短的regex(由Scott Gonzalez提供),只有几行代码:

/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|
((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|
[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]
|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/

Why is this so 'simple' compared to the more well-known monstrosity? Are there cases where one regex would fail and the other would succeed (whether the cases are valid or invalid emails)?

为什么这和更著名的怪物相比如此“简单”?是否存在一个regex失败而另一个成功的情况(这些情况是有效的还是无效的电子邮件)?

2 个解决方案

#1


10

The regex is a custom combination of:

regex是:

  • RFC 2234 ABNF
  • RFC 2234 ABNF
  • RFC 2396 URI Generic Syntax (obseleted by RFC 3986)
  • RFC 2396 URI通用语法(被RFC 3986取代)
  • RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1
  • RFC 2616超文本传输协议——HTTP/1.1
  • RFC 2822 Internet Message Format
  • RFC 2822 Internet消息格式
  • RFC 3987 IRI
  • RFC 3987 IRI
  • RFC 3986 URI Generic Syntax
  • RFC 3986 URI通用语法

I wrote the regex when Web Forms 2.0 was being drafted and RFC 5322 did not exist. If you look at the order in which the RFCs were written, you'll notice that the definition for IRI and URI changed after Internet Message Format was written. This means that RFC 2822 does not support current IRI definitions. Unfortunately, it wasn't a simple task of just substituting definitions, so I had to pick and choose which definitions to use from which RFCs. I also made choices about what to remove (like support for comments).

当Web表单2.0起草时,RFC 5322不存在时,我编写了regex。如果您查看RFCs编写的顺序,您将注意到IRI和URI的定义在编写Internet消息格式之后发生了更改。这意味着RFC 2822不支持当前的IRI定义。不幸的是,这不是一个简单的替换定义的任务,所以我必须选择并选择使用哪个定义来使用RFCs。我还选择了删除什么(比如支持评论)。

The regex is not fully hand-written. While I did manually write every section of the regex, I scripted the "glue". Each definition from the RFCs is stored in a variable, with compound definitions utilizing the variables that store the simpler definitions (@Walf: this is why there are so many subpatterns and ors).

regex不是完全手写的。虽然我确实手工编写了regex的每个部分,但是我编写了“glue”的脚本。RFCs中的每个定义都存储在一个变量中,复合定义利用存储更简单定义的变量(@Walf:这就是为什么有这么多子模式和ors)。

To complicate the matter, the version of the regex that is used in the jQuery Validation plugin is modified even further to account for differences between spec-valid addresses and user expectation of a valid address. I have no recollection of what modifications I made. I promised Jörn Zaefferer (the author of the validation plugin) that I would write a newer script to generate the regex. The new script would allow you to specify options for what you do and don't want to support (required TLD, specific TLDs, IPv6, comments, obsolete defintions, quoted local names, etc.). That was 5 years ago. I started it once, but never finished. Maybe one day I will. What I have so far is hosted on GitHub: https://github.com/scottgonzalez/regex-builder

更复杂的是,jQuery验证插件中使用的regex版本被进一步修改,以考虑特定有效地址和用户对有效地址的期望之间的差异。我不记得我做了什么修改。我向Jorn Zaefferer(验证插件的作者)保证我会编写一个新的脚本来生成regex。新的脚本将允许您指定要支持的选项(需要TLD、特定的TLDs、IPv6、注释、过时的定义、引用的本地名称等)。那是5年前的事了。我做了一次,但从未完成。也许有一天我会。到目前为止,我所拥有的都托管在GitHub上:https://github.com/scottgonzalez/regex builder

If you want a regex for validating email addresses, I'd suggest the following regex which is included in the HTML5 specification:

如果您想要一个regex来验证电子邮件地址,我建议使用以下regex,它包含在HTML5规范中:

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

/ ^[a-zA-Z0-9 ! # $ % & * + / = ? ^ _的{ | } ~ -)+ @[a-zA-Z0-9](?:[a-zA-Z0-9 -]{ 0,61 }[a-zA-Z0-9])?(?:\[a-zA-Z0-9]。(?:[a-zA-Z0-9 -]{ 0,61 }[a-zA-Z0-9])?)* /美元

If you use regex-builder and turn off all the options, you'll get something similar. But it's been about a year since I looked at that, so I don't remember what the differences are.

如果您使用regex构建器并关闭所有选项,您将得到类似的结果。但我已经有一年没看了,所以我不记得有什么不同了。


I'd also like to point out that the link in the original question specifically mentions RFC 822. While it's great that RFC 822 advanced us from Arpanet to the ARPA Internet, this isn't exactly current. The Internet has made a few advances in the past three decades and this RFC has been superseded twice. I'd like to see any new work following the latest standards.

我还想指出,原始问题中的链接特别提到RFC 822。RFC 822将我们从Arpanet推向ARPA Internet,这很好,但这并不是目前的情况。在过去的三十年里,互联网取得了一些进步,这个RFC被两次取代。我想看看有什么新作品符合最新的标准。


UPDATE:

更新:

A friend asked me why the HTML5 regex doesn't support UTF-8. I've never asked Hixie about it, but I assume this is the reason: Even though some TLDs started to support IDNs (International Domain Names) in 2000 and RFC 3987 (IRI) was written in 2005, when RFC 5322 was written in 2008 it only listed characters in the ranges 33-90 and 94-126 as valid dtext (characters allowed for use in a domain literal). HTML5 is based on RFC 5322 and as a result there is no UTF-8 support. It certainly seems strange that RFC 5322 doesn't account for IDNs, but it's worth nothing that even in 2008 IDNs weren't actually usable. It wasn't until 2010 that ICANN approved the first set of IDNs. However, even today if you want to use an IDN, you pretty much need to completely destroy your domain name using Punycode if you actually want things like email and DNS to work globally.

一个朋友问我为什么HTML5 regex不支持UTF-8。我从来没有问Hixie它,但我认为这是原因是:尽管一些tld开始支持国际域名(国际域名)在2000年和RFC 3987(IRI)是写于2005年,在RFC 5322写于2008年它只列出字符范围33 - 90和94 - 90有效dtext(字符允许使用一个域文字)。HTML5基于RFC 5322,因此不支持UTF-8。RFC 5322没有包含IDNs,这看起来确实很奇怪,但即使在2008年IDNs也没有什么价值。直到2010年,ICANN才批准了第一套IDNs。然而,即使在今天,如果你想要使用IDN,如果你想要像电子邮件和DNS这样的东西在全球范围内运行,你也需要使用Punycode彻底销毁你的域名。

UPDATE 2:

更新2:

Updated HTML5 regex to match the updated spec, which changed label length limits from 255 characters to 63 characters, as specified in RFC 1034 section 3.5.

更新了HTML5 regex以匹配更新后的规范,该规范将标签长度限制从255个字符更改为63个字符,如RFC 1034第3.5节所述。

更多相关文章

  1. 未捕获的ReferenceError:函数未定义,它标记
  2. 在中的元素的自定义子弹符号,这是一个普通字符,而不是一个图像
  3. js获取html下拉框中选中值的自定义属性值
  4. 怎么用.net把带html标签的字符导出到Excel
  5. escapeXml过滤掉特殊字符
  6. 如何确定在web页面上呈现的字符串的长度(以像素为单位)?
  7. 具有不间断空格的呈现字符串
  8. js字符串与html代码互相转换时怪想法:自己解析js字符串成普通字
  9. iOS:从NSString中删除(一个html字符串)

随机推荐

  1. Android(安卓)事件分发机制具体解释
  2. 修改dialog的大小
  3. Android usb网络共享开关
  4. android 中常用的权限
  5. Android 图片缩放
  6. 2012.08.24——— android ndk 编译ffmpe
  7. Android(安卓)、 WP 7 、IOS
  8. 从 android sqlite 中获取boolean值
  9. android子线程中刷新界面控件
  10. Android 控件的显示和隐藏