reptile7's JavaScript blog: So, You Want to Validate an Email Address, Huh?

Saturday, September 09, 2006

So, You Want to Validate an Email Address, Huh?
Blog Entry #50

The following code for validating an email address appears in HTML Goodies' "JavaScript Basics Part 3" primer:

<script type="text/javascript">
function validateForm( ) {
var email = document.forms.tutform.elements.email.value;
/* the line above can alternately be written as:
var email = document.tutform.email.value; */
if(!/^[a-zA-Z][\w\.-]*[a-zA-Z0-9]@[a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]\.[a-zA-Z][a-zA-Z\.]*[a-zA-Z]$/.test(email)) { alert("Please enter a valid e-mail address.");
return false; }
return true; }
</script>
<form name="tutform" onsubmit="return validateForm( );">
Email Address: <input name="email"><p>
<input type="submit" value="Submit Form">
<input type="reset" value="Reset Form">
</form>

The user enters an email address into the email field and clicks the "Submit Form" button, triggering the validateForm( ) function. The value of the email field is assigned to the identifier email, which is then compared via the test( ) method of the RegExp object (discussed in the previous post) to the following regexp pattern:

/^[a-zA-Z][\w\.-]*[a-zA-Z0-9]@[a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]\.[a-zA-Z][a-zA-Z\.]*[a-zA-Z]$/

Before we dissect this thing, we should note that a majority of email addresses conform to the following 'anatomy':

local-part@second-level-domain.top-level-domain

It is, of course, not at all unusual for an email address to have more than two domains to the right of the commercial at (@) separator; I myself once had a reptile7@mailhost.tcs.tulane.edu email address. An email address can also have more than one top-level domain, e.g., feedback@uq.edu.au.

Wikipedia details here character limitations of the various parts of an email address. In brief:
(1) The local-part can contain letters, numbers, and the following characters:
! # $ % & ' * + - / = ? ^ _ ` { | } ~
Internal and nonconsecutive periods are also allowed, e.g., x.y@some-domain.com and x.y.z@some-domain.com are OK, but .xy.@some-domain.com and x..y@some-domain.com are not.
In general, the following printable ASCII characters are not usable in the local-part (there are special situations that allow them, however - we'll see one later):
" ( ) , : ; < > @ [ \ ]
Spaces via the space bar are also not OK.
(2) The domains after the @ symbol can contain letters, numbers, and internal hyphens.

We're ready now to go through the above regexp pattern from left to right.
• After the ^ start-of-string anchor, [a-zA-Z] matches a single letter character, lowercase or uppercase. (Note that the pattern does not have an i flag.)
• [\w\.-]* matches zero or more alphanumeric characters, underscores, periods, or hyphens; \w is equivalent to [a-zA-Z0-9_]. The backslash preceding the period here is unnecessary (see the update towards the end of Blog Entry #48).
• [a-zA-Z0-9] matches a single alphanumeric character; this concludes the local-part of the email address.
• Next is the @ separator.
• [a-zA-Z0-9] matches a single alphanumeric character.
• [\w\.-]* matches zero or more alphanumeric characters, underscores, periods, or hyphens.
• [a-zA-Z0-9] matches a single alphanumeric character; this concludes the second-level-domain part of the email address.
• \. matches the period separating the second-level domain from the top-level domain.
• [a-zA-Z] matches a single letter character.
• [a-zA-Z\.]* matches zero or more letter characters or periods.
• [a-zA-Z] matches a single letter character; the pattern ends with the $ end-of-string anchor.

At this point, we can rewrite the pattern in a shorter form:

var email_regexp = /^[a-z][\w.-]*[a-z\d]@[a-z\d][\w.-]*[a-z\d]\.[a-z][a-z.]*[a-z]$/i;

Clearly, there are a great many valid email addresses that the pattern will match but there are others it won't; for example, the pattern mandates that the first character be a letter and thus would not match 123joe@some-domain.com. Moreover, there are invalid email addresses that the pattern will match, e.g., joe...burns@some-domain.com.

On the plus side, the [\w.-]* section of the pattern to the right of the @ symbol gives the pattern the flexibility to match email addresses with multiple domains. Both the reptile7@mailhost.tcs.tulane.edu and feedback@uq.edu.au email addresses given above are matched by the pattern, for example:

[a-z] matches r
[\w.-]* matches eptile
[a-z\d] matches 7
@ matches @
[a-z\d] matches m
[\w.-]* matches ailhost.tcs.tulan
[a-z\d] matches e
\. matches .
[a-z] matches e
[a-z.]* matches d
[a-z] matches u

You may be wondering why the post-@ [\w.-]* section matches ailhost.tcs.tulan and not ailhos; this is because the * quantifier is "greedy" and thus returns "the leftmost longest match," as explained in detail here. In fact, the "greediness" of the * renders the period in the pattern's penultimate [a-z.]* section unnecessary - you can verify this for yourself by applying in the manner above the pattern to the feedback@uq.edu.au email address.

Before we move on, let's finish our deconstruction of the script and its validateForm( ) function. If email and email_regexp don't match, then email_regexp.test(email) returns false and thus the if condition, !email_regexp.test(email), returns true; a "Please enter a valid e-mail address" alert( ) pops up and false is returned to the onSubmit event handler in the <form> tag:

<form name="tutform" onsubmit="return validateForm( );"> becomes
<form name="tutform" onsubmit="return false;">

this cancels the submit event, i.e., the user's input is not sent to the form's processing agent. (We learned in Blog Entry #47 that click events are also cancelable via return false statements.)

If email and email_regexp do match, then email_regexp.test(email) returns true and thus the if condition returns false; in this case, validateForm( ) returns true to the onSubmit function call and the form is submitted.

Other email address regexp patterns

Regular-Expressions.info

Jan Goyvaerts, the architect of http://www.regular-expressions.info/, has posted a carefully-thought-out essay on the validation of email addresses here in which he offers the following regexp pattern for validating an email address:

/^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i

For the local-part:
• [A-Z0-9._%-]+ matches one or more letters, numbers, periods, underscores, percent signs, or hyphens.
For the domain name:
• [A-Z0-9.-]+ matches one or more letters, numbers, periods, or hyphens. Like the *, the + quantifier is "greedy"; consequently, the combination of the period and the + again allows the pattern to match email addresses with multiple domains.
• For the top-level domain, [A-Z]{2,4} matches a minimum of two letters and a maximum of four letters (e.g., .ca, .gov, .name).

Mr. Goyvaerts alleges that his regexp pattern "matches 99% of the email addresses in use today," granted that, like the HTML Goodies regexp pattern discussed above, there are valid email addresses that his pattern won't match. Anticipating just such an objection, he says, "[A] regexp to match truly any possible email address is not only hideously complex, it's also totally useless." Complementarily, he also warns, "Don't go overboard in trying to eliminate invalid email addresses with your regular expression." His article concludes with a humongously long regexp pattern for validating email addresses that conform to the RFC 822 standard, which, Wikipedia notes, "was obsoleted in April 2001 by RFC 2822."

Mr. Goyvaerts' pattern does match reptile7_@excite.com, my longest extant email address, which is not matched by the HTML Goodies pattern.

WebReference.com

The 10th part of WebReference.com's "Pattern Matching and Regular Expressions" tutorial addresses email address validation and offers the following regexp pattern for validating an email address:

var reg2 = /^.+\@(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?)$/;

Comments:
• As noted in Blog Entry #48, a period outside of square brackets matches any character except a newline (\n) character.
• The @ symbol is not a regexp metacharacter and does not need to be escaped with a backslash.
For the domain name:
• Neither the hyphen nor the period in the [a-zA-Z0-9\-\.]+ section needs to be escaped with a backslash.
• The [a-zA-Z]{2,3} section leaves four-letter top-level domains (.aero, .arpa, .coop...) out in the cold. Brrrr!

The reg2 pattern is designed to match email addresses ending in IP addresses as well as those that end in domain names, and contains three parenthesized clauses towards this end:
(\[?)
([a-zA-Z]{2,3}|[0-9]{1,3})
(\]?)
In this regard, however, the "Restrictions on email addresses" section of the RFC 3696 standard notes, "The domain name can also be replaced by an IP address in square brackets, but that form is strongly discouraged except for testing and troubleshooting purposes."

In regular expressions, the opening and closing parentheses are metacharacters that are used to create groupings for various reasons; for example, they can serve as containers for | OR alternatives, as in the ([a-zA-Z]{2,3}|[0-9]{1,3}) clause above, or they can be added to a regexp pattern simply to improve its readability (this seems to be their purpose in the (\[?) and (\]?) clauses). The 5th part of the WebReference.com tutorial notes that parentheses can also be used to "remember" for subsequent use parts of a string matched by a regexp pattern.

(BTW, the closing square bracket ] is not a regexp metacharacter and does not need to be escaped with a backslash.)

The reg2 pattern is augmented with a second pattern that matches invalid email addresses:

var reg1 = /(@.*@)|(\.\.)|(@\.)|(\.@)|(^\.)/;

• The first clause, (@.*@), catches email addresses with two @ symbols. Playing the Devil's advocate, however, I see from the RFC 3696 standard that Abc\@def@example.com is a valid email address.
• The second clause, (\.\.), catches email addresses with two (or more) consecutive periods.
• The third and fourth clauses respectively catch email addresses with periods immediately following and preceding the @ separator.
• The fifth clause catches email addresses that begin with a period.

I leave it to you to deconstruct the rest of WebReference.com's code.

Obviously, all of the regular expressions above for validating email addresses leave at least some room for improvement. I'm not going to offer you my own email address regexp pattern but will throw in my lot with Mr. Goyvaerts' pattern because of its simplicity.

RegExLib.com maintains here a site with a large number of email address regexp patterns.

The next post will conclude our discussion of regular expressions with a brief look at password validation.

reptile7
- posted by A. Peak @ 12:53 PM

Comments: Post a Comment

<< Home

Actually, reptile7's JavaScript blog is powered by Café La Llave. ;-)

About Me