reptile7's JavaScript blog
Tuesday, September 26, 2006
 
First Intermission
Blog Entry #52

HTML Goodies' JavaScript Primers #30 contains no scripts but is a 'commencement address' of sorts in which Joe waxes philosophical on the primers series and on teaching and learning in general. Joe also in Primer #30 hawks his book JavaScript Goodies, although his link thereto is broken; the current Barnes & Noble* page for the second (2001) edition of JavaScript Goodies is here.
(*Joe's link points to a page at Fatbrain.com, which was purchased by Barnes & Noble in 2000.)

With respect to further JavaScript instruction, Joe recommends that we tuck into HTML Goodies' Beyond HTML : JavaScript tutorials. We may well tackle some of these tutorials in the future, but for the time being, we will move next to a discussion of the HTML Goodies JavaScript Script Tips, as parenthetically noted in my first blog entry.

The Script Tips examine, in varying degrees of detail, about 30 JavaScript scripts over the course of 93 "tip" articles; each script spans 1-5 tips. More so than the primers we've covered heretofore, these tips are meant to bring us out of the ivory tower and provide real-world examples of useful things we can do with JavaScript. The tip scripts include, for example, a guestbook script, a calculator script, a guitar chord script, and two different digital clock scripts - some good stuff is on tap here, even if it needs to be cleaned up a bit.

In the next post, then, we'll start our Script Tip odyssey by checking over Script Tips #1-4, which collectively discuss two small snippets of code. These first tips begin very simply, but IMO, it will still be worthwhile to catalogue their contents, comment on them, and review some fundamental aspects of JavaScript that I myself was unaware of when I launched this blog.

reptile7

Tuesday, September 19, 2006
 
Channeling Allen Ludden
Blog Entry #51

In recent entries, we've addressed various aspects of the validation of text field input. We turn our attention in this post to a related and complementary topic: the validation of user input into password fields. In a sense, validating a password requires an approach opposite to that for validating a name, ZIP code, etc.: for a password, you deliberately want the user to enter something weird, i.e., as difficult to crack as possible.

There is no universal standard for a strong password. A Google search for "password requirements" generates almost 100,000 hits from all types of organizations, each having its own password policy. For this entry, I arbitrarily chose as a benchmark the "Minimum Password Complexity Standards" recommended by the University of California at Berkeley, according to which a password MUST:
(1) contain eight characters or more; and
(2) contain characters from two of the following three character classes:
(a) letters;
(b) numbers;
(c) all other printable ASCII characters (! @ # $ % ^ & * ( ) _ + | ~ - = \ ` { } [ ] : " ; ' < > ? , . /).

Now then: how do we ensure that a user's password conforms to these guidelines?

(We will not discuss Berkeley's "The password MUST NOT be..." guidelines, which are not validatable as far as I am aware.)

Solution #1

<script type="text/javascript">
function validpw( ) {
var userpw = document.fpw.pw.value;
if (userpw.length < 8) {
window.alert("Your password must have at least 8 characters."); document.fpw.pw.focus( ); }
else {
var lett = /[a-z]/i;
var num = /\d/;
var nonalphanum = /[_\W]/;
if ( (lett.test(userpw) && num.test(userpw)) || (lett.test(userpw) && nonalphanum.test(userpw)) || (num.test(userpw) && nonalphanum.test(userpw)) )
window.alert("Thank you.");
else {
window.alert("Please choose a broader range of characters for your password."); document.fpw.pw.focus( ); } } }
</script>
<form name="fpw">
Enter your password, please:
<input type="password" name="pw"><p>
<input type="button" value="Submit" onclick="validpw( );"> <input type="Reset">
</form>

Comments

In the validpw( ) function, the user's password input is assigned to the identifier userpw. Subsequently, the script uses a simple comparison (à la the Primer #29 Script) to address userpw's ≥8-character length requirement.

The lett, num, and nonalphanum regular expressions correlate with the three character classes listed above. In complement to \w (defined in the previous post), \W matches any character that is not a letter, a number, or an underscore. \W will match a space character, but I can't think of any reason why a password shouldn't contain spaces.

Finally, I addressed the 2-out-of-3 character class requirement by stringing together a series of regexp_name.test(userpw) commands with the && and || logical operators.

Try it out below - type in a password that does or does not meet the Berkeley standards and then click the Submit button:

Enter your password, please:


Solution #2

Naturally, I wondered, "Is there any way we can represent the Berkeley standards with a single regular expression?" Indeed there is:

<script type="text/javascript">
function validpw2( ) {
var userpw2 = document.fpw2.pw2.value;
var pw_regexp = /(?=^[\s\S]{8,}$)((?=[\s\S]*[a-z])(?=[\s\S]*\d)|(?=[\s\S]*[a-z])(?=[\s\S]*[_\W])|(?=[\s\S]*\d)(?=[\s\S]*[_\W]))/i;
if (pw_regexp.test(userpw2)) window.alert("Thank you.");
else {
window.alert("Your password does not meet our length and/or character requirements.");
document.fpw2.pw2.focus( ); } }
</script>
<form name="fpw2">
Enter your password, please:
<input type="password" name="pw2"><p>
<input type="button" value="Submit" onclick="validpw2( );"><input type="Reset">
</form>

The star of the script above is the regexp pattern pw_regexp, which makes extensive use of a "positive lookahead" construct having the following general syntax:

x(?=regexp_pattern)y

As explained by Regular-Expressions.info here and here, the browser:
(a) matches x if x is followed by a match of the regexp_pattern in the parentheses, but then
(b) discards the regexp_pattern part of the match, and
(c) returns to the dividing line between x and the character following x in attempting to match y. A series of positive lookaheads

x(?=regexpA)(?=regexpB)(?=regexpC)(?=regexpD)...

thus allows us to compare a series of regexp patterns with the same (sub)string, because with each matching lookahead the browser will return to the dividing line between x and the character following x.

Let's briefly look at the positive lookaheads that compose the pw_regexp pattern.

• (?=^[\s\S]{8,}$) matches any input of ≥8 characters; we learned in Blog Entry #48 that the [\s\S] pattern matches any single character. The {8,} quantifier format is discussed by Regular-Expressions.info in the "Limiting Repetition" section of this page. Nothing precedes (?=^[\s\S]{8,}$) in the pw_regexp pattern; consequently, assuming that the user's input userpw2 comprises at least 8 characters, the browser returns to the dividing line between the void to the left of userpw2 and the first character of userpw2 before comparing the rest of pw_regexp with userpw2.

• (?=[\s\S]*[a-z]) checks userpw2 for the presence of a single letter character, lowercase or uppercase (note pw_regexp's i flag); it specifically matches a letter character preceded by zero or more characters (regardless of what they are).

• Similarly, (?=[\s\S]*\d) and (?=[\s\S]*[_\W]) check userpw2 for the presence of a single digit and nonalphanumeric character, respectively.

Like the test( ) commands of Solution #1 above, the (?=[\s\S]*[a-z]), (?=[\s\S]*\d), and (?=[\s\S]*[_\W]) lookaheads are combined and alternated so as to satisfy the 2-out-of-3 character class requirement.

On my computer, positive lookaheads are supported by Netscape 7.02 but not by MSIE 5.1.6, which promptly throws an "Unexpected quantifier" compilation error when it hits the ? character in the first lookahead.

Giving credit where credit is due, my pw_regexp pattern borrows from some of the password patterns posted at RegExLib.com - here is a typical example:

var J_Samuel = /^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?!.*\s).{4,8}$/;

As shown above, these patterns contain lookaheads using a period to represent a generic character; my tests of the patterns at the RegExLib.com site were successful but I couldn't get them to work in a SimpleText file on my hard disk until I substituted [\s\S]'s for the periods. (FYI: the use of 's for forming plurals in isolated cases is discussed here.)

As a final aside, I sent to FirstGov.gov, "The U.S. Government's Official Web Portal," an email asking, "Has the federal government ever issued official recommendations for choosing strong passwords for computer accounts?" I was directed to this page hosted by the United States Computer Emergency Readiness Team (but originating from Carnegie Mellon University).

OK, that'll do it for our discussion of data validation, at least for the time being. In the next post, we'll return to the HTML Goodies JavaScript Primers series and its final Primer #30.

reptile7

Saturday, September 09, 2006
 
So, You Want to Validate an Email Address, Huh?
Blog Entry #50

The following code for validating an email address appears in HTML Goodies' "JavaScript Basics Part 3" primer:

<script type="text/javascript">
function validateForm( ) {
var email = document.forms.tutform.elements.email.value;
/* the line above can alternately be written as:
var email = document.tutform.email.value; */
if(!/^[a-zA-Z][\w\.-]*[a-zA-Z0-9]@[a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]\.[a-zA-Z][a-zA-Z\.]*[a-zA-Z]$/.test(email)) { alert("Please enter a valid e-mail address.");
return false; }
return true; }
</script>
<form name="tutform" onsubmit="return validateForm( );">
Email Address: <input name="email"><p>
<input type="submit" value="Submit Form">
<input type="reset" value="Reset Form">
</form>

The user enters an email address into the email field and clicks the "Submit Form" button, triggering the validateForm( ) function. The value of the email field is assigned to the identifier email, which is then compared via the test( ) method of the RegExp object (discussed in the previous post) to the following regexp pattern:

/^[a-zA-Z][\w\.-]*[a-zA-Z0-9]@[a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]\.[a-zA-Z][a-zA-Z\.]*[a-zA-Z]$/

Before we dissect this thing, we should note that a majority of email addresses conform to the following 'anatomy':

local-part@second-level-domain.top-level-domain

It is, of course, not at all unusual for an email address to have more than two domains to the right of the commercial at (@) separator; I myself once had a reptile7@mailhost.tcs.tulane.edu email address. An email address can also have more than one top-level domain, e.g., feedback@uq.edu.au.

Wikipedia details here character limitations of the various parts of an email address. In brief:
(1) The local-part can contain letters, numbers, and the following characters:
! # $ % & ' * + - / = ? ^ _ ` { | } ~
Internal and nonconsecutive periods are also allowed, e.g., x.y@some-domain.com and x.y.z@some-domain.com are OK, but .xy.@some-domain.com and x..y@some-domain.com are not.
In general, the following printable ASCII characters are not usable in the local-part (there are special situations that allow them, however - we'll see one later):
" ( ) , : ; < > @ [ \ ]
Spaces via the space bar are also not OK.
(2) The domains after the @ symbol can contain letters, numbers, and internal hyphens.

We're ready now to go through the above regexp pattern from left to right.
• After the ^ start-of-string anchor, [a-zA-Z] matches a single letter character, lowercase or uppercase. (Note that the pattern does not have an i flag.)
• [\w\.-]* matches zero or more alphanumeric characters, underscores, periods, or hyphens; \w is equivalent to [a-zA-Z0-9_]. The backslash preceding the period here is unnecessary (see the update towards the end of Blog Entry #48).
• [a-zA-Z0-9] matches a single alphanumeric character; this concludes the local-part of the email address.
• Next is the @ separator.
• [a-zA-Z0-9] matches a single alphanumeric character.
• [\w\.-]* matches zero or more alphanumeric characters, underscores, periods, or hyphens.
• [a-zA-Z0-9] matches a single alphanumeric character; this concludes the second-level-domain part of the email address.
• \. matches the period separating the second-level domain from the top-level domain.
• [a-zA-Z] matches a single letter character.
• [a-zA-Z\.]* matches zero or more letter characters or periods.
• [a-zA-Z] matches a single letter character; the pattern ends with the $ end-of-string anchor.

At this point, we can rewrite the pattern in a shorter form:

var email_regexp = /^[a-z][\w.-]*[a-z\d]@[a-z\d][\w.-]*[a-z\d]\.[a-z][a-z.]*[a-z]$/i;

Clearly, there are a great many valid email addresses that the pattern will match but there are others it won't; for example, the pattern mandates that the first character be a letter and thus would not match 123joe@some-domain.com. Moreover, there are invalid email addresses that the pattern will match, e.g., joe...burns@some-domain.com.

On the plus side, the [\w.-]* section of the pattern to the right of the @ symbol gives the pattern the flexibility to match email addresses with multiple domains. Both the reptile7@mailhost.tcs.tulane.edu and feedback@uq.edu.au email addresses given above are matched by the pattern, for example:

[a-z] matches r
[\w.-]* matches eptile
[a-z\d] matches 7
@ matches @
[a-z\d] matches m
[\w.-]* matches ailhost.tcs.tulan
[a-z\d] matches e
\. matches .
[a-z] matches e
[a-z.]* matches d
[a-z] matches u

You may be wondering why the post-@ [\w.-]* section matches ailhost.tcs.tulan and not ailhos; this is because the * quantifier is "greedy" and thus returns "the leftmost longest match," as explained in detail here. In fact, the "greediness" of the * renders the period in the pattern's penultimate [a-z.]* section unnecessary - you can verify this for yourself by applying in the manner above the pattern to the feedback@uq.edu.au email address.

Before we move on, let's finish our deconstruction of the script and its validateForm( ) function. If email and email_regexp don't match, then email_regexp.test(email) returns false and thus the if condition, !email_regexp.test(email), returns true; a "Please enter a valid e-mail address" alert( ) pops up and false is returned to the onSubmit event handler in the <form> tag:

<form name="tutform" onsubmit="return validateForm( );"> becomes
<form name="tutform" onsubmit="return false;">

this cancels the submit event, i.e., the user's input is not sent to the form's processing agent. (We learned in Blog Entry #47 that click events are also cancelable via return false statements.)

If email and email_regexp do match, then email_regexp.test(email) returns true and thus the if condition returns false; in this case, validateForm( ) returns true to the onSubmit function call and the form is submitted.

Other email address regexp patterns

Regular-Expressions.info

Jan Goyvaerts, the architect of http://www.regular-expressions.info/, has posted a carefully-thought-out essay on the validation of email addresses here in which he offers the following regexp pattern for validating an email address:

/^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i

For the local-part:
• [A-Z0-9._%-]+ matches one or more letters, numbers, periods, underscores, percent signs, or hyphens.
For the domain name:
• [A-Z0-9.-]+ matches one or more letters, numbers, periods, or hyphens. Like the *, the + quantifier is "greedy"; consequently, the combination of the period and the + again allows the pattern to match email addresses with multiple domains.
• For the top-level domain, [A-Z]{2,4} matches a minimum of two letters and a maximum of four letters (e.g., .ca, .gov, .name).

Mr. Goyvaerts alleges that his regexp pattern "matches 99% of the email addresses in use today," granted that, like the HTML Goodies regexp pattern discussed above, there are valid email addresses that his pattern won't match. Anticipating just such an objection, he says, "[A] regexp to match truly any possible email address is not only hideously complex, it's also totally useless." Complementarily, he also warns, "Don't go overboard in trying to eliminate invalid email addresses with your regular expression." His article concludes with a humongously long regexp pattern for validating email addresses that conform to the RFC 822 standard, which, Wikipedia notes, "was obsoleted in April 2001 by RFC 2822."

Mr. Goyvaerts' pattern does match reptile7_@excite.com, my longest extant email address, which is not matched by the HTML Goodies pattern.

WebReference.com

The 10th part of WebReference.com's "Pattern Matching and Regular Expressions" tutorial addresses email address validation and offers the following regexp pattern for validating an email address:

var reg2 = /^.+\@(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?)$/;

Comments:
• As noted in Blog Entry #48, a period outside of square brackets matches any character except a newline (\n) character.
• The @ symbol is not a regexp metacharacter and does not need to be escaped with a backslash.
For the domain name:
• Neither the hyphen nor the period in the [a-zA-Z0-9\-\.]+ section needs to be escaped with a backslash.
• The [a-zA-Z]{2,3} section leaves four-letter top-level domains (.aero, .arpa, .coop...) out in the cold. Brrrr!

The reg2 pattern is designed to match email addresses ending in IP addresses as well as those that end in domain names, and contains three parenthesized clauses towards this end:
(\[?)
([a-zA-Z]{2,3}|[0-9]{1,3})
(\]?)
In this regard, however, the "Restrictions on email addresses" section of the RFC 3696 standard notes, "The domain name can also be replaced by an IP address in square brackets, but that form is strongly discouraged except for testing and troubleshooting purposes."

In regular expressions, the opening and closing parentheses are metacharacters that are used to create groupings for various reasons; for example, they can serve as containers for | OR alternatives, as in the ([a-zA-Z]{2,3}|[0-9]{1,3}) clause above, or they can be added to a regexp pattern simply to improve its readability (this seems to be their purpose in the (\[?) and (\]?) clauses). The 5th part of the WebReference.com tutorial notes that parentheses can also be used to "remember" for subsequent use parts of a string matched by a regexp pattern.

(BTW, the closing square bracket ] is not a regexp metacharacter and does not need to be escaped with a backslash.)

The reg2 pattern is augmented with a second pattern that matches invalid email addresses:

var reg1 = /(@.*@)|(\.\.)|(@\.)|(\.@)|(^\.)/;

• The first clause, (@.*@), catches email addresses with two @ symbols. Playing the Devil's advocate, however, I see from the RFC 3696 standard that Abc\@def@example.com is a valid email address.
• The second clause, (\.\.), catches email addresses with two (or more) consecutive periods.
• The third and fourth clauses respectively catch email addresses with periods immediately following and preceding the @ separator.
• The fifth clause catches email addresses that begin with a period.

I leave it to you to deconstruct the rest of WebReference.com's code.

Obviously, all of the regular expressions above for validating email addresses leave at least some room for improvement. I'm not going to offer you my own email address regexp pattern but will throw in my lot with Mr. Goyvaerts' pattern because of its simplicity.

RegExLib.com maintains here a site with a large number of email address regexp patterns.

The next post will conclude our discussion of regular expressions with a brief look at password validation.

reptile7


Powered by Blogger

Actually, reptile7's JavaScript blog is powered by Café La Llave. ;-)