reptile7's JavaScript blog
Thursday, August 11, 2011
 
The RegExp True Connection
Blog Entry #223

Today's post will wrap up our discussion of HTML Goodies' "Checking Your Work: Validating Input -- Getting Started" and "Checking Your Work: Validating Input -- The Cool Stuff" tutorials and their validations of field values for a guestbook form.

Computer OK

In the previous entry we examined the "Cool Stuff" tutorial's validation of the guestbook form's First Name and Last Name fields via a noNumbersExpression regexp pattern that flags name inputs containing any digit or symbol characters. Not quite complementarily, the "Cool Stuff" tutorial also provides a numbersOnlyExpression regexp pattern that can flag inputs that do not consist solely of digits and is used to validate the form's How many years have you been using the Internet? field.

var numbersOnlyExpression = /^[0-9]+$/;
// Validate years is a whole number
if (!numbersOnlyExpression.test(document.getElementById("TextInternetYears").value)) {
    validationMessage += " - Years using Internet must be a whole number\n";
    valid = false; }


As we are only interested in a true/false comparison between the numbersOnlyExpression pattern and the years input, the above code employs a regexpObject.test( )-based conditional vis-à-vis the tutorial's stringObject.match( )-based conditional.

The ^[0-9]+$ pattern matches a string that, from start to finish, comprises one or more digit characters, and can be shorthanded to ^\d+$.
• Read about the ^ and $ anchors here.
• The + operator is detailed here.
• The use of square brackets to delimit regexp character classes is fleshed out here.

If the user types hi into the How many years ... field, then that input will be intercepted by the numbersOnlyExpression conditional. Per its // Validate years is a whole number comment, the conditional will also flag a floating-point number input such as 5.5. On the minus side, the conditional will accept a number like 500, which is not exactly a meaningful input given that the Internet didn't exist 500 years ago.

So we need a new-and-improved numbersOnlyExpression pattern that puts a reasonable cap on the How many years ... value. A bit arbitrarily - granting that there is probably some disagreement as to when "the Internet" effectively began - I am going to set that cap to 25 years because 1986 was the year that the Internet Engineering Task Force came into being. Accordingly, here's the validation pattern I would use:

var digits0to25Expression = /^(1?\d|2[0-5])$/;

The 1?\d subpattern handles the 0-19 range of inputs; the ? operator optionalizes the tens-place 1 so that the \d matches the ones-place digit for both the 0-9 and 10-19 ranges. In turn, the 2[0-5] subpattern handles the 20-25 range of inputs. The subpatterns are separated by a vertical bar, which serves as a boolean OR operator. The subpattern composite must be wrapped in parentheses because the ^ and $ anchors as operators have a higher precedence than does the | operator; sans parentheses ^1?\d|2[0-5]$ would match strings such as 19apples and oranges20.

(The original numbersOnlyExpression allows for numbers with leading zeroes whereas my digits0to25Expression does not. But the leading zero thing now strikes me as a non-issue. I mean, why would users prepend zeroes to their inputs unless specifically directed to do so? And it's not as though leading zeroes invalidate the digits that follow them - at worst they make a number look a bit weird.)

Hangin' on the telephone

Let's move on to the harder stuff. The "Cool Stuff" tutorial vets the guestbook form's Telephone field via a telephoneExpression regexp pattern:

var telephoneExpression = /^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$/;

The telephoneExpression pattern accommodates the following telephone number formats:
(123) 456-7890
(123)456-7890
123-456-7890
456-7890

The \d{3}-\d{4} subpattern preceding the $ end-of-string anchor matches three digits followed by a hyphen followed by four digits, and is for the local part of the phone number.

The ((\(\d{3}\) ?)|(\d{3}-))? subpattern is for the area code part of the phone number; per the concluding ? operator, this part of the match is optional. The area code subpattern can be divided into two parts:

(\d{3}-) matches three digits followed by a hyphen.

(\(\d{3}\) ?) matches a ( left parenthesis followed by three digits followed by a ) right parenthesis followed by an optional space. The left parenthesis and the right parenthesis are both regexp metacharacters and must be literalized via a preceding \ backslash operator. A space in a regexp pattern actually matches a literal space character and is not just there to separate other regexp tokens.

The outer parentheses are unnecessary in both cases although they do improve the subpattern's readability a bit.

Instead of using one field to hold the user's phone number, it would be better to offer a -- three-field arrangement that moves focus from field to field as the phone number is filled in; we previously coded just such an arrangement in the Demo section of Blog Entry #173. If we stick with a single Telephone field, then we should indicate on the form the specific phone number format(s) that we want (the user shouldn't have to guess in this regard), for example:

Telephone (123-456-7890):

One format is good enough for me, and the 123-456-7890 format, which would allow us to simplify the telephoneExpression pattern to ^\d{3}-\d{3}-\d{4}$, is as good as any of them.

Email my

The "Cool Stuff" tutorial lastly provides an emailExpression regexp pattern for vetting the guestbook form's Email field.

var emailExpression = /^[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\.[a-zA-z0-9]{2,4}$/;

Crafting a regular expression for validating an email address is necessarily an exercise in compromise; as a matter of course a practicable email regexp pattern will fail to match some valid email addresses and will match some invalid email addresses, and the emailExpression pattern is no exception. With that caveat out of the way, here's what we've got in the emailExpression pattern:

• In JavaScript, the \w regexp token is a shorthand for the [a-zA-Z0-9_] character class. [\w\-\.\+]+ matches one or more letters case-insensitively, digits, underscores, hyphens, periods, and/or plus signs, and is for the local part of an email address. As part of a character class, the period and the plus sign do not need to be literalized with a backslash - see the Metacharacters Inside Character Classes section of the aforelinked Regular-Expressions.info "Character Classes" page - and this is also true for the class's non-range-spanning hyphen; consequently, the local-part subpattern can be written as [\w-.+]+.

\@ matches the @ symbol that separates the address's left-hand local part and its right-hand domains. Even outside of a character class, @ is not a regexp metacharacter and does not need to be literalized with a backslash.

[a-zA-Z0-9\.\-]+ matches one or more letters case-insensitively, digits, periods, and/or hyphens (the backslash operators are again unnecessary), and takes care of everything between the @ separator and the period that precedes the address's final/top-level domain; for example, this subpattern would match the uq.edu part of feedback@uq.edu.au.

\. matches the period that precedes the address's final/top-level domain (the backslash is needed here).

[a-zA-z0-9]{2,4} matches two-to-four letters case-insensitively and/or digits, and is for the address's final/top-level domain.

Commenter kburger raises two important issues regarding the [a-zA-z0-9]{2,4} subpattern:
If I'm reading it correctly it expects a TLD of between 2 and 4 characters. I used to use something similar to that until they allowed a couple (maybe more?) of new TLD's: .travel .museum And then it also looks like your regex would accept TLD's containing numbers. I don't think that's allowed, is it?
As of this writing:
(1) .museum and .travel are the only longer-than-four-letters generic top-level domains; accommodating them is a simple matter of changing the {2,4} operator to {2,6}.
(2) Not counting the ASCII versions of internationalized country code top-level domains, it is true that there are no top-level domains that contain digits. According to the Restrictions on domain (DNS) names section of RFC 3696 ("Application Techniques for Checking and Transformation of Names"), however, a top-level domain name can include but not consist solely of digits (and thus an a@b.99 input, which would pass validation, is indeed "not allowed").

Imperfect as it may be, the emailExpression pattern does match the overwhelming majority of valid email addresses out there. The emailExpression pattern is very similar to the /^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i pattern that Jan Goyvaerts of Regular-Expressions.info uses to validate email addresses; the latter pattern requires an alphabetic final/top-level domain and augments the local-part [\w.-] character class with a percent sign vis-à-vis a plus sign, but otherwise the two patterns are essentially identical. I'm not sure I've ever seen a + or a % in an email address but maybe you have, and we should probably put both of them in the emailExpression local part to be on the safe side.

The "A Feedback Form" page of WebReference.com's "JavaScript Regular Expressions" tutorial offers a reg1 regexp pattern that can be used to fend off several types of ordinarily invalid email addresses (those containing two or more @ characters, those with consecutive periods, etc.):

var emailExpression = /^[\w+%.-]+@[a-z\d.-]+\.[a-z\d]{2,6}$/i;
var reg1 = /(@.*@)|(\.\.)|(@\.)|(\.@)|(^\.)/; // not valid
// Validate email formatted properly
var emailValue = document.getElementById("TextEmail").value;
if (!emailExpression.test(emailValue) || reg1.test(emailValue)) {
    validationMessage += " - Your email address appears invalid\n";
    valid = false; }


In the name of completeness

The Restrictions on email addresses section of RFC 3696 states:
In addition to restrictions on syntax, there is a length limit on email addresses. That limit is a maximum of 64 characters (octets) in the "local part" (before the "@") and a maximum of 255 characters (octets) in the domain part (after the "@") for a total length of 320 characters. Systems that handle email should be prepared to process addresses which are that long, even though they are rarely encountered.
It is left to you to incorporate local-part/domain/overall length testing into the preceding code should you feel the need to do so.

More, more, more

The "Getting Started"/"Cool Stuff" guestbook form is rendered in the div below and is ready for your inputs. Clicking the button will trigger a revamped validateMyForm( ) function that uses the tests we have developed in the last three entries to vet values for the form's fields excepting the Country selection list, which is wired to a new !document.getElementById("SelectCountry").selectedIndex test that checks if the user has not selected a country. 'Blank' inputs for the text fields and the Country selection list will turn on "*Please provide ..." error messages next to those fields on the form. In all cases false is returned to the form's onsubmit event handler; your data set won't be sent to me or anyone else.

First Name: *Please provide your first name.
Last Name: *Please provide your last name.
Telephone (123-456-7890): *Please provide your phone number.
Email: *Please provide your email address.
Country: *Please select a country.
How many years have you been using the Internet? *Please provide an integer in the range 0-25, inclusive.
Add to mailing list? Yes No
Favorite websites? HTMLGoodies.com WDVL.com Internet.com JavaScript.com None


In the following entry we'll take on the next Beyond HTML : JavaScript sector tutorial, "Making a Wizard with JavaScript".

reptile7

Comments: Post a Comment

<< Home

Powered by Blogger

Actually, reptile7's JavaScript blog is powered by Café La Llave. ;-)