Saturday, August 19, 2006
A Tourist's Guide to Regular Expressions
Blog Entry #48
In today's post, we venture into the wild and woolly world of regular expressions - a relatively advanced topic that would seem to separate a real programmer from an amateur such as myself - but there's nothing stopping us from giving it the go, is there now? A full-fledged presentation on regular expressions is definitely beyond the scope of this blog; instead, to keep things at a fairly basic level, we will apply some simple regular expressions to user inputs into the text fields of the Primer #29 Script, which we discussed in detail in Blog Entry #46, and I'll explain as best I can what I'm doing as we go along.
References
Various regular expressions resources can be found on the Web; here are my 'picks':
(1) If you're new to regular expressions, then a good starting point is JavaScript Kit's "Introductory Guide to Regular Expressions" tutorial.
(2) A comprehensive and more general (not limited to JavaScript) treatment of regular expressions appears at http://www.regular-expressions.info/.
Also deserving of mention:
(3) WebReference.com (which, like HTML Goodies, is part of the JupiterWeb 'empire') offers an 11-part "Pattern Matching and Regular Expressions" tutorial.
(4) Netscape's overview of regular expressions is here; its discussion of the core RegExp object is here. And to be up-to-date about all of this, you may want to check out Appendix B of Netscape's JavaScript 1.5 Core Reference, whose list of deprecated features pertains mostly to the RegExp object.
At HTML Goodies itself, the regular expressions topic briefly crops up in two articles - "Validating Special Numbers" and the "JavaScript Basics Part 3" primer - and that's about it, I'm sorry to say. There is no information on the RegExp object, its properties, or its methods on any of HTML Goodies' JavaScript References pages.
General remarks/syntax
So, what is a regular expression and what is it good for? To use a simplified analogy, use of a regular expression is a bit like doing a word search puzzle. We're going to search a string (as opposed to a grid of seemingly random letters) for one or more occurrences of a character pattern, which, if found, will appear horizontally left-to-right (not vertically, diagonally, nor backwards) in the string. The character pattern, termed a regular expression (or regexp for short), can be wildly complex or as simple as a single letter; yes, as in a word search, the pattern can also be an ordinary word.
There are two syntaxes for creating a regular expression:
(1) A 'literal' syntax that delimits the pattern with forward slashes:
var regexp_name = /pattern/flags;
Optionally following the pattern are one or more modifiers termed flags; in JavaScript, there are three regular expression flags - i, g, and m - most important are the i flag, which renders our pattern search case-insensitive, and the g flag, which allows us to globally match all occurrences (not just the first occurrence) of the pattern in the string.
(2) We can also use a new RegExp( ) constructor statement:
var regexp_name = new RegExp("pattern", "flags");
Note that both RegExp( ) arguments appear in quotes. (I was going to say at this point, "After all, regular expressions are themselves strings"; however, typeof regexp_name returns object.)
With respect to these two syntaxes, JavaScript Kit's "Programmer's Guide to Regular Expressions" tutorial says, "In almost all cases you can use either way to define a regular expression, and they will be handled in exactly the same way no matter how you declare them." I for my part will use the literal regular expression syntax in this post.
And how do we put these things into practice? As noted on the "String and Regular Expression Methods" page of the first JavaScript Kit tutorial linked above, there are four String object methods and two RegExp object methods for comparing a regular expression with a target string. Of these six methods, I will in our discussion of the Primer #29 Script use most often the search( ) method of the String object, which works much the same way as does the indexOf( ) method of the String object (discussed in Blog Entry #46) except that search( ) takes a regular expression argument (indexOf( ) takes a string argument):
string_object.search(regexp_pattern);
/* "If successful, search( ) returns the index of the regular expression inside the string. Otherwise, it returns -1," quoting Netscape. */
Let's turn now to the input fields of the Primer #29 Script...
Your first name, please
<form name="dataentry">
Enter First Name:<br>
<input type="text" name="fn" onblur="validfn(this.value);">
We're ready to code a new-and-improved, regular-expression-based validfn( ) function. What do we put in it?
Don't leave it blank, part 3
If all you want to do is to ensure that a user doesn't leave the "Enter First Name" field blank - and as far as regular expressions go, this is setting the bar rather low, as we'll see in a bit - then this can be easily done with the following function:
function validfn(fnm) {
var notblank = /[\s\S]/;
if (fnm.search(notblank) == -1) {
window.alert("First name is required."); document.dataentry.fn.focus( ); } }
Let's look at the regular expression that is assigned to the identifier notblank.
(You may want to follow along at JavaScript Kit's "Categories of Pattern Matching Characters" page.)
• \s matches any whitespace character (a space, a tab, a line break, etc.).
• \S matches any non-whitespace character.
Without the backslashes, s and S match themselves.
• The square brackets apply a logical OR to the matching process; [\s\S] matches either a single whitespace character or a single non-whitespace character. Without the square brackets, \s\S would match a whitespace character followed by a non-whitespace character.
In sum, the /[\s\S]/ pattern matches any single character, because all characters are either whitespace or non-whitespace, as the "Dot" page of http://www.regular-expressions.info/ points out.
We then search( ) fnm, the value (user input) of the "Enter First Name" field, for an occurrence of notblank (any character), and compare the return with -1 in the condition of the subsequent if statement:
if (fnm.search(notblank) == -1)
If the user has left the field blank, i.e., if no characters are present, then the if condition returns true; a "First name is required" alert( ) message pops up and focus is returned to the fn field.
A regexp pattern for 'normal' names
I recognize that there is as much variation to first names as there is to language itself. First names can be multi-part ("Mary Ann"), be hyphenated ("Jean-Paul"), contain accents, umlauts, apostrophes, etc.; some of these situations are easily accommodated by a regular expression, others less so. We can begin by using a regular expression to keep nonletter characters out of the user's fn input:
function validfn(fnm) {
var normalname = /^[a-z]+$/i;
if (fnm.search(normalname) == -1) {
window.alert("Please enter a proper name."); document.dataentry.fn.focus( ); } }
Let's look at the normalname pattern.
• Note the i flag after the second forward slash; the subsequent search( ) will be case-insensitive.
• The normalname pattern begins with a caret (^), which does not match itself but is a metacharacter that matches, to be precise, the dividing line between the first character of a string and the 'void' to the left of the string, according to the "Anchors" page of http://www.regular-expressions.info/. In other words, what follows the ^ must appear at the very beginning of a string. (If normalname also had an m (multiline) flag, then the ^ would also match the dividing line between a newline (\n) character and the following line.)
• [a-z] matches a single character in the 26-letter a-to-z character set; use of a hyphen between a and z allows us to specify a-to-z as a range of characters so we don't have to type out all 26 letters; in contrast, [az-] would match a single a, a single z, or a single -. Without the i flag, [a-z] would match a single lowercase a-to-z character.
• The plus sign (+) does not match itself but is another metacharacter that serves as a quantifier; + matches the preceding character or entity one or more times. [a-z]+ thus matches a string of ≥1 a-to-z characters.
• Finally, the $ does not match itself but is another metacharacter that matches the dividing line between the end of a string and the 'void' to the right of the string; in other words, what precedes the $ must appear at the very end of a string. (If normalname had an m flag, then the $ would also match the dividing line between a \n and the preceding line.)
As before, we then search( ) fnm for an occurrence of normalname and compare the return with -1 in the condition of the subsequent if statement. If the user has entered any nonletter characters (e.g., numbers, nonalphanumeric characters, whitespace), then the if condition returns true; a "Please enter a proper name" alert( ) message pops up and focus is returned to the fn field.
Note that without the ^ and $ metacharacters, a user input of #Joe% would match the normalname pattern.
Perhaps you are a stickler for capitalization; the following code will let sloppy users know, in no uncertain terms, that inputs beginning with lowercase letters are simply out of order:
var firstcap = /^[A-Z]/;
if (fnm.search(firstcap) == -1) {
window.alert("The first letter of your name should be capitalized."); document.dataentry.fn.focus( ); }
Admittedly, the patterns above won't stop the user from entering Lkjhgf into the fn field; however, we can make sure that the user's input contains at least one vowel with the following code:
var vowel = /[aeiouy]/i;
if (fnm.search(vowel) == -1) {
window.alert("There are no vowels in your name! Please enter a proper name.");
document.dataentry.fn.focus( ); }
A regexp pattern for not-so-normal names
The following pattern can handle a name with a space, a hyphen, or an apostrophe; it can also accommodate first-initial-middle-name-last-name people ("J. Paul Getty") who might want to enter just a first initial and a period:
var notsonormal = /^[a-z]*[\s\.'-]?[a-z]*$/i;
• The asterisk (*) does not match itself but is a metacharacter that serves as a quantifier; * matches the preceding character or entity zero or more times - this will allow the notsonormal pattern to match 'Aisha as well as Ze'ev.
• In a regular expression, a period ("dot") is ordinarily a metacharacter that matches any single character except a newline character; a preceding backslash "escapes" a period back to its literal identity, i.e., an actual period.
• The ? does not match itself but is a metacharacter that serves as a quantifier; ? matches the preceding character or entity zero or one time(s).
(Update: according to the "Use The Dot Sparingly" section of http://www.regular-expressions.info/dot.html, a period inside of square brackets is not a metacharacter and does not in this case need to be escaped with a backslash. I find on my computer that notsonormal will match A. but not A# with or without the backslash.)
The notsonormal pattern will not match Abd-Al-Rahman; can you write a pattern that does?
And what if a Günther, a Søren, or a Thérèse visits your site? We certainly don't want to leave anyone out if at all possible...names like these can be matched by putting the Unicode code points for their special characters in your regexp pattern, e.g.:
var specialchar = /^[a-z]+[\u00fc\u00f8\u00e9][a-z]+[\u00e8]?[a-z]*$/i;
• \u00fc encodes ü
• \u00f8 encodes ø
• \u00e9 encodes é
• \u00e8 encodes è
I think that's enough for this entry - we'll look at a regular expression pattern for a zip code in the next post.
^reptile7$
Actually, reptile7's JavaScript blog is powered by Café La Llave. ;-)