reptile7's JavaScript blog: Tag Free

Saturday, December 22, 2007

Tag Free
Blog Entry #98

In this post, we'll examine HTML Goodies' JavaScript Script Tips #79 and #80 and their script that removes the markup tags, and thus extracts the #PCDATA text, of a document or document fragment. The Script Tips #79-80 Script was originally designed to work with HTML document code but should also be applicable to code written in XML or in any other markup language employing SGML-type element start-tags and end-tags.

In the course of working through it recently, the Script Tips #79-80 Script initially rubbed me the wrong way philosophically; I thought, "Man, the scripts we've discussed up to this point would generally be improved by more structural markup, not less." But it subsequently occurred to me that the script could be useful for removing the presentational markup that clogs many of these same scripts like artery plaques. Joe for his part states at the beginning of Script Tip #79 that he has used [the script] to death, although he doesn't say specifically what he's used it for.

The Script Tips #79-80 Script can be accessed by following the Here's the Code links in both script tips and is reproduced in the div below:

<html>
<head>
<title>The Script Tips #79-80 Script Demo Page</title>

<script language="javascript" type="text/javascript">

function DelHTML(HTMLWord)
{
a = HTMLWord.indexOf("<");
b = HTMLWord.indexOf(">");

HTMLlen = HTMLWord.length;

c = HTMLWord.substring(0, a);

if (b == -1)
b = a;

d = HTMLWord.substring((b + 1), HTMLlen);

Word = c + d;

tmp = Word.indexOf("<");

if (tmp != -1)
Word = DelHTML(Word);

return Word;
}

function doit( )
{
ToCheck = window.document.forms["Check"].elements["Input"].value;

Checked = DelHTML(ToCheck);

window.document.forms["Check"].elements["Output"].value = Checked;
}
</script>

<body bgcolor="#FFFFFF">

<form name="Check">
<textarea cols="50" name="Input" rows="6">
</textarea>
<input onclick="doit( );" type="button" value="Remove all HTML Tags" />
<textarea cols="50" name="Output" rows="6">
</textarea> 
</form>

</body>

</html>

The Check form

The Script Tips #79-80 Script's display is housed in a form named Check. As shown at the script's demo page, the Check form comprises three controls, in source order:
(1) a textarea box, named Input, into which the user inputs a document or document fragment;
(2) a Remove all HTML Tags button, which when clicked sends the user's input to the doit( ) function in the script's script element for tag removal; and
(3) a textarea box, named Output, to which doit( ) writes the tag-removed document or document fragment.

Validation notes
• The Check form does not have an action attribute, which has a #REQUIRED designation even in the HTML 4.01 Transitional DTD.
• XHTML 1.0 deprecates the name attribute of the form element.
For validating the script document against the XHTML 1.0 Strict DTD, then, recast the form element start-tag as <form action="">, and use document.forms[0] to reference the form in the doit( ) function, if you'd like to keep the form element container (modern browsers don't need it but at least some older browsers won't render controls outside of a parent form).
• In the original script, all but one of the attributes of the Check form and its controls are unquoted; quote them for XHTML-compliance.

One more point before moving on. In Script Tip #79, Joe says, Notice the order in which [the Check controls] are written to the page. Keep them in that order: box, button, box. It helps the script to run correctly. Actually, the associative formObject.elements["controlName"] references for the Input and Output boxes in the doit( ) function are unrelated to the positions of these controls within the Check form; consequently, an alternate control order - box, box, button, or perhaps button, box, box - wouldn't cause any problems vis-à-vis the script's execution. (However, had the script's author, "CompuH@cker", used ordinal formObject.elements[0] and formObject.elements[2] references for the Input and Output boxes, respectively, then Joe would be correct.)

Deconstructing the Script Tips #79-80 Script

We're ready to deconstruct the JavaScript part of the Script Tips #79-80 Script, which is for the most part quite straightforward. As a test case, let's suppose that we enter the string

<span style="color:brown;">Brown text</span> looks really cool on a Web page

into the Input box and then click the Remove all HTML Tags button, triggering the doit( ) function. (Contra Script Tip #80, the doit( ) function is not actually sitting inside of a larger function named DelHTML( ), which we'll get to shortly; rather, doit( ) is external to DelHTML( ).)

function doit( ) {
ToCheck = window.document.forms["Check"].elements["Input"].value;

Our test string is assigned to the variable ToCheck.

Checked = DelHTML(ToCheck);

This line calls the DelHTML( ) function, which precedes doit( ) in the script element, and passes ToCheck thereto. Later, DelHTML( )'s output will be assigned to the variable Checked.

function DelHTML(HTMLWord) {

ToCheck, the input string, is given a new identifier, HTMLWord.

a = HTMLWord.indexOf("<");
b = HTMLWord.indexOf(">");

The index of the first < character in the HTMLWord string, 0, is assigned to the variable a; similarly, the index of the first HTMLWord > character, 26, is assigned to the variable b.

HTMLlen = HTMLWord.length;

The HTMLWord length value, 77, is assigned to the variable HTMLlen.

c = HTMLWord.substring(0, a);

HTMLWord.substring(0, 0) returns an empty string, which is assigned to the variable c.

if (b == -1) b = a;

In Script Tip #80, Joe poses (but does not satisfactorily answer) a What if there's no >? question, to which the above if statement would pertain. We'll address this situation, and Joe's follow-up What if there are no tags? question, in the "> without <, and vice versa" section below. For now, the if condition returns false and the browser moves on to...

d = HTMLWord.substring((b + 1), HTMLlen);

HTMLWord.substring(27, 77) returns the string
Brown text</span> looks really cool on a Web page,
which is assigned to the variable d.

Word = c + d;

c and d are concatenated to give the string
Brown text</span> looks really cool on a Web page,
which is assigned to the variable Word.

tmp = Word.indexOf("<");

The index of the first < character in the Word string, 10, is assigned to the variable tmp.

if (tmp != -1)
[Word = ]DelHTML(Word); /* The part of this line in square brackets is unnecessary. */

The if condition above returns true, so DelHTML( ) is re-called and the Word string is passed thereto.

It is left to the reader to verify that on our second run through the DelHTML( ) function:
(1) the Word string is given the HTMLWord identifier;
(2) the a index will be 10;
(3) the b index will be 16;
(4) the HTMLlen length will be 50;
(5) the c substring will be Brown text;
(6) the d substring will be looks really cool on a Web page (d will begin with a space character);
(7) the Word string will be Brown text looks really cool on a Web page;
(8) tmp will be -1; the tmp != -1 condition is now false, so the browser moves to...

return Word;

The now-detagged Word string is returned to the doit( ) function and given the identifier Checked, as noted earlier.

window.document.forms["Check"].elements["Output"].value = Checked;

Finally, the Checked string is loaded into the Output box.

> without <, and vice versa

So, what if there's no >, huh? Let's suppose that our input string is
Experts agree that 4<5 most of the time;
what happens? A trial run at the script demo page shows that the script merely subtracts the < character and outputs
Experts agree that 45 most of the time.
Complementarily, let's suppose that our input string is
Studies show that 4>5 for large values of 4;
this time, the script chops off the > character plus the substring that precedes it, and outputs
5 for large values of 4.
Probably not what you would want, is it?

However, if we input a tagless string lacking </> characters, for example,
It's time for another cup of the hot, black liquid,
then the script does at least return
It's time for another cup of the hot, black liquid
without incident.

Of course, all inputted strings without tags should be outputted unchanged; this is easily achievable via the following minor script modifications:

(1) Replace the DelHTML( ) function's if (b == -1) b = a; conditional with

if (a == -1 || b == -1) return HTMLWord;

(2) Wrap DelHTML( )'s subsequent statements in an else { ... } container.

Speaking of unbalanced </> situations:

• It should be emphasized that the script won't work properly if the inputted code's tags contain any errors with respect to their delimiting < and > characters; e.g.,

<a href="http://www.someWebSite.com/" This is a link.</a>

will not return its link text.

• You may be wondering, "What if the inputted code itself has a script element containing one or more < and/or > characters?" In this case, you should externalize the script element code before applying the Script Tips #79-80 Script to the remaining code.

A non-iterative 'detagification'

My simple deconstruction example contained only two HTML tags. But what if we were to input an entire document with dozens or even hundreds of tags? For the DelHTML( ) function to run through such a document over and over and over again - oh, my aching CPU! - is rather inefficient, needless to say. Isn't there some way to pick off all those tags in one sweep? It turns out that we can replace the DelHTML( ) function with two lines of regular expression-based code that will allow us to do just that; recast the doit( ) function as:

var ToCheck, SGMLTag, Checked;
function doit( ) {
ToCheck = document.forms[0].Input.value;
SGMLTag = /<[^>]+>/g;
Checked = ToCheck.replace(SGMLTag, "");
document.forms[0].Output.value = Checked; }

The SGMLTag <[^>]+> regexp pattern matches any valid tag:
• The starting < and ending > tag characters appear literally; they are not regexp metacharacters and do not need to be escaped with backslashes.
• The [^>] negated character matches one character that is not a >.
• The [^>] character is matched one or more times via the + quantifier.
The accompanying g flag 'globalizes' the matching process, i.e., all SGMLTag-tag matches in the document are returned (without the g, the browser would only return the first SGMLTag-tag match).

The replace( ) method of the String object is discussed by Mozilla here. The above replace( ) command replaces each SGMLTag in the inputted ToCheck string with an empty string, in effect subtracting the document's tags.

I've taken the <[^>]+> regexp pattern from this section of Jan Goyvaerts' Regular-Expressions.info Web site, a regular expressions resource that I highly recommend.

In the following entry, we'll check over the Script Tips #81-83 Script, which creates various multicolor text strings.

reptile7
- posted by A. Peak @ 3:26 PM

Comments: Post a Comment

<< Home

Actually, reptile7's JavaScript blog is powered by Café La Llave. ;-)

About Me