Tuesday, October 23, 2007
Bouncer Script B: XHTML Validation
Blog Entry #92
From HTML to XHTML
The basic differences between XHTML and HTML are listed in Chapter 4 of the XHTML 1.0 Specification; the following sections thereof are most relevant to the Script Tip #74 Script:
4.2. Element and attribute names must be in lower case
With only a few exceptions, the HTML element and attribute names in the original Script Tip #74 Script are written in uppercase letters; however, I lowercased these names for the script version that I put in a div two entries ago.
4.1. Documents must be well-formed
4.3. For non-empty elements, end tags are required
There are no overlapping (improperly nested) elements in the Script Tip #74 Script - so far, so good. As for the script's non-empty elements, five tags are missing:
(1-2) The html element start-tag and end-tag
(3) The head element start-tag
(4) The form element end-tag
(5) The body element end-tag
The W3C informs us:
In SGML-based HTML 4 certain elements were permitted to omit the end tag; with the elements that followed implying closure. XML does not allow end tags to be omitted. All elements other than those declared in the DTD as EMPTY must have an end tag.The HTML 4 Index of Elements shows that there are four elements whose start-tags and end-tags are both optional: html, head, body, and tbody. Section 3.1.1 of the XHTML 1.0 Specification states that a strictly conforming XHTML document necessarily has an html element start-tag equipped with an xmlns="http://www.w3.org/1999/xhtml" attribute. A literal reading of Section 4.3's tag requirement thus suggests that a valid XHTML document can still omit the start-tags of the head and body (and, when relevant, tbody) elements as long as the end-tags for these elements are present.
The Abstract for the XHTML 1.0 Specification proclaims that
XHTML 1.0 [is] a reformulation of HTML 4 as an XML 1.0 application,however, and the element production in the XML 1.0 Specification
element ::= EmptyElemTag | STag content ETag
clearly shows that if an element is present in an X(HT)ML document, then both its start-tag and end-tag are required (i.e., there's no ? following STag or ETag). I also note that, unlike their HTML counterparts, the element declarations in the XHTML DTDs do not contain the - and O indicators for required and optional tags, respectively; for example:
<!ELEMENT p %Inline;> (XHTML)
<!ELEMENT P - O (%inline;)* -- paragraph --> (HTML)
The XHTML declarations do not distinguish between required and optional tags because, well, there are no optional tags in XHTML. So for validation purposes, the missing tags listed above must all be added to the Script Tip #74 Script. (But we don't need to add <tbody> and </tbody> tags, even though the script contains a table, for a reason discussed below.)
4.6. Empty Elements
(See also Appendix C's C.2. Empty Elements)
I trust y'all know that with respect to empty elements and XHTML compliance you're supposed to add a space and a slash before the final > character, e.g.:
• Convert <br> to <br />
• Convert <hr> to <hr />
• Convert <input id="input0" name="iName" maxlength="15"> to
<input id="input0" name="iName" maxlength="15" /> - etc.
The document type declaration
A valid (X)HTML document must include a document type (DOCTYPE) declaration, which is placed at the top of the document prior to the html element start-tag. (It follows that if a document doesn't have a DOCTYPE declaration - and there are quite a few Web pages out there that don't have them - then it hasn't been validated, even if it otherwise conforms to one of the W3C's DTDs.) The DOCTYPE declaration for an XHTML 1.0 Strict document is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
The MSDN Library's !DOCTYPE page provides the best 'anatomy' of a DOCTYPE declaration that I've seen.
Case-sensitivity is important! I can confirm that DOCTYPE must be in uppercase, html must be in lowercase*, and PUBLIC must be in uppercase; otherwise, a validator will determine that the document is not valid.
(*As shown below, an uppercase HTML is OK for HTML validation.)
According to the W3C, the http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd URL, termed the system identifier,
allows user agents to download the DTD and any entity sets that are needed,i.e., if the user is online, the browser 'patches in' to the system identifier resource (the DTD) in much the same way that a script element's src="myScript.js" attribute connects the browser to an external script. Does this happen in practice? I don't know. It is clear that the HTML 4.01 Transitional DTD (or its equivalent) is already hard-coded into current browsers given that they can render (X)HTML documents without being connected to the Web.
Many Web pages have DOCTYPE declarations without system identifiers; for example, at the top of the source of EarthLink's home page, we find:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
This declaration is acceptable for HTML validation but not for XHTML validation. To validate an XHTML document against one of the XHTML DTDs at the W3C's Web site, the PUBLIC availability keyword, the -//W3C//DTD XHTML 1.0 (Strict | Transitional | Frameset)//EN public identifier, and the system identifier must all appear in the declaration.
From invalid to valid
In summary, our necessary pre-validation changes are:
(1) Subtract the script's presentational elements and attributes and replace them with style rules (we did this two entries ago).
(2) Add the proper DOCTYPE declaration to the top of the document.
(4) Make the document well-formed (i.e., correctly tag the start and end of all elements).
(5) Lowercase the names of elements and attributes.
So, I uploaded an appropriately modified Script Tip #74 Script document to my EarthLink server space. I surfed over to the W3C's Markup Validation Service and entered the document's URL into the Address: field in the Validate by URI section. Lights...camera...action!! I clicked the Check button; up popped a "This page is not Valid XHTML 1.0 Strict!" page detailing a "problem" and three "errors" in my document:
Problem: "No Character Encoding Found! Falling back to UTF-8."
Man, you'd think EarthLink would deal with this on the server side; evidently not. Per the W3C's recommendation, I added the meta element below to the head of my document:
<!-- Follow the title element with: -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
Character encodings and the setting thereof are treated here in the HTML 4.01 Specification.
Error #1: "The [first-in-source-order br] element is not allowed to appear in the context in which you've placed it."
Regarding the large Login heading that begins the document body, my document replaced
With this formulation, the br element is now a child element of the body element, which is a violation of the body element's content model. To make a long story short, in the DTD the br element is a member of the %special.pre; set of elements, which is exclusive to the %Block; set of elements that can be body element children. Anyway, I thought, "Is this br element really necessary? Let's get rid of it," and I accordingly commented it out.
Error #2: "<form name='iAccInput'>: there is no attribute 'name'"
Section 4.10 of the XHTML 1.0 Specification declares,
Note that in XHTML 1.0, the name attribute of [the a, applet, form, frame, iframe, img, and map] elements is formally deprecated, and will be removed in a subsequent version of XHTML- very sloppy of me to have missed that. I removed the name="iAccInput" attribute and recast the script element's
iName = document.iAccInput.iName.value;
AccId = document.iAccInput.iAccID.value;
iName = document.forms.iName.value;
AccId = document.forms.iAccID.value;
Error #3: "<form name='iAccInput'>: required attribute 'action' not specified"
With respect to the script's execution, there's no need for the form element to have an action attribute because we're not submitting the form to a processing agent; checking the DTD, however, we see that the action attribute of the form element does indeed have a #REQUIRED default value designation. The %URI; action value's replacement text is "CDATA" - CDATA is defined here in the HTML 4.01 Specification as
a sequence of characters from the document character set and may include character entities- so, reasoning that an empty string is still a string, I simply added action="" to the form element start-tag.
Alternatively, we could remove the form element container and reference the iAccInput controls in the script element via document.getElementById("input#") expressions:
iName = document.getElementById("input0").value;
AccId = document.getElementById("input1").value;
Having made the above changes, I reuploaded my script document and resubmitted it to the W3C validator; this time I was greeted with a "This Page Is Valid XHTML 1.0 Strict!" page! Success!
XHTML and the tbody element
So why don't we need to add a tbody element to the Script Tip #74 Script? In HTML, the tbody element is necessarily a child element of the table element:
<!ELEMENT TABLE - - (CAPTION?, (COL*|COLGROUP*), THEAD?, TFOOT?, TBODY+)>
However, XHTML revises the content model of the table element so that the tbody element is an optional child of the table element:
<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
We'll go through Script Tip #75's password-protection script in the following entry.