The following notes are copied from an article posted on the WebTechniques website ( http://www.webtechniques.com ).
Web Techniques Magazine
August 1998
Volume 3, Issue 8
Beyond HTML
Parsing XML in IE4 with JScript
By Michael Floyd
One bit of fallout from the Artificial Intelligence craze of the mid 1980s was accelerated research into natural-language processing. During that time period, I wrote several natural-language parsers that would take English syntax and convert it into commands that computers could understand. Most of the parsers were used in database applications and accepted English queries like "Give me all the dish on the 1995 Microsoft consent decree." The system would dutifully return all records related to the landmark case between Microsoft and the Department of Justice.
While it may seem like smoke and mirrors to some, such parsers are actually quite simple to write. They take a string (a sentence, in this case) and break it up into a list of tokens. The tokens are placed into a tree structure, which can be traversed by an application. This process is simplified in database applications by the fact that many tokens--words like "and," "the," "me," and "dish"--are unnecessary for the query and can be tossed out. Parsers are used for many types of applications such as compilers, interpreters, and other language processors, including browsers.
<OBJECT
ACCESSKEY=key
ALIGN=ABSBOTTOM | ABSMIDDLE | BASELINE | BOTTOM
| LEFT | MIDDLE | RIGHT | TEXTTOP | TOP CLASS=classname
CLASSID=id
CODE=url
CODEBASE=url
CODETYPE=media-type
DATA=url
DATAFLD=colname *
DATASRC=#ID *
HEIGHT=n
ID=value
LANG=language
LANGUAGE=JAVASCRIPT | JSCRIPT | VBSCRIPT | VBS NAME=name
STYLE=css1-properties
standby **
hspace **
vspace **
TABINDEX=n
TITLE=text
TYPE=MIME-type
WIDTH=n
event = script
>
* This attribute is Microsoft-specific.
** Included in the HTML 4 specification, but not supported in Internet Explorer.Example 1: General form for the Object element. Statistics for file:///d:/north.xml
XML Version: 1.0
Charcter set supported: UTF-8
Document type: ArticleExample 2: Report generated by Listing One. Last month I mentioned that several XML parsers are beginning to appear around the Net. Most of these parsers are written in C++ or Java, and will likely be used by programmers to create the next generation of XML tools and applications. They often incorporate a command-line interface, and in most cases are poorly documented. The good news is that you don't have to be a hard-core C++ or Java developer to use XML on your Web site. That's because Microsoft Internet Explorer 4.0 includes an XML parser that's accessible through scripting--if you have IE 4, you can use JScript to experiment with XML now. This month I'll look at the API for the MSXML parser, and in the process show how you can pass an XML document to the parser and get back the document's structure.
Document Objects
As you likely know, the Document Object Model (DOM) is a W3C specification that describes a platform-independent and language-neutral interface for accessing structured documents including those in HTML, CSS, and XML formats. The core DOM specification describes a set of object definitions that let you represent the objects within a document. The core specification has also been extended for XML so that document type definitions, entities, and CDATA sections can also be represented. The objects and interfaces, sometimes called APIs, are referred to as object models.
Microsoft defines its object model in terms of a "document" object, an "element" object, and a collection class. Each of these classes has a set of properties and methods you can use to manipulate XML documents. That is, when you take an XML document and run it through Internet Explorer's parser, you'll get back a tree structure that you can then traverse. In this first example, I'll show you how to access the document object and retrieve information about the document. First, I'll create an HTML page that contains a form, our document object, and the parser script. The idea is to put up an edit control that can be used to enter the name of a valid XML file. The script takes that file and passes it to Internet Explorer, which will parse the file and return the results in our document object. I'll then query the document object's properties and report on this document.
Listing One starts off by creating an object using the new HTML 4 Object element. This new element is a generalized mechanism for inserting things like multimedia objects, plug-ins, Java applets, and COM objects into an HTML document. The syntax for the Object tag is shown in Example 1. The key attributes for our purposes are ClassID, ID, and Name. In general,
ClassIDcontains a URL that identifies the implementation of an object. In Internet Explorer, ClassID acts as an identifier for the object type. The long string of characters assigned to the ClassID attribute identifies the object implementation to Internet Explorer. The clsid: portion of the string tells Internet Explorer that the rest of the string refers to an ActiveX control. The ID attribute is a unique identifier that will be used to reference this object from within our script.According to the HTML 4 specification, the browser must pass the contents of the Name attribute along with any data from the object if there's no accompanying DECLARE attribute. So, I use the Name attribute to submit, or pass, the object in a form. The next step is to create the form, which will be used to input the name of the XML file to be parsed. Listing One uses a simple Input element to place an edit field on the page. Because no Type attribute is specified, a text field is assumed. A second Input element is used to create the "Parse", or submit, button. The Name attribute acts as an identifier and allows us to reference the button from the script. When the onClick event occurs, the value in the Filename edit field is retrieved and passed to the Parse() function.
Parsing the Document
Microsoft's Internet Client SDK (see "Online") describes ten API calls that let you set and retrieve various properties for an XML document. These include the ability to retrieve the version of XML the document supports, the character set supported, contents of the !DOCTYPE element, and the document's root element. There are also a number of properties that are documented, but were not implemented at the time of this writing. I stumbled into one other property, READYSTATE, which appears to be undocumented for the parser (although it is covered in the DHTML documentation). The complete list of properties and methods is detailed in Table 1.
Method/Property Description CharsetReturns the character set supported by the document. createElement()Takes a type and tagname as parameters and returns a newly created element. doctypeReturns the contents of !DOCTYPEas listed the XML document.fileModifiedDateReturns the date the file was previously modified.* fileSizeReturns the XML document's file size.* fileUpdatedDateReturns the date the file was previously updated.* mimeTypeReturns a MIME type if specified.* rootReturns the root element for the document. This is used to traverse the rest of the document tree. URLSets or returns the path (usually a URL) of the document. versionReturns the supported version of XML as listed in the <?XML>command.* Documented but not supported for the XML document object. Table 1: Properties and methods for the XML document object.
My Parse() method in Listing One uses most of the methods in Table 1 to report on the document object. (Examples for the unsupported method calls are included for completeness, but are commented out.) Parse begins by opening a browser window and writing out the preliminary HTML tags needed to display the results. Next, the root property, which stores the document's root element, is retrieved and assigned to the DocumentRoot variable. This is the starting element we'll use to traverse the tree structure of elements; see Listing Two. The remaining code queries the other properties and reports the results in the browser Window.
When you load Listing One into Internet Explorer, the Edit control and Parse button appear in the window. To test the code, I've used the XML file we created in last month's column (see "Beyond HTML," July 1998), which is an excerpt from Ken North's March 1998 "Database Developer" column. When you enter a filename, a second browser window pops up and displays the XML version as defined in the <? XML Version ?> prolog. The results of running north.xml through the parser are shown in Example 2.
Traversing the Tree
Listing Two presents a second method, displayTree(), which reports on elements within the document tree. The displayTree() method outputs the element details in a visual manner that mimics the structure of the document. If you were to run this code you would see that ArticleText is a child node of Article and a parent node of SubHead.
The displayTree() method is designed to be called from Parse() and takes a document object and an integer value (N) as its parameters. You get this value by querying the document's root property, as described previously. The purpose of
Nis to keep track of our level in the hierarchy and to indent the child nodes appropriately. displayTree() begins by ensuring that there is, indeed, a document object. If not, the routine issues an error and bails out. Assuming we have a valid object, displayTree() creates the indent string used to indent child nodes from their parent.You can get to the child nodes through a collection class called children. This class provides an item() method to retrieve elements from the collection and a length property, which lets you determine the number of items in the collection. Collections are a very powerful feature of Microsoft's Dynamic HTML, and a discussion is beyond the scope of this month's column. For our purposes, displayTree() uses the length method simply to set the value for N when indenting the output and to display the number of child elements in the output for element detail.
The next series of statements prints out the element detail in the browser. The first step is to check the current object type using the
typeproperty. This property contains an integer value that represents one of five types: element (return value is equal to 0), text (value = 1), comment (2), document (3), or DTD (4). For completeness, I've mapped these values to their string equivalents; see the GetTypeStr() method in Listing Two. In our case, we're just interested in elements. So, displayTree() checks for an element type equal to zero. If found, it writes out the element type, its tag name, and any attributes contained within the element. Also note that the indent string is incorporated into the detail output.The final step is to check to see if there are any child nodes. If there are, we must perform the entire process again. I do this by iterating through each element in the collection and calling displayTree() recursively. This has the effect of performing a "depth-first search," where each branch of the tree is fully explored before moving on to the next branch. Note here that N is incremented only if its value is different from the current branch level within the tree.
Since this is a depth-first search and we are traversing the tree from top to bottom, I need to increment N only when a child branch is encountered. And because of the nature of recursion, I don't have to worry about restoring (or decrementing) the value of N after searching the branch. The reason is that as the recursion "unwinds" to the previous level, the value of N is restored automatically.
I've separated the methods in Listing Two from Listing One for purposes of clarity. In practice, you'll want to drop these methods into Listing One, then add a call
displayTree(xmlDocument.root, 0)
to the Parse() function (just before writeln statements that add the closing </BODY> and </HTML> tags). Finally, the complete source code and the XML file used for this example are available electronically.
Tool of the Month
Many of the tools available for editing, publishing, and viewing XML files are actually SGML tools. Such is the case with SoftQuad's Panorama Viewer, a browser plug-in that lets you view SGML documents. The company has announced a new version supporting XML, which should be available by the time this reaches print. In the meantime, since XML is a subset of SGML, you can also view XML files as long as they're both "well formed" (they conform to XML guidelines) and valid (they contain a DTD). The Panorama viewer is currently available for the Windows 95/NT/3.x, Macintosh, and UNIX platforms.
The Panorama Viewer includes a number of features: You can use annotations to add notes and comments to SGML documents. Panorama Viewer supports multiple style sheets to control the display of documents. Panorama also provides navigators--multipane windows that let you navigate through documents. The Panorama Viewer carries a retail price of $49, but a 60-day evaluation copy is available for download from the SoftQuad Web site; see "Online." A complementary product, Panorama Publisher, is also available for $195.
ONLINE DOM Specification
www.w3.org/TR/WD-DOM/Microsoft Internet Explorer Client SDK
www.microsoft.com/ie/SoftQuad Product Catalogue: Panorama Viewer
www.softquad.com/products/pc-pview.htmPutting It to Work
Now that you can parse and view XML files, you may be wondering how XML can be used on your site. You can use XML to make your site more accessible to the disabled, create channels using Microsoft's Channel Definition Format, support incremental downloads using Marimba's Open Software Description (OSD) format, or let your visitors view molecular structures with the Chemical Markup Language.
One intriguing idea is to create an XML-aware search engine that lets you search for text more efficiently and at a finer granularity. For example, I could tag this column in every place where a word or phrase is defined. Then I could search on <definition>= "object" and get the definition of "object model" without getting every occurrence of object that appears in the column. The ability to search on tags, and text within tags, is very powerful. If you come up with other ideas for using XML, drop me a line at the address below.
Michael Floyd is a consultant, freelance writer, and Web Techniques' editor at large. He can be reached via email at mfloyd@BeyondHTML.com.Copyright © Web Techniques. All rights reserved.
Web Techniques Magazine
|
Copyright © : 1997 - 2005 |