HTML Plus!, Chapter 2: HTML -- the HyperText Markup Language

Site hosted by Angelfire.com: Build your free website today!

HTML Plus! (0-534-51626-2)

James Powell, Virginia Polytechnic Institute & State University
Copyright © 1997

Chapter 2: HTML -- the HyperText Markup Language

CHAPTER AT A GLANCE:

Text markup
HTML tags
Generalized and specific markup
Character entities
Table 2-1: HTML Special Characters
What Is an HTML "Editor"?
The basic structure of an HTML document

Markup

HTML is a text markup language. If you've ever peeked at the codes behind a WordPerfect document or used some old-style text formatting system, then you are probably familiar with markup. Markup is the hidden codes or tags used to mark the insertion point for a graphic, or to indicate that a certain piece of text should be bold, italics, a paragraph, indented, a different font size. Some types of markup are like road signs. They are equivalent to the signs on a highway where the speed limit changes with the terrain. Sometimes the sign warns the user to proceed more slowly due to numerous intersections or hairpin curves, while later signs encourage drivers to return to a higher speed (and some drivers add five or ten miles/kilometers to this amount!). These sign-style tags set a mode that stays in effect until a new tag is encountered. Once a bold tag is located, all text following it is bolded until some other font style tag is encountered.

Script/GML is a type of markup used on IBM mainframes to generate formatted documents. It is a sign-style language where one type style is set and stays set until another tag is encountered. Here is an example of specific markup using the Script/GML language:

  .bChapter 1
  .iIntroduction to markup

Here each tag is preceded with a period and immediately precedes document text. The ".b" tag causes all text that follows to be displayed in bold. The ".i" tag later changes the font style to italics. Script/GML has fallen out of favor over the years as word processors with WYSIWYG ("What You See Is What You Get") capabilities have proliferated. But ironically it is a direct ancestor of HTML.

HTML markup is a container style markup. Just as most food products in a grocery store are wrapped, boxed or bagged, pieces of text in a document are enclosed, or tagged, by HTML markup. Without proper containers, text in an HTML document might look fine or, like milk without a container, pour all over the place making the page unreadable. Unlike food containers however, HTML tags are hidden from the user when the finished document is displayed by a Web browser. They are only visible during the editing process.

Specific versus Generalized Tags

HTML contains markup for encoding both specific and generalized information about a document. Specific markup is used to control exactly how a piece of text should look when displayed. Script/GML is an example of a specific markup language, where tags correspond to formatting styles such as bold, italics, or underlining.

Here is an example of specific markup in HTML:

  <B>Chapter 1</B>
  <I>Introduction to markup</I>

Notice that each segment of text is contained by matched tags, rather than simply preceded by a tag that specifies a mode switch.

In both examples, tags are added to the document to encode additional information about the document text. The process of adding these tags is referred to as marking up or tagging a document. The markup is interpreted by the computer or other output device and converted to formatting information which is then applied to the marked text. In this case, Chapter 1 is bold and Introduction to markup is italicized. In each example the computer or laser printer has little choice about how to display the marked text, since there are definite rules that determine how bold or italicized text should look. This is the essence of specific markup. It is inflexible and has no relationship with the structure of the document in which it occurs, only the appearance. If HTML were just a specific markup language, you would never need to see or learn about its tags.

HTML is also a generalized markup language. Generalized markup is not always obvious to the user. It encloses document structures such as paragraphs, lists and headings. Because of these markup elements, an HTML author needs to be familiar with and insert these tags with an editor. Since generalized markup encapsulates structures within a document, it is left up to the display device (in most cases a computer with a World Wide Web browser) to determine how to display the tagged structures. Here is an example of generalized markup in HTML:

  <HTML>
  <HEAD><TITLE>Chapter 1</TITLE></HEAD>
  <BODY>
  <H1>Chapter 1: Introduction to markup</H1>
  </BODY>
  </HTML>

In this example, the entire document is enclosed by the <HTML> tag. The first <HTML> tag is called the start tag. Since HTML tags are containers, it is paired with an end tag, beginning with a forward slash, that occurs after the text that it marks: </HTML>. This is how all HTML tags look. In a few cases, an end tag can be omitted, but it is good practice to always include an end tag even when minimization is allowed. Other tags in this example include the <HEAD> tag, which encloses the document header. An HTML document has only one header, which is located at the top of the document. The actual textual content of the document starts after the <BODY> tag. Here tags such as <H1> can be used to mark headings. Headings mark section titles such as chapters or parts.

It is up to the display software (the Web browser) to decide how to display the text marked with these tags. The contents of the <TITLE> tag in the <HEAD>section are typically displayed in the Web browser's title bar when it retrieves the document. The document contents start with the <BODY> tag; anything other than a comment between the <HEAD> and <BODY> tags is not valid HTML, and will be ignored. Headings (such as <H1>) can be displayed in a variety of ways and, in fact, are usually under the control of the reader. Browsers allow the reader to specify the size and even the font style for each HTML heading level.

It is difficult to avoid mixing specific and generalized markup when using HTML, but it is helpful to be aware of the differences. Authors who are not aware of the differences typically make mistakes such as relying on headers to emphasize text rather than to tag structures. The reader might have selected a very small font for that level header on their browser. The impact of their document is greatly reduced since they did not use specific markup, such as the bold tag, to control the appearance of this portion of text.

Attributes

So far, we've seen two types of HTML markup: start and end tags. All HTML tags look like those in the examples above. Each starts with a less than sign (<), followed by the tag text, followed by a greater than sign (>). Many HTML tags contain additional information between the "<>" that further defines the tag. These pieces of information are called attributes. Attributes define additional characteristics of a tag. Some have predefined values, or only one possible value, whereas you can define the values of others. An attribute is assigned a value by including the attribute name followed by an equal sign (=) and the value. If the attribute allows you (the author) to define a value, then the value should be enclosed in double quotes (""):

  <IMG SRC="worldmap.jpg" ISMAP>

In this example, the <IMG> tag's SRC attribute is assigned the value "worldmap.jpg" by the author. The ISMAP attribute has only one possible value, ISMAP=ISMAP, so you can abbreviate it by simply listing the attribute name. Adding the ISMAP attribute to this image tag makes it an image map instead of a regular inline graphic.

End Tags

As we saw in our first example, HTML start tags have corresponding end tags. An end tag looks like the start tag but a forward slash follows the less than sign like this: </HTML>. End tags close a section of markup. If a bold <B> tag is not closed with a </B> tag, then all remaining text in the document will be bolded (like milk without a container!). Some end tags are not often used because it is clear by the next tag that the previous tag has ended. The paragraph tag <P> is typical of this. One almost never sees a </P> tag because the next <P> tag makes it clear to the software that the previous paragraph has ended. In practice, it is best to insert the begin and end tags in all cases, unless you are using an HTML markup tool to enter tags, in which case it is often not practical and too time consuming. Also, if you are hand tagging a document or part of a document, it is helpful to capitalize the text of the HTML tag: <P> rather than <p> since this makes the markup easier to distinguish from the text. However, both capital and lowercase make valid HTML tags. A few tags do not have end tags, such as the <IMG> tag, since it is not used to markup text but rather to specify that an image should be included at this point in the document.

Character Entities

Another type of HTML markup is the character entity. You can use any ASCII character (A-Z, a-z, 0-9 plus a few standard punctuation and mathematical symbols) in an HTML document by simply typing it. To use characters not found in ASCII such as characters from the ISO Latin-1 character set (characters used in many western European languages), you insert character entities. Character entities are markup which indicates that a special character should be inserted where the entity occurs in the document. Character entities start with an ampersand (&) and end with a semicolon (;). For example,

  please look in /&atilde;ftp/pub/incoming for your homework assignment

is displayed as

  please look in /~ftp/pub/incoming for your homework assignment

Character entity text should be lower case, unless you are specifying an entity for an upper case letter, in which case only the first letter should be capitalized. &ATILDE; would be incorrect in the above example. A character entity is treated exactly as a single ASCII character, so it can be used anywhere an ASCII character can appear in a document and can be bold, italicized, etc.

Since the less than, greater than and ampersand are part of HTML markup, they also have character entities which must be used when you want to include one of these characters in a document:

  The HTML tags for bold and italics are &lt;B&gt; &amp; &lt;I&gt;

is displayed as

  The HTML tags for bold and italics are <B> & <I>

If these characters were inserted into the body of the text unencoded, they might be misinterpreted as part of a tag. Then when no matching end tag is found, the entire document might be improperly displayed. Occasionally you might slip up and include one of these characters, only to find when you view the document that you have a hypertext anchor spanning many lines, or text has disappeared. Double quotes can often be inserted with no ill effect but if they appear within a tag, weird things can happen. It is good practice to always insert entities for characters used to construct HTML markup whenever you are hand-editing a file. Table 2-1 lists some of these characters, their meaning in HTML, and the character entity you should use whenever you want to include one of these characters in your document text.

Table 2-1 HTML Special Characters

Character	Meaning	Entity Equivalent
`<`	Opens an HTML tag	`<`
`>`	Closes an HTML tag	`>`
`&`	Starts a character entity	`&`
`"`	Encloses attribute values	`"`

You may also insert non-ASCII types of characters by using numeric entities. Numeric entities are similar to character entities but the ASCII character code is used instead of the abbreviated name. For example, &#123; is the entity for a left curly bracket ({). A complete list of character and numeric entities is available in Appendix B.

HTML editing tools can even do some of the work for you. One extremely useful tool, rtf2html, converts documents saved in Microsoft's Rich Text Format to HTML. It constructs a valid document structure, and converts as many formatting attributes as it can to HTML tags. Newer tools incorporate conversion engines like these directly into word processors and page-layout tools. Software packages such as PageMaker and WordPerfect can already save files in HTML. They can recognize a paragraph or bolded text and perform the markup for you. But no such tool is likely to work 100% of the time when attempting to perform markup on citations, lists, blockquotes, etc., because there are several accepted standards for formatting this type of information. That's why it is important to learn HTML.

What Is an HTML "Editor"?

HTML was developed in 1991, long after word processors had become fairly common. So naturally, the first World Wide Web browser was also an HTML authoring tool -- an HTML editor. Most HTML editors list tags that are available and allow you to select portions of text around which the tags should be inserted. Once you select some text, you usually click a button for the appropriate tag or select it from a pull-down list. The first HTML editor, written for computers running the NEXTSTEP operating system, had a feature that allowed you to build hypertext links between documents. Once you'd selected the text for a link, you browsed through your files and selected a file to which you wanted to link this text. Of course, the file to which you were linking had to be on the same machine as the HTML document you were creating, which was not always convenient. Other editors have a "validation" feature which attempts to make sure you use tags correctly by looking for improperly nested tags, invalid attributes, missing end tags, etc.

Microsoft Word document window
Word processors such as Microsoft Word can be used as HTML Editors

Nesting

If two or more tags are to be applied to the same text, they should always be nested within one another. It is incorrect to have one tag start within another and end outside it. For example,

  <B><I>Important Announcement</B></I>

should instead be:

  <B><I>Important Announcement</I></B>

If you find this difficult to remember, think of Russian "babushka" dolls. These small wooden ornaments are actually containers for smaller dolls. Each tag is a doll, and your text is on a slip of paper inside the smallest doll.

Of course, not all tags can be nested. For example, the <HEAD> tag cannot occur within <BODY>. Conversely, many tags can only occur between certain other tags. Tags are said to be elements of other tags when they can only occur between them. All HTML tags are elements of the <HTML> tag as this tag encapsulates the entire document. The <TITLE> tag is an element of the <HEAD> tag as it cannot occur outside the document header.

HTML and SGML

HTML is defined using the SGML (Standard Generalized Markup Language) meta-language. SGML not only defines the tags, attributes and their allowed values, and the character entities, but also where these items can occur in a document; thus SGML defines the structure of the document. All of this information is defined in a document called a document type declaration (DTD). Meant to be human readable, DTDs are often too dense and obtuse to be digested in one or two readings (far better to find a book like this that tells you what tags to use where!). Since HTML is defined using SGML, it is often referred to as SGML. In fact, it is more precisely an application of SGML. But you will be forgiven if you persist in generically referring to it as SGML.

SGML utilizes some standard terminology for describing tags and document structure. The places where a tag may be used in a document is, in SGML parlance, its permitted context (Figure 2-1). For example, the permitted context of a title tag is the header of a document. Paragraph and list tags are examples of HTML tags that have multiple permitted contexts for they may be used anywhere in the body of a document whether inside an HTML form or on their own:

  <BODY>
  <P>HTML markup is easy
  <DIV CLASS=abstract>
  <P>It only takes a few hours to begin creating documents in HTML. HTML is the
  best tool currently available for communicating information on the Internet.  This
  guide aims to show you the language and how to use it effectively.
  </DIV>
  <H1>HTML Tags</H1>

In this example, the permitted context of the HTML 3 <DIV> tag is anywhere within the body of a document, that is between the <BODY> start and end tag. Using a <DIV> tag in the header section is not allowed according to this rule. Elements and permitted context are synonymous.

Figure 2-1 Permitted Context and Content Model

Tags also have a content model (Figure 2-1) which specifies what types of tags and data may appear within them. A paragraph's content model is text, which means document text (ASCII characters and character entities) and textual and character markup (such as bold and italics tags) may occur between a paragraph start and end tag:

  <P>
  The <B>header</B> of a document is a separate structure from the <I>body</I>.
  The header &amp; body are the content model for the &lt;HTML&gt; tag.
  </P>

Let's review what we've learned so far. Tags are text containers that come in two flavors: generalized tags that record what kind of text is tagged (heading, title, body) and specific tags that tell the Web browser how a piece of text should look when it is displayed. Attributes refine tags. Character entities let you safely include non-ASCII characters and characters that normally make up tags in a document. Putting the pieces together, HTML markup consists of tags, attributes and character entities that you (the author) use to encode the structure and some basic formatting information for presenting a document on the World Wide Web. Now on to the tags!

HTML Plus! Preface | HTML Plus! Contents