doxml Manual Version 0.5: XML Syntax

$Id: syntax.html,v 1.16 1999/07/24 22:04:05 francis Exp $

This is a brief introduction to XML syntax.

Index

Introduction

XML, like HTML, is an example of SGML, the Standard Generalized Markup Language. SGML has been around since 1986; it was designed to be a common syntax for defining markup languages. The theory was that a language could be defined in terms of a DTD (Document Type Definition), and any SGML parser could process any SGML document, given its DTD. SGML parsers could then be used to build applications.

Like many standards, SGML grew more and more complex over the years, until a complete SGML implementation was a massive undertaking. So incomplete implementations proliferated. Commonly available SGML parsers did not implement useful features of newer versions of SGML, so that someone defining a new SGML-based language had to choose between convenient features and deployability. In addition, experience with HTML showed that there were difficulties in using SGML to define a language that could be extended over time. Consider the <br> tag; suppose that it were not defined in HTML 1.0, and you wanted to add it to HTML 2.0, without confusing HTML 1.0 applications. How would an HTML 1.0 application know that <br> is not supposed to be terminated by </br>? But this sort of extensibility is exactly what is needed in a standard language.

So the W3C started an effort to define XML, the Extensible Markup Language. XML documents are SGML documents, and can be processed by SGML parsers which implement the appropriate SGML features; but not all SGML features are legal in XML. For one thing, all XML tags are terminated explicitly. Either the start tag is paired with an end tag (as in <a href="foo.html">foo</a>), or the start tag ends with "/>" to indicate that it is self-contained (as in <a name="bar"/>).

Lots more information on XML.

Documents

An XML document is a single element, called the root element, optionally preceded by a sequence of processing instructions, called the prolog.

XML Declaration

An XML document should include an XML Declaration, which appears syntactically as a processing instruction whose name is "xml", and whose value is formatted like a string of attributes. Three such pseudo-attributes are defined at this time: version, encoding, and standalone. version should always be 1.0 (for the present); encoding specifies the character encoding (not that doxml can handle anything other than UTF-8 or a subset thereof); and standalone indicates whether there are any markup declarations (see Document Type Declaration) external to the document. An example would be:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

If the XML Declaration is present, it must be at the very start of the document: the < that opens it must be the first byte. This is so that the XML Declaration can be used as a quick means of identifying XML documents in the absence of anything like a MIME Content-Type. (For example, the Unix file command, which inspects the start of a file to determine its type, can easily be configured to look for the XML Declaration.)

Document Type Declaration

An XML document may contain a Document Type Declaration, which appears as a pseudoelement whose name is "!DOCTYPE". An example:

<!DOCTYPE Test PUBLIC "PublicTest" 'test:"publicTest"'>

The details of the Document Type Declaration are pretty hairy, so I'm going to put off documenting it right now. For now, please see the spec.

Elements

An XML element is a piece of markup data. Its syntax takes one of two forms:

...where attr* means "zero or more attributes", [foo] means foo is optional, and xml-text is simply any sequence of XML elements interspersed with literal text. Tags and namespaces can contain pretty much any character that would not be ambiguous (for an exact list, see the spec).

Attributes

Attributes take the form name="value". Unlike in HTML, all attribute values must be enclosed in quotes (single or double; it makes no difference).

Namespaces

Namespaces are an extension to XML. The core XML spec does not include any sort of structure on the names used for tags and attributes, which means that, while everyone is free to define their own XML tags, there is always a risk that two people will define the same tag to mean different things, because the namespace is flat.

XML namespaces are a way of carving up the flat namespace and letting people define their own sets of names independently, without breaking compatibility with XML 1.0. A namespace is identified by a URI, so anybody can define their own, either by basing it on the URL of their Web site or by generating a UUID and using the uuid: URI scheme.

Of course, it would be highly inefficient to somehow concatenate a long URI onto every name in a document. So, instead, the approach is to define a short prefix that can be used in place of the URI. For example:

<s:ellipse xmlns:s="http://www.example.com/shapes/">
<width s:units="inch">17</width>
<height units="cm">25</height>
<text>
<html xmlns="http://www.w3.org/TR/REC-html40">
<h1>Ellipse</h1>
</html>
</text>
</s:ellipse>

In this example, the s:ellipse tag is mapped to the ellipse tag from the namespace http://www.example.com/shapes/. The width, height, and text tags, which don't specify a namespace, are, by default, assumed to be from the same namespace as the ellipse tag that contains them. The html tag is specified to be from the namespace http://www.w3.org/TR/REC-html40 (which is the namespace for HTML-in-XML), and the h1 tag inside defaults to the same namespace.

Attributes can also have namespaces associated. In this example, the width tag has a units attribute which is specified to be from the shapes namespace, and the height tag has a units attribute which, although it does not specify a namespace, defaults to be the same one.

Processing Instructions

A processing instruction (PI) is a piece of out-of-band information embedded in the XML document; it provides extra hints to the application. Why this is necessary is not actually clear to me; I suspect it was something that some members of the working wanted, and the rest could not come up with a strong reason against. A PI takes the form:

<?name value?>

where name is just like an element name, and value is any string which does not contain the substring "?gt;". Note that name may not start with "xml" (case-insensitive), except for the XML declaration at the start of the document.

Text

Text is pretty much as you'd expect: strings of characters (other than <). The only complications are entities (which, if you don't use a Document Type Declaration, are almost exactly as in HTML) and CDATA blocks (which are optional).

Entities

An entity starts with an ampersand ('&') and ends with a semicolon (';'). It is either a parsed entity (one which gets expanded into text) or an unparsed entity (one which gets reported to the application as a special atom). The following parsed entities are predefined by the XML standard:

EntityText value
&amp;&
&lt;<
&gt;>
&apos;'
&quot;"

The XML spec also defines a type of parsed entity called a character reference: much as in HTML, you can write an expression such as &#12345; to mean the character number 12345 (decimal), or &#x3F0; to mean the character number $3F0 ($ for hexadecimal),

All other entities must be declared in the Document Type Declaration.

CDATA blocks

A CDATA block is simply text which is escaped so as not to be interpreted as XML; it can safely contain characters that look like XML markup. A CDATA block takes the form:

<!CDATA[text]]>

Obviously, text cannot contain the string "]]>". CDATA blocks are simply a convenience; the same results could be obtained via the use of &lt;, &gt;, etc. (and, in fact, doxml_write() uses such entities on output, even if the input was in CDATA form).

Comments

A comment is text which is ignored; it never shows up at all in the data structures assembled by the parser. A comment takes the form:

<!--text-->

Obviously, text cannot contain the string "-->". In fact, the XML spec specifies that text cannot contain the string "--" at all. I'm not sure why this is the case, but doxml enforces it.