Basic XML Syntax and Well-Formedness
by K. Yue
August 2002
copyright 2002
1. Introduction
- XML has stricter syntax than HTML.
- to avoid ambiguity.
- to simplify processing.
- All XML documents must satisfy XML syntax specification.
- XML documents must be well-formed.
- You can check for well-formedness by using my very simple XML
syntax checker.
2. Elements and Attributes
- Similar to HTML, XML elements (tags) are the basic building blocks of an
XML document.
- There are restrictive XML syntactic requirements on XML tags.
- Tags are case sensitive.
- Note that XML standards have restrictions on what can be used as names
for elements and attributes. W3C's definition: A Name is a token beginning
with a letter or one of a few punctuation characters, and continuing with
letters, digits, hyphens, underscores, colons, or full stops, together known
as name characters.
Example:
HTML: <IMG...>
</img> ok.
XML: must be <img
>
</img>.
- Every start tag of an non-empty element must have a corresponding end tag.
Example:
HTML: <p>first<p>second OK
XML: must be <p>first</p><p>second</p>.
- There is one and only one root element. For example, the root element is
<wml> for WML and <xsl:stylesheet> for XSLT.
Example:
HTML: the following file is OK (missing <html> element):
<head>
</head><body>
</body>
XML: must has only one root element.
- An empty element uses the following format: <tag />, or <tag></tag>.
- It is better to use <tag /> for an empty element.
Example:
HTML: an empty element simply has no end tag:<br>, <hr>, etc.
XML: must use <br/> or <br></br>.
- An attribute value must be enclosed by quotes: single or double quotes.
Example:
HTML: <IMG SRC=abc.gif> OK
XML: must be <img src="abc.gif"/> or <img src='abc.gif'/>
- XML syntax is more restrictive than HTML:
- Less ambiguity; more portability.
- Easier to process.
- Easy generic parser implementation.
- Unlike HTML, white spaces within the XML contents are usually
preserved in XML by XML parsers.
- However, extra white spaces within XML element elements and attributes
may be removed by XML parsers.
3. XML Entities
- XML Entities are chunks of content that can be referred to by the syntax:
&name;
- Character entities refer to a single character.
- Predefined character entities are:
<: <
>: >
&: &
' (apostrophe): '
" (quote): "
- XML does not allow < and & in the text.They must be represented by < and
& respectively.
- Numbered character entities has the number &#number; where number has
a range from 0 to 65,536, representing the Unicode value of the character.
- Additional user-defined entities (external entities)
can be defined by the Document Type Definition (DTD) used by the XML document.
Example:
If the named entity "yue" is defined as "Kwok-Bun Yue"
in the DTD used by the XML document, then every occurrence of "&yue;"
within the XML document will be replaced by "Kwok-Bun Yue"
4. Miscellaneous Markups
4.1 Comments
- Comments are enclosed by <!-- and -->
- No "--" should appear within the comment.
4.2 Strings and Characters
- String literals are enclosed by quotes.
- They are used as attribute values.
Example:
"I'm fine."
'The quote: "I'm fine." is for you'
"Quote using entity references: "."
- Character data: text that is not marked up.
- Since < and & are markup delimiters, they must be referred to by
using entity references.
4.3 CDATA Sections
- XML allows CDATA sections for storing text not to be interpreted as markup.
They are not parsed.
- Syntax: <![CDATA[
]]>.
- Any data can be included inside CDATA sections, except the string "]]>".
- CDATA sections are sometimes easier to generate and read.
Example:
<code>
<![CDATA[
<%
Response.write "<td>hi</td>"
%>
]]>
</code>
5. Well-Formedness
- All XML documents must be well-formed:
- Syntax should conform to the XML specification.
- If no DTD is provided, there are no references to external entities.
Only the five predefined ones can be used.
- If an XML document is not well-formed, XML parsers:
- will give out an error message.
- will not attempt to recover from the error.
- Do not confused the concept of "well-formedness" with "validity".
- All XML documents must be well-formed.
- An XML document may or may not be a valid XML document.
- Validation will be discussed further with DTD.
Dr. Kwok-Bun Yue
Professor, Computer Science and Computer Information Systems
Chair, Division of Computing and Mathematics
University of Houston-Clear Lake
2700 Bay Area Boulevard
Houston, TX 77058
Yue's home page
yue@uhcl.edu
281-283-3864