An Introduction to XML
K. Yue copyright 2000
Created: November 22, 2000
Introduction
- XML stands for eXtensible Markup Language.
- XML is a system for defining, validating, and sharing document formats.
- Standard organization: World Wide Web Consortium (W3C): http://www.w3.org/XML/.
- XML is a simplified version of the Standard Generalized Markup Language
(SGML).
- Both XML and SGML may be considered as meta-languages for describing markup
languages.
- A valid XML document is also a SGML document.
- XML uses tag elements to distinguish document structures, and attributes
to describe their properties in a way similar to HTML.
- Unlike HTML, XML is extensible and authors can define and share their own
elements.
- As a comparison, HTML has a fixed set of elements: <p>, <body>,
<li>, etc, but not <chapter>, <author> or <bun>.
Example:
<?xml version="1.0"?>
<memo priority="very high">
<from>Bun Yue</from>
<to>Everybody</to>
<body>
Hello, welcome!
</body>
</memo>
Comparing XML and HTML
XML |
HTML |
A meta-language for defining markup languages |
A markup language |
For data only |
For data and display |
Describe application structures; no intrinsic display instruction. |
Describe display instructions. |
Users to define tags |
Fixed number of tags |
Web and non-Web |
Web only |
Used in both client-side and server-side |
Used in clients (browsers) |
Problems with HTML
- HTML provides information for displaying. It is difficult to get semantic
information of the data (HTML scalping).
Example:
In HTML:
<h2>Bun Yue</h2>
<h2>Introduction to XML</h2>
It is difficult for the HTML browser to get the meaning of the data "Bun
Yue" and "Introduction to XML".
- HTML elements do not provide meaning.
- With Javascript embedded, traditional HTML files put the following three
major components mixed in one place:
- Data contents
- Presentation design
- Programming logic
- However, it may be desirable to have three different development teams
to create the three components.
- Result: difficult to create and maintain.
XML advantages
XML provides semantic meaning.
Example:
<author>Bun Yue</author>
<booktitle>Introduction to XML</booktitle>
- XML documents may be created by domain experts.
- XML assists in the separation of data contents, presentation design and
programming logic.
- XML is usually used for storing and transmitting data contents only.
Core XML Technologies
- XML Specification: XML syntax.
- XML Namespace: defining address spaces.
- XLink: linking between XML documents.
- XPointer: referring to parts of an XML document.
- DTD (Document Type Definition): Specifying grammars of XML vocabularies.
- XML Schema: Pending improved ways to define grammars.
- DOM (Document Object Model): Object and interface collections for manipulating
XML documents.
- XSL (eXtensible Stylesheet Language): for displaying and transforming XML
documents.
XML Syntax
- Similar to HTML, XML elements (tags) are the basic building blocks of an
XML document.
- There are restrictive XML syntactic requirements on tags.
- Tag are case sensitive.
Example:
HTML: <IMG...>
</img> ok.
XML/WML: must be <img
>
</img>.
- Every start tag of an non-empty element must have a corresponding end tag.
Example:
HTML: <p>first<p>second ok.
XML/WML: must be <p>first</p><p>second</p>.
- There is only one root element, the <wml> tag.
Example:
HTML: the following file is ok:
<head>
</head><body>
</body>
XML: must has only one root element.
- An empty element uses the following format: <tag />.
Example:
HTML: an empty element simply has no end tag:<br><hr>.
XML/WML: <br/> or <br></br>.
- An attribute value must be enclosed by quotes.
Example:
HTML: <IMG SRC=abc.gif> ok.
XML/WML: must be <img src="abc.gif"/>.
- XML syntax is more restrictive than HTML:
- Less ambiguity; more portability.
- Easier to process.
- Easy generic parser implementation.
- Unlike HTML, white spaces within the XML contents are usually preserved
in XML by XML parsers.
- White spaces within XML element tags and attributes may be removed.
XML Entities
- XML Entities are chunks of content that can be refered to by the syntax:
&name;
- Predefined entities are:
<: <
>: >
&: &
' (apostrophe): '
" (quote): "
Strings and Characters
String literals are enclosed by quotes. They are used as attribute values.
Example:
"I'm fine."
'The quotes: "I'm fine." is for you'
"Quotes using entity references: "."
- Character data: text that is not marked up.
- Since < and & are markup delimiters, they must be referred to by
using entity references.
CDATA
- XML allows CDATA sections for storing text not to be interpreted as markup.
- Syntax: <![CDATA[
]]>.
Example:
<code>
<![CDATA[
<%
Response.write "<td>hi</td>"
%>
]]>
</code>
Well-Formedness
- All XML documents must be well-formed:
- Syntax should conform to the XML specification.
- If no DTD is provided, there are no references to external entities.
Only the five predefined ones can be used.
- If a data object is not well-formed, XML parsers parsing the data object:
- will give out an error message.
- will not attempt to recover from the error.