Introduction to XPath
by K. Yue
1. Introduction
Resources:
Basics:
- XPath is used to address parts of an XML document.
- XPath is a W3C recommendation.
- The newest version is 3.0, which is largely backward compatible.
- XPath is used by XPointer, XSLT and XQuery.
- XPath is designed to access elements, but not creating new elements.
- Designed to be embedded in a host language, such as XSLT or XQuery.
- XQuery is a superset of XPath.
- The result is a sequence, which is an ordered list of items.
- An item can be a node or of atomic value.
- There are 7 node types:
- Document: represent an entire XML document
- Element
- Attribute
- Comment
- Text
- Processing Instruction
- Namespace
2. Path Expression
- XPath uses path expressions to address parts of the documents, called location path.
- A location path is composed of a sequence of location steps, separated by a '/'.
- A location path can be absolute or relative.
- An absolute location path starts with '/', the document root.
- A relative location path does not start with '/'. Its path is relative to a context node.
- This is similar to the Unix directory system.
Example:
Consider film.xml (with data extracted from Sakila).
//films/film
- The XPath expression lists all film elements with a parent films elements in the XML document.
- It is an absolute location path.
- In Editix, use ">View > Windows > XPath View" to execute XPath expressions. You should select XPath 2.0 instead of XPath 1.0.
Location path requirements:
- A location step is composed of three parts:
- a node axis (required): to describe direction for navigation.
- a node test (required): to specify the node type, and
- a set of node predicate (optional): to specify additional inclusion test.
Example:
//films/child::film[actor/@id='162']
Consider the location step:
child::film[actor/@id='162']
- Node axis: child
- Node test: film
- Node predicate: [actor/@id='162']
The XPath expression lists all <film> elements that:
- are a child node of a <films> element in the document, and
- have a child <actor> which has an attribute id with the value of '162'.
Note that actor/@id='162' is a relative path, relative to the context node, which is a film node.
Node Axes:
- An axis is the first part of the location step and is followed by :: before the node test and predicates.
- It indicates the direction to go for the next location step.
- There are 13 axes. Note the classification of axes into 'forward axis' and 'reverse axis'.
- In general, forward axes are preferable.
From XPath 3.0:
[40] ForwardAxis ::= ("child" "::")
| ("descendant" "::")
| ("attribute" "::")
| ("self" "::")
| ("descendant-or-self" "::")
| ("following-sibling" "::")
| ("following" "::")
| ("namespace" "::")
[43] ReverseAxis ::= ("parent" "::")
| ("ancestor" "::")
| ("preceding-sibling" "::")
| ("preceding" "::")
| ("ancestor-or-self" "::")
- The default axis is child.
Node test:
- Node test is the second part of a location step.
- It is required.
- There are three kind of node tests:
- NameTest: the name of an element or attribute node.
- NodeType test:
- node(): all nodes, including comments and PI, excluding attributes and the document root.
- text(): a text node
- comment(): a comment node
- processing-instruction('pi-name')
- * is a wildcard character matching any name. It is a name test.
Node Predicate:
- Predicate tests are the last part of a location step.
- They are enclosed by [] and are optional.
- There may be more than one predicate tests.
- XPath expressions and built-in functions can be used to construct predicate (boolean) expression as the added condition for inclusion.
- Boolean operators can be used: and, or.
Shorthand:
- . is the shorthand for self::node()
- .. is the shorthand for parent::node().
- // is the shorthand for /descendant-or-self::node()/
- @ is the shorthand for attribute::
Example:
//text()
or
/descendant::*/text()
list all text nodes.
Note the difference of
//actor/@id[.='20']
//actor[@id='20']
- The first expression returns a sequence of id attributes.
- The second expression returns a sequence of actor elements.
//film/actor [position()=2]
or
//film/actor[2]
- The XPath function returns the second item in the dynamic context of evaluation.
- Thus, this returns the second child <actor> element of a <film> element.
3. Sequence Expressions
- In XPath 1.0, node set is the main data type of the returned result.
- In XPath 2.0, sequence is the main data type of the returned result.
- A sequence is an ordered heterogeneous collection of items.
- An item can be a node or an atomic value, but not a sequence.
- A sequence may contain duplicate atomic values or nodes.
Example:
(1, 5 to 8, "Bun Yue", 2.1)
(1+2, 5)
(1 to 50)[. mod 3 = 1]
//film/* | //film
(1, 2, (3, (4, 5))) is (1,2,3,4,5)
- XPath 2.0 results are sequences. Atomic values are considered to be sequences with a single item.
4. Other Expressions
- XPath 2.0 supports many types of expressions not supported in XPath 1.0.
Primary expressions
- Includes literals (constants), variable references, function calls, use of parenthesis, and the context node (i.e.: .).
- There are many built-in functions.
- Functions may use the namespace fn.
- See http://www.w3.org/TR/xpath-functions/.
Example:
//film[count(actor)>=10]
or
//film[count(./actor)>=10]
returns all <film> nodes with 10 or more actors.
//film/actor [position()=2]
//film/actor [fn:position()=2]
or
//film/actor[2]
- The XPath function returns the second item in the dynamic context of evaluation.
- Thus, this returns the second child <actor> element of a <film> element.
//film[starts-with(title/text(),'A')]
gives all <film> element with titles started with 'A'.
distinct-values(//film/actor[starts-with(text(),'A')])
gives a sequence of actor names starting with an 'A'.
Arithmetic, Comparison and Logical Expressions
- Similar to some of the operators found in other languages.
- However, XML Schema data types are quite different to most other languages. So be careful.
For Expression and Variable Binding:
- The Let expression allows the definitions of multiple variables and subsequently using them later in the expression returned.
- Format: Let $var1 := (expression-1), $var2 := (expression-2), ... return (expression)
- This allows the binding of expressions to variables for multiple uses.
- This is especially useful in XQuery.
- The For expression allows the definitions of multiple variables to iterate through sequence expressions then subsequently using them later in the expression returned.
Example:
for $film in (//film) return $film/actor
is the same as:
//film/actor
- There is actual no need to use the for expression in the example above.
List all names of all actors appeared in more than 35 films by using the for expression in XPath.
fn:distinct-values(//film/actor[for $a in . return count(//film/actor[@id = $a/@id]) > 35]/text())
This can be slow.
Conditional Expressions:
Example:
if (//film[title/text()='ADAPTATION HOLES']) then 'found Holes' else 'no Holes'
Qualified Expressions
- The existential and universal qualifiers.
Example:
All film elements with an actor of id of 4 or less
//film[some $a in actor satisfies $a/@id <= 4]
same as:
//film[actor/@id <= 4]
All film elements with only actors of id > 150.
//film[every $a in actor satisfies $a/@id > 150]
For filmActor.xml:
//film[every $a in //film[@id=937]/actorIds/actorId/@actorId satisfies ./actorIds/actorId/@actorId = $a]
returns all film elements that have all actors who appearred in film with id 937.