XPath is a formal language that is used to navigate through and query elements and attributes in XML documents. While this notation is being used in XSL and XQuery, it is very useful for DOM data access and extraction. XML documents and also HTML/XHTML documents are objects of DOM parsing while using XPath.
What is XPath?
The XML Path Language (XPath) is a set of syntax and semantics for referring to portions of XML documents. XPath is intended to be used by other specifications, such as XSL Transformations (XSLT), the XML Pointer Language (XPointer) and XML Query (XQuery). XPath expressions identify a node or a set of nodes in an XML/HTML document for working with.
XPath abbreviated directory
|XPath evaluation||Location paths||Predicates|
|XPath in screen scraping|
In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. XML documents are treated as trees of nodes. The topmost element of the tree is called the root element. This set of nodes can contain zero or more nodes.
Look at the following XML document:
<?xml version="1.0" encoding="UTF-8"?>
<title land="en">DVD Music</title>
<author>K. A. Bred</author>
<price currency="USD" >29.99</price>
Example of nodes in the XML document above:
<list> (root element node)
<author>K. Bred</author> (element node)
lang="en", new='true' (attribute nodes)
Atomic values are nodes with no children or parent. Items are atomic values or nodes.
Parent. Each element and attribute has one parent. In the example above the item element is the parent of the title, author, year, and price.
Children. Element nodes may have zero, one or more children. In the example above, the title, author, year and price elements are all children of the item element.
Siblings are nodes that have the same parent. The the title, author, year, and price nodes are all siblings.
Ancestors are node’s parent, parent’s parent, etc. In the example above, the ancestors of the title element are the item element and the list element.
Descendants are node’s children, children’s children, etc. In the example, descendants of the list element are the item, title, author, year, and price nodes.
To learn how to apply these notations in real xpaths, refer to XPath in examples.
XPath expressions can refer to attributes as well as elements in an XML/HTML document. When referring to an attribute, the “@” character is used. For example, the following XPath expression identifies price elements whose currency attribute contains the value USD:
XPath evaluation relative to a context
XPath evaluates expressions relative to a context. Context is usually specified by the technologies that extend XPath, such as XSLT or XPointer. An XPath context includes a context node, context size, context position, and other context data. From a context standpoint, the context node is of most interest here since the scraping works for getting data from the node trees where access is for the context node-set.
Location path is a specialization of an XPath expression. A location path identifies a set of XPath nodes relative to its context. XPath defines two syntaxes: the abbreviated syntax and the unabbreviated syntax. We consider only the abbreviated syntax because as the most widely used; the unabbreviated syntax is more complex.
The two types of location paths are relative and absolute.
A relative location path is a sequence of location steps separated by “/”, for example:
A location step consists of:
- an axis (defines the tree-relationship between the selected nodes and the current node)
- a node-test (identifies a node within an axis)
- zero or more predicates (to further refine the selected node-set)
In the above example relative location path consists of three location steps: the first step list, selects a set of nodes relative to the context node; the second, item, selects a set of nodes in the subset identified by the first step, and so on. Note that there is no “/” at the beginning of the path.
Absolute location path. It starts with a “/”, optionally followed by a relative location path, with initial “/” referring to the root node. An absolute location path is basically a relative location path evaluated in the context of the root node, for example:
In this example, we get all the item with price<20 that are children to list, which should be in a root node. With absolute location paths, the context node (current node set) isn’t meaningful because path is always evaluated from root node.
XPath expressions can include predicates. Predicates are used in location paths to filter the current set of nodes. A predicate contains a Boolean expression (or an expression that can be easily converted to Boolean). Each member of the current node-set is tested against the Boolean expression and kept if the expression is true. Otherwise, it is rejected. A predicate is enclosed in square brackets, . Predicates are useful in refining the resulting node set. For example, 1st line XPath expression identifies the item only with price value more than 300, while 2nd line location path searches for price nodes with currency 'USD':
Predicates allow relational operators >, <, >=, <=, and != != as well as the use of Boolean operators. If < > are not allowed in some XSL expressions use < or > instead:
|name||Selects all nodes named “name“|
|/||Selects from the root node (also steps delimiter)|
|//||Selects nodes in the document from the current node that match the selection no matter where they are|
|.||Selects the current node(s) (context node(s))|
|..||Selects the parent of the current node (context node)|
And some examples…
@ — is used to refer to attributes.
* (asterisk) is used to refer to all the elements that are children of the context node. Here 1st line xpath refers all the nodes children to item nodes; 2nd line xpath gets all the nodes:
 — can also be used to refer to specific elements in an ordered sequence. The following example refers to the second item element:
// — is used to refer to all children of the context node. 1st line xpath chooses all item elements, while 2nd line xpath refers to all item elements that have a list parent:
. (dot) is used to refer to the context node itself (current node). The following xpath refers to all the item nodes that are children of the context node:
.. (double dot) is used to refer to parent of context node. This xpath refers to the item elements in the context of the author nodes (since item is a parent to author):
The | (pipe) operator allows you to get access to multiple node-sets. From our example, XML-set following xpath gets the node-set with all author and year elements:
Numeric operators provided by XPath are: + (addition), - (subtraction), * (multiplication), div (division), and mod (remainder from truncating division).
Strings in XPath are enclosed in quotation marks (' or "). When an XPath string is contained in an XML document and contains quotation marks, you have to use one of the two following options:
- Quote them using ' or " respectively. For example:
12notes = 'New Lapytop"s ;USB"MP3 player"'
- Use single quotation marks (') if the expression is enclosed in double quotation marks ("), and vice-versa. For example:
12select = "item[@private='false'or price/@currency='USD']"
XPath defines a set of functions called the core function library. Each function is identified by a function name, return type (must not be void) and the type of the arguments (zero or more, mandatory, or optional).
Functions are used inside predicates and expressions. XSLT extends this function set. The most used functions are divided into four groups:
Node-set functions provide information on a set of nodes (one or more nodes). Some node-set functions are:
- last() – Returns a number called the context size, which is the number of nodes in a given context. (different from the last node)
- position() – Returns a number called the context position, which is the position of the current node in the set (list) of nodes in a given context. For example, you can test whether you are dealing with the last node of a set with the expression position()=last().
- count(node-set) – Returns the number of nodes in the argument node-set.
- id(object) – Returns a node-set, the result of selecting elements by their unique id declared as type ID in a DTD (Document Type Definition).
String functions are for strings processing. Useful string functions are:
- string string(object?) – Converts the argument object or the context node to a string. Valid arguments are a node-set, a number, a Boolean, or any other type. A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. Note that this function is not intended for converting numbers into strings for presentation to users. For that, use format-number function and xsl:number element.
- string concat(string, string, string*)- Takes two or more strings as arguments and returns the concatenation of them. For example, concat("First ","mile of","Orient Express") returns ""First mile of Orient Express".
- boolean starts-with(string, string) – Returns true if the first argument string starts with the second argument string. Otherwise, it returns false. For example, starts-with("Miles Smiles album, CD", "Miles") returns true.
- boolean contains(string, string)– Returns true if the first argument string contains the second argument string. Otherwise, it returns false. For example, contains("Miles Smiles album, CD", "album") returns true.
- string normalize-space(string?) This function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.
Other XPath string functions are substring() , substring-before() , substring-after() , string-length() ,normalize-space() , and translate(). Refer to section 4.2 of the XPath 1.0 specification for details.
Boolean functions are used to convert an object or a string to either true or false, or to get the true or false values directly.
- boolean boolean() – Returns the conversion to boolean of the object passed as an argument, according to the following rules: A number is true if different from zero or NaN; a node-set or a string are true if not empty.
- boolean not(boolean) – Returns true if the boolean passed as argument is false; false otherwise.
- boolean true() and boolean false() – Return true or false, respectively. These functions are useful because true and false are seen as normal strings in XPath, and not the true and false values.
- boolean lang(string) – Returns true if the language of the context node is the same or a sub-language of the string argument is specified; false otherwise. The language of the context node is defined by the value of the xml:lang attribute. If xml:lang attribute is not specified lang("en") returns false on any node of the tree.
Number functions are XPath’s numeric functions, and they all return numbers. They are:
- number() – Converts the optional object argument (or the context node if no argument is specified) to a number, according to the following rules:
- Boolean true is converted to 1, false to 0.
- A string is converted to a meaningful number.
- A node-set is first converted to a string and then the string converted to a number.
In the following example we address the item nodes-set where the time node value is not bigger than 60:
- sum() – Returns the sum of all nodes in the node-set argument after the number() function has been applied to them.
- floor() – Returns the largest integer number that is not greater than the number argument. For example, floor(1.7)returns 1.
- ceiling() – Returns the smallest integer number that is not less than the number argument. For example, ceiling(3.65) returns 4.
- round() – Returns the integer number that is closest to the number argument. For example, round(6.51) returns 7.
There are also other types of functions in W3C specifications, that you can look at here.
|item||Selects all nodes with the name item|
|/list||Selects the root element list|
|item/title||Selects all title elements that are children of item|
|//item||Selects all item nodes no matter where they are in the document|
|item//price||Selects all price nodes that are descendant of the item node, no matter where they are under the item nodes|
|//@lang||Selects all attributes that are named lang|
For more examples visit XPath in examples.
XPath in screen scraping
Transform HTML into XML
As far as web scraping, we mostly work with available HTML pages (besides RSS) that might be malformed. For XPath to evaluate such pages, we first need to change the HTML text to XML-type for proper application of XSL, XSLT and XQuery data processing. One of the tools for working this is HTML Agility pack library. Also, some articles on the subject are given at HtmlAgilityPack Article Series.