The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Parsing content by XPath takes more content preparation, I think. XPath’s approach (for HTML-XML structures) to parsing is much less time and resource consuming compared to Regex parsing.

If you have a small set of HTML pages that you want to scrape data from and then to stuff into a database, Regexes might work fine… this works well for a limited, one-time job (from community Wiki).

If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.

Here I’ve summed the basic steps to be done with DOMXPath class usage:
  1. Initialize a DOMDocument class instance from page content (work with HTML as with XML)
  2. Initialize a DOMXPath class instance from DOMDocument class instance.
  3. Parse the DOMXPath object.

1. Initializing a DOMDocument  class instance from page content

  • create a new DOMDocument class instance
When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.
  • load the HTML text into the DOMDocument object
  • enable user error handling
Now the DOMDocument object (named ‘$DOM’) contains all the target text as a HTML DOM structure. It’s ready for different methods and properties to be applied.

2. Initializing a DOMXPath object from the DOMDocument object

  • Initialize DOMXPath object for further parse
Now XPath methods are applicable to the content

Parsing the DOMXPath object

As a test page I took the Blocks Testing Ground page and wrote a code using XPath to retrieve data.

How libxml library reacts to a malformed HTML

The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:

  • No warning for this case: <p><p><p>
  • For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div>  the library has issued an exception for the DOMXPath ‘query’ method.

The whole Scraper Listing