WP Web Scraper is the Word Press plugin that works to extract web data into custom Word Press pages. The scraper uses a cURL extraction library for scraping and phpQuery for parsing HTML. This tool is a good reliable plugin among the other scraping software and plugins.

Plugin Usage

This scraper plugin can be inserted either into a WordPress theme or directly by the shortcode into the page HTML.

The shortcode use of the plugin is simple. Insert into the HTML page the plugin denotation:

[wрws url=”…” selector=”…”  {other optional arguments here} ]

For the PHP implementation insert the following tag into a template:

<?php echo wpws_get_content($url, $selector, $xpath, $wpwsopt)?>

Plugin features

The scraper plugin is rich in features, giving much more flexibility compared to its counterpart, Web Scraper Shortcode.

  1. Caching of scraped data is defined as a cache timeout in minutes.
  2. The scrape plugin allows one to customize a user-agent (see in example).
  3. Error handling is well elaborated with the plugin: silent fail, error display, custom error message or display expired cache (see in example).
  4. It allows one to clear or replace a regex pattern from extraction before output into the WordPress page (clear_regex, replace_regex parameters).
  5. A good feature of the plugin is having a basehref parameter. Basehref is a parameter which can be used to convert relative links from the extracted data into absolute links. For example, basehref=”http://yahoo.com”, will convert all relative links to absolute by appending http://yahoo.com to all href and scr values. Note: basehref needs to be a complete path (with http) with no trailing slash (see in example).
  6. It allows to pass POST arguments to a URL to be scraped.
    postargs=’name1=value1&name2=value2′
  7. The scrape plugin provides the means for the dynamic conversion of scrape data into specified character encoding (using incov) in order to scrape data from a site using a different charset.
  8. For advanced use, it can create scrape pages on the fly using dynamic generation of URLs to scrap or post arguments based on your page’s get or post arguments.
  9. It also has a callback function to be invoked for parsing the scraped data.

Example

An example of the use of the plugin.

  1. We want to extract a piece of html on this URL:
    http://ca.finance.yahoo.com/q?s=rab.v&ql=1
  2. append the base to the relative links:
    basehref=’http://ca.finance.yahoo.com’
  3. having a custom user-agent:
    user-agent=”bot at scraping.pro”
  4. switch error handling to on:
    on_error=”error_show”
  5. select the second child div of the element with id=’yfi_comparison’ :
    selector=”#yfi_comparison div:eq(1)”
  6. output result as HTML (by default) rather than a plain text:
    output=”html”

The result follows:

[wpws url=’http://ca.finance.yahoo.com/q?s=rab.v&ql=1′ basehref=”http://ca.finance.yahoo.com” user_agent=”bot at scraping.pro” on_error=”error_show” selector=”#yfi_comparison div:eq(1)”]

_________________________________
Note that since we’ve applied the basehref argument, the context links are full, not relative.

This scraper’s parameters array is dumped here:


Decoding option

One may specify a charset for iconv charset conversion of scraped content through an htmldecode parameter. You need to specify the charset of the source URL you are scraping from. If ignored, the default encoding of your blog will be used.

Callback

Using the callback function, you can extend the plugin to do advanced parsing. Simply put, the callback function will parse the extracted value and return the required data. Your callback function can reside in the functions.php of your theme. The function will take a single string parameter, parse it and return a string as output.

Special notes about the scrape plugin

  1. Plugin does not work for more then 1 selector.
  2. Be cautious using the plugin since it stores the scraped pages in the temporary wp-options table (MySQL database). So the problem is that the plugin does not take care of flushing those intermediate data. If you intend to extract large data chunks this table might grow dangerously large to put down MySQL.

Summary

This plugin is a convenient tool for a fast scrape of dynamic web data. It provides much flexibility as to sifting results, appending for relative links, callback function call and others. I would recommend it for miscellaneous data-rich web page development.