Web Content Extractor

Web Content Extractor is a visual user-oriented tool that scrapes typical pages. Its simplicity makes for a quick start up in data ripping.

Overview

Web Content Extractor (WCE) is a simple user-oriented application that scrapes web pages and parses data from them.  This program is easy to use, but it gives you the ability to save every single project for future (daily) use. The trial version will work with only 150 records per scrape project. As far as exporting and putting data into different formats, Web Content Extractor is excellent for grouping data into Excel, text, HTML formats, MS Access DB, SQL Script File, MySQL Script File, XML file, HTTP submit form and ODBC Data source. Being a simple and user friendly application, it steadily grows in practical functionality for complex scrape cases.

Characteristics

Usability
Functionality
Easy to learn
Customer supportemail; the service personnel is apt to help.
Price$75; $50 for more than 2 licences.
Trial version/Free version14 days
OS (Specifications)Win
Data Export formatsExcel, text, HTML, MS Access DB, SQL Script File, MySQL Script File, XML file, HTTP submit form, ODBC Data source
Multi-thread yes (up to 20)
API no
Schedulingthrough Windows Task Manager

Workflow

Let’s see how to scrape data from londonstockexchange.com using Web Content Extractor. First, you need to open the starting page in the internal browser:

Then, you need to define “crawling rules” in order to iterate through all the records in the stock table:

Crawling

Also, as you need to process all the pages, set the scraper to follow the “Next” link on every page:

Paging settings

After that, drill down into each stock table row and extract information from the “Summary” section. This is done by defining an “Extraction pattern” for getting data fields:

And finally, when you’re done with all the rules and patterns, run the web scraping session. You may track the scraped data at the bottom:

Extracting

As soon as you get all the web data scraped, export it into the desired destination:

Saving results

Dynamic Elements Extraction

Scraping of dynamic web page elements (like popups and ajax-driven web snippets) is not an easy task. Originally, we didn’t consider Web Content Extractor to be able to break through here, but with transformation URL script, it’s possible. Thanks to the Newprosoft support center, we got help with crawling over popups on a certain web page.

Go to Project->Properties->Crawling Rules->URL Transformation Script, where you may compose a script that will change casual crawl behavior into a customized one:

The task of building a transformation script is not a trivial one. Yes, it is possible to make a project to crawl through dynamic elements, but, practically speaking, you’ll need to know web programming (XPathRegex, JavaScript, VBScript, jQuery).

Here is an example of such an URL transformation script for scraping popups from  http://www.fox.com/schedule/ (the script was composed by a Newprosoft specialist, so in difficult cases you’ll need to turn to them):

Multi-Threading and More

As far as multi-threading, the Web Content Extractor sends several server requests at the same time (up to 20), but remember that each session runs with only one extraction pattern. Filtering helps with sifting through the results.

Summary

The Web Content Extractor is a tool to get the data you need in “5 clicks” (the example task we completed within 15 minutes). It works well if you scrape simple pages with minimum complications for your private or small enterprise purposes.