The new Web Scraper Testing Drive Stage is on, the AJAX upload. Here we’ll check if the scrapers are able to extract the AJAX supplied data. This is simply not an easy task for the scraper software.

After some struggles and a bit of pestering the scrapers’ support teams, I was able to make ALL the scrapers extract the Ajax-driven data. For some it was a piece of cake, while others needed some fine-tuning for the tricky data harvest.

The Scraper’s Ajax result table

HTML via AJAXXML via AJAXJSON via AJAX
Web Content Extractor
Mozenda
Screen Scraper
Visual Web Ripper
Content Grabber
OutWit Hub
Helium Scraper
WebSundew Extractor
Easy Web Extractor

Mozenda

Mozenda did very well. In the Agent builder, upon loading the page, I clicked on the XML and JSON links. The scraper saved these actions with clicks on the “Wait for Ajax” option (see in the picture with red color)! Then I made a capture list for all the names. As I ran the Agent, it worked perfectly to click and wait for the Ajax data load.

Web Content Extractor

The thing that is needed for Ajax data load in WCE is the “Enable Javascript” option (Tools->Settings). Just put a check in this box. Then you define the crawling rules to follow those links for the Click to get XML through AJAX and Click to get JSON through AJAX (see those paths in red on the picture below). The last thing is to define the extraction pattern for the area with id=”case_ajax” or for each of subareas.

Visual Web Ripper

My first trial with Ajax data was not successful with VWR. So I watched again the video on “Ajax enabled website” scrape. It helped me a lot.

First I pressed the links to load the XML or JSON content, and for each of them I defined the ‘Link’ type Template. Now in the option area for that template I went to the Action tab and checked the JavaScript (including Ajax) and AJAX radio buttons. That done, I saved the template, and the scraper automatically performed the content upload.

Content Grabber

With Content Grabber it took me less then 5 minuts to fetch all the data of the ajax links. Since Ajax calls/requests have the asynchronous nature, I’ve figured to set the link command attribute (at the Property tab) as the Asynchronous Processing mode (see the figure below). The rest was like a piece of cake.
Content Grabber asynchronous processing mode

WebSundew Extractor

WebSundew Extractor did the task with excellence. You just click on the links, and the scraper will implement the corresponding “Click Node” events. Then you just define the iterator pattern on the data loaded. Simple.

Helium Scraper

With Helium Scraper I first created ‘kinds’ as the links for loading data thru Ajax (xmlLink and ajaxLink kinds).

Later I included the actions to navigate to them.

Besides the links’ kinds, I composed the ‘Names’ kind for target names gathering.

There is the need to check the box “Wait for Ajax” in Project -> Options:

The action tree will look like this:


Now with this project I successfully extracted the Ajax loaded names.

Screen Scraper

With Screen Scraper, the secret to extracting those Ajax XMLHttpRequests is to define them in a Proxy session and generate scrapeable files out of them:

Lines 2,3 and 6 from the Proxy session correspond to the Ajax requests for html, xml and json data respectively. I have circled with red the vital info for the json request in the HTTP request headers:

Accept: application/json

x-request-with: XMLHttpRequest

In order to check the response for each request choose the “Response” tab in the HTTP transaction sniffer.

It’s out of those recorded requests that we produce the scrapeable files from which we extract data. So, we imitate those requests, and parse the responses to the requests, not the resulting HTML as other scrapers do. However, for each request-response pair we need to produce a unique scrapeable file in the Scraping Session with an extraction pattern. I’ve named the scrapeable files html, xml and json respectively (see on the picture below).

For example, the json data are in this format: {“names”:[“George”,”Eric”,”Alice”]} so the resulting extraction pattern will look like this (sub-patterns are not shown in the figure, though they need to be defined):

OutWit Hub

OutWit Hub has again been good for scraping sophisticated cases. Initially I did not have a clue on how to proceed with JavaScript, yet their support provided the solution.

We need to compose the scraper with some directives (works with OutWit Hub Pro only).

#start# directive points to the place where the scraper starts to extract content.

#addToQueue# directive adds a scraped Link to the stack of pages to visit. It is usually used with URLs HTML links, but in this case the link is javascript.

#nextPage# directive will generate the event that will replace the onClick and allow it to scrape the dynamic nodes.

Note, I chose the Source type to be “Dynamic” for this scraper automator (in red):

Where did I get those javascript:getXml() and getJson() functions for the #addToQueue# directives? From the page’s html code, they are found right at the bottom of the <body></body> tags:

Those JavaScript functions are invoked dynamically through Ajax when on the browser page, but we force them to execute with #addToQueue# directives. Thus we bypass the Ajax logic on the dynamic pages extraction in OutWit.

Easy Web Extractor

Initially I had some difficulties in setting up a project for scraping Ajax. Support helped by directing me to create the column of CLICK type (see in red on the figure). As the scraper performs the project it ‘clicks’ on the links to load the data through Ajax, and later it harvests them.

Conclusion

The Ajax Test Drive stage has established that even simple XPath and Regex armed scrapers are also able to parse Ajax driven data. With some scrapers, Ajax requests are saved, repeated and the responses are parsed; with others, clicking the links is simulated and the “Wait for ajax” delay set; in other scraper the getXml() and getJSON() JavaScript functions are invoked directly through directives, but only in the professional version. The modern scrapers are well able to rip the web 2.0 and extract data on the fly.