Here we come to one new milestone: the JavaScript-driven web sites scrape.

Recently a friend of mine got stumped as he was trying to get content of a website using PHP simplehtmldom library. He was failing to do it and finally found out the site was being saturated with JavaScript code. The anti-scrape JavaScript insertions do a tricky check to see if the page is requested and processed by a real browser and only if that is true, will it render the rest of page’s HTML code.

How to find out if site is JS stuffed?

When you approach a target page – you won’t necessarily be able to tell whether or not it is JS-scrape-proof locked. It might take you some time and a few unsuccessful trials, before you begin to suspect something is wrong; especially since there’s no essential output at scraper’s end. So, prior to starting web scraping it’s wise to use a web sniffer (also called network analyzer) to watch the network activity in the target site’s page load. Nowadays, every decent browser has built-in developer tool that includes a web sniffer. In this example I used Chrome ext. web sniffer. In the following picture you can see the multiple JS scripts vs only one mainframe load (in red box by me) on every page load (taken by Google ext. sniffer).scripts vs mainframe in JS-stuffed sites

JS-protected content scrape

JS protection logic

First of all, let’s consider what JS logic is using to prevent scraping.

When your browser requests the HTML content, the JS code should arrive first (or at least simultaneously). This code calculates a value which it returns to the target server. Based on this calculation, the HTML code might be not restored from obfuscating or might not even be sent to the requester, thus leaving scraping efforts thwarted.

The ways to break thru anti-scrape JS logic

  1. Simulate real browser for scrape. The most used tools for that are Selenium and iMacros. There is a recently emerged web IDE called WebRobots that drives Chrome browser thru JaveScript robots. Read how to leverage Selenium with headless FF browser.
  2. Use JS plugins to execute JS logic and return needed value to the target server to unlock content for scrape. I found v8js PHP plugin and its Python analog – PyV8.
  3. Use the specific libraries (toolkits) as add-ons for the scripting languages. Below is an example of using a library with python in web scraping.

Example use of dryscrape library

I wanna show you the code example of how to leverage dryscrape library to evaluate scraped JS for JavaScript protected content.

Python code with only requests library

Since the requests library does not do JS evaluation support, it fails to return the JS protected content:

 Scraping code with JS support thru dryscrape library

The difference is essential as the library does much to fetch and evaluate JS and provide an access to the desired HTML.

If you know a better way to fight JS prevention methods let me know in the comments.