Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. In this post we will consider some advantages and drawbacks of using WebDriver for web scraping.

The Advantages

1. WebDriver can simulate a real user working with a browser

Since WebDriver uses a real web browser to access the web site, its activity does not differ from the activity of an ordinary user surfing the web. When you load a web page using WebDriver, the browser consequently loads all the web site resources (javascript files, images, css files and so on…) and executes all javascripts on the page. At the same time it stores all the cookies created by websites and sends complete HTTP headers as all browsers do. This makes it very hard to determine whether  a real person accesses the web site or a robot. While it’s really burdensome to simulate all these actions in a program that sends “handmade” HTTP requests to the server, with WebDriver you can do it in several simple steps.

2. WebDriver can scrape a web site using a specific browser

While many web scraping programs do use a real web browser for data extraction, in most cases the browser they use is WebBrowser Control, which is Internet Explorer. WebDriver, however, works not only with Internet Explorer but also with a variety of browsers such as Google Chrome, FirefoxOpera, HtmlUnit and even Android and iOS.

3. WebDriver can scrape complicated web pages with dynamic content

Sometimes the data you need to extract is not in that raw HTML you got after doing an HTTP request. It may be generated dynamically (using AJAX and JavaScript, as in our test case). Though it is still possible to get this data with merely HTTP requests (by analyzing the traffic and javascript code that processes the data), it’s often much easier to let a web browser do it for you. In this case WebDriver comes to the rescue.

4. WebDriver is able to take screenshots of the webpage

It’s a fact that if you need to see what the web page looks like, you need a browser that can render it.  WebDriver is a very convenient way to get those screenshots when you need them.

The Drawbacks

1. The program becomes quite large

Even if you need to scrape a small portion of data, your program needs to be linked with all Selenium WebDriver libraries (there are about 4-5 Mb of them in total), and also the driver executable needs to be installed for each browser you want to use during scraping (that may be about another 6 Mb, at least in the case of Chrome Driver). Therefore your program may grow from 10 Kb to 10 Mb!

2. A browser application needs to be started

When you use WebDriver to scrape web pages you load a whole web browser into the system memory. This not only takes time and consumes system resources, but also may cause your security subsystem to react (and even disallow your program to run).

3. The scraping process is slower

Since a browser waits until the whole web page is loaded, and only then allows you to access its elements, the scraping process may take longer in comparison with making simple HTTP requests to the web server.

4. The browser generates a bigger network traffic

Web browsers load a lot of supplementary files that may be of  no value for you (like css, js and image files). This may generate much, much more traffic than when you only request the resources that you really need (using separate HTTP requests).

5. The scraping can be detected by such simple means as Google Analytics

If you scrape too many pages using WebDriver, you can be easily detected by any JavaScript-based traffic-tracking tools (like Google Analytics). The web site owner does not even need to install any sophisticated scrape bot detection mechanism!

Conclusion

All the drawbacks mentioned above follow from the fact that Selenium WebDriver is not primarily intended to be used for web scraping (its sphere is browser automation), but as web scraping specialists, we can still take great advantage from having it in our tool set as a powerful scraping tool. It is really not hard to integrate it into almost any web scraping solution written in Java, C#, Ruby, Python, JavaScript (Node.js) and even PHP, but in the end it is up to you whether to use it or not. I hope that this article will be helpful in making the right decision.