4 in 1 of scrapinghub

Scrapinghub is the developer-focused web scraping platform. It provides web scraping tools and services to extract structured information from online sources. The Scrapinghub platform also offers several useful services to collect organized data from the internet. Scrapinghub has four major tools – Scrapy Cloud, Portia, Crawlera, and Splash. We’ve decided to try the service. In this post we’ll review its main functionality and also share our experience with Scrapinghub.

Scrapy Cloud

Scrapy Cloud Icon If you have been associated with the web scraping industry (especially in development), you might have heard of Scrapy, the open source data extracting framework. One can create, run, and manage web crawlers with Scrapy easily. For the heavy lifting scraping (e.g. manual server operations, periodic actions, maintenance, etc), Scrapinghub’s Scrapy Cloud automates and visualizes your Scrapy web spiders’ activities.

However, the Scrapy Cloud will limit your ability to scrape data from websites. It has some built-in tools which you can utilize to extract information. If you host Scrapy on your own, you can use the Python based framework to write and run spiders more effectively. Portia (a UI-scraping tool) has some limited tools and features that you can use to scrape web data. But if you use your own Scrapy deployment, you can program your spiders with your own conditions and filters. This will give you a more personalized outcome.

Scrapy Cloud PricingScrapy Cloud Free Plan

Price range for Scrapy Cloud goes from free to $300 per month.

  • The free plan allows you to run only one concurrent crawl (see more to the right).
  • $25 and $50 plans support 4 concurrent crawls. This scales to 8 and 16 concurrent crawls, if you spend $150 or $350 respectively. Additional benefits are provided in higher valued packages.
  • The CPU and RAM options vary from plan to plan. For example, in the $25/mo plan, you get only shared access to the server computer’s RAM. But in the $50/mo plan, you would get 1.3GB of RAM. Each plan gets different amount of resources allocated.
  • The free plan retains your scraped data for 7 days. You can extend this period to 120 days by purchasing any paid plan.

Portia

Portia IconWeb scraping involves coding and programming crawlers. If you are a non-coder person, Portia can help you extract web contents easily. This Scrapinghub’s tool lets you use point&click UI interface to annotate (select) web content for its further scrape and store of it. I’ll go deeper inside Portia later in this post. One can use Portia within a Scrapinghub account as a free add-on.

Crawlera

Your spiders may face bans by some web servers during crawling. This situation is frustrating because it hampers data extraction. Scrapinghub’s Crawlera is a solution to the IP ban problem. The service routes your spiders through thousands of different IP addresses. Crawlera has a good collection of IP addresses of more than 50 countries. If a request gets banned from a specific IP, Crawlera executes it from another IP – performing persistently perfectly. Crawlera is able to detect 130+ ban types, server responses, captchas and takes the appropriate action (changing IPs, slowing down operation speed, etc). It functions adaptively to minimize IP bans.Crawlera Icon The system halts crawling in the worst situation (when the target server continuously rejects crawling requests).

We couldn’t find exactly how many failed extraction attempts would lead to Crawlera giving up, but generally it depends on the overall setup (ex. Splash browser timeout limit, the Scrapy Cloud package’s specifications, etc.)

Crawlera supports both HTTP and HTTPS proxies. The service is available as an addon in Scrapy Cloud. The cost ranges from $25-$500 per month with available negotiable enterprise pricing.

Splash

Splash is another Scrapinghub’s feature. It’s an open source JavaScript rendering service developed by Scrapinghub. Web pages that use JS can be better scraped using the Splash browser. It may process multiple pages in parallel.

Using Splash you can:

  • Process HTML requests
  • Write scripts using Lua programming language – for more customized browsing
  • Take screenshots, etc.

Splash IconSplash also supports ad blocker rules to accelerate the rendering speed. Splash functions run in a sandbox environment by default. But you can disable these restrictions with a simple command. The default timeout period of Splash browser is 30 seconds. This can cause problems with longer scripts and slower websites. However, this limit can be changed as well.

You may find more details about the headless, scriptable HTTP API browser on this official page. A premium subscription is required to use Splash on Scrapy Cloud.

My experience with Portia as a common user

I’ve tried Portia – the GUI based content extractor by Scrapinghub. It provides basic point-&-click tools to grab content from websites. To use Portia, you need to first add the service as an addon to your Scrapinghub project. Navigate to a target site inside Portia (use upper input field for site URL).

Annotation

Upon web page loading, start to annotate its content (annotation interface appears). Clicking the “Annotate this page” button opens a site in annotation mode. Hovering a mouse on the contents will highlight them. Clicking on the highlighted items opens a small popup window that gives you additional option (i.e. select a specific field and assign to a new or existing field).Annotating on Portia UI

After annotating a page’s contents, you need to save a sample and publish the changes on Portia’s interface to get ready for the final scraping stage.

Running a Spider

Visit the Scrapinghub dashboard, open the published project and run it (use the Run Spider button at the top right corner). The Run Spider button brings a popup window where the target site should be searched from Scrapinghub’s database. Finally pressing the Run button runs the spider.

The current page will show the ongoing job progress. After it signals the completion, we get the scraped data in the Items field. It shows a number representing how many datasets it has extracted. Clicking that number finally shows the extracted dataset.Portia Output - Extracted Data

How deep is the crawling?

Portia keeps scraping contents upon clicking start. Portia tends to extract a site data as deeply as it can. Scrapinghub’s support person has stated: The spider can run for as long as it is able to extract the items or until the item, page or time limit you set in your project has been reached. There is an official video on how this process works.

Extracted data are downloadable as CSV, XML, JSON and JSON formats.

Simple HTML markup case

Portia’s UI worked well for testing-ground/blocks, yet it was not sufficient for extracting detail info from the HTML markup. The Portia’s UI has failed to distinguish (put in a separate fields) a description (in red) from the upper title item in bold:Portia Simple HTML Markup Case

Images scraping

I was able to scrape images from scraping.pro and ergonotes.com. But scraping allrecipes.com for images resulted in failure. I think it’s related to on-page JavaScript that prevents image grabbing (overall scraping process was mostly smooth).

JS on-page support toggle

In order to make Portia tool work with JS-stuffed pages, you can set JavaScript as enabled or disabled (look at the sidebar of Portia interface). If you face problems with annotation, try toggling JS on and off. While annotating, if the annotator wrongly selects wider/shorter areas, you may turn JS on to highlight and select contents using the mouse cursor (similar to selecting texts in MS Word).  

Ebay case

I also tried to scrape data from ebay. That experiment was good. Portia was able to catch the image in sample item case, but, in the final output, it provided the image URL only. Seemingly, pagination worked for Ebay as well. In my experiment, Portia scraped more content than the homepage contained.

I annotated a single product as a sample item (electronic gadget). Then, I ran the spider. The spider made a total of 35580 requests, and it collected 7163 items (smartphones, accessories and other gadgets). I’ve checked the extracted items randomly. As I’ve mentioned earlier, images were not scrapped, but other information (title, short description, price, and URL) were collected.

The spider stopped after scraping 7163 items.

Handling Captchas

I contacted Scrapinghub support regarding Captchas. They said that the service currently doesn’t support solving captchas. Applying web programming (outside of Portia), you may be able to submit Captcha image [to a captcha solving service] before starting annotation. Captcha protected web pages made scraping information more difficult.

Handling login forms

There is a built-in login option on the Portia sidebar. I tried another site to see how the login actions are handled by this UI scraping tool. I’ve put the login information there, but it didn’t work with the testing-ground site. Then I tried to login by manually putting the experimental site credentials in the Portia interface, it worked well; see it at the bottom of the image:Handling Login Form on Portia

Although, on the annotation page, it generated login errors like “REDIRECTING…” and “THE SESSION COOKIE IS MISSING OR HAS A WRONG VALUE!” Since you cannot scrape information when the annotation stage is missing, Portia’s login handler failed in my experiments with the test site.

Using Common Web Development Methods with Portia

I’ve asked Portia’s team whether we can take an existing Portia project (which we’ve saved on our account) and alter its code files for further development. Portia’s support team replied that currently this feature is partially available. Here is their reply: “The script is there but not yet made public as there are still some bugs to be fixed. The script will work only for Portia 2.0 . For now, if required, Portia team can port the spider to scrapy spider for you“.

Pros and Cons of Scrapinghub with Portia UI

Pros

  • Scrapinghub offers GUI based data extraction tool Portia for free (which I liked the most).
  • Although premium-only, Crawlera can simplify web scraping by avoiding IP ban (thru IP rotation). This is not sufficient for huge business directories scrape.
  • The free plan retains extracted data in cloud for 7 days. 

Cons

  • Portia lags when visiting the target site link (initially) and using annotation features.
  • Portia is JS-based UI tool, so it’s highly dependent on browser settings and therefore the page elements annotation might not be that smooth for some sites.
  • There is no trial period available to try the premium features.

Scrapinghub Support

Scrapinghub offers several support channels; email, forums, tweets and dashboard based messaging. I’ve mailed them some questions, but support didn’t respond even after two weeks. I’ve also visited their support forums. There were 15 posts on the first page, and only 3 posts had comments on them. Most of questions had replies after 1 week from publishing. I’ve sent a message to the Scrapinghub team via the dashboard messaging system which took me a day to get a response from them. They seem to pay more attention to development rather than to support. :-)

Conclusion

Scrapinghub as a web service for web developers is a good playground to host and run custom scrapers. I wished the service had better documentation and quicker forum support. Paid features (JS-render support (Splash), IP rotation (Crawlera)) can help you assess the full strength of Scrapinghub and enriching the web scraping experience.

Portia as part of Scrapinghub’s platform is a good tool for beginners who have little knowledge about web data extraction (and no coding skills). As far as complexity sites, Portia is really convenient for basic scraping. Portia works well on structured pages not stuffed with JS and other scrape-proof tools.