Over the last one or two years there has been a lot of maturing in the area of visual Web Scrapers. New companies like ParseHub, ScrapingHub and Kimono are bringing new tools to the market, while industry veterans like Outwithub, visual web ripper and Mozenda continue to update their great tooling to annotate/train scrapers and extract web data.
Interestingly, something has changed now. Import.io has created a new tool which is a little bit different on the surface, and having spoken to them, a LOT different under the hood.
Crawlera by Scraping Hub
I came across this tool a few weeks ago, and wanted to share it with you. So far I have not tested it myself, but it is a simple concept- Safely download web pages without the fear of overloading websites or getting banned. You write a crawler script using scruping hub, and they will run through there IP proxies and take care of the technical problems of crawling.
What is Crawlera?
Crawlera is a smart HTTP/HTTPS downloader designed specifically for web crawling and scraping. It routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems. As a scraping user, you no longer have to worry about tinkering with download delays, concurrent requests, user agents, cookies or referrers to avoid getting banned, you just use Crawlera to download pages instead. Some plans provide a standard HTTP proxy API, so you can configure it in your crawler of choice and start crawling.
Using Crawlera, you should be able to mitigate any problems/overheads associated with crawling websites which is the beauty of this tool for scraping hub users.
- No need to think about number of IPs or delay, just fetch clean pages, as fast as possible
- Automatic retrying and throttling to crawl politely and prevent bans
- Add your own proxies if you need more bandwidth
- HTTPS support
- POST support
- HTTP proxy API for seamless integration
The service is also available via Mashape.com as an API.
As i’ve said; I have not used this service myself, if you wanted to use it, there is a pricing tool you can use to estimate costs on their site.
If you have experience using this tool, please leave a comment with your experiences.
This is part 1 of a series dedicated to getting novices started using a simple web scraping framework using python.
In this post we will get up and running with simple web scraping using Python, specifically the Scrapy Framework. more…
Web scraping Data platform import.io, announced last week that they have secured $3M in funding from investors that include the founders of Yahoo! and MySQL.
They also released a new beta version of the tool that is essentially a better version of their extraction tool, with some new features and a much cleaner and faster user experience. more…
ProxyMesh is another rotating anonymous proxy server service that lets users stay anonymous with the help of a network of continuously rotated IP proxy servers. This service requires no software to be downloaded but it can be easily used in conjunction with Visual Web Ripper software. more…
I often receive requests asking about email crawling. It is evident that this topic is quite interesting for those who want to scrape contact information from the web (like direct marketers), and previously we have already mentioned GSA Email Spider as an off-the-shelf solution for email crawling. In this article I want to demonstrate how easy it is to build a simple email crawler in Python. This crawler is simple, but you can learn many things from this example (especially if you’re new to scraping in Python). more…
Recently, import.io (a free scraping online tool) announced that they are adding another way to get data from the web: they’ll build it for you. This new “Data as a Service” program is targeted at businesses and organizations who need data, but don’t have the time or resources to devote to using the import.io tool to build it themselves. For these clients, import will curate custom datasets based on their specific requirements as well as develop custom data implementation solutions based on the organization’s in-house software. more…
Recently I decided to outsource a web scraping project to another company. I typed “web scraping service” in Google, chose six services from the first two search result pages and sent the project specifications to all of them to get quotes. Eventually I decided to go another way and did not order the services, but my experience may be useful for others who want to entrust web scraping jobs to third party services. more…
Recently I was asked to look at a brand- new online regex tester, regviz.org, developed as a collaboration of VISUS, University of Stuttgart and University of Trier. Though there are a lot of regex online testers on the market today, and many of them are quite good, let’s look at what is special about regviz.org and what it lacks. more…
For over four decades now, Relational Database Management Systems (RDMS) have dominated the enterprise market. However, the trend seems to change with the introduction of NoSQL databases. In this article, we are going to highlight practical examples where NoSQL systems have been deployed. We will also go further and point out other applications where implementation of such systems might be necessary. more…