The present trends in web scraping tools

Recently I got a question from one of the blog readers. After I replied to it, I decided to share it with a wider audience.
Question:

Hi,

I found your scraping.pro site and found it very helpful, then realized the web scraper solutions rating was from 2014.  What is the best solution for today?   I have lots of sites I need to scrape, mainly search then drill-down sites.   I would like to be able to schedule the scraping to run on a daily basis.  Is there a direction you could point me?  I’m a seasoned developer by trade but am seeing all these point and click solutions (e.g. import.io) and am wondering if I should stick with Node.JS or .NET or if I should investigate some of these GUI scrapers of today.

more…

Octoparse – a scraping tool designed for non-programmers

Octoparse is an easy and powerful visual web scraper enabling anyone, even those without much programming background, to collect and extract data from the web. Octoparse is designed in a way to help users easily deal with complex website structures, such as those with JavaScript; it can be compared to other web scraping tools such as Import.io and Mozenda.

Octoparse 2nd Anniversary Sale – Up to 40% Off!

more…

Php Curl download file

We want to show how one can make a Curl download file from a server. See comments in the code as explanations.

 

Hotel: scrape prices, Q&A

Question

I want to extract the hotel name and the current room price of some hotels daily from https://www.expedia.ca/Hotel-Search?#&destination=Quebec,%20Quebec,%20Canada&startDate=06/11/2016&endDate=07/11/2016&regionId=&adults=2

I am a small hotel owner and want those info quite often, and hope I can do it with codes automatically in someway.  You are expert in this field, what is the easiest ways to get those information?  Can you give me some example codes? more…

Dexi.io – how to improve performance

dexi-improve-speedIntro

Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it. more…

Back to top