web scraping

Crawling web pages with Netpeak Spider in conjunction with NetNut and GeoSurf proxies

NetpeakSpider-logo-owlAgreed, it’s hard to overestimate the importance of information – “Master of information, master of situation”. Nowadays, we have everything we need to become a “master of situation”. We have all the needed tools like spiders and parsers that can scrape various data from websites. Today we will consider scraping Amazon with a web spider equipped with proxy services. more…

Web Scraping with Node.js

nodejs-web-scraping-logoThe web scraping topic has been actively growing in popularity for dozens of years now. Freelance sites are overcrowded with orders connected with this contradictory data extracting process. Today we will combine two new and revolutionary directions in web development. So, let’s consider an elegant and modern way to scrape data from websites with Node.js! more…

JavaScript rendering library for scraping javascript sites

logo-js-rendering-libraryCan you imagine how many scraping instruments are at our service? Though it has a long history, scraping has at last become a multi-lingual and simple approach. Unfortunately, there is a list of non-trivial tasks which can’t be resolved in a snap.

One of these tasks is scraping javascript sites, those that output data using JavaScript. Facing this task, classic scrapers (not all of them though) ignore JS-data and continue their own life-cycle. However, when this little defect becomes a big trouble, developers all over the world take measures. And they did it! Today we consider one of the most awesome tools which scrapes JS-generated data – Splash. more…

Octoparse 7.0 – a free web scraping tool for non-developers

Octoparse LogoOctoparse has recently launched a brand new version 7.0, which has turned out to be the most revolutionary upgrade in the past two years, with not only a more user-friendly UI, but also some of the advanced features make web scraping even easier. In this post, I will walk through some of the new features/changes made available in this new version, with respect to how a beginner, even one without any coding background, can approach this web scraping tool. more…

Web Scraping with Java and HtmlUnit

java-htmlunit-post-front-cover-smallWeb scraping or crawling is the act of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. It can be done manually, but generally this term refers to the automated process of downloading the HTML content of a page, parsing/extracting the data, and saving it into a database for further analysis or use. more…

The present trends in web scraping tools

Recently I got a question from one of the blog readers. After I replied to it, I decided to share it with a wider audience.
Question:

Hi,

I found your scraping.pro site and found it very helpful, then realized the web scraper solutions rating was from 2014.  What is the best solution for today?   I have lots of sites I need to scrape, mainly search then drill-down sites.   I would like to be able to schedule the scraping to run on a daily basis.  Is there a direction you could point me?  I’m a seasoned developer by trade but am seeing all these point and click solutions (e.g. import.io) and am wondering if I should stick with Node.JS or .NET or if I should investigate some of these GUI scrapers of today.

more…

Octoparse – a scraping tool designed for non-programmers

Octoparse is an easy and powerful visual web scraper enabling anyone, even those without much programming background, to collect and extract data from the web. Octoparse is designed in a way to help users easily deal with complex website structures, such as those with JavaScript; it can be compared to other web scraping tools such as Import.io and Mozenda.

Octoparse 2nd Anniversary Sale – Up to 40% Off!

more…

Dexi.io – how to improve performance

dexi-improve-speedIntro

Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it. more…

Octoparse – March 2017 Release

octoparse-logoOctoparse is a new, modern, visual web data extraction software. It has always committed itself to providing users with a more professional data scraping service and to becoming one of the most popular web scraper tools.

It has released a new version of the tool, 6.4.1, in March 2017 with some new features and a much faster and better user experience. more…

Back to top