Agreed, it’s hard to overestimate the importance of information – “Master of information, master of situation”. Nowadays, we have everything we need to become a “master of situation”. We have all the needed tools like spiders and parsers that can scrape various data from websites. Today we will consider scraping Amazon with a web spider equipped with proxy services.
Why should we use proxy?
A few words about the proxy we will use, and more about proxy in general. Proxy server is a middleware between you and the world-wide web – you connect to the proxy server, then via this server you surf the internet (in our case, scrape data from needed web-resources). This middleware allows you to receive compressed data in order to save web traffic. Secondly, it’s used for anonymity. If we parse a lot of data, the requesting site IP might be banned for a while. Fortunately, by using such proxy servers, the proxy’s IP will be banned, not yours.
In our test we will use 2 proxies:
- GeoSurf – proxy network with a wide ecosystem. Its desktop and mobile VPN allow fast and reliable work.
- NetNut – very fast and highly-secured proxy network. The servers are located on major internet routes or at ISP network connectivity points that are completely controlled by NetNut.
Both of them provided us residential proxy network for using in tandem with Netpeak Spider. So, it’s the residential proxies that will be used to scrape the Amazon, a data aggregator.
Netpeak Spider at a first glance
Proxy servers play the role of supportive software. The lead character is Netpeak Spider. This tool was developed for complex SEO analysis. Moreover, Netpeak Spider is a perfect scraper, we will see how it works little bit later.
We are going scrape data from amazon.com, specifically, book bestsellers – an author name and book’s title.
Before we start scraping, let’s quickly configure our project in Netpeak Spider. Firstly, let’s connect proxy with the spider. After you get proxy service login data, go to Settings -> Proxy and enter needed data. And don’t forget to check the connection!
What else should we configure in Netpeak Spider?
- Number of active threads – it defines how fast crawler will work. The greater the number of threads, the higher the chance it will be banned by Amazon. Let it be 10 threads.
- URL – we want a crawl information only about books, so we need to set an init url – https://www.amazon.com/gp/bestsellers/books
- Parameters – Netpeak Spider will return a lot of information about parsed urls, such as response type, url depth, head tags and so on. For now, we don’t need it. In order to skip overloading, making the spider work faster, you can leave only a status code parameter in general parameters.
Now we test the spider by crawling about 50 thousand URLs! Spider will collect URLs that are related to the start point (above mentioned URL) and return to us some information. Let’s check how fast spider will work with both NetNut and GeoSurf.
Even for spiders, analyzing 50 thousand links is a hard task that will take a long time. We have armed ourselves with patience and waited until spider finishes working. If you haven’t much time for waiting, you can save & close project and continue spider’s work from the saved point – good feature, I think.
Here is result of the work in one picture:
Here we can see detailed information about parsed URLs: how many are broken links, links with 3xx and 5xx code, skipped and so on. At the All Results tab you can easily get needed data about every parsed link. For example, source code of the page or issues.
So, what’s about the results from scraping 50K urls?
|Time spent||Scraped data amount|
|5 ½ hour||5.1 Gb|
|3 ¾ hour||5.8 Gb|
Pause, close and play over again
Takes too much time, doesn’t it? 50 thousands urls is a complex task even for the best spiders. Fortunately, our spider has a very useful feature. You can pause site analyzing, close spider and continue its work in any time.
We got acquainted with the spider, now let’s do something more engaging. We will try to scrape data about every book. In particular, book’s title and author. Firstly, let’s consider the source code on the page with bestseller and press F12.
To get book’s title and author’s name, we should find the needed information in source code. Of course, we won’t grab with the whole code, all we need is to click on the first icon in the toolbar (“Select an item…”) and click on the needed information. In our case, this is the book title.
Now we need to copy the element’s code for scraping. We will use XPath. Right click on the code -> Copy -> Copy XPath. The same thing works with author’s name.
After we’ve found needed elements, we should configure the spider for scraping. Go to Settings -> Scraping and tick “Use HTML scraping”. Now we can add the fields that we will scrape. Choose XPath, assign name and paste the XPath code [that you’ve just acquired] for each field.
Then, in “Parameters” you will see a new checkbox called “Scraping”. Tick this checkbox. It will allow us to see book’s title and author on the next site analysis.
Everything seems to be fine, but in fact we have some problems. Let’s look at the links that we have parsed. In the right panel, open the tab “Site structure”.
After a little investigation, we’ve found out that not all the pages that we’ve parsed are links to the bestsellers. Strange thing, bestselling books should be situated at address …/gp/bestsellers/books/book-name. Instead, books are situated at address …gp/product/book-name. Moreover, we want to parse only books, but we have parsed a lot of needless links. Why?
That’s caused by Amazon’s complex site structure. The site categories are not organized in a traditional way (sitename/category/subcategory/product-name) and as a result, it’s very hard to scrape such huge sites with complex site structure. If the structure is ordinary, the spider will go deeper from category to subcategory by default. Try Netpeak Spider on the little online stores, you will understand the difference.
Let’s look at our table with “book urls”. Find in the site structure tab “product”.
This is the list of urls we should parse. Fortunately, it’s very easy to do with Netpeak Spider. Right click on any url, current table ->scan table, success!
And here are our results! In some cases, there are even two results for the author. Click on the highlighted digit and you will see the actual result.
The test has shown us Netpeak Spider has a decent speed for scraping 50K entries with a minimum of effort. As one can see, spiders make scraping much easier and faster. Additionally, proxy servers provide a more reliable (ban-proof) scraping process. Now it’s your turn – try to use spider for your own needs: change configurations, try to scrape data from different sites, configure the result output and export them in file. And don’t forget about proxy!