Being the biggest scraper Google itself doesn’t like when somebody scrapes it. This makes life of google scrapers difficult.
In this post I offer you several hints on how to scrape Google in a safe way (if you still decided to do this).
The first thing Google scrapers need to have is a proxy source that is reliable. This will allow you to change your IP address. It goes without saying that any proxy that you choose needs to be of the high anonymous variety. You also need to be certain that the proxy is extremely fast and that it has not been guilty of any Google abuse previously.
A person should use anywhere from 50 to 150 proxies for their continued scraping activity. This will depend on what the average result set is for all individual search queries. There will inevitably be some projects that require additional proxies.
Make sure you choose the right time to change your IP. This is critical if you are going to scrape successfully. Always change your IP following every keyword switch if you are receiving 300-1,000 results for each keyword. If you are receiving less than 300 results, a single IP can be used to scrape several keywords. However, you may need to add a delay or increase the amount of proxies you are using.
Be certain that you clear all of your cookies following every IP change or totally disable them.
Google scrapers should never utilize threads unless they are needed. Threads are multiple scraping processes that are done at the same time. It is possible for you to scrape millions of results every day without the use of threads.
Add &num=100 to the search URL in order to set the maximum amount of search results to 100.
Your main search should have other keywords appended to it. Google makes it difficult to obtain more than 1,000 results for a single topic. However, it is possible to obtain almost all URLs.
Avoid gray or blacklisting for reliable scraping. Google scrapers should never scrape more than 500 requests during a 24-hour period for each IP address.
In the event that you get a captcha or virus warning, you need to stop what you are doing right away. Captcha indicates that they have detected your scraping activities. Increase the amount of proxies. If you are using more than 100, it might be necessary to utilize a different source for your IPs. Use the private proxy source listed above. It is possible to scrape Google constantly without them ever detecting you.