In this article I’d love to revise few well-known methods of protecting website content from automatic scraping. Each one has its advantages and disadvantages, so you need to make your choice basing on the particular situation. None of these methods is ultimate and each one has its own ways around I will mention further.
1. IP-address ban
The easiest and most common way to determine attempts of website scraping is analyzing the frequency of requests to the server. If requests from a certain IP-address are too often or too much, the address might be blocked and it is often asked to enter captcha to unblock.
The most important thing in this protection method is to find the boundary between the common frequency and number of requests and attempts of scraping in order not to block ordinary users. Commonly this might be determined by analyzing common users’ behavior.
Example: Google might be a good example of using this method as it controls the number of requests from a certain IP address, issues a warning to block IP and prompts you to enter captcha.
Tools: Some services (like distilnetworks.com) allow you to automate the process of tracking suspicious activity on your site and even offer the authenticated user check with captcha.
Bypassing: One may bypass this protection using multiple proxies to hide the real IP-address of the scraper. The examples are BestProxyAndVPN providing affordable services such as cheap proxy, and SwitchProxy service, though more expensive, it is specially designed for automatic scrapers and withstands heavy loads. Another option is to apply rotating proxy services.
2. Using different accounts
With this protection method the data might be accessed by authorized users only. It simplifies the control on users’ behavior and blocking suspicious accounts regardless of the IP-address the client is working from.
Example: Facebook is a good example, as it is constantly controlling the users’ activity and blocking the suspicious accounts.
Bypassing: This protection might be bypassed by creating a set of accounts including the automatic ones. There are certain services (like bulkaccounts.com) selling accounts on well-known social networks. Verifying the account by phone (so-called, PVA-Phone Verified Account) to check its authenticity may create the essential complexity for automatic accounts creation, although it could be bypassed using disposable SIM-cards.
3. Usage of CAPTCHA
It’s a popular way of data protection from web scraping, too. In this case a user is invited to type captcha text to get access to the website. The inconvenience to the regular users forced to enter captchas is the significant disadvantage of this method. Therefore, it’s mostly applicable in systems where data is accessed not very often and upon individual requests.
Example: Website position’s testing in the SERP (eg http://smallseotools.com/keyword-position/) can be a good example of Captcha usage to prevent automated querying services.
Bypassing: Captcha might be bypassed using captcha recognising software and services. They might be divided into two main categories: automatic recognition without manpower (OCR, such as GSA Captcha Breaker) and recognition using manpower (somewhere in India people are sitting and processing online requests of images recognition, for example Bypass CAPTCHA service). Human-based option is usually more effective, but the payment in this case is per captcha recognized, comparing with one-time payment when software is purchased.
Example: Facebook is a good example of this way of protection from web scraping.
EDIT: As Us0r noted in comments, this also can be bypassed.
5. Frequent update of the page structure
One of the most effective ways to protect websites against automatic scraping is to change their structure frequently. This can apply not only on changing the names of HTML element identifiers and classes, but even on the entire hierarchy. This makes writing scraper very complicated, although it overloads the website code and, sometimes, the entire system as well.
On the other hand, these changes can be made manually once a month (or several months). It makes scrapers’ lives tough anyway.
Bypassing: To bypass protection like this a more flexible and “intelligent” scraper is required, or just a scraper’s manual correction is needed when these changes occur.
6. Limitation of the frequency of requests and downloadable data allowance
This allows to make scraping of large amounts of data very slow and therefore impractical. At the same time the restrictions must be applied considering the needs of a common user, so that it would not reduce the overall usability of the site.
Bypassing: It might be bypassed through accessing the website from different IP-addresses or accounts (multiple users’ simulation).
7. Mapping the important data as images
This method of content protection makes automatic data collection more complicated and at the same time it maintains visual access for common users. Images often replace prices (example), e-mail addresses and phone numbers, but some websites even manage to replace random letters in the text. Although nothing prevents to display the content of a website in graphic form (eg using Flash or HTML 5), it can significantly hurt its indexability for search engines.
Drawbacks: The negative effects of this method are that not all the content is indexed by search engines and that the users are not able to copy data to the clipboard.
Bypassing: It’s hard to bypass this protection as some automatic or manual images recognition is required, similar to the one used in captcha case.