Web Data Extraction is critical to the online operations of companies across the globe. With more data being scraped daily, websites implement techniques to block extraction efforts.

Blocking techniques

Common IP based blocking techniques include tracking IP location and blocking geolocations in their entirety.  

Some sites even block data center IPs altogether by purchasing lists of known data center IP addresses and denying or flagging them outright.  

Other blocking techniques include rate-limiting which refers to limiting the number of requests allowed: per IP, per second and the other methods to block bots based on their user agents – to differentiate crawlers from real users.

In order to pass these obstacles, start by using the right proxy network

To solve IP tracking and data center IPs blocks, begin by using a residential proxy network.

Residential IP –  is an IP address assigned from an Internet Service Provider to a user. 

Data center IP – is a static IP sold by companies who own servers containing many consecutive IP addresses.

Residential IPs

Luminati provides residential IPs in any country or city in the world, allowing you to truly emulate a real-user in any location, helping overcome many common IP based blocking techniques.


Software

The right software is the next step to overcome more sophisticated blocking techniques.

The Luminati Proxy Manager (LPM) is a free, open-source software that was created with built-in scraping features. These features, if set-up correctly, automatically overcome common request-based blocking techniques such as rate-limiting and bot blocking techniques like fingerprint detection.

This is accomplished with built-in features that automate:

  • IP rotation
  • Auto Retry
  • Limiting requests
  • Routing Requests
  • Bandwidth reduction
  • Random User-Agent
  • Override headers

Installation for the LPM  can be done using:

Windows Installer: https://luminati.io/lpm#installation

BASH install script (Mac OS/Linux): url -L https://luminati.io/static/lpm/luminati-proxy-latest-setup.sh | bash

NPM Package: sudo npm install -g @luminati-io/luminati-proxy

Docker Image:  docker pull luminati/luminati-proxy

GitHub Source Code: https://github.com/luminati-io/luminati-proxy

 

To overcome geolocation blocking, requests are to be sent using the Residential proxy network with a country targeted IP.  If an issue arises, such as a (4xx|5xx) error code, the same request can be automatically retried with the Residential proxy network using a city targeted IP.  

Issues can refer to anything such as an unwanted:

  • status code
  • URL
  • Body element
  • Request time

All of which are automatically avoided using the Luminati Proxy Manager rules which allow for a trigger (issue) and an action to be taken if the issue arises.

The actions that can be taken consist of:

  • Retry with new IP
  • Retry New Port
  • Ban IP
  • Ban IP per Domain
  • Refresh IP
  • Save IP to a reserve pool

 

Request-based blocking techniques

Rate-Based Blocking: Refers to limiting the number of requests allowed, per IP, per second. Utilizing a large network that allows for continuous rotation of an IP address is an easy solution to this restriction. Within the LPM, go to the IP control tab, set ‘Max Request’ to 1 and this will automatically rotate the IP every request.  

Bot-based blocking: Takes into account a user-agent to differentiate crawlers from real users. Upon entering a website, the site itself collects information in order to deliver the right language, operating system, screen size and more. By paying attention to the user-agent and response headers, this common blocking technique can be avoided altogether. Under the ‘headers’ tab in the LPM are options to employ Random User-Agent and Override headers for every request. Click ‘yes’ beside these options and after each request, the session is terminated and all variables changed.

Presets

The LPM also contains preset configurations already programmed for specific use-cases.

Round-robin config

One of the most common presets of LPM is the Round-robin configuration. To successfully scrape specific data elements, merely set-up a proxy port with the Round-Robin preset.  The Round-Robin preset automatically creates a round-robin pool type which rotates the request IP address with every request. This preset also disables ‘multiply’ options, sets ‘Pool size’ requests to 10 and sets ‘max requests’ to 1.

Pool size refers to a group of IPs allocated to your port which in this particular preset is 10 (and can be changed accordingly). This group of 10 IPs is continuously rotated with each request using 1 IP and then switching. IP switches are configured in the ‘max request’ setting which here is set to 1 but can be set to any number required.

In the examples below, the request is first tried using a residential country targeted IP, and if the request fails (returns an error code), then it is automatically retried with a city targeted IP.  This is referred to as the ‘Waterfall Method’ and consists of automatically resending the same request on a failure, using a different IP type.

Here is an example of the Manual Configuration file in the LPM for the round-robin configuration with the Waterfall method:

 

Online Shopping Preset 

The Online Shopping Preset is configured for shopping pages and automatically creates a round-robin pool type which rotates request IP address with every request. It sets DNS to resolve remotely, generates a random user-agent for each request, creates an explanatory rule for post-processing each request to scrape the data required and enables SSL analyzing.

Below is an example of the Manual Configuration file in the LPM for the Online Shopping preset with rules and the Waterfall method.

Wrap up

Luminati proxy network provides  39+ million residential IPs across the globe and with Luminati Proxy Manager, collecting accurate worldwide pricing data is simple.

If you are interested in downloading our free, open-source Luminati Proxy Manager click here:

If you would like to learn more about opening an account with Luminati Networks please click here to be connected to one of our dedicated business development managers.