Recently I received a question in my mail box about scraping data aggregate sites (aka yellow pages) or business directories.
I replied to him directly, but our conversation on business directories was an interesting one that I thought you guys would find useful.
Here’s the question:
I am interested in scraping the database in such a website www.1881.no. My guess is that I would need a webdriver, like Selenium to do the job. I am very newbie to this field, but I believe if given some pointers, I can get some data out.
Could you please provide me with pointers on how to extract data from this website.
As a generic answer, I’ll provide you with some basics of scraping those business (and private life) directories.
First of all you need to be clear of data aggregators’ characteristics
- Those kind of services aggregate a huge amount of data, it’s often hard to estimate. Most likely you need to develop and run a special script for getting know the site’s estimated data amount.
- You need to query the data to fetch them since no predefined pages are there. Querying example is http://www.1881.no/?query=car. So you should properly make search terms, ex. [car, pizza, home etc…] to query against the site. Study up on GET and POST HTTP requests.
- The data in those aggregators changes over time, so you should set a scraper/crawler to detect outdated info. That usually involves a special algorithm and thus is much harder than a straightforward scrape.
- Those kind of sites are especially vigilant about using anti-scraping measures to avoid data leaks. So be ready for unexpected pits falls and unbreakable firewalls. You might want to read some of my previous posts about anti-scraping tools to get a better understanding of some of them.
- Because the amount of data on these sites is so huge, you’ll want to store it in an appropriate Database. Setting up DB will make sure your data is easily accessible later.
- To remain undetected (unbanned) by such aggregators you’ll need to adopt these two scraping methods:
- IP-proxying. See some posts on using IP-proxying with scraping software, especially Reliable rotating proxies for business directories scrape.
- Imitating human behaviour by using some browser automation tools (Selenium, iMacros and others)
- There are some off-the-shelf scraping softwares that are pretty well suited for such fine and tedious tasks. You might see an example of such a scraping software accomodating a free proxy network account for business directories scrape.