business directory

Linkedin lost in court to data analytic company that scrapes Linkedin’s public profiles info

On September 9th, 2019 the UNITED STATES COURT OF APPEALS 1 has affirmed the former district court’s determination that a certain [data] analytic company is lawful to scrape [perform automated gathering] LinkedIn’s public profiles info. Now the historical event has happened in which a court is protecting a data extractor’s right for mass gathering openly presented business directory information. more…

What are the best online resources to acquire data?

Recently I received this question: What are the best online resources to acquire data from?

The top sites for data scrape are data aggregators. Why are they top in data extraction?
They are top because they provide the fullest, most comprehensive data [sets]. The data in them are highly categorized. Therefore you do not need to crawl and fetch other resources and then combine multiple-resource data.

Those sites fall into 2 categories:

1. Goods and services aggregators. Eg. AliExpress, Amazon, Craiglist.
2. Personal data and companies data aggregators. Eg. Linkedin, Xing, YellowPages. For such aggregators another name is business directories.

The first category of sites and services is quite wide-spread. These sites and services promote their goods with the goal of being well-known online, to have as many backlinks as possible to them.

The second category, the business directories, does not tend to reveal its data to the public. These directories rather promote their brand and give scraping bots minimum opportunity for data acquiring*.

Consider the following picture where a company’s data aggregator gives to the user only 2 input fields: what and where.

what_where_data_aggregator
You can find more of how to scrape data aggregators in this post.

————–
*You have to adhere to the ToS of each particular website/web service when you perform its data scraping.

Crawling web pages with Netpeak Spider in conjunction with NetNut and GeoSurf proxies

NetpeakSpider-logo-owlAgreed, it’s hard to overestimate the importance of information – “Master of information, master of situation”. Nowadays, we have everything we need to become a “master of situation”. We have all the needed tools like spiders and parsers that can scrape various data from websites. Today we will consider scraping Amazon with a web spider equipped with proxy services. more…

Hotel: scrape prices, Q&A

Question

I want to extract the hotel name and the current room price of some hotels daily from https://www.expedia.ca/Hotel-Search?#&destination=Quebec,%20Quebec,%20Canada&startDate=06/11/2016&endDate=07/11/2016&regionId=&adults=2

I am a small hotel owner and want those info quite often, and hope I can do it with codes automatically in someway.  You are expert in this field, what is the easiest ways to get those information?  Can you give me some example codes? more…

Reliable rotating proxies for business directories scrape

logo_rotating_proxiesWe’ve already written about suitable proxy servers for web scraping. Now we want to focus our readers on those for the huge/mass quantities data records scrape, particulary from the business directories. When scraping busines directories, their web servers can identify repetative requesting and put you on hold by looking at the IP address that is used for frequent http requests. Proxy rotation web service is the means for repeatedly changing IP address. Thus, target web server can only see the random IP addresses from rotatign proxies pool at each request. more…

Back to top