Professional data extraction requires adequate proxying to keep anonymity of scraping robots. When attempting to extract large data sets (over 1M records, ex. business directories) reliable and fast proxy service is needed.
Sequentum has released the Nohodo proxy service integration for Content Grabber. Nohodo provides a free account for Content Grabber users (up to 5000 requests/monthly for free). The feature is available for both trial users and regular customers. Here’s how it works…
Register free account at Nohodo
- Upgrade your version of Content Grabber. You can do it from the Help menu.
- Sign up to the Nohodo free proxy account. For this you log into the Content Grabber website and visit your Account page.
- You’ll be taken to a Nohodo’s page. Scroll down to choose Free account, then scroll further down to reach the form. Upon submitting the form, visit your mail box to get a sign up code (from a support mail), insert it into the same form and resubmit it to finally create Nohodo account.
Configure Nohodo at Content Grabber
Then you will need to configure your Nohodo proxy service in Content Grabber (follow steps given there). After you’ve done that, you are ready to use Content Grabber with the Nohodo proxy network. This helps your agents to stay undetected even during high volume web requests such as to scraping YellowPages and other data aggregation sites.
- The free account is limited to 5000 web requests per month (roughly 500 Mb), but you can also get paid packages to support larger traffic requirements.
- The free Nohodo accounts only have access to US proxies.
The Content Grabber & Nohodo integration seems to be sound. It’s a modern web scraping software combined with a reliable IP anonymity service. The Nohodo IP rotation network, which owns thousands of high performance proxies, should meet the IP anonymity demands of the professional web scraper.
We’ve already introduced you to the theory behind the new NO CAPTCHA reCAPTCHA, but now we come to the practical integration part. Here we’ll share how to insert and configure “NO CAPTCHA reCAPTCHA” into a web page. more…
Sooner or later a new generation of spam protection methods will emerge to block all unwanted site visitors. The recently launched Google No CAPTHCA reCaptcha could just be such a method. This new “behaviour analysis” tool is getting more and more attention both from the site owners and from scraping engines who are trying to break it. Since Google does not reveal any secrets of its operation, we want to share with you the techniques used in this new smart analysis CAPTCHA that determines between bot and human. Let’s look inside. more…
Consistent web scraping requires the use of multiple rotating proxies to prevent blocking and throttling by your target website. Let’s take the Content Grabber – a visual scraper with the Proxy-Connect rotating proxy server service for an example scrape.
In the age of the modern web there are a lot of data hunters – people who want to take the data that is on your website and re-use it. The reasons someone might want to scrape your site are incredibly varied, but regardless it is important for website owners to know if it is happening. You need to be able to identify any illegal bots and take necessary action to make sure they aren’t bringing down your site. Not all bots are malicious (think search engine bots) so I’ve outlined some criteria for site owners and developers to use to identify if and how their site is being scraped.
Suppose I run a query to import.io API:
$url = "https://query.import.io/store/connector/" . $connectorGuid . "/_query?_user=" . urlencode($userGuid) . "&_apikey=" . urlencode($apiKey);
“HI there can you please tell me that what are connector-guid, user-guid and api key in below given code and how to get them for any website?”
I came across this question on StackOverflow, and as an avid import.io user I thought I’d answer it here as well, in case any of you have the same issue.
Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:
- The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes. more…
In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).
The challenge was to scrape arbitrary sites and save all the site’s pure text (stripping all the html markup) into a single file. Originally I was going to use python and PHP solutions, but then I thought I’d try using Google App Script instead. And it turned out pretty well. more…
Google Apps Script file converting
The other day I was challenged to do some cloud converting following the web scraping project with Google App Script (GAS). Namely to get a google doc file and to convert it into MS Word format. more…
Most of developers stuck with the cookie handlng in web scraping. Sure it’s a tricky thing and this once has been my stumbling stone too. So here mainly for new scraing engineers i’d like to share of how to handle cookie in web scraping when using PHP. We’ve already done the post on scrape by cURL in PHP, so here we’ll only focus on a cookie side. more…