Development

Various topics related to Web Scraper, Web Crawler and Data Processing development

How to detect your site is being scraped?

detect scrapeIn the age of the modern web there are a lot of data hunters – people who want to take the data that is on your website and re-use it. The reasons someone might want to scrape your site are incredibly varied, but regardless it is important for website owners to know if it is happening. You need to be able to identify any illegal bots and take necessary action to make sure they aren’t bringing down your site.  Not all bots are malicious (think search engine bots) so I’ve outlined some criteria for site owners and developers to use to identify if and how their site is being scraped.
more…

Import.io: Connector-GUIDs, User-GUIDs, API keys and how to get them?

Suppose I run a query to import.io API:

“HI there can you please tell me that what are connector-guid, user-guid and api key in below given code and how to get them for any website?”

I came across this question on StackOverflow, and as an avid import.io user I thought I’d answer it here as well, in case any of you have the same issue. 
more…

Import.Io Magic Method API

Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:

  • The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
  • It seems to have trouble with some javascript pages.

But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes. more…

Scrape with Google App Script

In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).

The Challenge

The challenge was to scrape arbitrary sites and save all the site’s pure text (stripping all the html markup) into a single file. Originally I was going to use python and PHP solutions, but then I thought I’d try using Google App Script instead. And it turned out pretty well. more…

A Simple Email Crawler in Python

Email Crawling I often receive requests asking about email crawling. It is evident that this topic is quite interesting for those who want to scrape contact information from the web (like direct marketers), and previously we have already mentioned GSA Email Spider as an off-the-shelf solution for email crawling. In this article I want to demonstrate how easy it is to build a simple email crawler in Python. This crawler is simple, but you can learn many things from this example (especially if you’re new to scraping in Python). more…

Back to top