No CAPTCHA reCaptcha challenge

No CAPTCHA reCAPTCHA solvingSooner or later a new generation of spam protection methods will emerge to block all unwanted site visitors. The recently launched Google No CAPTHCA reCaptcha could just be such a method. This new “behaviour analysis” tool is getting more and more attention both from the site owners and from scraping engines who are trying to break it. Since Google does not reveal any secrets of its operation, we want to share with you the techniques used in this new smart analysis CAPTCHA that determines between bot and human. Let’s look inside. more…

How to detect your site is being scraped?

detect scrapeIn the age of the modern web there are a lot of data hunters – people who want to take the data that is on your website and re-use it. The reasons someone might want to scrape your site are incredibly varied, but regardless it is important for website owners to know if it is happening. You need to be able to identify any illegal bots and take necessary action to make sure they aren’t bringing down your site.  Not all bots are malicious (think search engine bots) so I’ve outlined some criteria for site owners and developers to use to identify if and how their site is being scraped.
more…

Import.io: Connector-GUIDs, User-GUIDs, API keys and how to get them?

Suppose I run a query to import.io API:

“HI there can you please tell me that what are connector-guid, user-guid and api key in below given code and how to get them for any website?”

I came across this question on StackOverflow, and as an avid import.io user I thought I’d answer it here as well, in case any of you have the same issue. 
more…

EndCaptcha for fast CAPTCHA solving

endCaptcha Captcha solveFrom time to time, web users struggle with “CAPTCHA services” such as DeCaptcher and DBC. And although those services are reliable, often times they’re “overloaded”, meaning the images to be solved get rejected or it takes a lot of time to be decoded (some services might even take 50 seconds to solve a single image!).

But, I recently came across a new service that hopes to fill this (fast CAPTCHA solving) gap. EndCaptcha.com, is a new image digitization service that was built to satisfy the needs of the most demanding consumers. It uses a dedicated team of operators assisted by a smart OCR system. That’s why it’s being considered a Premium CAPTCHA service.  more…

Import.Io Magic Method API

Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:

  • The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
  • It seems to have trouble with some javascript pages.

But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes. more…

Turn any interactive website into an API with ParseHub

parsehub
Anyone should be able to pull data from the web and access it in the format they want. If a website does not have an API available, scraping is one of the only options to get the data you need. But figuring out how to scrape data in the complicated HTML is a pain.

ParseHub is a new web browser extension that you can use to turn any dynamic and poorly structured website into an API, without writing code. ParseHub is a scraping tool that is designed to work on websites with JavaScript and Ajax; it is similar to web scraping tools such as Import.io and Kimono Labs.
more…

My site is being scraped, how can I prevent being scraped?

As anyone who’s spent any time on the scraping field will know, there are plenty of anti-scraping techniques on the market. And since I regularly get asked what the best way to prevent someone from scraping a site, I thought I’d do a post rounding up some of the most popular methods. If you think I’ve missed any out, please let me know in the comments below!

If you are interesting of how to find out if your site is being scraped, then turn to this post: How to detect your site is being scraped?

more…

Back to top