Miscellaneous

Other posts not belonging to any specific category

CloudScrape to transform into Dexi.io

We have already written some posts on CloudScrape, a Copenhagen, Denmark-based web scraping service startup. The service now has a new look and new features for data extraction and business intelligence – with the launch of new name: Dexi.io.

Pipes for aggregation and post-processing

Dexi.io has relaunched and rebranded from its early-stage name CloudScrape. The company has also released a new product, Pipes. Pipes adds intelligent data transformation to complement the point-&-click data extraction service. In a nutshell, Pipes is a data integration and post-processing engine inside of Dexi.io, that is able to aggregate, sanitize extracted data and a lot more. We’ll share more on it in the following posts.

Driving Innovation is the Key to Success

Stefan Avivson, CEO of Dexi.io explains: “Although Robotic Process Automation (RPA) is not a new concept, service providers have been using so called Robotic Processing for a decade now, but the amount of available data and the technology to process it has evolved tremendously over the past two years. There is basically no real limitation to the use of Big Data and there is no real effort in convincing people that RPA is the future. It’s more a question of knowing how! Utilizing the resources of our innovation team for our clients and partners has without doubt been one of the key drivers to our success!

Let’s see more in the future of this cutting-edge cloud scrape service.

Web scraping with JavaScript

Is it possible to scrape an HTML page with JavaScript from inside of a web browser?

To be perfectly honest I wasn’t sure so I decided to try it out.

Full disclaimer here, I didn’t actually succeed. However, it was a great learning experience for me and I think you guys could benefit from seeing what I did and where I went wrong. Who knows, maybe you can take what I’ve done and figure it out for yourself!
more…

Content Grabber self-contained (standalone) agent

self_contained_agentAs web scraping is becoming easier to use, more and more people are able to leverage the world’s web resources. As this trend grows, structured data from the web empower businesses and enable a wave of new business ideas to become a reality. Now there is a new technology on the market called: “self-contained agents” that might just make this a tsunami! more…

Import.io: Connector-GUIDs, User-GUIDs, API keys and how to get them?

Suppose I run a query to import.io API:

“HI there can you please tell me that what are connector-guid, user-guid and api key in below given code and how to get them for any website?”

I came across this question on StackOverflow, and as an avid import.io user I thought I’d answer it here as well, in case any of you have the same issue. 
more…

EndCaptcha for fast CAPTCHA solving

endCaptcha Captcha solveFrom time to time, web users struggle with “CAPTCHA services” such as DeCaptcher and DBC. And although those services are reliable, often times they’re “overloaded”, meaning the images to be solved get rejected or it takes a lot of time to be decoded (some services might even take 50 seconds to solve a single image!).

But, I recently came across a new service that hopes to fill this (fast CAPTCHA solving) gap. EndCaptcha.com, is a new image digitization service that was built to satisfy the needs of the most demanding consumers. It uses a dedicated team of operators assisted by a smart OCR system. That’s why it’s being considered a Premium CAPTCHA service.  more…

Import.Io Magic Method API

Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:

  • The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
  • It seems to have trouble with some javascript pages.

But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes. more…

My site is being scraped, how can I prevent being scraped?

As anyone who’s spent any time on the scraping field will know, there are plenty of anti-scraping techniques on the market. And since I regularly get asked what the best way to prevent someone from scraping a site, I thought I’d do a post rounding up some of the most popular methods. If you think I’ve missed any out, please let me know in the comments below!

If you are interesting of how to find out if your site is being scraped, then turn to this post: How to detect your site is being scraped?

more…

Back to top