Web Scraper Test Drive!

As we deal with different web scrapers, a problem appeared: while a custom scraper can be focused to your specific needs and scrape everything, off-the-shelf web scrapers are often quite generic and mostly designed to perform common, simple tasks. In other words, they may appear not to be as flexible and universal as you’d expect. Of course, all web scraper developers try to make their products scrape all kinds of web pages, but we realized some of them are better suited for one type of task and others, for another.

What web scraper should you buy that will serve you for a long time with different future tasks? This question spurred us to start a “Web Scraper Test Drive” project. And here are some points about this project:

  1. We were kindly presented with web scrapers from 10 companies for us to test them
  2. A special testing ground was created with several difficult cases, which we are tested within our web harvesting practice (some of them are not available on that page yet, but we’ll open them as our testing goes on)
  3. We tried to scrape each of those cases with each of the web scrapers and post the results of each trial
  4. If we couldn’t do it ourselves, we contacted the scraper developers to see how they could help us

We’ve done all the hard work, so  you can just examine the results and decide which scraper best fits your needs. Keep in touch!

Ready… Set… GO!

TESTContent GrabberVisual Web RipperHelium ScraperScreen ScraperOutWit HubMozendaWebSundew ExtractorWeb Content ExtractorEasy Web Extractor
READ REVIEWREAD REVIEWREAD REVIEWREAD REVIEWREAD REVIEWREAD REVIEWREAD REVIEWREAD REVIEWREAD REVIEW
AVERAGE RATING
TABLE REPORTGO TO TESTVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTS
VIEW RESULTS
VIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTS
BLOCK LAYOUTGO TO TESTVIEW RESULTS
VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS
TEXT LISTGO TO TESTVIEW RESULTS
VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS
INVALID HTMLGO TO TEST
VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS

VIEW RESULTS
LOGIN FORMGO TO TESTVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTS
AJAXGO TO TESTVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTS
CAPTCHAGO TO TESTVIEW RESULTS
VIEW RESULTS
VIEW RESULTSVIEW RESULTS
VIEW RESULTS
VIEW RESULTSVIEW RESULTSVIEW RESULTSVIEW RESULTS

Those who have left the Scraper Test Drive

The Test Drive is open for any software web scraper to participate. But some have left the contest. So, as we perform the Test Drive on the scrapers, we want to share why some scrapers have left the race.

Web Data Extractor

In the very first test with Web Data Extractor, we encountered a problem of grouping scraped cells of data into a result table. Support said: “Information in any case will be given as list of items.” This made us to conclude this scraper is solely for mass gathering of emails, URLs, text and so on, although it has advanced now with a visual user interface. Following that, we excluded it from the Test Drive.

WebHarvy

After the first stage (Table Report scrape), the developers were not happy with the performance of the scraper. We asked them to state the cause of their leaving the contest. They said: “We are leaving the Test Drive since WebHarvy is not a general purpose web scraper and not suited to scrape data as per the requirements in many of the test cases. WebHarvy has been built to enable users to scrape data from ‘well formatted’ paginated lists, with minimum amount of interaction from the user’s part.”

If you have any questions or suggestions about the  Test Drive, feel free to comment below.

20 Comments

  1. Dan

    Great info thanks.

    There is a slight niggle though, I peaked at the publish date though and noticed its a bit out of date now, are you thinking about updating this review in 2014?

    It would be good to see if the newer Generic tools like ‘Kimono Labs’ and/or Import.io stack up against the traditional ones yet for most things.

    Reply

    • Michael Shilov

      Hi Dan,

      You’re right. Soon I’m going to start another cycle of web sraper testings and there I’ll include new software and services. Stay tuned!

      Mike

      Reply

    • Michael Shilov

      BTW, you can find your review of Kimono Labs here: http://scraping.pro/kimono-labs-review/

      Reply

  2. andrea lanzoni

    I’m interested to begin scraping with (fully) open source products. Among them I’m evaluating iRobot and VietSpider. I would be interested to read an independent review from yours, hopefully highlightning the usability for users with no-programming-background, the amount of sites that these software may scrape.
    Thank You
    Andrea Lanzoni

    Reply

  3. Fernando Almeida

    I tried almost all of these softwares and all of them fail at some point by far the best service i got to use for my needs is scrapinghub.com
    I cant really understand how you’ve missed it, because it should be taken has reference by all other softwares/services

    Reply

    • Igor Savinkin

      Fernando, Scrapinghub.com is not a pure web scraping software. It’s a scraping enviroment/framework where developers (not common users) might develop and run [cloud] scripting scrapers. Practically you might solve any task and bypass any wall or pit thru a custom scripting scraper. That’s the reason it’s not included in the Scraper Test Drive.

      Reply

  4. Keval

    Hey,
    Why FMiner is not included in the test drive?
    Can you pass test for it also?
    Thanks.
    Keval

    Reply

  5. Nikhil

    Hi,

    I tried contacting customer support through email for “Outwit Hub” . It was the worst experience i ever had. They reply very rudely and it was of no help at all.
    Thumbs down to them

    Reply

  6. Kasper

    Michael,

    Have you looked at http://www.djuggler.com/ ? Would be interested to see how it fare against other web scraping softwares.

    regards,
    Kasper

    Reply

    • Igor Savinkin

      Kasper, I couldn’t open that page: “ERR_NAME_NOT_RESOLVED”.

      Reply

  7. Andreas

    Hi!

    We have an online scraping tool called APFy.me (http://www.apfy.me) which allows you to transform a website into a nicely formatted API that you can consume. You’re welcome to check it out.

    Reply

  8. MatrixView

    Hi Michael,

    …Just found your scraper blog and subscribed to your RSS. Great stuff!

    I was wondering, have you ever considered reviewing the following webscraping tools?

    1) Cmdline based tool “Xidel” to download/extract/follow (HTML/XML/JSON/TXT/CSS3/XPath2/XQuery/HTML-templates/RegEx/XSLT/…)
    In bash and batch scripts this tool can do wonderful things, i.e. load env. variables with extracted values.
    You can find the tool here: http://videlibri.sourceforge.net/xidel.html [Source and binaries (mac/win/ux) available]
    I ran into this swiss-knife like tool on stackoverflow, where it is usually the shortest-command based solution to many scraping questions.

    2) GUI/Scripting/web automation tool “iRobotSoft” that can be used visually and script-wise (HTQL-python)
    You can find it here: http://www.irobotsoft.com/
    It’s free, and has a forum and demos.

    Cheers! MatrixView

    Reply

    • Igor Savinkin

      MatrixView,
      1) Xidel seems to be the command line utility for tech people only. Can it bypass CAPTCHAs, log in, evaluate embedded JS code, etc.?
      2) The robotsoft.com seems to be outdated; some features are off…

      Reply

  9. Stefan Avivson

    I was wondering why CloudScrape didn’t make it onto this list?

    Reply

    • Igor Savinkin

      Stefan, I’m limited in time, man-power. So, please, be patient with me. Hope we’ll proceed it.

      Reply

  10. Tiago Oliveira

    What is the best option in order to scrape “Tripadvisor” ?

    Reply

    • Igor Savinkin

      Tripadvisor is JS-stuffed to protect its data. So I’d recommend you powerful software that run full-stacked (incl. JS support) browsers for scrape purposes: Content Grabber and Visual Web Ripper.

      Reply

  11. Tiago Oliveira

    Igor,

    Thanks !

    Reply

  12. Zone Téléchargement

    Is it possible with this soft duplicate entire CMS with all functions??
    Regard,
    Rico

    Reply

    • Igor Savinkin

      I think you can fetch all the data needed. But to “duplicate CMS with all functions” is not a scraper work. It’s web developer work. Scraper only gathers data; a developer develops [online] service.

      Reply

Leave a Reply

Back to top