The new Scraper Test Drive stage is on, called CAPTCHA. What can the scrapers perform to get through the “robot fighters”? The off-the-shelf scrapers are not designed for CAPTCHA solving by default. Furthermore, some stated that “bypassing Captchas was compatible with Internet good ethics”. I agree with this, but for the full Scraper Test Drive taste, we still want to try out the scrapers.

CAPTCHA categories

One can mention most used CAPTCHAs are of 2 categories:

  • Pure character as image recognition, 1st CAPTCHA
  • The Draggable and ‘Drag & Drop’, 2nd and 3rd CAPTCHAs.

The latter category is the JavaScript powered CAPTCHA, so there can be a “fairly” easy way to solve them compared to image recognition.


Bypass by sending POST submit form

It turned out that we included two JavaScript driven CAPTCHAs in the Test Drive on CAPTCHAs and after some consideration we figured out the way to pass them by. Here we want to share on how it is possible to bypass those JavaScript or event-driven CAPTCHAs.

Draggable:

  1. Send request: URL = http://testing-ground.scraping.pro/captcha?qaptcha
    POST parameter action=qaptcha
    POST parameter qaptcha_key=qaptcha_crack
  2. Submit the form with empty qaptcha_crack parameter:
    URL = http://testing-ground.scraping.pro/captcha?qsubmit
    POST parameter qaptcha_crack=<empty>

Drag&Drop:

  1. Send request to http://testing-ground.scraping.pro/fancy-captcha/captcha.php and save the number it returned
  2. Send this number in a POST parameter named “captcha” when submitting the form.

That’s it! More on these poor reliable CAPTCHAs you can read here.

The CAPTCHA solving Results table

Here are the overall results for the off-the-shelf scrapers solving CAPTCHAs.

Character CaptchaDraggable CaptchaDrag & Drop Captcha
Content Grabber
Content Grabber
Helium Scraper
Screen Scraper
Visual Web Ripper
OutWit Hub
Web Content Extractor
Easy Web Extractor
Mozenda
WebSundew Extractor

Visual Web Ripper

Basically Visual Web Ripper did the image recognition CAPTCHA (1st CAPTCHA) through connecting to the 3rd party service. The 2nd and 3rd ones can be solved with sending a POST submit form which is kind of complicated for a common user task.

Automated CAPTCHA bypass project (for 1-st CAPTCHA, Securimage)

The standard CAPTCHA bypass project development is:

  1. Add a content element that selects the CAPTCHA image. Then use the Misc options tab to uncheck the Save content option.
  2. Add a FormField element that selects the CAPTCHA input field. Then use the AdvancedOptions tab to select the image element as a CAPTCHA element.
  3. Use the AdvancedOptions tab to add a Decode CAPTCHA script to the FormField element that selects the CAPTCHA input field. (see the picture below)
  4. Add a FormSubmit template that submits the CAPTCHA form. You may need to set the Misc option to Optional template if the CAPTCHA form is not always displayed.

The default decode CAPTCHA script is designed to work with the www.deathbycaptcha.com service. If you are using this service, you only need to add your login name and password.

A decode CAPTCHA script can be written in C# or VB.NET. I have no account at deathbycaptcha.com, but I got the trial account from bypasscaptcha.com. So I needed to change the script, leaving its shell (title, input and output) untapped. I downloaded the C# API from bypasscaptcha.com and started to bind it to Visual Web Ripper.

It took me a while (you really need to know C# language), but the “Compile and Validate Script” function with corresponding button was a great help. The script does need a valid service key.

The scraper passed the test:

Note: In some cases CAPTCHA might be given unexpectedly as the extra anti-bot page/screen (after a robot has processed several hundred pages). In this case neither semi-automatic nor fully-automatic CAPTCHA bypass projects can be composed, since it’s not evident what and where a CAPTCHA is.

POST requests to solve JavaScript CAPTCHAs

We inquired with support about the possibility for the software to solve CAPTCHAs through sending a POST submit form and support replied:
“The software isn’t really designed for this sort of processing, but with a little bit of page transformation and JavaScript injection it can be done. It’s probably beyond what most users would be able to accomplish on their own”.

Still here are the steps required to do it:

  1. Add a page transformation element that adds or modifies the CAPTCHA form, so that it submits the required values.
  2. Add a Link template to execute a JavaScript that makes the asynchronous request return to the server. The JavaScript also sets form fields if required.
  3. Add a submit template that submits the form from step 1

Since the CAPTCHA test has become an issue with the most of the scrapers, we want to share more on the cases which do not fit the test requirements (fully automatic solution), however it might also help to share on the special features for semi-automatic passing.

Semi-automated extraction for JavaScript CAPTCHAs (for 2nd and 3rd CAPTCHAs)

VWR provides for semi-automated data extraction. It means that when the CAPTCHA puzzle is encountered, the project is paused and the user is prompted to manually solve the CAPTCHA and continue the project execution. Such a CAPTCHA demo project is available for registered users. Visit this link (for registered users only) to see the simple steps to compose such a project and/or download the demo project.

Note: A registered user is also one having a trial account.

This kind of extraction can be done only in the Web Browser mode with the options “View browser” and “Debugging” checked.
Here’s how to get past the two JavaScript CAPTCHAs:

  1. Create a new project and load the CAPTCHA page.
  2. Switch to Navigation mode and manually process the CAPTCHA. You’ll now see the text “THE FORM WAS SUCCESSFULLY SUBMITTED”.
  3. Add a Content element that extracts the text “THE FORM WAS SUCCESSFULLY SUBMITTED”.
  4. Edit the Content element and open Advanced options. Set the option “Pause when missing”.
  5. Run the project and make sure “View browser” and “Debugging” is enabled.
  6. The project will now load the CAPTCHA page and pause.
  7. Manually process the CAPTCHA in the web browser.
  8. Click the Continue button in the VWR debug window.
  9. The project will now continue and extract the content.

Here is the debugging info for semi-automated CAPTCHA solving:

Visual Web Ripper, being a powerful scraper, completed the character recognition CAPTCHA, while for the draggable CAPTCHAs, it requires some extra coding to solve these event-driven CAPTCHAs.

Content Grabber

Drag-&-Drop and Draggable

As we’ve mentioned the Drag-&-Drop and Draggable captchas are JS captchas, solvable thru GET and POST requests.

So we just need to create “Navigate URL” Content Grabber commands to execute those requests.

Content Grabber navigate URL command

Inside of this command we set common attributes by defining Data provider and Data Column as following:

Content Grabber Nvigate URL command setting 1

Then at the Data tab we define a script. This very script will request captcha code service and reply with POST data:

Content Grabber Navigate URL command setting 2

Note:  It’s possible to do this without any scripting, but you’ll end up with quite a few commands for something that’s easily done with a small script.

Drag and Drop script

For the “Drag and Drop” Captcha we got tech support help, they’ve provided us a custom C# script to insert into the command. The command uses a very simple data provider script to post the required values and retrieve the number which it then used to generate the final URL.

Note: It is the post= parameter inside of the second URL in the script, that makes Content Grabber to send a POST request with captcha credentials.

Draggable captcha script

Image Captcha

I’ve followed the software manual to make commands for capturing Captcha image, OCRing it and submitting solved text. To get Content Grabber manual go in the Main menu to Help -> Software Manual.

To solve an image captcha you simply follow these steps:

  1. Select a Captcha image and make a new command
  2. In this command choose OCR tab, tick up Convert image to text
  3. Press Script button to add scriptChoose image captcha to solve with OCR in Content Grabber
  4. Add the script with your Capthca solving service into the window. I used DeathByCaptcha OCR service.
    Note, the code scripts for some popular services you may find at the Content Grabber’s software manual, CAPTCHA partition.
    Save the command.
  5. Now you choose the text input field and make a new command, call it Captcha Code.
  6. In this command choose Captured data as Data provider and Captcha as Data Column to be a feed into this field. Save the command.Get decoded value into the input field for captcha by Content Grabber
  7. After that you choose Submit form button to make a Navigate Link command and save it.

Testing this with a deathbycaptcha working account issued the desired result:

Image captcha solution success with Content Grabber

back to results

Helium Scraper

Premade for solving CAPTCHA

As soon as I opened Helium Scraper I turned to Premades since I knew that the developers have done many useful things there. Just go to File->Online Premades -> Communicate with DeCaptcher to import the premade projects.

There they provided two projects to easily communicate with the third party De-Captcher.com service in order to automatically solve CAPTCHA codes. However the service which we reviewed uses the Optical Character Recognition system, thus it’s only applicable to the test’s first kind of CAPTCHA solving and is not applicable to events-driven CAPTCHAs solving.

The Captcha IMAGE, Captcha INPUT, Captcha SUBMIT kinds were created. Then according to the premade description, we filled out one of the action trees: ‘Solve Captcha If Needed’ or ‘Fill Up Captcha’, and configured it with the acquired parameters: Kinds defining a CAPTCHA, DeCaptcha’s account username and password and the maximum tries number (see a picture at right). What those trees do is, first, check to see if the CAPTCHA picture is
present and if it is, try to solve it and submit the result. It will try to do this until the CAPTCHA picture is not
found or until the maximum amount of attempts has been reached.

The Helium Scraper passed the image recognition CAPTCHA.

The JavaScript-driven CAPTCHAs solving

For these tricky event-driven CAPTCHAs we contacted scraper support. Our inquiry inspired the support team to write a new premade called Simulate DragDrop (now available for the updated version). It simulates a drag-and-drop action from a source element into a target element. You can find it at File -> Online Premades. Also a Boolean Text Gatherer (Project -> Boolean Text Gatherers) was created that helps the CurrentThing kind identify the element it needs to select, which is the item you’d drag-and-drop for the AJAX FANCY CAPTCHA.

The project was then provided for so that we could solve the last two CAPTCHAs. See the results here:

Among the other scrapers, Helium Scraper did this task with the least amount of time and power consuming, if one does not take into account the creation of a new JS gatherer and the additional boolean element by the support team.

Screen Scraper

Screen Scraper support supplied us with a CAPTCHA-solving Example Project that works on ReCaptcha. The scraper does allow the link to the DeathbyCaptcha and/or DeCaptcher services through custom scripts using the corresponding services of API:

If one runs the scraping session on the workbench, the script does just pop up the CAPTCHA for a manual fill-it. The workbench is only to develop the routine, and once done, the script can be run in the server mode, and that script will submit CAPTCHA to the service.

Draggable and Drag & Drop CAPTCHAs solving

As far as the drag & drop CAPTCHAs, Screen Scraper is able to send the POST requests, so basically the scraper is able to solve these CAPTCHAs when additional scripting is added. This scripting must imitate the HTTP session in order to reproduce the requests that we described in the Bypass by sending POST submit form paragraph. Go to Screen Scraper Documentation and inquire for the “POST request” term for further reference.

Web Content Extractor

This scraper is for purely straightforward scrape. So even though I tried my best I failed to find any way for automatic CAPTCHA solving through Web Content Extractor. Yet at the last minute the support provided the project for semi-automated CAPTCHA solution.
Upon the project run one needs to enter CAPTCHA text manually or make a ‘smart move’ and click “Submit form”. This solution does not meet the Scraper Test Drive requirements, yet we mention it as the only given way and render some point for it.

Mozenda

At this time, Mozenda does not support automated or manual workarounds of CAPTCHA.

Easy Web Extract

After a short search for the CAPTCHA solving techniques, I found that this scraper is not designed to do this tricky task.

Outwit Hub

Sending POST request for 3d CAPTCHA solution

In this project the support has helped us how to use an automator, called scraper, in Outwit Hub. Since in the ‘Drag & Drop’ CAPTCHA the index of the item for drag & drop is obvious in HTML (followed with ajax-fc-highlighted ui-draggable class names; see a picture below), why not scrape it, save into a variable and send it in the POST request?


For saving data into a custom variable one can use following directive:

#variable#<VariableName># directive declares and sets the value of the variable (<VariableName>). Occurrences of the variable are then replaced, at run time, by the scraped value.

So we defined the variable named position and the scraper will scrape the digit (\d – regex) prior to the Marker After (see the picture below).

Next directive is for sending POST request:
#nextPage# directive allows to tell scraper how to find the link to the next page to use in an automatic browse process.

It’s used for sending POST request to solve 3-d CAPTCHA according to what we mentioned about solving JS driven CAPTCHAs. Note, the ‘captcha’ POST parameter is passed with position variable value (see the picture below).

The request is here with POST parameters in red: http://testing-ground.extract-web-data.com/captcha?fsubmit&#POST#submit=Submit+form&#POST#captcha=#position#

Just run (execute) a scraper and do not forget to press “Next Page” or “Browse” button for sending POST request.

The scraper has now solved only the last CAPTCHA, yet i believe it will grow up in its skills and strength.

WebSundew Extractor

Their support promised to configure the agents for CAPTCHA solving but, so far, no reply from them. I do not think this scraper is able to pass CAPTCHA Test Drive.

Conclusion

The Test Drive has shown that some of scrapers have developed the gateway to the third party services for CAPTCHA recognition. Also, some provide the scripting opportunity for solving ‘drag & drop’ type CAPTCHAs.

The Test Drive results are helpful, exposing the scrapers’ abilities. Three of the scrapers are the low-level web data extractors, allowing scripting and more. For example, Visual Web Ripper allows semi-automated CAPTCHA solving in that the scraper pauses for the user to manually solve the CAPTCHA and resumes to extract further. The Helium Scraper is good for JavaScript outwork since it extracts using Kinds (elements association) and JavaScript gatherers. Content Grabber has performed excellent, sending GET and POST requests for JS captchas solutions and plugging in Captcha solving service for image Captcha recognition.