breaked by imacroRecently I’v been getting requests for a tutorial showing how to solve Google’s No CAPTCHA ReCaptcha. I’ve introduced it before and promised to work out a script to automate solving it. And here’s what I’ve come up with.

Disclaimer. After we’ve published the post, Google has drastically complicated the form.

  1. Google engineers have removed iframe’s name attribute that we’ve tried to stick to.
  2. They’ve changed the html markup pertaining to the image puzzle. The table layout is now inside the block layout, a table being of 3×3 to 6×6 boxes large. Therefore the random click solution probability has decreased.
    Thus the scripting solution time has drastically increased. Probability to solve 4×4 table puzzle (with 2-3 right checked tiles) in a single attempt is now: 1/16*1/15*1/14 * 100% = 0.03-0.4%. It’s orders of magnitude less than original 2.8%.
  3. Google also has set a session timeout limit. So after certain time, it makes session to time out.

Practically we still strive to improve the code to beat reCaptcha down. Keep track of the new posts where we’ll expose new code!

Brute force cracking

After some unsuccessful behaviour imitating trials with Selenium, I got help from my friend (Egor Homakov) and his post on reCaptcha. Specifically, this paragraph where he talks about reCaptcha’s vulnerability:

“Client (website with CAPTCHA) knows how many wrong attempt you made (because verification is server side) but doesn’t know how many challenges you actually received (because User gets challenge with JavaScript, Client isn’t involved). Getting a challenge and verifying a challenge are loosely coupled events”.

So it seems that the brute force method might be the only way forward, since Client does not know how many picture challenges a macro solves at a single attempt and google provides as much as challenges needed with no ban.  I’ve calculated the probability of a single trial to rightly choose random 2 out of 9 pictures: 2/9 * 1/8 = 1/36 ≅  2.8%. Not a bad probability in case google puts forth only 2 pictures out of 9 to choose at each challenge.

If you are eager for the CAPTHCA solution code you can jump directly to it. Or you can find out how to insert the reCaptcha HTML frame numbers; see how below.

iMacros

select randomlyThe reCaptcha is essentially a user behaviour operating puzzle. iMacros functionality does not provide to reproduce real user behaviour, so we need to leverage brute force to break it.

Algorithm

  1. Macro clicks on a reCaptcha’s check box
  2. Google’s assessment machine at server decides if the right clues are present and returns results on the web page.
  3. Macro checks reCaptcha’s checkbox if it is marked up. If true (captcha is solved) => Finish macro
  4. [If not] Google’s server provides a picture puzzle to be solved
  5. Macro checks up randomly selected pictures* and submits solution to Google’s server.
  6. Go to point 2.
*If captcha algorithm finds there are already some checked pictures, at this point in the loop the macro marks out only one tile (picture). This gives a better probability for solution with 3-d or forth pictures puzzle.

The algorithm can be securely repeated for unlimited repetitions and thus eventually gets the bull eye.

JavaScript wrapping

Since the macro code requires only a part of it to be repeated, I’ve used a JavaScript wrapping (scripting) over iMacro code. Read more here.

Frames identifying

At the record stage, iMacro identifies reCaptcha HTML frames by name which is a random number set by Google’s server. This does not fit for repeated addressing them. So, instead, you’ll need to find out what frame number persist to the main CAPTCHA code and loaded picture puzzle. Through trial and error I’ve found frame numbers for main reCaptcha box and picture puzzle box of WordPress register page (https://wordpress.org/support/register.php) – 5 and 6 correspondingly.

Here is the simple code for you to get a frame number of the main captcha HTML frame. Get the number prompted up after checkbox is checked. Picture puzzle frame number must be a next number after a main captcha frame number:

Play this code in loop. Having captured the main captcha frame number you’ll substitute it into the final macro code as well as picture puzzle frame number; 2 times each in a code:

FRAME F=<your number>  // captcha main frame number

FRAME F=<your number>+1  // picture puzzle frame number

Submacros

Basically there are 4 macros doing their own function:

1. Initial click on check box

2. Check if captcha is solved (checkbox is checked)

We check if the checkbox is checked by hovering on it – CONTENT=EVENT:MOUSEOVER – and getting the return code. See iMacro Error and Return Codes page.

 3. Check if already some pictures checked up.

If so, this makes us click on one more picture per step (not 2 pictures at initial puzzle load).

 4. Main macro

It works to click on one or two pictures (tiles) in a puzzle. Thus we may proceed with better probability for 3-d or forth pictures check.

The whole iMacro code

Now we put together these macros in a JavaScript wrapper code and play macros inside the JavaScript loop through iimPlay() function. I’ve added here the debugging (iimDisplay()) and benchmarking (start, end vars) code. Feel free to remove it.

Execution statistic

The macro execution time is subject to normal distribution. With the over 50 samples (trials) I’ve got these results:

  • mean value – 47 loops  (~3 min)
  • standart deviation – 21 loops

Obviously reCaptcha solutions with iMacro code is not optimal in terms of time, but it proves that reCaptcha is not a break-proof anti-scrape solution.

Integration

Some readers may ask if it’s possible to integrate this solution into real app. Sure. You can leverage iMacro Scripting interface through many languages (Windows-based).

The iMacros Browser, Internet Explorer (with iMacros Add-on), Google Chrome (with iMacros Add-on) and Firefox (with iMacros Add-On) can be controlled with any Windows programming or scripting language.

As far as the Linux integration, iMacros “no longer support or update the Linux version of the Scripting Interface” (2013), but you can try to get in gear using this existing approach.

Conclusion

So far, the brute force approach is the only one that I’ve been able to use successfully to break such remote managed CAPTCHAs. The Client does not know how many challenges were issued by remote CAPTCHA provider (in this case – Google) until the CAPTCHA is solved by the user. And the provider does not care since it’s generating millions of challenges for the host of remote CAPTCHA clients (holders). I feel the browser automation is the best way to solve these new sophisticated bot protection puzzles!

I welcome your comments and questions!