breaked by seleniumI’ve already written about how the new No CAPTCHA ReCaptcha works, and even had some success breaking it with an iMacros’ browser automation. But, the latest scraping tools are – for most part – driven by Python, so now I want to try the same experiment with Selenium + Python.

Disclaimer. After we’ve published the post, Google has drastically complicated the reCaptcha.

  1. Google engineers have removed iframe’s name attribute that we’ve tried to stick to.
  2. They’ve changed the html markup pertaining to the image puzzle. The table layout is now inside the block layout, a table being of 3×3 to 6×6 boxes large. Therefore the random click solution probability has decreased.
    Thus the scripting solution time has drastically increased. Probability to solve 4×4 table puzzle (with 2-3 tiles to be checked) in a single attempt is now: 2/16*1/15 * 100% = 0.8%. It is orders of magnitude less than original 2.8%.
  3. Google also has set a session timeout limit. So after certain time, it makes reCaptcha solution session to time out.

Practically we still strive to improve the code to beat reCaptcha down. Now I’ve updated the post with the new code!

Brute force works

The brute force approach works best for cracking this remotely supplied (by 3rd party) CAPTCHA. In a previous post I mentioned that the Client (website with CAPTCHA) does not control how many [picture puzzle] challenges the user has to take before passing the reCaptcha. So if one iterates over the image puzzles by randomly checking up pictures and submits a result to the CAPTCHA provider (google) the probability of solving it with a single submission is 2.8%. This value was valid for the year 2015 reCaptcha, but since the picture puzzle complication the probability dropped to less than 1%!

Read more of the theoretical part here.

So we need to program Selenium to automate moves and clicks to fetch the right reCaptcha elements: tiles (pictures), buttons, checkbox (which in turn is just a html block element).

Let’s get started.

Code in pieces

You should jump directly to the whole renewed code (incl. all the imports), but here it is broken down into sections.

This first code piece invokes a basic Firefox browser instance, it grabs content from a URL, saves the main window handler mainWin for further use and identifies the main captcha frame. We identify iframe by the tag name, so the following code will  move the driver to the first iFrame:

Provided your page containing more than just reCaptcha frames, you should research to find out what would be the frames indexes. Tip: Use iMacro for that.

driver.switch_to_frame() is deprecated. You might replace it with driver.switch_to.frame()
Now we click on a checkbox, wait till the picture puzzle is on (loaded by reCaptcha’s API) and jump to the second frame, containing puzzles themselves.

Next, we continue iterating until we solve the reCaptcha’s picture puzzle. The write_stat procedure writes each passed attempt info into a CSV file for further statistical analysis.

The whole code

Timeout limits

Now reCaptcha is session timeout sensitive. So if the brute force fails to solve it within a certain period of time, reCaptcha’s JS algorithm deliberately stops any interactions:
– makes reCaptcha ticked up
– google server returns {"success":"false"} upon siteverify (point 3: ‘decode the response’).

recaptcha solutionI’ve found the reCaptcha timeout being appx. 2.5-3 min.

Proxy usage

Some info from Lanre (out reader and contributer):

          High load sites challange

I also discovered while working on a site where a filling a form for a few thousands of people that this new captcha concept is linked to the IP address such that as you progress in the iteration, your chances of getting verified reduces until above 50-80 iterations and by this time your session is timing out and the captcha is no longer valid.

Dynamic IP

I wasn’t able to successfully resolve the captchas i guess due to the fact that my ISP is a static one. However, whenever i used DHCP on another ISP, i was able to resolve the captchas but only once and then i will need to restart my router before any other successful resolution of captchas.

So a question arose: Is the proxy concept a way to go resolve this challenge?
I think it’s worth to have new IP each time you reach site with reCaptcha so that Google would have no negative history about bruite forth attempts to solve reCaptcha. Otherwise google reCaptcha algorithm accumulates negative solution/timeout info for particular IP and makes following picture puzzles of higher complexity. The approach with proxy does not eliminate the reCaptcha bot suspision. Do not forget to spoof user-agent notation when using proxying.

Conclusion

The remotely managed puzzle CAPTCHA turned out to be vulnerable to brute force, yet after Google has enhanced it, it’s not that simple for crack. There is a poor correlation between the user sent attempts number (submitting form with CAPTHCA) and picture puzzle challenges set to user. So the browser automation brute force has performed to break a seemingly dead-lock reCaptcha. Because of the increased puzzle complexity, the brute force might fail to solve it within reCaptcha session time. So reCaptcha session timeout minimizes this kind of solution. Success rate now (Apr. 2016) being ~30%. The average timeout is 3 min.

Comments or algorithm improvement suggestions welcome!