In this blog post we are going to show how you can solve [Re]captcha with Java and some third party APIs, and why you should probably avoid them in the first place.
For the Python code (+ captcha API) see that post.

The post author is Kevin Sahin from ScrapingNinja.co.

Captcha solving

Completely Automated Public Turing test to tell Computers and Humans Apart is what captcha stands for. Captchas are used to prevent bots from accessing and performing actions on websites or applications. There are dozens of different captcha types, but you likely have seen at least these two:

captcha_v1

And this one:

The last one is the most used captcha mechanism, Google ReCaptcha v2. That’s why we are going to see how to “break” these captchas.

The only thing the user has to do is to click inside the checkbox. The service will then analyze lots of factors to determine if it a real user, or a bot. We don’t know exactly how it is done, Google didn’t disclose this for obvious reasons, but a lot of speculations have been made:

  • Clicking behavior analysis: “where did the user click?”, cursor acceleration, etc.
  • Browser fingerprinting
  • Click location history (do you always click straight on the center, or is it random, like a normal user?)
  • Browser history and cookies

For old captchas like the first one, Optical Character Recognition and recent machine-learning frameworks offer an excellent solving accuracy (sometimes better than Humans…) but for Recaptcha v2 the easiest and more accurate way is to use third-party services.

Many companies are offering Captcha Solving APIs that use real human operators to solve captchas.  I don’t recommend one in particular, but I have found 2captcha.com easy to use and reliable, but relatively expensive ($2.99 for 1000 recaptchas).

Under the hood, these APIs need the specific site-key and the target website URL; with this information they are able to get a human operator to solve the captcha.

Technically the Recaptcha challenge is an iFrame with some magical Javascript code and some hidden input. When you “solve” the challenge, by clicking or solving an image problem, the hidden input is filled with a valid token.

It is this token that interests us, and 2captcha API will send it back. Then we will need to fill the hidden input with this token and submit the form.

The first thing you will need to do is to create an account on 2captcha.com and add some funds. You will then find your API key on the main dashboard.

We have set up an example webpage with a simple form with one input and a Recaptcha to solve:captcha_sandbox

We are going to use Chrome in headless mode to post this form and HtmlUnit to make the API calls to 2captcha (we could use any other HTTP client for this). Now let’s code.

Here is some boilerplate code to instantiate both WebDriver and WebClient, along with the API URL and key.

Then we have to call the 2captcha API with the site-key, your API key, and the website URL, as documented here. The API is supposed to respond with the following format: OK|123456.

 

Now that we have the job ID, we have to loop over another API route to know when the ReCaptcha is solved and get the token, as explained in the documentation. It returns CAPCHA_NOT_READY if it is not yet ready and still the OK|TOKEN when it is ready:

Note that it can take up to 1 minute based on my experience. It could be a good idea to implement a safeguard/timeout in the loop, because on rare occasions the captcha never gets solved.

Now that we have the magic token, we just have to find the hidden input, fill it with the token, and submit the form.

The Selenium API cannot fill hidden input, so we have to manipulate the DOM to make the input visible, fill it, and make it hidden again, so that we can click on the submit button:

And that’s it :-). The whole Java code you can find here.

Generally, websites don’t use ReCaptcha for each HTTP request, but only for suspicious ones, or for specific actions like account creation, etc. You should always try to figure out if the website is showing you a [Re]captcha because you made too many requests with the same IP address or the same user-agent, or maybe you made too many requests per second.

As you can see, “Recaptcha solving” is quite slow so the best way to “solve” this problem is by avoiding captchas in the first place! In order to do so, we recommend to you an article How to scrape websites without getting blocked, check it out!

Reducing the chance of getting Captcha is better than solving it, it is cheaper and much faster. Sometimes it’s not possible, as the web page shows a Captcha 100% of the time, but in many cases you can by-pass this by being smart with your scrapers.