audio captcha logoI want to share how I’ve done the audio captcha recognize-er. The audio captcha recognize-er was designed to solve captcha at back in 2012. 

The task

The captchas at Xbox consisted of 5 speakers. Each speaker says one of ten digits from 0-9. There have been 5*10=50 recorded samples. The site captcha generation algorithm then generated a random sequence of 6-8 sounds from these 50 pre-recorded sounds. The distance between sounds was random. The algorithm also added a background noise to the speakers’ sounds, probably taken from some radio sounds and mixed into the captcha digits signal.


I’ve found it’s not possible to filter out the important 6-8 sounds from this speech-like noise by traditional Band-pass filters. So another method was thought out. The method consisted of two parts:

  1. Extract all the 50 audio samples
  2. Create recognizer comparing known sounds with fetched signal

Step 1. Getting audio samples

For this step, several hundred unrecognized captchas were downloaded by a certain script. Then correlations between this sounds were calculated using Matlab program. If the same digit with the same speaker is placed in two different captchas then the correlation must be high. We decreased the noise by summing up of several captchas:
audio captcha sum up
Because of the summation (sum up), the sound amplitude has increased as N (number of captchas summed), while the noise amplitude has increased only as √. This gave a relative decrease of the noise level. So after the first step, there were 50 noise-reduced audio samples ready to be used at the recognition phase (the 2-nd step).

Step 2. Recognition by a Sparse filter

It was found that recognition by just simple correlation is not fast enough (takes several seconds). So another method was thought out: Sparse filter. The task of the sparse filter is to find some known signal in an unknown one. First it takes to find local optimums in the known signal:
audio captcha sparse filter
The sparse filter is a finite impulse response (FIR) filter with response elements almost all zeros except places of local optimums where it equals to +1 or -1.
If we apply the sparse filter then it will detect only important sound, removing the noise. Here is result of applying of a sparse filter:
audio captcha sparse filter application result
Peak here says that we’ve found the signal and its position. Sparse filter works much faster than correlation method since no multiplication used, only summing up or subtraction.
There was done also an additional check with classic correlation at a found place (digit). Such filtering decreases false positives.

Implementation & results

The recognizer was written in C++ to get speeded up and be able to compile with GCC (GNU Compiler Collection) for Linux servers. The reached recognition ratio is 38%, captcha solving time is 0.2 sec.
The post was written by Maxim Vedenev.