reverse engineering captchas - plg uwplg. migod/papers/2008/wcre08- · reverse engineering
Post on 26-Jun-2018
Embed Size (px)
Reverse Engineering CAPTCHAs
Abram Hindle, Michael W. Godfrey, Richard C. HoltSoftware Architecture Group (SWAG)
University of WaterlooWaterloo, Ontario, CANADA
CAPTCHAs are automated Turing tests used to de-termine if the end-user is human and not an automatedprogram. Users are asked to read and answer VisualCAPTCHAs, which often appear as bitmaps of text char-acters, in order to gain access to a low-cost resource suchas webmail or a blog. CAPTCHAs are generated by soft-ware and the structure of a CAPTCHA gives hints to its im-plementation. Thus due to these properties of image pro-cessing and image composition, the process that createsCAPTCHAs can often be reverse engineered. Once the im-plementation strategy of a family of CAPTCHAs has beenreverse engineered the CAPTCHA instances may be solvedautomatically by leveraging weaknesses in the creation pro-cess or by comparing a CAPTCHAs output against itself.In this paper, we present a case study where we reverse engi-neer and solve real-world CAPTCHAs using simple imageprocessing techniques such as bitmap comparison, thresh-olding, fill-flood segmentation, dilation, and erosion. Wepresent black-box and white-box methodologies for reverseengineering and solving CAPTCHAs. As well we providean open source toolkit for solving CAPTCHAs that we haveused with a success rates of 99, 95, 61, 30%, and 27% onhundreds of CAPTCHAs from five real-world examples.
1. IntroductionCAPTCHAs are often the lone sentry that guards po-
tentially valuable or abusable resources. CAPTCHA is anacronym for Completely Automated Public Turing test totell Computers and Humans Apart. Many websitesand services utilize CAPTCHAs to act as a Turing test fortheir clients, to ensure that each request comes from anindividual human user and is not an attempt by an auto-mated program to acquires resources in bulk (Figure 1). Es-sentially, a CAPTCHA is a strange kind of lock, one thatis designed to be easily broken by human visual patternmatching, but not by means of automated software. That
is, CAPTCHA design is an exercise in creating a breakablelock.
Unfortunately CAPTCHAs limit not only automatedspam bots from using resources, their use is an impedimentto anyone who is visually impaired, or anyone who has torely on audio screen readers, braille screen readers or alter-native browsers. Even if an audio CAPTCHA is providedthere are still users who are discriminated against. Thusthere are legitimate reasons to automate the solving of theseCAPTCHAs, whether the purpose is humanistic or mali-cious.
Solving CAPTCHA is an Hard AI image processingproblem in the general case , thus it is hard to cre-ate an all encompassing CAPTCHA solver, but it is easyto create a solver for a family of CAPTCHAs. There arean infinite number of methods to produce CAPTCHAs butCAPTCHA generation is also an Hard AI problem (in thegeneral case) ; this makes CAPTCHAs similar to theSPAM problem: a continuous interaction between adver-sarial elements trying to tune their techniques to defeat eachothers defences.
The CAPTCHAs we consider in this paper are primar-ily bitmap depictions of text that the user is meant to rec-ognize and reproduce. Audio CAPTCHAs and pictureCAPTCHAs are not dealt with in this study. For sake ofbrevity, within this paper we will use the term captcha torefer only to visual textual bitmapped CAPTCHAs.
Captchas 1 are generated by software, thus they are rela-tively deterministic. Their output might vary wildly, but theoutput, the instance of the captcha, was created by follow-ing a deterministic process. As we mentioned above, gen-eral captcha generation is a Hard AI problem, captcha gen-erators avoid this complexity by generating specific fami-lies of captchas by following a deterministic process. Thisprocess and this software can be reverse engineered in bothblack-box and white-box (source available) settings. Thustarget captchas can be solved by reverse engineering and re-
1From this point on, we will use the more common lower-cased spellingof the term.
implementing the captcha generator. A target captcha mayalso be solved by exploiting the weaknesses of the opera-tions identified while reverse engineering the captcha.
Our contributions include:
A methodology for reverse engineering the process ofcaptcha creation
A general methodology for solving captchas
A method of solving captchas by leveraging an exist-ing implementation
A set of tools for captcha solving
1.1. Previous Work
Von Ahn et al.  formalized the idea of CAPTCHAsas an automated Turing test that leveraged Hard AI prob-lems. We have previous work that briefly describes some ofour captcha solving efforts . Much of our work was mo-tivated by the work on the application of shape matching tothe EZ-Gimpy captcha . The shape matching algorithmwas used for character recognition. It relied on matching thepoints of the contour of a shape to points of shapes in thedatabase and then finding the best permutation of matchingpoints, measuring the distance and then choosing the classof the closest matching candidate.
Other work focused solely on segmentation , whichdealt with noise removal via supervised learning and rules.Others further applied machine learning techniques to aidsegmentation and character recognition . We haveadopted the use of K-Means clustering  to segmentcaptchas as described by our colleagues Caine et al. .
2. Common Properties of Captchas
The constraints that are important to every captcha are:
Readable: the captcha must be easily read and decoded byhumans.
Unguessable: The captcha message cannot be guessed atrandom with any real confidence.
Order-able: Characters are read left to right, top to bottom(exceptions could include Hebrew or Arabic captchas).If a captcha is readable, its character ordering shouldbe apparent.
These constraints are important because difficultcaptchas can dissuade potential customers, which is not theintent of using captchas. Given these constraints some fea-tures and properties of common captcha implementationsinclude:
Bitmap fonts in fixed positions: Many Captchas con-sist of text in a fixed position written in a bitmap font. Al-though they are easy to break, they are easy to generate andthus very common while still serving as a minor deterrentto spammers.
Static background: Captchas that use static back-grounds are easy to spot because the background is alwaysthe same color or texture.
Random placement: Many captchas use randomlyplaced characters instead of characters in static positions asit can complicate segmentation and recognition.
Background/foreground noise: Captchas often deploybackground noise and overlapping foreground noise. Noiseoften consists of random pixels, lines and other objects.This makes comparison less deterministic.
Linear transformations: Captchas often warp theircharacters via linear transformations such as rotation orskew. These characters are more difficult to match becausethey need to be normalized first.
Non linear transformations: More advanced captchasrely on non-linear transformations and warping becausethey are harder to normalize. Non-linear transforms includeprojecting characters onto surfaces, smearing and warping.
Dripping/Fuzzy text: Some captchas employ creativedistortions by obscuring individual characters with smearsand spikes that often renders text difficult to read and tomatch.
Layering: Most importantly many of these captchasfollow some general pattern of composition: the layering ofthese above techniques, the use of overlays, and transfor-mations. Identifying how a captcha is layered is similar todiscovering how the captcha was created.
The main methodology for solving a captcha is first toanalyze the captcha and determine the steps of its generationprocess, find relevant counter-techniques, then to configurea specific solver from this information. The specific captchasolver will exploit the weaknesses of the methods used togenerate that family of captchas.
3.1. Reverse Engineering a Captcha
The knowledge of common captcha implementationstrategies allows us to reverse engineer the process ofcaptcha creation. This helps us choose and configure theappropriate tools to solve the captcha. In some cases, ifa captcha can be reproduced, then instances can also besolved by comparing them to instances of the captcha gener-ated during a parameter search (as explained in Section 3.3).A general methodology for reverse engineering a captcha isto identify the following operations and order them by theirlayering:
Figure 1. CAPTCHA Creation and Human Deciphering
Figure 2. Model of Computer CAPTCHA Solving
Layering: discover operations and overlays
Background: discovery and filtering
Noise: method and possible removal
Text: method and properties of the text
Transforms: method, what objects or layers are trans-formed, and possible reverse transforms
Layering refers to the general pattern of captcha con-struction. Layers consist of bitmaps, text, vector graphics,transforms and noise. Layers are usually placed on top ofeach other cumulatively (assuming some transparency oropacity) or in the case of transforms, as operations in a se-quence. In many cases the sequence of operations appliedis analogous to layering. This is similar to how layeringis used in photo-editing software like Adobe Photoshop orthe GIMP. One difficulty is in determining when the text isadded, although if one assumes that layers are added cumu-latively, it is often not hard to see if the text is overlappedor transformed by anything such as line noise or a rotation.Layering is often given away by