machine learning methods for captcha recognition

28
Machine Learning Methods for CAPTCHA Recognition Rachel Shadoan Zachery Tidwell, II

Upload: rachelshadoan

Post on 05-Dec-2014

7.254 views

Category:

Technology


2 download

DESCRIPTION

This presentation describes several approaches to segmenting and solving CAPTCHAs.

TRANSCRIPT

Page 1: Machine Learning Methods For Captcha Recognition

Machine Learning Methods for CAPTCHA Recognition

Rachel ShadoanZachery Tidwell, II

Page 2: Machine Learning Methods For Captcha Recognition

CAPTCHACompletely Automated Public Turing Test to tell Computers and Humans Apart

Why are they interesting?o Harder than normal text recognition

On par with handwriting recognition, reading damaged text

o Techniques translate well to other problemsFacial recognition (Gonzaga, 2002)Weed identification (Yang, 2000)

o Near infinite data setsEasier to avoid over-fitting

Page 3: Machine Learning Methods For Captcha Recognition

Hypothesis

CAPTCHA recognition can be accomplished to a high degree of accuracy using machine learning methods with minimal preprocessing of inputs.

Page 4: Machine Learning Methods For Captcha Recognition

Methods

Learning Methodso Feed-forward Neural

Netso Self-Organizing Mapso K-Meanso Cluster Classification

Segmentation Methodso Overlappingo Whitespaceo K-Means

Toolso JCaptchao Image Processing

Page 5: Machine Learning Methods For Captcha Recognition

JCaptcha

o Open-source CAPTCHA generation software

o Highly configurableCan produce CAPTCHAs of many levels of difficulty

o Check it out at:http://jcaptcha.sourceforge.net

Page 6: Machine Learning Methods For Captcha Recognition

Image ProcessingSparse Image

Represents Images as unbounded set of pixelsEach pixel is a value between 0 and 1 and a

coordinate pairCenter each image before turning into a matrix of

0s and 1s

Original After Transformation

Page 7: Machine Learning Methods For Captcha Recognition

As covered in class

Feed-Forward Neural Nets

Page 8: Machine Learning Methods For Captcha Recognition

Self-Organizing MapsTraining

Initialize N buckets to random values

For each input

Find the bucket that is “closest” to the input

Adjust the “closest” bucket to more closely match the input using exponential average

Collection

For many inputs

Sort each input into the bucket it most closely matches

For each bucket and each character

Calculate the probability of that character going into that bucket.

Page 9: Machine Learning Methods For Captcha Recognition

K-Means• Very similar to Self‐Organizing Maps (SOMs)

• Can use the same classifying mechanism as used for SOM

Page 10: Machine Learning Methods For Captcha Recognition

Overlapping Segmentation• Divide image into

fixed number of overlapping tiles of the same size

• In our case, 20 x 20 pixels with a 50% overlap

• Discard chunks under a certain size and chunks that are all white

Note: This is a B with part of it cut off, not an E. Therein lies the rub.

Page 11: Machine Learning Methods For Captcha Recognition

• Iterate through the image from left to right—segment when a full column of whitespace is encountered

• Works perfectly for well-spaced text

Whitespace Segmentation

Page 12: Machine Learning Methods For Captcha Recognition

K-Means Segmentation• Performs better

than heuristic segmentation on closely-packed inputs

Page 13: Machine Learning Methods For Captcha Recognition

Even‐width

K‐Means

Whitespace

Even‐width

K‐Means

Whitespace

Segmentation Comparison

Page 14: Machine Learning Methods For Captcha Recognition

Experiment 1Machine Learning Method:

Self-Organizing Map Topology

200 buckets, initialized randomlyInputs:

3 letter CATPCHAs Random fontsLetters A-G“Chunked” using overlapping segmentation

Page 15: Machine Learning Methods For Captcha Recognition

Experiment 1 ResultsBuckets fell into three primary categories:

Distinguishable letters

Chunks with halves of two letters

Indistinguishable noise

Page 16: Machine Learning Methods For Captcha Recognition

Experiment 1 Results

Page 17: Machine Learning Methods For Captcha Recognition

Experiment 2ML Method:

Neural Net Topology:

Fully connected400 inputs50 node hidden layer 7 outputs

Inputs:Single letter CATPCHAsRandom fonts Letters A-G

400 Nod

es

50 Nod

es

7 Nod

es

Contains … ?

A: 0 or 1 B : 0 or 1C: 0 or 1D: 0 or 1E: 0 or 1F: 0 or 1G: 0 or 1

Page 18: Machine Learning Methods For Captcha Recognition

Neural Net Learning Curve

Experiment 2 Results

Page 19: Machine Learning Methods For Captcha Recognition

Experiment 2 Results

Neural Net Accuracy vs. Size of Hidden Layer

Past a certain number of nodes in the hidden layer, the topology ceases to have a huge impact on accuracy.

Page 20: Machine Learning Methods For Captcha Recognition

Experiment 3ML Method:

Neural Net Topology:

Fully connected400 inputs1000 node hidden layer 7 outputs

ML Method:SOM

Topology:500 buckets

Inputs:4 letter CATPCHAs Fandom fontsLetters A-G

Page 21: Machine Learning Methods For Captcha Recognition

Experiment 3

Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐G

Page 22: Machine Learning Methods For Captcha Recognition

Experiment 4ML Method:

Neural Net Topology:

Fully connected400 inputs1000 node hidden layer 7 outputs

ML Method:SOM

Topology:500 buckets

Inputs:4 letter CATPCHAs Fandom fontsLetters A-Z

Page 23: Machine Learning Methods For Captcha Recognition

Experiment 4

Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐Z

Page 24: Machine Learning Methods For Captcha Recognition

Experiment 5ML Method:

Neural Net Topology:

Fully connected400 inputs1000 node hidden layer 7 outputs

ML Method:SOM

Topology:500 buckets

Inputs:5 letter CATPCHAs Fandom fontsLetters A-Z

Page 25: Machine Learning Methods For Captcha Recognition

Experiment 5

Neural Net vs. SOM on CAPTCHAs Length 5, Letters A-Z

Page 26: Machine Learning Methods For Captcha Recognition

What it all means• Increasing number of characters

dramatically decreases total accuracy because segmentation quality decreases

• True positive rate goes down when segmentation quality decreases

• Hence, better segmentation is the key

Page 27: Machine Learning Methods For Captcha Recognition

Future WorkImproved Segmentation

o Wirescreen segmentationo Ensemble techniques

Improved True Positive Rates with Current Systemo Ensemble techniques

New problemso Handwriting recognitiono Bot net of doom

Page 28: Machine Learning Methods For Captcha Recognition

Questions?