Machine Learning Methods For Captcha Recognition

Machine Learning Methods for CAPTCHA Recognition

Rachel ShadoanZachery Tidwell, II

CAPTCHACompletely Automated Public Turing Test to tell Computers and Humans Apart

Why are they interesting?o Harder than normal text recognition

On par with handwriting recognition, reading damaged text

o Techniques translate well to other problemsFacial recognition (Gonzaga, 2002)Weed identification (Yang, 2000)

o Near infinite data setsEasier to avoid over-fitting

Hypothesis

CAPTCHA recognition can be accomplished to a high degree of accuracy using machine learning methods with minimal preprocessing of inputs.

Methods

Learning Methodso Feed-forward Neural

Netso Self-Organizing Mapso K-Meanso Cluster Classification

Segmentation Methodso Overlappingo Whitespaceo K-Means

Toolso JCaptchao Image Processing

JCaptcha

o Open-source CAPTCHA generation software

o Highly configurableCan produce CAPTCHAs of many levels of difficulty

o Check it out at:http://jcaptcha.sourceforge.net

Image ProcessingSparse Image

Represents Images as unbounded set of pixelsEach pixel is a value between 0 and 1 and a

coordinate pairCenter each image before turning into a matrix of

0s and 1s

Original After Transformation

As covered in class

Feed-Forward Neural Nets

Self-Organizing MapsTraining

Initialize N buckets to random values

For each input

Find the bucket that is “closest” to the input

Adjust the “closest” bucket to more closely match the input using exponential average

Collection

For many inputs

Sort each input into the bucket it most closely matches

For each bucket and each character

Calculate the probability of that character going into that bucket.

K-Means• Very similar to Self‐Organizing Maps (SOMs)

• Can use the same classifying mechanism as used for SOM

Overlapping Segmentation• Divide image into

fixed number of overlapping tiles of the same size

• In our case, 20 x 20 pixels with a 50% overlap

• Discard chunks under a certain size and chunks that are all white

Note: This is a B with part of it cut off, not an E. Therein lies the rub.

• Iterate through the image from left to right—segment when a full column of whitespace is encountered

• Works perfectly for well-spaced text

Whitespace Segmentation

K-Means Segmentation• Performs better

than heuristic segmentation on closely-packed inputs

Even‐width

K‐Means

Whitespace

Even‐width

K‐Means

Whitespace

Segmentation Comparison

Experiment 1Machine Learning Method:

Self-Organizing Map Topology

200 buckets, initialized randomlyInputs:

3 letter CATPCHAs Random fontsLetters A-G“Chunked” using overlapping segmentation

Experiment 1 ResultsBuckets fell into three primary categories:

Distinguishable letters

Chunks with halves of two letters

Indistinguishable noise

Experiment 1 Results

Experiment 2ML Method:

Neural Net Topology:

Fully connected400 inputs50 node hidden layer 7 outputs

Inputs:Single letter CATPCHAsRandom fonts Letters A-G

400 Nod

es

50 Nod

es

7 Nod

es

Contains … ?

A: 0 or 1 B : 0 or 1C: 0 or 1D: 0 or 1E: 0 or 1F: 0 or 1G: 0 or 1

Neural Net Learning Curve



Neural Net Accuracy vs. Size of Hidden Layer

Past a certain number of nodes in the hidden layer, the topology ceases to have a huge impact on accuracy.




ML Method:SOM

Topology:500 buckets

Inputs:4 letter CATPCHAs Fandom fontsLetters A-G

Experiment 3

Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐G




ML Method:SOM


Inputs:4 letter CATPCHAs Fandom fontsLetters A-Z

Experiment 4

Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐Z




ML Method:SOM


Inputs:5 letter CATPCHAs Fandom fontsLetters A-Z

Experiment 5

Neural Net vs. SOM on CAPTCHAs Length 5, Letters A-Z

What it all means• Increasing number of characters

dramatically decreases total accuracy because segmentation quality decreases

• True positive rate goes down when segmentation quality decreases

• Hence, better segmentation is the key

Future WorkImproved Segmentation

o Wirescreen segmentationo Ensemble techniques

Improved True Positive Rates with Current Systemo Ensemble techniques

New problemso Handwriting recognitiono Bot net of doom

Questions?

Machine Learning Methods For Captcha Recognition

Technology

Transcript of Machine Learning Methods For Captcha Recognition