Machine Learning Methods For Captcha Recognition
-
Upload
rachelshadoan -
Category
Technology
-
view
7.248 -
download
2
description
Transcript of Machine Learning Methods For Captcha Recognition
Machine Learning Methods for CAPTCHA Recognition
Rachel ShadoanZachery Tidwell, II
CAPTCHACompletely Automated Public Turing Test to tell Computers and Humans Apart
Why are they interesting?o Harder than normal text recognition
On par with handwriting recognition, reading damaged text
o Techniques translate well to other problemsFacial recognition (Gonzaga, 2002)Weed identification (Yang, 2000)
o Near infinite data setsEasier to avoid over-fitting
Hypothesis
CAPTCHA recognition can be accomplished to a high degree of accuracy using machine learning methods with minimal preprocessing of inputs.
Methods
Learning Methodso Feed-forward Neural
Netso Self-Organizing Mapso K-Meanso Cluster Classification
Segmentation Methodso Overlappingo Whitespaceo K-Means
Toolso JCaptchao Image Processing
JCaptcha
o Open-source CAPTCHA generation software
o Highly configurableCan produce CAPTCHAs of many levels of difficulty
o Check it out at:http://jcaptcha.sourceforge.net
Image ProcessingSparse Image
Represents Images as unbounded set of pixelsEach pixel is a value between 0 and 1 and a
coordinate pairCenter each image before turning into a matrix of
0s and 1s
Original After Transformation
As covered in class
Feed-Forward Neural Nets
Self-Organizing MapsTraining
Initialize N buckets to random values
For each input
Find the bucket that is “closest” to the input
Adjust the “closest” bucket to more closely match the input using exponential average
Collection
For many inputs
Sort each input into the bucket it most closely matches
For each bucket and each character
Calculate the probability of that character going into that bucket.
K-Means• Very similar to Self‐Organizing Maps (SOMs)
• Can use the same classifying mechanism as used for SOM
Overlapping Segmentation• Divide image into
fixed number of overlapping tiles of the same size
• In our case, 20 x 20 pixels with a 50% overlap
• Discard chunks under a certain size and chunks that are all white
Note: This is a B with part of it cut off, not an E. Therein lies the rub.
• Iterate through the image from left to right—segment when a full column of whitespace is encountered
• Works perfectly for well-spaced text
Whitespace Segmentation
K-Means Segmentation• Performs better
than heuristic segmentation on closely-packed inputs
Even‐width
K‐Means
Whitespace
Even‐width
K‐Means
Whitespace
Segmentation Comparison
Experiment 1Machine Learning Method:
Self-Organizing Map Topology
200 buckets, initialized randomlyInputs:
3 letter CATPCHAs Random fontsLetters A-G“Chunked” using overlapping segmentation
Experiment 1 ResultsBuckets fell into three primary categories:
Distinguishable letters
Chunks with halves of two letters
Indistinguishable noise
Experiment 1 Results
Experiment 2ML Method:
Neural Net Topology:
Fully connected400 inputs50 node hidden layer 7 outputs
Inputs:Single letter CATPCHAsRandom fonts Letters A-G
400 Nod
es
50 Nod
es
7 Nod
es
Contains … ?
A: 0 or 1 B : 0 or 1C: 0 or 1D: 0 or 1E: 0 or 1F: 0 or 1G: 0 or 1
Neural Net Learning Curve
Experiment 2 Results
Experiment 2 Results
Neural Net Accuracy vs. Size of Hidden Layer
Past a certain number of nodes in the hidden layer, the topology ceases to have a huge impact on accuracy.
Experiment 3ML Method:
Neural Net Topology:
Fully connected400 inputs1000 node hidden layer 7 outputs
ML Method:SOM
Topology:500 buckets
Inputs:4 letter CATPCHAs Fandom fontsLetters A-G
Experiment 3
Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐G
Experiment 4ML Method:
Neural Net Topology:
Fully connected400 inputs1000 node hidden layer 7 outputs
ML Method:SOM
Topology:500 buckets
Inputs:4 letter CATPCHAs Fandom fontsLetters A-Z
Experiment 4
Neural Net vs. SOM on CAPTCHAs Length 4, Letters A‐Z
Experiment 5ML Method:
Neural Net Topology:
Fully connected400 inputs1000 node hidden layer 7 outputs
ML Method:SOM
Topology:500 buckets
Inputs:5 letter CATPCHAs Fandom fontsLetters A-Z
Experiment 5
Neural Net vs. SOM on CAPTCHAs Length 5, Letters A-Z
What it all means• Increasing number of characters
dramatically decreases total accuracy because segmentation quality decreases
• True positive rate goes down when segmentation quality decreases
• Hence, better segmentation is the key
Future WorkImproved Segmentation
o Wirescreen segmentationo Ensemble techniques
Improved True Positive Rates with Current Systemo Ensemble techniques
New problemso Handwriting recognitiono Bot net of doom
Questions?