Captcha Synopsis

15
REVERSE ENGINEERING CAPTCHAs ABSTRACT: DEPT. OF COMPUTER ENGINEERING 1

Transcript of Captcha Synopsis

Page 1: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

ABSTRACT:

Completely Automated Public Turing test to tell Computers & Human Apart are

commonly used for determination of the end user as human or automated program. These are

primarily images consisting of noise along with content to be identified, which are provided to

the user who is expected to identify the content. The content from these images can be identified

by humans due to their vision capabilities but automated systems cannot distinguish the content

from noise & hence it provides for unwanted access by automated systems. To recognize this

content from the image a processor needs a CAPTCHA solver. The solver uses image processing

technique to process and distinguish the content from noise. The entire process executed by the

solver is called reverse engineering CAPTCHA. Since CAPTCHA’s are of different, different

types of solvers employing various technique can be implemented using a variety of image

processing algorithms.

In this project we present idea of one such CAPTCHA solver that may be employed to

solve a generic set of text based visual CAPTCHA. The project proposes five steps to implement

the same. Each of these steps is proposed to use a combination of one or more computer graphics

based algorithms to process an output a simpler version of original image to be used in

consecutive steps, consequently outputting the recognized content from the image which may be

used to compare with the input provided by the user and declare success or failure.

Keywords: CAPTCHA, ANN.

DEPT. OF COMPUTER ENGINEERING 1

Page 2: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

INTRODUCTION:

Completely Automated Public Turing test to tell Computers & Human Apart (CAPTCHA`s)

are a measure for increasing the security of the websites that are commonly used as a medium of

information & corporate exchange now-a-days. These enhance the reliability of any website.

These are mainly images containing characters, which constitute the content part, along with a

lot of background noise. These are dynamically generated and are often used for encryption

purposes and hence should be universally unique. That is once a CAPTCHA image is generated,

it should never be repeated again. CAPTCHA`s can be of mainly two types such as follows:

1. Visual CAPTCHA`s

2. Audio CAPTCHA`s

Visual CAPTCHA`s are simply images as mentioned above while audio CAPTCHA`s are

similar images which read out the characters from the image. Audio CAPTCHA`s are mainly

designed to facilitate the blind and the visually impaired to use the same websites despite the use

of CAPTCHA`s.

These CAPTCHA`s are of different types according to their types and backgrounds as well

as the noise added in each of them. These provide us security in the sense that the system can

differentiate between users and tell apart users from automated systems, where the latter can be

used to hack or retrieve personal information from different websites.

DEPT. OF COMPUTER ENGINEERING 2

Page 3: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

Fig 1: An example of visual CAPTCHA for Facebook

Fig 2: An example of audio CAPTCHA for passport.

DEPT. OF COMPUTER ENGINEERING 3

Page 4: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

Thus whenever CAPTCHA’s are used, a program to break the visual CAPTCHA so as to

recognize the content form the noise is needed. Such a program is known as CAPTCHA solver.

As different types of CAPTCHA`s are available, different solvers according to these types are

build. These may employ computer graphics algorithms or artificial neural networks (ANN) for

the solving process.

The solving process mainly consists of the following 5 stages:

1. Input

2. Preprocessing

3. Segmentation

4. Feature extraction

5. Pattern matching or classification

Whenever the user gives an input corresponding to a CAPTCHA, it is checked with the

characters recognized by the solver for the same CAPTCHA. If both of them match then the

input is accepted else it is rejected.

DEPT. OF COMPUTER ENGINEERING 4

Page 5: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

MOTIVATION:

Networking and the use of World Wide Web (WWW) has increased multiple folds in the

last few years. The internet is easily available these days at cheap rates and used for by all for

various purposes. Also, along with this, way if life has also changed. E-business along with

electronic transactions has led us to an era of improved technology.

Despite improvement in the technology, not all the means of information exchange are

safe. Also due to the use of this very technology, information is available at our fingertips, in

abundance, with considerable ease. Not all the information is put to use for good n ethical means.

The same may be misused by one and all.

Thus there is a need of security measures for the use of websites and other software so as

to provide a relatively safer environment to use the available sources. These security measures

should be hassle free so as to give the user a tension free and uncomplicated environment to

work in. CAPTCHA is once such security application which is used for this security.

The use of these different CAPTCHA`s and their importance in today’s networking led

us to research and study them. A lot of work though implemented and researched in this area, a

lot still remains to be done. This gave us the idea to try and implement these hitherto

unimplemented areas in the above mentioned field. Also the possibility of usage of Artificial

Neural Networks has given us the jerk in the right direction to work in this domain.

DEPT. OF COMPUTER ENGINEERING 5

Page 6: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

PROPOSED WORK:

In this project we propose to develop a solver for text based visual CAPTCHA. The

solver is intended to constitute 5 steps. These stages and & their working can be summarized as

follows:

1. Input:

The solver takes input in the form of an image containing certain characters and/or

numbers along with digital noise. This image is taken in the form of a standard image file

which is then used as an input by the later stages of the solver.

2. Preprocessing:

This step directly follows the input step and is the first stage where the actual processing

is done on the original input file. This stage mainly does the work or background noise

removal rendering the image in the binary format. This can be done stepwise, in which

case grayscale images are created before the actual creation of binary images. The

background noise may be of different types and thus different algorithms for removal of

each type of noise have to be implemented to provide versatility of noise removal.

3. Segmentation:

This forms the second step in processing the input image and bringing it one step closer

to the output. This stage segments the input image into a number of glyphs such that each

one of them represents a single character or number. These glyphs can be obtained using

a range of segmentation algorithms readily available, according to the type of input.

4. Feature extraction:

This stage thrives to bring about unity in the storage space for each character in a given

language. To attain this thinning algorithms are applied so as to maintain a glyph`s

information within a minimum possible amount of storage memory. Also it may employ

DEPT. OF COMPUTER ENGINEERING 6

Page 7: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

other techniques such as skeletonization and scaling to bring about uniformity of storage.

Also it computes a probability value for the glyph to be a certain character.

5. Classification/Pattern matching:

This is the last working stage of the proposed system. This stage takes as input the

probability acquired in the above stage, of each glyph created in the segmentation stage.

The probability of each glyph is separately compared to each of the letters stored in the

database. Accordingly the one with the highest probability is chosen as the recognized

letter. Alternatively thresholding can be done o decide upon the the cut off limit for

pattern matching.

SCHEDULE OF WORK:

DEPT. OF COMPUTER ENGINEERING 7

Page 8: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

Phase Start Date End Date Time in Hours

Remarks

Analysis –

Need Analysis- 21/06/10 30/06/10 40

Feasibility study- 01/07/10 07/07/10 22

Scope determination- 07/07/10 10/07/10 08

Literature survey- 11/07/10 24/07/10 32

Scripting determination- 25/07/10 26/07/10 05

Documentation- 26/07/10 30/07/10 12

Design –

Functional requirements 01/08/10 02/08/10 08

Database design 03/07/10 24/08/10 80

Detail Design – 26/08/10 23/09/10 80

Review- 24/09/10 26/09/10 16

Project Management-

Task Tracking- 27/09/10 02/10/10 40

Status Reporting- 03/10/10 07/10/10 24

Change and Scope mgmt 08/10/10 14/10/10 40

Development-

Module coding 15/12/10 25/01/11 160

Unit testing 01/02/11 22/02/11 60

Test Data 23/02/11 28/02/11 40

Integration with other module 02/03/11 30/03/11 80

Testing-

DEPT. OF COMPUTER ENGINEERING 8

Page 9: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

Black Box Integration Testing 01/04/11 06/04/11 60

Presentation-

Presentation to internals 07/04/11 10/04/11 16

Documentation-

Project documentation 11/04/11 27/04/11 30

REFERENCES:

DEPT. OF COMPUTER ENGINEERING 9

Page 10: Captcha Synopsis

REVERSE ENGINEERING CAPTCHAs

[1] Reverse Engineering CAPTCHAs by Abram Hindle, Michael W. Godfrey, Richard C. Holt

Software Architecture Group (SWAG). University of Waterloo, Waterloo, Ontario, CANADA

[2] A Projection-based Segmentation Algorithm for Breaking MSN and YAHOO CAPTCHAs

by Shih-Yu Huang, Yeuan-Kuen Lee, Graeme Bell and Zhan-he Ou

[3] ‘Visual Character Recognition using Artificial Neural Networks by Shashank Araokar’.

[4] Breaking visual captcha : A Novel Approach using HMM by Abhay Bansal.

[5] Breaking visual CAPTCHA by G. Mori and Malik

[6 ] Breaking Visual CAPTCHAs with Naïve Pattern Recognition Algorithms.

[7] A Low-cost Attack on a Microsoft CAPTCHA

Jeff Yan, Ahmad Salah El Ahmad School of Computing Science, Newcastle University, UK

[8] Breaking Visual CAPTCHAs with Naïve Pattern Recognition Algorithms by Jeff Yan,

Ahmad Salah El Ahmad School of Computing Science, Newcastle University, UK.

[9] Using Machine Learning to Break Visual Human Interaction Proofs (HIPs) Kumar

Chellapilla Patrice Y. Simard

[10] A note on the Nagendraprasad-Wang-Gupta thinning algorithm Rafael C. Carrasco and

Mikel L. Forcada

DEPT. OF COMPUTER ENGINEERING 10