SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN …pages.cs.wisc.edu/~paulgc/Thesis.pdf ·...

SEARCH ENGINE ENHANCEMENT BYEXTRACTING HIDDEN AJAX CONTENT IN

WEB APPLICATIONSby

PAUL SUGANTHAN G C 20084053MUTHUKUMAR V 20084041NANDHAKUMAR B 20084043

A project report submitted to the

FACULTY OF INFORMATION AND

COMMUNICATION ENGINEERING

in partial fulfillment of the requirements

for the award of the degree of

BACHELOR OF ENGINEERING

in

COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ANNA UNIVERSITY CHENNAI

CHENNAI - 600025

MAY 2012

CERTIFICATE

Certified that this project report titled “ SEARCH ENGINE ENHANCEMENTBY EXTRACTING HIDDEN AJAX CONTENT IN WEBAPPLICATIONS” is the bonafide work of PAUL SUGANTHAN G C(20084053), MUTHUKUMAR V (20084041), NANDHAKUMAR B(20084043) who carried out the project work under my supervision, for the

fulfillment of the requirements for the award of the degree of Bachelor of

Engineering in Computer Science and Engineering. Certified further that to the

best of my knowledge, the work reported herein does not form part of any other

thesis or dissertation on the basis of which a degree or an award was conferred on

an earlier occasion on these are any other candidates.

Place: Chennai Dr. V VetriselviDate: Project Guide,

Designation,

Department of Computer Science and Engineering,

Anna University Chennai,

Chennai - 600025

COUNTERSIGNED

Head of the Department,

Department of Computer Science and Engineering,

Anna University Chennai,

Chennai – 600025

ACKNOWLEDGEMENTS

We express our deep gratitude to our guide, Dr. V VETRISELVI for guiding us

through every phase of the project. We appreciate her thoroughness, tolerance and

ability to share her knowledge with us. We thank her for being easily approachable

and quite thoughtful. Apart from adding her own input, she has encouraged us to

think on our own and give form to our thoughts. We owe her for harnessing our

potential and bringing out the best in us. Without her immense support through

every step of the way, we could never have it to this extent.

We are extremely grateful to Dr. K.S. EASWARAKUMAR, Head of the

Department of Computer Science and Engineering, Anna University, Chennai

600025, for extending the facilities of the Department towards our project and for

his unstinting support.

We express our thanks to the panel of reviewers Dr. ARUL SIROMONEY, Dr.A.P. SHANTHI and Dr. MADHAN KARKY (list of panel members) for their

valuable suggestions and critical reviews throughout the course of our project.

We thank our parents, family, and friends for bearing with us throughout the course

of our project and for the opportunity they provided us in undergoing this course

in such a prestigious institution.

Paul Suganthan G C Muthukumar V Nandhakumar B

ABSTRACT

Current search engines such as Google and Yahoo! are prevalent for searching the

Web. Search on dynamic client-side Web pages is, however, either inexistent or

far from perfect, and not addressed by existing work, for example on Deep Web.

This is a real impediment since AJAX and Rich Internet Applications are already

very common in the Web. AJAX applications are composed of states which can

be seen by the user, but not by the search engine, and changed by the user using

client-side events. Current search engines either ignore AJAX applications or

produce false negatives. The reason is that crawling client-side code is a difficult

problem that cannot be solved naively by invoking user events.

The project is aimed to propose a solution for crawling and extracting the hidden

ajax content. Thus enabling the search engines to enhance its search result quality

by indexing dynamic ajax content. Though AJAX can be crawled by testing

manually in browser by invoking client side events, enhancing the search engine

to crawl AJAX content automatically similar to traditional web applications

hasn’t been achieved.

The project describes the design and implementation of an AJAX Crawler. Then

enabling search engine to index the crawled states of an AJAX page. The

performance of AJAX Crawler is evaluated and compared with traditional

crawler. The possible issues regarding crawling AJAX content and future

optimizations are also analysed.

திடடபபணிச சுருககம

தறேபாது உளள ேதடு ெபாறிகள அைனததும, இைணயதளததில உளள அடிககடி மாறுகினற உைரைய ெகாணடுளள வைலபபககஙகைள ேதடுவதிலைல. இதனால இைணயதளததில உளள பல உைரகள மககளுககு ெதrயாமல ேபாகிறது. இததிடடததின ேநாககம மைறநதுளள பல உைரகைள ேதடு ெபாறிகளுககு ெதrய ெசயவது. பிரதான கூகுள, யாஹூ ேபானற ேதடு ெபாறிகள கூட பல உைரகைள கணடுெகாளளாமல இருககிறது. எனேவ இததிடடம மூலம இைணயதளததில உளள மைறநதுளள பல உைரகள ேதடு ெபாறிகளால கணடுபிடிககபபடும. ஆகேவ இைணயதளததில உளள மைறநதுளள உைரகளின எணணிகைக குைறயும. இததிடடததின மூலம ேதடு ெபாறிகளின திறைம அதிகrககபபடும.

Contents

CERTIFICATE i

ACKNOWLEDGEMENTS ii

ABSTRACT(ENGLISH) iii

ABSTRACT(TAMIL) iv

LIST OF FIGURES viii

LIST OF TABLES ix

LIST OF ABBREVIATIONS x

1 INTRODUCTION 11.1 AJAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Organisation of this Report . . . . . . . . . . . . . . . . . . . . . 4

2 RELATED WORK 52.1 Crawling AJAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Google’s AJAX Crawling Scheme . . . . . . . . . . . . . . . . . 8

3 REQUIREMENTS ANALYSIS 113.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . 113.2 Non-Functional Requirements . . . . . . . . . . . . . . . . . . . 12

3.2.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Hardware Considerations . . . . . . . . . . . . . . . . . . 123.2.3 Performance Characteristics . . . . . . . . . . . . . . . . 123.2.4 Security Issues . . . . . . . . . . . . . . . . . . . . . . . 13

v

3.2.5 Safety Issues . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 SYSTEM DESIGN 154.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . 154.2 Module Descriptions . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 Identification of Clickables . . . . . . . . . . . . . . . . . 174.2.2 Event Invocation . . . . . . . . . . . . . . . . . . . . . . 194.2.3 State Machine representation of AJAX website . . . . . . 19

4.2.3.1 Visualizing the State Machine . . . . . . . . . . 214.2.4 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.5 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.6 Reconstruction of state . . . . . . . . . . . . . . . . . . . 22

4.3 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . 224.4 UseCase Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.1 UseCase Diagram . . . . . . . . . . . . . . . . . . . . . . 234.5 System Sequence Diagram . . . . . . . . . . . . . . . . . . . . . 24

4.5.1 Event Invocation . . . . . . . . . . . . . . . . . . . . . . 244.5.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6 Data Flow Model . . . . . . . . . . . . . . . . . . . . . . . . . . 254.6.1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 25

5 SYSTEM DEVELOPMENT 285.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.2 Implementation Description . . . . . . . . . . . . . . . . 28

5.1.2.1 Ajax Crawling Algorithm . . . . . . . . . . . . 295.1.2.2 State Machine . . . . . . . . . . . . . . . . . . 325.1.2.3 Indexing . . . . . . . . . . . . . . . . . . . . . 345.1.2.4 Searching . . . . . . . . . . . . . . . . . . . . 355.1.2.5 Reconstruction of a particular state after crawling 36

6 RESULTS AND DISCUSSION 376.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 39

vi

6.2.1 Crawling Time . . . . . . . . . . . . . . . . . . . . . . . 396.2.1.1 Number of States Vs Crawling Time . . . . . . 40

6.2.2 Clickable Selection Policy . . . . . . . . . . . . . . . . . 416.2.2.1 Number of AJAX Requests Vs Probable

Clickables . . . . . . . . . . . . . . . . . . . . 426.2.2.2 Probable Clickables Vs Detected Clickables . . 43

6.2.3 Clickable Selection Ratio Vs Crawling Time . . . . . . . 446.3 Search Result Quality . . . . . . . . . . . . . . . . . . . . . . . . 456.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7 CONCLUSIONS 507.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A Snapshots 52A.1 Search Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.2 Google Bot and AJAX Crawler . . . . . . . . . . . . . . . . . . . 54

B DOM 58B.1 DOM - Document Object Model . . . . . . . . . . . . . . . . . . 58B.2 DOM Tree Representation . . . . . . . . . . . . . . . . . . . . . 58

References 60

vii

List of Figures

1.1 Crawler Architecture . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 AJAX Crawling Scheme . . . . . . . . . . . . . . . . . . . . . . 92.2 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Visualizing State Machine . . . . . . . . . . . . . . . . . . . . . 214.3 UseCase Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Sequence Diagram - Event Invocation . . . . . . . . . . . . . . . 244.5 Sequence Diagram - Searching . . . . . . . . . . . . . . . . . . . 254.6 Level 0 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 254.7 Level 1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 264.8 Level 1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 27

6.1 Number of States Vs Crawling Time(in minutes) . . . . . . . . . 406.2 Number of AJAX Requests Vs Probable Clickables . . . . . . . . 426.3 Probable Clickables Vs Detected Clickables . . . . . . . . . . . . 436.4 Clickable Selection Ratio Vs Crawling time per state(in minutes) . 44

A.1 Interface I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.2 Interface II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.3 Fetched By Google Bot . . . . . . . . . . . . . . . . . . . . . . . 54A.4 Fetched By Google Bot . . . . . . . . . . . . . . . . . . . . . . . 55A.5 Fetched By AJAX Crawler . . . . . . . . . . . . . . . . . . . . . 56A.6 Fetched By AJAX Crawler . . . . . . . . . . . . . . . . . . . . . 57

B.1 DOM Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

viii

List of Tables

5.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Crawling Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4 Clickable Selection Policy . . . . . . . . . . . . . . . . . . . . . 41

ix

LIST OF ABBREVIATIONS

Acronym What (it) Stands For

AJAX Asynchronous Javascript And Xml

CSS Cascading Style Sheet

DOM Document Object Model

HTML Hyper Text Markup Language

JS Java Script

JUNG Java Universal Network Graph Framework

URL Uniform Resource Location

XML Extensible Markup Language

x

CHAPTER 1

INTRODUCTION

Web applications are becoming more and more the replacement of desktop

applications. In this chapter we introduce the techniques that support this change,

and we give an outline of this thesis. The first section presents AJAX, the major

new technique and architectural change for web applications over the past years.

Section 1.2 discusses and explains the operation of a crawler. Section 1.3 presents

the research problems of this thesis. Section 1.4 presents the scope of this thesis.

Section 1.5 dicusses the organisation of this thesis.

1.1 AJAX

AJAX is an acronym for Asynchronous JavaScript and XML. AJAX is a

technique whereby a website can update part of a page without refreshing the

whole content. This saves bandwidth and provides for a more interactive user

experience. In other words, changes that a user makes appear quicker on the

screen, and the website seems to respond much faster. The improved action

increases the interactivity of websites and makes the user experience much more

enjoyable. It should be noted that AJAX is not a technology in its own right,

rather, it is a technique that utilizes other technologies. AJAX is considered one

of the core techniques behind Web 2.0 applications.

AJAX is a clever combination of using the client-side JavaScript engine [11] to

1

2

update small parts of the Document Object Model (DOM) with information

retrieved by asynchronous server communication. By using AJAX technology

developers can create applications in which the page does not have to be

re-rendered again every time an interaction has taken place; only small sub-sets of

the page need to get updated.

A common problem with AJAX applications is the disability of the web browser’s

Back button. In a normal non-AJAX application, every webpage has a unique

URL. Thus, a user can hit the Back button to take him back to the previous URL,

which would be the state that the browser was in before the user’s last action.

This can be seen as a sort of Undo operation. However, with AJAX the URL of

the webpage does not change every time the state of the web application changes.

Therefore a press of the back button will bring the user to a state much further

back than he might have intended. Also, page bookmarking is dependant upon the

URL of the page in question. Therefore, pages created by AJAX will not be

bookmarkable.

1.2 Crawler

A Web crawler [7] is a computer program that browses the World Wide Web in a

methodical, automated manner or in an orderly fashion. Web crawlers are mainly

used to create a copy of all the visited pages for later processing by a search

engine that will index the downloaded pages to provide fast searches. Figure 1.1

depicts high-level architecture of a standard Web Crawler.

3

FIGURE 1.1: Crawler Architecture

1.3 Problem Definition

With the advent of Web 2.0 , AJAX is being used widely to enhance interactivity

and user experience. Also standalone AJAX applications are also being

developed. For eg Google Maps, Gmail and Yahoo! Mail are classic examples of

AJAX applications. Current crawlers ignore AJAX content as well as dynamic

content added through client side script. Thus most of the dynamic content is still

hidden. We have considered two problems in our project.

1. Crawling AJAX Content in websites

2. Making the crawled AJAX Content indexable and searchable

4

1.4 Scope of the Project

The project enables hidden dynamic content to be visible to search engines. Thus

the hidden web can be explored to a great extent. The project describes the design

and development of an AJAX Crawler and building an AJAX Search Engine to

search through the crawled states. Finally we evaluate the performance of the

AJAX Crawler. However the scope of the project is limited by the fact that

crawling AJAX content is time consuming due to the fact that it requires to

execute Javascript unlike traditional crawlers.

1.5 Organisation of this Report

This report is organized as follows.Chapter 2 discusses the related work done in

this area. Chapter 3 describes the requirement analysis of the system. Chapter 4

elaborates on the design of the system. Chapter 5 details about the development

of the system. Chapter 6 describes the results obtained from our system and also

provides an analysis of the results. Finally, Chapter 7 summarizes the work we

have completed and presents pointers for future work.

CHAPTER 2

RELATED WORK

2.1 Crawling AJAX

Ajax (Asynchronous JavaScript and XML) is one of the most rising and

promising techniques in the web application development area of the past few

years. Capturing the traditional multi-page application into a single page

increases the responsive and interactive experience of a user. Users do not have to

click-and-wait any more, and the page does not have to be re-rendered again

every time an interaction takes place, i.e., only small sub-sets of the page need to

get updated.[14] With these new dynamic applications a new term has

emerged,Web 2.0 , which is used to mark the changes in web applications in

facilitating communication, information sharing, interoperability, and

collaboration. The term Web 2.0 is used to denote Ajax applications but is also,

and more commonly, used to denote user-generated content.

The addition of the responsiveness brought by the AJAX technique makes it

possible to operate applications on a web server and inside a browser as if they

are desktop applications. Currently the web application market is becoming

increasingly dominant and there are operating systems designed around them

such as the Chrome OS from Google and the WebOS from Palm. This shows the

importance of web applications as a replacement of ordinary applications.

5

6

AJAX is a clever combination of using the client-side JavaScript engine [11] to

update small parts of the Document Object Model (DOM) with information

retrieved by asynchronous server communication. By using AJAX technology

developers can create applications in which the page does not have to be

re-rendered again every time an interaction has taken place; only small sub-sets of

the page need to get updated. Therefore the users experience a very fast

responsive application inside the web-browser. The application is available

everywhere the user connects to the internet, and is accessible with every browser.

This eliminates the main disadvantages of having to install a full blown desktop

application on a computer with a certain amount of computational capacity and

the troubles of sharing files with people or other locations. This makes the use of

cloud computing interesting. Cloud computing is the term used to describe the

trend in the computing world in moving away from desktop applications to

on-line services . Although the web applications are not a new phenomena, the

use of AJAX techniques are. These new techniques also require a good quality of

service, of which testing is an important aspect.

The new AJAX technology does not include the property of having a unique URL

representing a unique state in an application . Due to the lack of an external

reachable unique state, a state reached by URL, crawlers are not able to access the

full content of an AJAX application without the use of a pre-programmed

JavaScript engine [7] . This problem of not having a reachable state by URL

occurs when crawling and testing an AJAX application. [9]

The first major work in the crawling AJAX was done by Duda [10] , which

suggests a way to crawl dynamic comments page in Youtube. They modelled

7

AJAX website as a State Machine. They developed the first AJAX Crawling

algorithm which this project uses as a base. They also indexed and made the

Youtube dynamic comments page searchable.

To circumvent this problem Mesbah et. al. [15] proposed to crawl AJAX

applications by Inferring User Interface State Changes. Their technique focuses

around a state machine which stores the actions a user executes on a web-page

inside a real browser, starting from the root, the index state, and following traces

down to the final state of a certain path. These states are discovered through

searching the current DOM-tree for possible elements at which events can be

fired. These events include for example the onClick, onMouseOver or

onMouseOut, and firing the events on all the possible candidate elements may

result in a new states. The result of an event, the DOM-tree, is compared with the

DOM-tree from before the execution. If the DOM tree is changed a new state of

the application is added to the state machine linked to its predecessor state. The

edge between the two states represents the element and event combination that

results in the new state originating from the previous state. By storing the

combination of an element and an event, the crawler is able to repeat the flow of

actions, which result in the given state. By using this information it is possible to

bring an AJAX application to a given state, and this makes an AJAX application

state aware by adding an external indexing shell.

8

2.2 Finite State Machine

Finite state machines are used to describe the behaviour of a system by recording

the transitions from one state to another state. This method is mostly used in

verifying software systems or software protocols [5].

The state machine used inside a crawler is not a fully specified state machine, but

an incomplete specified state machine. A completely specified state machine is a

state machine where every transaction results in a unique new state [16]. When

examining an Ajax application it is possible to have multiple transactions

resulting in the same state, e.g., two different links can result in the same page.

This observation leads to the fact that the state machine used is an incomplete

specified state machine. The minimal version of a completely specified state

machine can be found in polynomial time [13]. Incomplete specified state

machine, are proved to be NP-complete [12] in terms of finding the minimal state

machine. This means that there is no algorithm known that minimises an

incomplete specified state machine in polynomial time.

2.3 Google’s AJAX Crawling Scheme

Google proposed its own scheme for crawling AJAX [1]. The AJAX websites

which conform to this scheme will be crawled by Google Bot. The Googles

AJAX crawling scheme proposes to mark the addresses of all the pages that load

AJAX content with specific chars. The whole idea behind it is to use special hash

fragments (#!) in the URLs of those pages to indicate that they load AJAX

9

content. When Google finds a link that points to an AJAX URL, for example

http://example.com/page?query #!state, it automatically interprets it (escapes it)

as http://example.com/page?query& escaped fragment =state.

FIGURE 2.1: AJAX Crawling Scheme

The programmer is forced to change his/her Website Architecture in order to

handle the above requests. So when Google sends a web request for the escaped

URL, the server must be able to return the same HTML code as the one that is

presented to the user when the AJAX function is called.

After Google sees the AJAX URL and after interpreting (escaping it), it grabs the

content of the page and indexes it. Finally when the indexed page is presented in

the Search Results, Google shows the original AJAX URL to the user instead of

the escaped one. As a result the programmer should be able to handle users

10

request and present the appropriate content when the page loads.

FIGURE 2.2: Control Flow

The implementation of Google’s AJAX Crawling scheme imposes some

constraints on developers. Also a site having less amount of AJAX content cannot

be changed to this scheme for the purpose of crawling. Thus the thesis proposes a

way to crawl AJAX sites by constructing the state machine of the site, which does

not impose any constraints on the developers. We view every site as a AJAX site

and start crawling by invoking javascript events.If there is any change in DOM,

then we record it in state machine. Thus once the state machine of a URL is

generated, we then index the states to enable searching.

CHAPTER 3

REQUIREMENTS ANALYSIS

In this chapter, we provide an overview of the requirements and the functionalities

of the system.

3.1 Functional Requirements

The project aims to crawl AJAX content in web applications and make it

searchable. The abstract modular view of the project is given by the following

steps.

1. Identification of Clickables

2. Invocation of events

3. Representing AJAX website as State Machine

4. Indexing the crawled states

5. Searching through the indexed content

6. Reconstruction of a particular state in browser

11

12

3.2 Non-Functional Requirements

3.2.1 User Interface

User Interface is provided for searching. The User Interface is developed using

HTML and PHP. The user enters the query in a text box and performs the search.

For the browser driven UI, any standard web browser like Mozilla Firefox or

Internet Explorer is required.

3.2.2 Hardware Considerations

The project is requires a computer with Windows Operating System. The system

used in our experiments consists of 320 GB Hard Disk Drive, 2 GB RAM and 1.2

GHz Processor.

3.2.3 Performance Characteristics

As performance forms an important parameter of this project, there are a number

of performance consideration factors. The AJAX Crawling is compared with

traditional crawling.The factors are :

1. Crawling Time

2. Search Result Quality

3. Clickable Selection Policy

13

3.2.4 Security Issues

As the project is fully software based, there are no security issues concerning this

project.

3.2.5 Safety Issues

There are no particular safety issues concerning this project.

3.3 Constraints

• Javascript execution

A crawler capable of crawling AJAX requires the capability to execute

Javascript.

• Duplicate State

Multiple events may lead to same state.Thus we need to eliminate adding

duplicate states.

• Infinite State Change

If the same events can be invoked indefinitely on the same state, the

application model can explode.

• Numerous ways of adding event handlers

A Javascript event can be added to an particular HTML element in many

ways. Thus all events assigned through various ways should be handled

properly.

14

3.4 Assumptions

• No Forms

The AJAX Crawler doesn’t handle forms because handling forms is

complex. It requires appropriate test data for submitting forms. Also if

captcha is present, it is not possible to submit such a form. Also a form may

consist of different type of input elements like checkbox, select box, redio

button etc. Deep web crawling by handling forms is itself a separate

research problem.

• Limiting the number of states

The Crawler limits the number of states to prevent state explosion.

• Only Click Event

The Crawler invokes only click event on HTML elements during

crawling.The elements which can be clicked are termed as clickables.

• Only Text based retrieval

The Crawler handles only text based changes. Image based changes like in

Google maps are not considered.

CHAPTER 4

SYSTEM DESIGN

In this chapter, we describe the deign issues considered in the software

development process.

4.1 System Architecture

4.1.1 Architecture Diagram

Figure 4.1 depicts the architecture diagram of the entire system. The set of

modules, along with the control flow between them is depicted.

15

16

FIGURE 4.1: Architecture Diagram

17

4.2 Module Descriptions

4.2.1 Identification of Clickables

Identification of clickables is the first phase in an Ajax Crawler. It involves

identifying clickables that would modify the current DOM . The main issue

regarding this is that click event may be added to an HTML element in many

ways. A number of ways to add event listener is shown below,

• <div id=test onclick=‘test function( );’ >

• test.onclick=test function;

• test.addEventListener(‘click’,test function,false);

• Using Jquery javascript library,

$(‘#test’).click(function()

{

test function();

});

All the above 4 methods, perform the same function of adding the event onclick

on element test.

Thus clickables cannot be identified in a standard way because of the numerous

Javascript libraries that exist and each has its own way of defining event handlers.

So the approach of clicking all the clickable elements is being followed. The list

18

of clickable HTML elements is shown below.

<a>, <address>, <area>, <b>, <bdo>, <big>, <blockquote>, <body>,

<button>, <caption>, <cite>, <code>, <dd>, <dfn>, <div>, <dl>, <dt>,

<em>, <fieldset>, <form>, <h1>to <h6>, <hr>, <i>, <img>, <input>,

<kbd>, <label>, <legend>, <li>, <map>, <object>, <ol>, <p>, <pre>,

<samp>, <select>, <small>, <span>, <strong>, <sub>, <sup>, <table>,

<tbody>, <td>, <textarea>, <tfoot>, <th>, <thead>, <tr>, <tt>, <ul>,

<var>

Though this approach is time consuming and can cause sub elements to be

clicked repeatedly, it has the advantage of all the possible states being reached.

XPath is used to retrieve the clickable elements. The XPath expression to retrieve

all clickable elements in the document is shown below.

//a | //address | //area | //b | //bdo | //big | //blockquote | //body | //button |

//caption | //cite | //code | //dd | //d f n | //div | //dl | //dt | //em | // f ieldset |

// f orm | //h1 | //h6 | //hr | //i | //img | //input | //kbd | //label | //legend |

//li | //map | //ob ject | //ol | //p | //pre | //samp | //select | //small | //span |

//strong | //sub | //sup | //table | //tbody | //td | //textarea | //t f oot | //th |

//thead | //tr | //tt | //ul | //var

19

4.2.2 Event Invocation

HtmlUnit [2] library provides the ability to invoke any event on an HTML

element. The event is invoked on all elements retrieved using XPath expression.

After an event is invoked on an element, we need to wait for background

Javascript execution.

4.2.3 State Machine representation of AJAX website

An Ajax website can be represented by a State machine. Thus the navigation

model of an Ajax driven website can be visualized as a State Machine. The state

machine can be viewed as a Directed Multigraph. JUNG (Java Universal

Network/Graph Framework) [3] is used for building the state machine.

Nodes represents application state

Edges represent transition between states

In each node of a state machine, the DOM of the corresponding state is stored. In

each edge, we store the event type and XPath expression of the element on which

the event has to be invoked.

The State machine is represented as a Directed Multigraph. The State machine is

stored in graphML format.

<?xml version=”1.0” encoding=”UTF-8”? >

20

<graphml

xmlns=”http://graphml.graphdrawing.org/xmlns/graphml”

xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:schemaLocation=”http://graphml.graphdrawing.org/xmlns/graphml” >

<key id=”event” for=”edge” ><desc >Event type</desc ></key >

<key id=”target” for=”edge” ><desc >Event generating element</desc ></key >

<graph edgedefault=”directed” >

<node id=”0”/ ><node id=”1”/ ><node id=”2”/ ><node id=”3”/ >

<edge source=”0” target=”1” >

<data key=”event” >onclick</data >

<data key=”target” >/html/body/div/table/tbody/tr[1]/td[3]/div[1] </data >

</edge >



<data key=”target” >/html/body/div </data >

</edge >



<data key=”target” >/html/body/div/table/tbody/tr[1]/td[1]/div/div[4] </data >

</edge >



<data key=”target” >/html/body/div/table/tbody/tr[1]/td[1]/div/div[8]/strong </data >

</edge ></graph ></graphml >

From the above graphML file, the following inferences can derived,

21

• Number of nodes(application states) = 4

• The application state changes from source to target on clicking the element

derived by the Xpath expression stored in a key named target. Thus from the

graphML format, the path from one state to another can be obtained.

4.2.3.1 Visualizing the State Machine

The state machine of the sample site at http://test.thurls.com/ajax/home.php can be

visualized as shown in Figure 4.2. We can infer that there are totally 8 states.

FIGURE 4.2: Visualizing State Machine

22

4.2.4 Indexing

While State Machine is being constructed as a result of AJAX Crawler,

simultaneously indexing of states has to be done to enable searching through the

states. The project uses Lucene Open Source API for Indexing.

4.2.5 Searching

Searching involves getting input query and returning suitable results based on

reading the index files. Lucene Search API is used to perform search operation

and return the results.

4.2.6 Reconstruction of state

Once the user searches for a query and the results are displayed, we need to

navigate to a particular state directly and display it in browser once a user views a

result. We use Selenium Web Driver to navigate to a particular state by finding

the path between the target state and initial state from the State Machine, and then

invoking events along the path.

4.3 User Interface Design

User Interface is provided for users to perform search.The user enters the search

query in text box and performs the search. The snapshots of user interface have

been provided in the Appendix A.1.

23

4.4 UseCase Model

4.4.1 UseCase Diagram

Figure 4.3 denotes the control flow pattern of our algorithm, also showing the

involvement of the various software/hardware components in the various sections

of the flow. The actors represent these components and the use cases represent the

functionality.

FIGURE 4.3: UseCase Diagram

24

4.5 System Sequence Diagram

4.5.1 Event Invocation

Figure 4.4 shows the Sequence Diagram for Event Invocation. It shows the

sequence of events involved in invoking events and updation of DOM.

FIGURE 4.4: Sequence Diagram - Event Invocation

4.5.2 Searching

Figure 4.5 shows the Sequence Diagram for Searching the crawled states. It shows

the sequence of events involved in searching and reconstruction of result state.

25

FIGURE 4.5: Sequence Diagram - Searching

4.6 Data Flow Model

Figure 4.6 shows the Level 0 DFD of the system. Level 1 DFD’s are shown in

Figure 4.7 and Figure 4.8.

4.6.1 Data Flow Diagram

FIGURE 4.6: Level 0 Data Flow Diagram

26


27


CHAPTER 5

SYSTEM DEVELOPMENT

5.1 Implementation

5.1.1 Tools Used

The following tools were employed to implement the project.

Operating System Windows 7Languages used for development Java,PHP,JSP

Libraries HtmlUnit,JSoup,Lucene,JUNG,Selenium DriverDatabase MySql

IDE NetBeansUser Interface Mozilla Firefox

Performance Visualization(Graphs) Google Charts,Powerpoint

TABLE 5.1: Tools Used

5.1.2 Implementation Description

This section provides the detailed implementation of the complete system. This

section discusses all the algorithms used in the project. A brief explanation of

each algorithm and its need is also described. AJAX Crawling algorithm forms the

basis of the AJAX Crawler. It is described as follows :

28

29

5.1.2.1 Ajax Crawling Algorithm

The first step in crawling is to load the initial state and then wait for background

Javascript execution (this handles the case when an Ajax call is made using

onload event). Then all clickables in the initial state are found and the event is

invoked. The clickables are extracted using XPath expression. Those elements

matching the XPath expression are clicked. If there are DOM changes, then the

state machine is updated. The crawling is done in a breadth first manner. First all

states originating from the intial state are found. Then each state is crawled in a

similar fashion. HtmlUnit [2] Java library is used for implementing the AJAX

Crawling Algorithm. A WebClient object can be viewed as a Browser instance.

This covers the requirement of an AJAX Crawler to be capable of executing

Javascript.

Algorithm 1 Ajax Crawling algorithm1: procedure CRAWL(url)2: Load url in HtmlUnit WebClient3: Wait for background Javascript execution4: StateMachine← Initialize state machine5: StateMachine.add(initial state)6: while still some state uncrawled do7: current state← f ind some uncrawled state to crawl8: webclient← get web client(current state,StateMachine,url)9: while current state still uncrawled do

10: crawl state(webclient,current state,StateMachine)11: webclient← get web client(current state,StateMachine,url)12: end while13: end while14: save the StateMachine15: end procedure

30

Algorithm 2 Ajax Crawling algorithm (Continued)1: procedure GET WEB CLIENT(current state,StateMachine,url)2: webclient← Load url in HtmlUnit WebClient3: Wait for background Javascript execution4: path← Find shortest path from initial state to current state5: while current state not reached do6: xpath← Get Xpath to traverse to next state in path7: Generate the click event on the element retrieved by xpath8: Wait for background Javascript execution9: end while

10: return webclient11: end procedure

One of the problems with the HtmlUnit Webclient is that, once a DOM change

occurs and another state is reached, we cant be able to go back to the source state

to continue the breadth first crawling process. We need to again traverse from the

initial state to the source state , to continue the crawling process. This is done by

the function GET WEB CLIENT. Here we find the path from the intial state to

the current state to be crawled. Then invoke the events along the path to reach the

current state. Another issue with the WebClient is that it cannot be serialized and

stored. Thus each time when there is a DOM change, there is a need to traverse

from the initial state to current state.

The algorithm for crawling a individual state is described by the function

CRAWL STATE.Special care must be taken in order to avoid regenerating states

that have already been crawled (i.e., duplicate elimination). This is a problem also

encountered in traditional search engines. However, traditional crawling can most

of the time solve this by comparing the URLs of the given pages - a quick

operation. AJAX cannot count on that, since all AJAX states have the same URL.

Currently, we compare the DOM tree as a whole to check if two states are same.

31

Algorithm 3 Ajax Crawling algorithm (Continued)1: procedure CRAWL STATE(webclient,current state,StateMachine)2: elements← Get all clickable elements using Xpath3: while still an element remaining do4: xpath← Get Xpath of the current element5: if current element is already clicked in the current state then6: continue7: end if8: if current element is an anchor element then9: hre f ← Get href attribute of the current element

10: if href is null then11: Generate the click event on current element12: Wait for background Javascript execution13: if dom is changed then14: if new state is not already present in StateMachine then15: Add the new state to StateMachine16: Add a transition from current state to new state17: end if18: return19: end if20: end if21: else22: Generate the click event on current element23: Wait for background Javascript execution24: if dom is changed then25: if new state is not already present in StateMachine then26: Add the new state to StateMachine27: Add a transition from current state to new state28: end if29: return30: end if31: end if32: end while33: end procedure

32

5.1.2.2 State Machine

The algorithm for maintaining the State Machine is shown below.

Algorithm 4 State Machine Representation1: transition← Initialize a MultiKey Map2: crawl status← Initialize a Bit Vector3: graph← Initialize a Directed Multi Graph4: states← Initialize a Array List5: url← url currently being crawled6: procedure ADD NEW STATE(dom xml)7: if dom xml NOT IN states then8: state id=states.size();9: states.add(dom xml);

10: doc id=md5(url);11: index state(dom xml,url,doc id,state id);12: graph.addVertex(state id);13: end if14: end procedure15: procedure ADD TRANSITION(start state,end state,event, target xpath)16: if (start state,event,target xpath) NOT IN transition then17: transition.put(start state,event,target xpath,end state);18: graph.addEdge(start state,end state,event,target xpath);19: end if20: end procedure21: procedure UPDATE CRAWL STATUS(state id)22: crawl status.set(status id);23: end procedure24: procedure CHECK CRAWL STATUS25: num states=states.size()-1;26: for i = 0→ num states do27: if !crawl status.get(i) then28: return false;29: end if30: end for31: return true;32: end procedure

33

Algorithm 5 State Machine Representation (Continued)1: procedure GET NEXT STATE TO CRAWL2: num states=states.size()-1;3: for i = 0→ num states do4: if !crawl status.get(i) then5: return i;6: end if7: end for8: return -1;9: end procedure

10:

11: procedure CHECK STATE CRAWL STATUS(state id)12: if crawl status.get(state id) then13: return true;14: end if15: return false;16: end procedure17:

18: procedure SAVE STATE MACHINE19: layout← Initialize a Circle Layout of graph20: graphWriter← Initialize a Graph Writer21: out put← Initialize a Print Writer22: Add event type custom data for each edge in graph23: Add target XPath custom data for each edge in graph24: graphWriter.save(graph,output);25: end procedure

Thus we represent State Machine as a Directed Multigraph in JUNG(Java

Universal Network/Graph Framework)[3] . Each time when a new state is added

we check if the DOM is already in the state machine. Also each time while

adding a new transition, we check if it is not a duplicate transition. The State

Machine is saved in graphML format.

34

5.1.2.3 Indexing

Indexing is the process of extracting text from web pages, tokenizing it and then

creating an index structure (inverted index) that can be used to quickly find which

pages contain a particular word. The purpose of storing an index is to optimize

speed and performance in finding relevant documents for a search query. Without

an index, the search engine would scan every document in the corpus, which

would require considerable time and computing power.The project uses Lucene

Open Source API [4] for indexing the crawled states.Another advantage with

Lucene is that it supports incremental indexing. Thus we need not index all

documents from begining each time. The index files can be updated each time.

Only the text part in the DOM is indexed. In the inverted file,we store the URL,

DOC ID and STATE ID. The algorithm for indexing is given below.

Algorithm 6 Indexing crawled states using Lucene1: procedure INDEX STATE(dom xml,doc id,url,state id)2: indexWriter=new IndexWriter(path to index files,new

SimpleAnalyzer(),false);3: Document doc=new Document();4: doc.add(new Field(”content”,dom xml, Field.Store.YES,

Field.Index.TOKENIZED));5: doc.add(new Field(”url”,url, Field.Store.YES, Field.Index.NO));6: doc.add(new Field(”docid”,doc id, Field.Store.YES, Field.Index.NO));7: doc.add(new Field(”state”,state id, Field.Store.YES, Field.Index.NO));8: indexWriter.addDocument(doc);9: indexWriter.optimize();

10: indexWriter.close();11: end procedure

35

5.1.2.4 Searching

Once the indexing is done, the inverted files are saved. Searching involves

searching through the indexed content. The project uses Lucene Search API [4]

for searching through the indexed content. Since the search is done in PHP, the

project needs a way to access the Lucene index files in PHP. Zend PHP

Framework supports Lucene Search Library. Thus Zend enables to read Lucene

index files generated in Java to be read in PHP. The search results are returned as

an associate array. The array consists of the values for the parameters we

specified during indexing. For each result a score is assigned by Lucene. The

score is assigned during the indexing process based on the frequency of each

word in the document. The higher the score, the more relevant the result is to do

with the search query. The code snippet for searching in PHP using Lucene

Search library in Zend Framework is shown below.

Algorithm 7 Searching Lucene indexed files in PHP1: procedure SEARCH(query)2: $index = new Zend Search Lucene(path to index files);3: $hits = $index->find($query);4: foreach($hits as $hit)5: {6: echo $hit->score;7: echo $hit->docid;8: echo $hit->url;9: echo $hit->state;

10: }11: end procedure

36

5.1.2.5 Reconstruction of a particular state after crawling

After crawling the states of a particular URL, the states should be indexed to be

able to be searched by the search engine. [8] Thus a state needs to be reconstructed

for being displayed in search results. A web browser can load only the initial state

of a URL. But we need to load subsequent states which actually occur in browser

after a sequence of Javascript events are invoked. Thus the project uses Selenium

Web Driver [6] to load a particular state in browser directly. A Web Driver can be

viewed as a browser which can be controlled through code. Thus the project finds

the path from the initial state to the state to be loaded.The project initially loads the

initial state in Web Driver. Then the Javascript events along the path are invoked

in the Web Driver until the required state in reached. Thus the required state is

loaded in the browser to be viewed by the user. From then the user can continue

browsing from the required state like a normal browser.

Algorithm 8 Reconstruction of a particular state after crawling1: procedure RECONSTRUCT STATE(state)2: Read the graphML file of the corresponding URL and construct a Directed

Multigraph3: path← Find shortest path from initial state to the state to be constructed

(Djikstra Algorithm)4: Load the initial state in a Web Driver like Selenium5: while state not reached do6: xpath← Get Xpath expression of the element to be clicked next7: Generate the click event on the element retrieved by xpath8: Wait for background Javascript execution9: end while

10: The required state is currently loaded in the Web Driver11: end procedure

CHAPTER 6

RESULTS AND DISCUSSION

In this chapter, we report the significant results obtained in our experiments.

6.1 Results

Table 6.1 contains the list sample test cases used for evaluating the performance of

AJAX Crawler.

Case AJAX SiteC1 http://test.thurls.com/ajax/home.phpC2 http://spci.st.ewi.tudelft.nl/demo/aowe/C3 http://www.itrix.co.in/C4 http://demo.tutorialzine.com/2009/09/simple-ajax-website-jquery/demo.htmlC5 http://test.thurls.com/ajax/home1.php

TABLE 6.1: Test Cases

Some of the sample clickables in each of the test cases are shown below.

• Sample Clickables in C1

<div onclick=‘load content(1)’>Great Wall of China</div>

<div onclick=‘load content(2)’>Petra</div>

37

38


<b>Home</b>

<b>Workshop Organizers</b>

<b>Program Committee</b>

<b>Call for Papers</b>


<p id=”hel”>About Us</p>

<p id=”hel”>Sponsors</p>


<a href=”#page1”>Page 1</a>

<a href=”#page”>Page 2</a>


<div onclick=‘load content(24)’>Test 16</div>

<div onclick=‘load content(25)’>Test 17</div>

Table 6.2 contains the experimental results obtained for the sample test cases.

Probable Clickables are those elements in the DOM which can be clicked.

Detected Clickables are those that actually trigger AJAX requests.

Case MaximumDOM StringSize(bytes)

Probable Clickables Detected Clickables Number ofStates

C1 5829 24 8 8C2 6378 61 11 11C3 17422 167 27 27C4 2159 23 5 5C5 8233 58 26 26

TABLE 6.2: Experimental Results

39

6.2 Performance Evaluation

The performance of an AJAX crawler depends on

1. Crawling Time

2. Clickable Selection Policy

3. Search Result Quality

6.2.1 Crawling Time

In traditional crawling,

Crawling time of a page = network latency + server response time

In AJAX crawling,

Crawling time of a state = network latency + server response time + AJAX req.

time

The crawl time of a page in traditional crawling is in order of milli seconds whereas

in AJAX crawling, the crawl time of a page is in order of minutes. This is due to

the time spent in executing Javascript.Table 6.3 contains the crawling time for each

test case. Also crawl time per state is also shown.

Case Number of States Total Crawlingtime (in mins) Crawling time per state (in mins)C1 8 11.44 1.43C2 11 216.45 19.68C3 27 607.5 22.5C4 5 34.9 6.98C5 26 103.13 3.97

TABLE 6.3: Crawling Time

40

6.2.1.1 Number of States Vs Crawling Time

Figure 6.1 shows the plot between number of states and Crawling time (in

minutes).

FIGURE 6.1: Number of States Vs Crawling Time(in minutes)

Inferences from the graph

• The variation between Crawling time and number of states is not uniform.

• Crawling time doesn’t depend directly on the number of states.

• For the same website, crawling time is not constant when measured at

different instances.

• Network latency and server response time is not constant.

41

• Crawling time doesn’t depend only on the number of states. It is a weighted

measure of network latency, server response time ,AJAX request time and

also number of states.

6.2.2 Clickable Selection Policy

Clickable selection refers to the process of identifying clickables for invoking

events. An ideal clickable selection policy should identify clickables in an

optimal way such that most of them trigger AJAX requests or cause a change in

DOM. A better clickable selection policy can reduce the Javascript wait time ,

thus decreasing crawling time.We define a ratio called Clickable Selection Ratio

, which can defined as a ratio of Number of AJAX Requests to that of number of

probable Clickables.Table 6.4 contains the Clickable Selection Ratio for the

sample test cases.

Clickable Selection ratio = No. of AJAX Requests / No. of Probable Clickables

Case Number of Clickables Number of AJAX Requests Clickable Selection RatioC1 24 8 0.33C2 61 11 0.18C3 167 27 0.16C4 23 5 0.21C5 58 26 0.45

TABLE 6.4: Clickable Selection Policy

42

6.2.2.1 Number of AJAX Requests Vs Probable Clickables

Figure 6.2 shows the plot between Number of AJAX Requests and Probable

Clickables.

FIGURE 6.2: Number of AJAX Requests Vs Probable Clickables


• The variation between Probable Clickables and Number of AJAX Requests

is not uniform.

• The number of clickables depends on the structure of the web page.

• The number of AJAX requests cannot be directly related to the number of

probable clickables.

43

6.2.2.2 Probable Clickables Vs Detected Clickables

Figure 6.3 shows the plot between Probable Clickables and Detected Clickables

FIGURE 6.3: Probable Clickables Vs Detected Clickables


• The variation between Probable Clickables and Detected Clickables is not

uniform.

• The number of Detected Clickables depends on the structure of the web page

rather than on the number of Probable Clickables.

• The number of Detected Clickables cannot be directly related to the number

of Probable Clickables.

44

• The number of Detected Clickables cannot be directly related to the number

of AJAX Requests.

• Number of AJAX Requests <= Number of Detected Clickables

6.2.3 Clickable Selection Ratio Vs Crawling Time

Figure 6.4 shows the plot between Clickable Selection Ratio Vs Crawling Time

per state (in minutes)

FIGURE 6.4: Clickable Selection Ratio Vs Crawling time per state(in minutes)


• The variation between Clickable Selection Ratio and Crawling Time is

uniform.

45

• Crawling time is inversely proportional to Clickable Selection Ratio

6.3 Search Result Quality

Search result quality is improved on indexing hidden AJAX Content. The AJAX

Content is not visible to traditional crawlers. Thus the AJAX Content is not

indexed by traditional crawlers. Hence in AJAX Crawler the quality of search

results improve compared with traditional crawlers. We will now see how Google

Bot and AJAX Crawler fetches http://test.thurls.com/ajax/home.php (C1).

Screenshots are provided in Appendix A.2.

This is how Google bot fetches http://test.thurls.com/ajax/home.php (C1)HTTP/1.1 200 OK

Date: Fri, 20 Apr 2012 19:08:29 GMT

Content-Type: text/html

Connection: close

Server: Nginx / Varnish

X-Powered-By: PHP/5.2.17

Content-Length: 1180

<html>

<head>

<title>Ajax Crawling </title>

<script type=”text/javascript”

src=”http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js”>

</script><script type=”text/javascript”>

function load content(key)

46

{

$.get(’getcontent.php’,’key’ : key,function(data)

{

document.getElementById(’result’).innerHTML=data.result;

},’json’);

}

</script></head>

<body>

<center><h2>Wonders of the World</h2></center>

<br><br><table border=”0”><tr><td width=”250px” style=”position:fixed; ”>

<ul><li><div onclick=’load content(1)’>Great Wall of China</div></li>

<li><div onclick=’load content(2)’>Petra</div></li>

<li><div onclick=’load content(3)’>Christ the Redeemer</div></li>

<li><div onclick=’load content(4)’>Machu Picchu</div></li>

<li><div onclick=’load content(5)’>Chichen Itza</div></li>

<li><div onclick=’load content(6)’>Colosseum</div></li>

<li><div onclick=’load content(7)’>Taj Mahal</div></li>

<li><div onclick=’load content(8)’>Great Pyramid of Giza</div></li>

</ul></td>

<td style=”padding-left:350px;”>

<div id=”result”><script>load content(1);</script></div>

</td></tr></table></body></html>

Here the content inside the division called ’result’ is not loaded and the script

code is fetched as such by the Google Bot without executing.

47

This is how AJAX Crawler fetches http://test.thurls.com/ajax/home.php (C1)<html>

<head>

<title>Ajax Crawling </title>

<script type=”text/javascript”

src=”http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js”>

</script><script type=”text/javascript”>

function load content(key)

{

$.get(’getcontent.php’,’key’ : key,function(data)

{

document.getElementById(’result’).innerHTML=data.result;

},’json’);

}

</script></head>

<body>

<center><h2>Wonders of the World</h2></center>

<br><br><table border=”0”><tr><td width=”250px” style=”position:fixed; ”>

<ul><li><div onclick=’load content(1)’>Great Wall of China</div></li>

<li><div onclick=’load content(2)’>Petra</div></li>

<li><div onclick=’load content(3)’>Christ the Redeemer</div></li>

<li><div onclick=’load content(4)’>Machu Picchu</div></li>

<li><div onclick=’load content(5)’>Chichen Itza</div></li>

<li><div onclick=’load content(6)’>Colosseum</div></li>

<li><div onclick=’load content(7)’>Taj Mahal</div></li>

<li><div onclick=’load content(8)’>Great Pyramid of Giza</div></li>

</ul></td>

<td style=”padding-left:350px;”>

<div id=”result”>

The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and

other materials, generally built along an east to west line across the historical northern borders of

48

China in part to protect the Chinese Empire or its prototypical states against intrusions by various

nomadic groups or military incursions by various warlike peoples or forces. Several walls had

already been begun to be built beginning around the 7th century BC;these, later joined together and

made bigger, stronger, and unified are now collectively referred to as the Great Wall. Especially

famous is the wall built between 220 to 206 BC by the first Emperor of China, Qin Shi Huang. Little

of that wall remains. Since then, the Great Wall has on and off been rebuilt, maintained, enhanced;

the majority of the existing wall was reconstructed during the Ming Dynasty. <br/><br/>

Other purposes of the Great Wall have included allowing for border control practices, such as check

points allowing for the various imperial governments of China to tariff goods transported along the

Silk Road, to regulate or encourage trade (for example trade between horses and silk products), as

well as generally to control immigration and emigration. Furthermore, the defensive characteristics

of the Great Wall were enhanced by the construction of watch towers, troop barracks, garrison

stations, signaling capabilities through the means smoke or fire, and the fact that the path of the

Great Wall also served as a transportation corridor. <br/><br/>

The Great Wall stretches from Shanhaiguan in the east, to Lop Lake in the west, along an arc that

roughly delineates the southern edge of Inner Mongolia. The most comprehensive archaeological

survey, using advanced technologies, has concluded that all the walls measure 8,851.8 km

(5,500.3 mi).This is made up of 6,259.6 km (3,889.5 mi) sections of actual wall, 359.7 km (223.5

mi) of trenches and 2,232.5 km (1,387.2 mi) of natural defensive barriers such as hills and rivers.

</div></td></tr></table></body></html>

Here the content inside division called ’result’ , loaded through AJAX is fetched by

waiting for Javascript execution to complete. Thus we see that the DOM crawled

by the AJAX Crawler has the initial AJAX Content loaded into it whereas the

Google bot doesn’t see the initial AJAX Content.Thus the quality of results gets

improved by Crawling AJAX content.

49

6.4 Observations

The following important observations have been found out based on the analysis

if the results.

• Crawling time of an AJAX page is in order of minutes

• Crawling time doesn’t depend only on the number of states. It is a weighted

measure of network latency, server response time ,AJAX request time and

also number of states.

• Crawling time is inversely proportional to clickable selection ratio

• Number of clickables in a DOM depends on the DOM structure rather than

DOM size.

• Quality of search results is improved by AJAX Crawler.

CHAPTER 7

CONCLUSIONS

7.1 Contributions

In this chapter, we summarize the significant contributions of our work. These are:

1. Implementing an AJAX Crawler

2. Making the crawled AJAX states searchable

3. Analyzing the performance of AJAX Crawler

The results indicate that crawling AJAX content improves the quality of search

results at an overhead of large crawling time. Thus further optimization is needed

to reduce the AJAX crawling time in comparable with traditional crawling.

7.2 Future Work

The following are some of the possible extensions that can be done to the system.

• Multi Threading

This can be done by having separate HtmlUnit webclient in each thread. To

make sure that multiple threads dont crawl the same path, the state machine

needs to be synchronized between the threads.50

51

• Using the DOM change statistics between states

Consider a transition from state 1 to 2. Now there may be many static

elements common to states 1 and 2. So the effect of invoking the events on

these static elements in both the states is same. Thus, the elements which

get added or changed in DOM when the state changes from 1 to 2 needs to

be found out. The events need to be invoked on only those elements in state

2. The remaining transitions are same as in state 1.

• Avoid invoking events on nested elements

When the event has already been invoked on a element, the event need not

been invoked again on the enclosing parent element. For example, Consider

the following HTML element,

<div><b>test</b></div>

Here the clicking the element <b>test</b>has the same effect as clicking

<div><b>test</b></div>. Thus event need not be invoked on the

enclosing parent element. This would reduce the number of duplicate

transitions in the state machine.

APPENDIX A

SNAPSHOTS

A.1 Search Interface

Entering Search query

FIGURE A.1: Interface I

52

53

Search results being displayed

FIGURE A.2: Interface II

54

A.2 Google Bot and AJAX Crawler

This is how Google Bot fetches http://test.thurls.com/ajax/home.php

FIGURE A.3: Fetched By Google Bot

55

This is how Google Bot fetches http://test.thurls.com/ajax/home.php

FIGURE A.4: Fetched By Google Bot

56

This is how AJAX Crawler fetches http://test.thurls.com/ajax/home.php

FIGURE A.5: Fetched By AJAX Crawler

57

This is how AJAX Crawler fetches http://test.thurls.com/ajax/home.php

FIGURE A.6: Fetched By AJAX Crawler

APPENDIX B

DOM

B.1 DOM - Document Object Model

The Document Object Model (DOM) is an application programming interface

(API) for valid HTML and well-formed XML documents. It defines the logical

structure of documents and the way a document is accessed and manipulated. In

the DOM specification, the term ”document” is used in the broad sense -

increasingly, XML is being used as a way of representing many different kinds of

information that may be stored in diverse systems, and much of this would

traditionally be seen as data rather than as documents. Nevertheless, XML

presents this data as documents, and the DOM may be used to manage this data.

With the Document Object Model, programmers can build documents, navigate

their structure, and add, modify, or delete elements and content. Anything found in

an HTML or XML document can be accessed, changed, deleted, or added using the

Document Object Model, with a few exceptions - in particular, the DOM interfaces

for the XML internal and external subsets have not yet been specified.

B.2 DOM Tree Representation

Every valid HTML/XML document can be represented by a DOM Tree. Consider

the following HTML code snippet,

58

59

<table>

<tbody>

<tr>

<td>Shady Grove</td>

<td>Aeolian</td>

</tr>

<tr>

<td>Over the River, Charlie</td>

<td>Dorian</td>

</tr>

</tbody>

</table>

The DOM tree for the above code snippet is shown below.

FIGURE B.1: DOM Tree

References

[1] “Google’s AJAX Crawling Scheme,” https://developers.google.com/webmasters/ajax-

crawling/, 2010.

[2] “HtmlUnit,” http://htmlunit.sourceforge.net/, 2011.

[3] “JUNG,” http://jung.sourceforge.net/, 2010.

[4] “Apache Lucene,” http://lucene.apache.org/core/, 2011.

[5] “Book review: Design and validation of computer protocols by gerard

j. holzmann (prentice hall, 1991),” SIGCOMM Comput. Commun. Rev.,

vol. 21, no. 2, pp. 14–, Apr. 1991, reviewer-Fredlund, Lars-Ake. [Online].

Available: http://doi.acm.org/10.1145/122419.1024051

[6] “Selenium Web Driver,” http://seleniumhq.org/docs/03 webdriver.html,

2011.

[7] P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “Ubicrawler: a scalable

fully distributed web crawler,” Softw. Pract. Exper., vol. 34, no. 8, pp.

711–726, Jul. 2004. [Online]. Available: http://dx.doi.org/10.1002/spe.587

[8] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search

engine,” Comput. Netw. ISDN Syst., vol. 30, no. 1-7, pp. 107–117, Apr. 1998.

[Online]. Available: http://dx.doi.org/10.1016/S0169-7552(98)00110-X

[9] A. Deursen and A. Mesbah, “Research issues in the automated testing

of ajax applications,” in Proceedings of the 36th Conference on Current

Trends in Theory and Practice of Computer Science, ser. SOFSEM ’10.

60

http://doi.acm.org/10.1145/122419.1024051

http://dx.doi.org/10.1002/spe.587

http://dx.doi.org/10.1016/S0169-7552(98)00110-X

61

Berlin, Heidelberg: Springer-Verlag, 2010, pp. 16–28. [Online]. Available:

http://dx.doi.org/10.1007/978-3-642-11266-9 2

[10] C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou, “Ajax crawl:

Making ajax applications searchable,” in Proceedings of the 2009 IEEE

International Conference on Data Engineering, ser. ICDE ’09. Washington,

DC, USA: IEEE Computer Society, 2009, pp. 78–89. [Online]. Available:

http://dx.doi.org/10.1109/ICDE.2009.90

[11] J. Eichorn, Understanding AJAX: Using JavaScript to Create Rich Internet

Applications. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2006.

[12] S. Goren and F. J. Ferguson, “On state reduction of incompletely specified

finite state machines,” Comput. Electr. Eng., vol. 33, no. 1, pp. 58–69, Jan.

2007. [Online]. Available: http://dx.doi.org/10.1016/j.compeleceng.2006.06.

001

[13] J. E. Hopcroft, “An n log n algorithm for minimizing states in a finite

automaton,” Stanford, CA, USA, Tech. Rep., 1971.

[14] A. Mesbah and A. van Deursen, “Invariant-based automatic testing

of ajax user interfaces,” in Proceedings of the 31st International

Conference on Software Engineering, ser. ICSE ’09. Washington, DC,

USA: IEEE Computer Society, 2009, pp. 210–220. [Online]. Available:

http://dx.doi.org/10.1109/ICSE.2009.5070522

[15] A. Mesbah, A. van Deursen, and S. Lenselink, “Crawling ajax-based web

applications through dynamic analysis of user interface state changes,” ACM

Trans. Web, vol. 6, no. 1, pp. 3:1–3:30, Mar. 2012. [Online]. Available:

http://doi.acm.org/10.1145/2109205.2109208

http://dx.doi.org/10.1007/978-3-642-11266-9_2

http://dx.doi.org/10.1109/ICDE.2009.90

http://dx.doi.org/10.1016/j.compeleceng.2006.06.001

http://dx.doi.org/10.1016/j.compeleceng.2006.06.001

http://dx.doi.org/10.1109/ICSE.2009.5070522

http://doi.acm.org/10.1145/2109205.2109208

62

[16] T. Villa, T. Kam, R. K. Brayton, and A. Sangiovanni-Vincentelli, Synthesis

of finite state machines: logic optimization. Norwell, MA, USA: Kluwer

Academic Publishers, 1997.

SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN …pages.cs.wisc.edu/~paulgc/Thesis.pdf ·...

Documents

Transcript of SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN …pages.cs.wisc.edu/~paulgc/Thesis.pdf ·...