SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN …pages.cs.wisc.edu/~paulgc/Thesis.pdf ·...
Transcript of SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN …pages.cs.wisc.edu/~paulgc/Thesis.pdf ·...
SEARCH ENGINE ENHANCEMENT BYEXTRACTING HIDDEN AJAX CONTENT IN
WEB APPLICATIONSby
PAUL SUGANTHAN G C 20084053MUTHUKUMAR V 20084041NANDHAKUMAR B 20084043
A project report submitted to the
FACULTY OF INFORMATION AND
COMMUNICATION ENGINEERING
in partial fulfillment of the requirements
for the award of the degree of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANNA UNIVERSITY CHENNAI
CHENNAI - 600025
MAY 2012
CERTIFICATE
Certified that this project report titled “ SEARCH ENGINE ENHANCEMENTBY EXTRACTING HIDDEN AJAX CONTENT IN WEBAPPLICATIONS” is the bonafide work of PAUL SUGANTHAN G C(20084053), MUTHUKUMAR V (20084041), NANDHAKUMAR B(20084043) who carried out the project work under my supervision, for the
fulfillment of the requirements for the award of the degree of Bachelor of
Engineering in Computer Science and Engineering. Certified further that to the
best of my knowledge, the work reported herein does not form part of any other
thesis or dissertation on the basis of which a degree or an award was conferred on
an earlier occasion on these are any other candidates.
Place: Chennai Dr. V VetriselviDate: Project Guide,
Designation,
Department of Computer Science and Engineering,
Anna University Chennai,
Chennai - 600025
COUNTERSIGNED
Head of the Department,
Department of Computer Science and Engineering,
Anna University Chennai,
Chennai – 600025
ACKNOWLEDGEMENTS
We express our deep gratitude to our guide, Dr. V VETRISELVI for guiding us
through every phase of the project. We appreciate her thoroughness, tolerance and
ability to share her knowledge with us. We thank her for being easily approachable
and quite thoughtful. Apart from adding her own input, she has encouraged us to
think on our own and give form to our thoughts. We owe her for harnessing our
potential and bringing out the best in us. Without her immense support through
every step of the way, we could never have it to this extent.
We are extremely grateful to Dr. K.S. EASWARAKUMAR, Head of the
Department of Computer Science and Engineering, Anna University, Chennai
600025, for extending the facilities of the Department towards our project and for
his unstinting support.
We express our thanks to the panel of reviewers Dr. ARUL SIROMONEY, Dr.A.P. SHANTHI and Dr. MADHAN KARKY (list of panel members) for their
valuable suggestions and critical reviews throughout the course of our project.
We thank our parents, family, and friends for bearing with us throughout the course
of our project and for the opportunity they provided us in undergoing this course
in such a prestigious institution.
Paul Suganthan G C Muthukumar V Nandhakumar B
ABSTRACT
Current search engines such as Google and Yahoo! are prevalent for searching the
Web. Search on dynamic client-side Web pages is, however, either inexistent or
far from perfect, and not addressed by existing work, for example on Deep Web.
This is a real impediment since AJAX and Rich Internet Applications are already
very common in the Web. AJAX applications are composed of states which can
be seen by the user, but not by the search engine, and changed by the user using
client-side events. Current search engines either ignore AJAX applications or
produce false negatives. The reason is that crawling client-side code is a difficult
problem that cannot be solved naively by invoking user events.
The project is aimed to propose a solution for crawling and extracting the hidden
ajax content. Thus enabling the search engines to enhance its search result quality
by indexing dynamic ajax content. Though AJAX can be crawled by testing
manually in browser by invoking client side events, enhancing the search engine
to crawl AJAX content automatically similar to traditional web applications
hasn’t been achieved.
The project describes the design and implementation of an AJAX Crawler. Then
enabling search engine to index the crawled states of an AJAX page. The
performance of AJAX Crawler is evaluated and compared with traditional
crawler. The possible issues regarding crawling AJAX content and future
optimizations are also analysed.
திடடபபணிச சுருககம
தறேபாது உளள ேதடு ெபாறிகள அைனததும, இைணயதளததில உளள அடிககடி மாறுகினற உைரைய ெகாணடுளள வைலபபககஙகைள ேதடுவதிலைல. இதனால இைணயதளததில உளள பல உைரகள மககளுககு ெதrயாமல ேபாகிறது. இததிடடததின ேநாககம மைறநதுளள பல உைரகைள ேதடு ெபாறிகளுககு ெதrய ெசயவது. பிரதான கூகுள, யாஹூ ேபானற ேதடு ெபாறிகள கூட பல உைரகைள கணடுெகாளளாமல இருககிறது. எனேவ இததிடடம மூலம இைணயதளததில உளள மைறநதுளள பல உைரகள ேதடு ெபாறிகளால கணடுபிடிககபபடும. ஆகேவ இைணயதளததில உளள மைறநதுளள உைரகளின எணணிகைக குைறயும. இததிடடததின மூலம ேதடு ெபாறிகளின திறைம அதிகrககபபடும.
Contents
CERTIFICATE i
ACKNOWLEDGEMENTS ii
ABSTRACT(ENGLISH) iii
ABSTRACT(TAMIL) iv
LIST OF FIGURES viii
LIST OF TABLES ix
LIST OF ABBREVIATIONS x
1 INTRODUCTION 11.1 AJAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Organisation of this Report . . . . . . . . . . . . . . . . . . . . . 4
2 RELATED WORK 52.1 Crawling AJAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Google’s AJAX Crawling Scheme . . . . . . . . . . . . . . . . . 8
3 REQUIREMENTS ANALYSIS 113.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . 113.2 Non-Functional Requirements . . . . . . . . . . . . . . . . . . . 12
3.2.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Hardware Considerations . . . . . . . . . . . . . . . . . . 123.2.3 Performance Characteristics . . . . . . . . . . . . . . . . 123.2.4 Security Issues . . . . . . . . . . . . . . . . . . . . . . . 13
v
3.2.5 Safety Issues . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 SYSTEM DESIGN 154.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . 154.2 Module Descriptions . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Identification of Clickables . . . . . . . . . . . . . . . . . 174.2.2 Event Invocation . . . . . . . . . . . . . . . . . . . . . . 194.2.3 State Machine representation of AJAX website . . . . . . 19
4.2.3.1 Visualizing the State Machine . . . . . . . . . . 214.2.4 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.5 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.6 Reconstruction of state . . . . . . . . . . . . . . . . . . . 22
4.3 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . 224.4 UseCase Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.1 UseCase Diagram . . . . . . . . . . . . . . . . . . . . . . 234.5 System Sequence Diagram . . . . . . . . . . . . . . . . . . . . . 24
4.5.1 Event Invocation . . . . . . . . . . . . . . . . . . . . . . 244.5.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Data Flow Model . . . . . . . . . . . . . . . . . . . . . . . . . . 254.6.1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 25
5 SYSTEM DEVELOPMENT 285.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.2 Implementation Description . . . . . . . . . . . . . . . . 28
5.1.2.1 Ajax Crawling Algorithm . . . . . . . . . . . . 295.1.2.2 State Machine . . . . . . . . . . . . . . . . . . 325.1.2.3 Indexing . . . . . . . . . . . . . . . . . . . . . 345.1.2.4 Searching . . . . . . . . . . . . . . . . . . . . 355.1.2.5 Reconstruction of a particular state after crawling 36
6 RESULTS AND DISCUSSION 376.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 39
vi
6.2.1 Crawling Time . . . . . . . . . . . . . . . . . . . . . . . 396.2.1.1 Number of States Vs Crawling Time . . . . . . 40
6.2.2 Clickable Selection Policy . . . . . . . . . . . . . . . . . 416.2.2.1 Number of AJAX Requests Vs Probable
Clickables . . . . . . . . . . . . . . . . . . . . 426.2.2.2 Probable Clickables Vs Detected Clickables . . 43
6.2.3 Clickable Selection Ratio Vs Crawling Time . . . . . . . 446.3 Search Result Quality . . . . . . . . . . . . . . . . . . . . . . . . 456.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7 CONCLUSIONS 507.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A Snapshots 52A.1 Search Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.2 Google Bot and AJAX Crawler . . . . . . . . . . . . . . . . . . . 54
B DOM 58B.1 DOM - Document Object Model . . . . . . . . . . . . . . . . . . 58B.2 DOM Tree Representation . . . . . . . . . . . . . . . . . . . . . 58
References 60
vii
List of Figures
1.1 Crawler Architecture . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 AJAX Crawling Scheme . . . . . . . . . . . . . . . . . . . . . . 92.2 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Visualizing State Machine . . . . . . . . . . . . . . . . . . . . . 214.3 UseCase Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Sequence Diagram - Event Invocation . . . . . . . . . . . . . . . 244.5 Sequence Diagram - Searching . . . . . . . . . . . . . . . . . . . 254.6 Level 0 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 254.7 Level 1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 264.8 Level 1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 27
6.1 Number of States Vs Crawling Time(in minutes) . . . . . . . . . 406.2 Number of AJAX Requests Vs Probable Clickables . . . . . . . . 426.3 Probable Clickables Vs Detected Clickables . . . . . . . . . . . . 436.4 Clickable Selection Ratio Vs Crawling time per state(in minutes) . 44
A.1 Interface I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.2 Interface II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.3 Fetched By Google Bot . . . . . . . . . . . . . . . . . . . . . . . 54A.4 Fetched By Google Bot . . . . . . . . . . . . . . . . . . . . . . . 55A.5 Fetched By AJAX Crawler . . . . . . . . . . . . . . . . . . . . . 56A.6 Fetched By AJAX Crawler . . . . . . . . . . . . . . . . . . . . . 57
B.1 DOM Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
viii
List of Tables
5.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Crawling Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4 Clickable Selection Policy . . . . . . . . . . . . . . . . . . . . . 41
ix
LIST OF ABBREVIATIONS
Acronym What (it) Stands For
AJAX Asynchronous Javascript And Xml
CSS Cascading Style Sheet
DOM Document Object Model
HTML Hyper Text Markup Language
JS Java Script
JUNG Java Universal Network Graph Framework
URL Uniform Resource Location
XML Extensible Markup Language
x
CHAPTER 1
INTRODUCTION
Web applications are becoming more and more the replacement of desktop
applications. In this chapter we introduce the techniques that support this change,
and we give an outline of this thesis. The first section presents AJAX, the major
new technique and architectural change for web applications over the past years.
Section 1.2 discusses and explains the operation of a crawler. Section 1.3 presents
the research problems of this thesis. Section 1.4 presents the scope of this thesis.
Section 1.5 dicusses the organisation of this thesis.
1.1 AJAX
AJAX is an acronym for Asynchronous JavaScript and XML. AJAX is a
technique whereby a website can update part of a page without refreshing the
whole content. This saves bandwidth and provides for a more interactive user
experience. In other words, changes that a user makes appear quicker on the
screen, and the website seems to respond much faster. The improved action
increases the interactivity of websites and makes the user experience much more
enjoyable. It should be noted that AJAX is not a technology in its own right,
rather, it is a technique that utilizes other technologies. AJAX is considered one
of the core techniques behind Web 2.0 applications.
AJAX is a clever combination of using the client-side JavaScript engine [11] to
1
2
update small parts of the Document Object Model (DOM) with information
retrieved by asynchronous server communication. By using AJAX technology
developers can create applications in which the page does not have to be
re-rendered again every time an interaction has taken place; only small sub-sets of
the page need to get updated.
A common problem with AJAX applications is the disability of the web browser’s
Back button. In a normal non-AJAX application, every webpage has a unique
URL. Thus, a user can hit the Back button to take him back to the previous URL,
which would be the state that the browser was in before the user’s last action.
This can be seen as a sort of Undo operation. However, with AJAX the URL of
the webpage does not change every time the state of the web application changes.
Therefore a press of the back button will bring the user to a state much further
back than he might have intended. Also, page bookmarking is dependant upon the
URL of the page in question. Therefore, pages created by AJAX will not be
bookmarkable.
1.2 Crawler
A Web crawler [7] is a computer program that browses the World Wide Web in a
methodical, automated manner or in an orderly fashion. Web crawlers are mainly
used to create a copy of all the visited pages for later processing by a search
engine that will index the downloaded pages to provide fast searches. Figure 1.1
depicts high-level architecture of a standard Web Crawler.
3
FIGURE 1.1: Crawler Architecture
1.3 Problem Definition
With the advent of Web 2.0 , AJAX is being used widely to enhance interactivity
and user experience. Also standalone AJAX applications are also being
developed. For eg Google Maps, Gmail and Yahoo! Mail are classic examples of
AJAX applications. Current crawlers ignore AJAX content as well as dynamic
content added through client side script. Thus most of the dynamic content is still
hidden. We have considered two problems in our project.
1. Crawling AJAX Content in websites
2. Making the crawled AJAX Content indexable and searchable
4
1.4 Scope of the Project
The project enables hidden dynamic content to be visible to search engines. Thus
the hidden web can be explored to a great extent. The project describes the design
and development of an AJAX Crawler and building an AJAX Search Engine to
search through the crawled states. Finally we evaluate the performance of the
AJAX Crawler. However the scope of the project is limited by the fact that
crawling AJAX content is time consuming due to the fact that it requires to
execute Javascript unlike traditional crawlers.
1.5 Organisation of this Report
This report is organized as follows.Chapter 2 discusses the related work done in
this area. Chapter 3 describes the requirement analysis of the system. Chapter 4
elaborates on the design of the system. Chapter 5 details about the development
of the system. Chapter 6 describes the results obtained from our system and also
provides an analysis of the results. Finally, Chapter 7 summarizes the work we
have completed and presents pointers for future work.
CHAPTER 2
RELATED WORK
2.1 Crawling AJAX
Ajax (Asynchronous JavaScript and XML) is one of the most rising and
promising techniques in the web application development area of the past few
years. Capturing the traditional multi-page application into a single page
increases the responsive and interactive experience of a user. Users do not have to
click-and-wait any more, and the page does not have to be re-rendered again
every time an interaction takes place, i.e., only small sub-sets of the page need to
get updated.[14] With these new dynamic applications a new term has
emerged,Web 2.0 , which is used to mark the changes in web applications in
facilitating communication, information sharing, interoperability, and
collaboration. The term Web 2.0 is used to denote Ajax applications but is also,
and more commonly, used to denote user-generated content.
The addition of the responsiveness brought by the AJAX technique makes it
possible to operate applications on a web server and inside a browser as if they
are desktop applications. Currently the web application market is becoming
increasingly dominant and there are operating systems designed around them
such as the Chrome OS from Google and the WebOS from Palm. This shows the
importance of web applications as a replacement of ordinary applications.
5
6
AJAX is a clever combination of using the client-side JavaScript engine [11] to
update small parts of the Document Object Model (DOM) with information
retrieved by asynchronous server communication. By using AJAX technology
developers can create applications in which the page does not have to be
re-rendered again every time an interaction has taken place; only small sub-sets of
the page need to get updated. Therefore the users experience a very fast
responsive application inside the web-browser. The application is available
everywhere the user connects to the internet, and is accessible with every browser.
This eliminates the main disadvantages of having to install a full blown desktop
application on a computer with a certain amount of computational capacity and
the troubles of sharing files with people or other locations. This makes the use of
cloud computing interesting. Cloud computing is the term used to describe the
trend in the computing world in moving away from desktop applications to
on-line services . Although the web applications are not a new phenomena, the
use of AJAX techniques are. These new techniques also require a good quality of
service, of which testing is an important aspect.
The new AJAX technology does not include the property of having a unique URL
representing a unique state in an application . Due to the lack of an external
reachable unique state, a state reached by URL, crawlers are not able to access the
full content of an AJAX application without the use of a pre-programmed
JavaScript engine [7] . This problem of not having a reachable state by URL
occurs when crawling and testing an AJAX application. [9]
The first major work in the crawling AJAX was done by Duda [10] , which
suggests a way to crawl dynamic comments page in Youtube. They modelled
7
AJAX website as a State Machine. They developed the first AJAX Crawling
algorithm which this project uses as a base. They also indexed and made the
Youtube dynamic comments page searchable.
To circumvent this problem Mesbah et. al. [15] proposed to crawl AJAX
applications by Inferring User Interface State Changes. Their technique focuses
around a state machine which stores the actions a user executes on a web-page
inside a real browser, starting from the root, the index state, and following traces
down to the final state of a certain path. These states are discovered through
searching the current DOM-tree for possible elements at which events can be
fired. These events include for example the onClick, onMouseOver or
onMouseOut, and firing the events on all the possible candidate elements may
result in a new states. The result of an event, the DOM-tree, is compared with the
DOM-tree from before the execution. If the DOM tree is changed a new state of
the application is added to the state machine linked to its predecessor state. The
edge between the two states represents the element and event combination that
results in the new state originating from the previous state. By storing the
combination of an element and an event, the crawler is able to repeat the flow of
actions, which result in the given state. By using this information it is possible to
bring an AJAX application to a given state, and this makes an AJAX application
state aware by adding an external indexing shell.
8
2.2 Finite State Machine
Finite state machines are used to describe the behaviour of a system by recording
the transitions from one state to another state. This method is mostly used in
verifying software systems or software protocols [5].
The state machine used inside a crawler is not a fully specified state machine, but
an incomplete specified state machine. A completely specified state machine is a
state machine where every transaction results in a unique new state [16]. When
examining an Ajax application it is possible to have multiple transactions
resulting in the same state, e.g., two different links can result in the same page.
This observation leads to the fact that the state machine used is an incomplete
specified state machine. The minimal version of a completely specified state
machine can be found in polynomial time [13]. Incomplete specified state
machine, are proved to be NP-complete [12] in terms of finding the minimal state
machine. This means that there is no algorithm known that minimises an
incomplete specified state machine in polynomial time.
2.3 Google’s AJAX Crawling Scheme
Google proposed its own scheme for crawling AJAX [1]. The AJAX websites
which conform to this scheme will be crawled by Google Bot. The Googles
AJAX crawling scheme proposes to mark the addresses of all the pages that load
AJAX content with specific chars. The whole idea behind it is to use special hash
fragments (#!) in the URLs of those pages to indicate that they load AJAX
9
content. When Google finds a link that points to an AJAX URL, for example
http://example.com/page?query #!state, it automatically interprets it (escapes it)
as http://example.com/page?query& escaped fragment =state.
FIGURE 2.1: AJAX Crawling Scheme
The programmer is forced to change his/her Website Architecture in order to
handle the above requests. So when Google sends a web request for the escaped
URL, the server must be able to return the same HTML code as the one that is
presented to the user when the AJAX function is called.
After Google sees the AJAX URL and after interpreting (escaping it), it grabs the
content of the page and indexes it. Finally when the indexed page is presented in
the Search Results, Google shows the original AJAX URL to the user instead of
the escaped one. As a result the programmer should be able to handle users
10
request and present the appropriate content when the page loads.
FIGURE 2.2: Control Flow
The implementation of Google’s AJAX Crawling scheme imposes some
constraints on developers. Also a site having less amount of AJAX content cannot
be changed to this scheme for the purpose of crawling. Thus the thesis proposes a
way to crawl AJAX sites by constructing the state machine of the site, which does
not impose any constraints on the developers. We view every site as a AJAX site
and start crawling by invoking javascript events.If there is any change in DOM,
then we record it in state machine. Thus once the state machine of a URL is
generated, we then index the states to enable searching.
CHAPTER 3
REQUIREMENTS ANALYSIS
In this chapter, we provide an overview of the requirements and the functionalities
of the system.
3.1 Functional Requirements
The project aims to crawl AJAX content in web applications and make it
searchable. The abstract modular view of the project is given by the following
steps.
1. Identification of Clickables
2. Invocation of events
3. Representing AJAX website as State Machine
4. Indexing the crawled states
5. Searching through the indexed content
6. Reconstruction of a particular state in browser
11
12
3.2 Non-Functional Requirements
3.2.1 User Interface
User Interface is provided for searching. The User Interface is developed using
HTML and PHP. The user enters the query in a text box and performs the search.
For the browser driven UI, any standard web browser like Mozilla Firefox or
Internet Explorer is required.
3.2.2 Hardware Considerations
The project is requires a computer with Windows Operating System. The system
used in our experiments consists of 320 GB Hard Disk Drive, 2 GB RAM and 1.2
GHz Processor.
3.2.3 Performance Characteristics
As performance forms an important parameter of this project, there are a number
of performance consideration factors. The AJAX Crawling is compared with
traditional crawling.The factors are :
1. Crawling Time
2. Search Result Quality
3. Clickable Selection Policy
13
3.2.4 Security Issues
As the project is fully software based, there are no security issues concerning this
project.
3.2.5 Safety Issues
There are no particular safety issues concerning this project.
3.3 Constraints
• Javascript execution
A crawler capable of crawling AJAX requires the capability to execute
Javascript.
• Duplicate State
Multiple events may lead to same state.Thus we need to eliminate adding
duplicate states.
• Infinite State Change
If the same events can be invoked indefinitely on the same state, the
application model can explode.
• Numerous ways of adding event handlers
A Javascript event can be added to an particular HTML element in many
ways. Thus all events assigned through various ways should be handled
properly.
14
3.4 Assumptions
• No Forms
The AJAX Crawler doesn’t handle forms because handling forms is
complex. It requires appropriate test data for submitting forms. Also if
captcha is present, it is not possible to submit such a form. Also a form may
consist of different type of input elements like checkbox, select box, redio
button etc. Deep web crawling by handling forms is itself a separate
research problem.
• Limiting the number of states
The Crawler limits the number of states to prevent state explosion.
• Only Click Event
The Crawler invokes only click event on HTML elements during
crawling.The elements which can be clicked are termed as clickables.
• Only Text based retrieval
The Crawler handles only text based changes. Image based changes like in
Google maps are not considered.
CHAPTER 4
SYSTEM DESIGN
In this chapter, we describe the deign issues considered in the software
development process.
4.1 System Architecture
4.1.1 Architecture Diagram
Figure 4.1 depicts the architecture diagram of the entire system. The set of
modules, along with the control flow between them is depicted.
15
16
FIGURE 4.1: Architecture Diagram
17
4.2 Module Descriptions
4.2.1 Identification of Clickables
Identification of clickables is the first phase in an Ajax Crawler. It involves
identifying clickables that would modify the current DOM . The main issue
regarding this is that click event may be added to an HTML element in many
ways. A number of ways to add event listener is shown below,
• <div id=test onclick=‘test function( );’ >
• test.onclick=test function;
• test.addEventListener(‘click’,test function,false);
• Using Jquery javascript library,
$(‘#test’).click(function()
{
test function();
});
All the above 4 methods, perform the same function of adding the event onclick
on element test.
Thus clickables cannot be identified in a standard way because of the numerous
Javascript libraries that exist and each has its own way of defining event handlers.
So the approach of clicking all the clickable elements is being followed. The list
18
of clickable HTML elements is shown below.
<a>, <address>, <area>, <b>, <bdo>, <big>, <blockquote>, <body>,
<button>, <caption>, <cite>, <code>, <dd>, <dfn>, <div>, <dl>, <dt>,
<em>, <fieldset>, <form>, <h1>to <h6>, <hr>, <i>, <img>, <input>,
<kbd>, <label>, <legend>, <li>, <map>, <object>, <ol>, <p>, <pre>,
<samp>, <select>, <small>, <span>, <strong>, <sub>, <sup>, <table>,
<tbody>, <td>, <textarea>, <tfoot>, <th>, <thead>, <tr>, <tt>, <ul>,
<var>
Though this approach is time consuming and can cause sub elements to be
clicked repeatedly, it has the advantage of all the possible states being reached.
XPath is used to retrieve the clickable elements. The XPath expression to retrieve
all clickable elements in the document is shown below.
//a | //address | //area | //b | //bdo | //big | //blockquote | //body | //button |
//caption | //cite | //code | //dd | //d f n | //div | //dl | //dt | //em | // f ieldset |
// f orm | //h1 | //h6 | //hr | //i | //img | //input | //kbd | //label | //legend |
//li | //map | //ob ject | //ol | //p | //pre | //samp | //select | //small | //span |
//strong | //sub | //sup | //table | //tbody | //td | //textarea | //t f oot | //th |
//thead | //tr | //tt | //ul | //var
19
4.2.2 Event Invocation
HtmlUnit [2] library provides the ability to invoke any event on an HTML
element. The event is invoked on all elements retrieved using XPath expression.
After an event is invoked on an element, we need to wait for background
Javascript execution.
4.2.3 State Machine representation of AJAX website
An Ajax website can be represented by a State machine. Thus the navigation
model of an Ajax driven website can be visualized as a State Machine. The state
machine can be viewed as a Directed Multigraph. JUNG (Java Universal
Network/Graph Framework) [3] is used for building the state machine.
Nodes represents application state
Edges represent transition between states
In each node of a state machine, the DOM of the corresponding state is stored. In
each edge, we store the event type and XPath expression of the element on which
the event has to be invoked.
The State machine is represented as a Directed Multigraph. The State machine is
stored in graphML format.
<?xml version=”1.0” encoding=”UTF-8”? >
20
<graphml
xmlns=”http://graphml.graphdrawing.org/xmlns/graphml”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:schemaLocation=”http://graphml.graphdrawing.org/xmlns/graphml” >
<key id=”event” for=”edge” ><desc >Event type</desc ></key >
<key id=”target” for=”edge” ><desc >Event generating element</desc ></key >
<graph edgedefault=”directed” >
<node id=”0”/ ><node id=”1”/ ><node id=”2”/ ><node id=”3”/ >
<edge source=”0” target=”1” >
<data key=”event” >onclick</data >
<data key=”target” >/html/body/div/table/tbody/tr[1]/td[3]/div[1] </data >
</edge >
<edge source=”0” target=”2” >
<data key=”event” >onclick</data >
<data key=”target” >/html/body/div </data >
</edge >
<edge source=”0” target=”3” >
<data key=”event” >onclick</data >
<data key=”target” >/html/body/div/table/tbody/tr[1]/td[1]/div/div[4] </data >
</edge >
<edge source=”1” target=”2” >
<data key=”event” >onclick</data >
<data key=”target” >/html/body/div/table/tbody/tr[1]/td[1]/div/div[8]/strong </data >
</edge ></graph ></graphml >
From the above graphML file, the following inferences can derived,
21
• Number of nodes(application states) = 4
• The application state changes from source to target on clicking the element
derived by the Xpath expression stored in a key named target. Thus from the
graphML format, the path from one state to another can be obtained.
4.2.3.1 Visualizing the State Machine
The state machine of the sample site at http://test.thurls.com/ajax/home.php can be
visualized as shown in Figure 4.2. We can infer that there are totally 8 states.
FIGURE 4.2: Visualizing State Machine
22
4.2.4 Indexing
While State Machine is being constructed as a result of AJAX Crawler,
simultaneously indexing of states has to be done to enable searching through the
states. The project uses Lucene Open Source API for Indexing.
4.2.5 Searching
Searching involves getting input query and returning suitable results based on
reading the index files. Lucene Search API is used to perform search operation
and return the results.
4.2.6 Reconstruction of state
Once the user searches for a query and the results are displayed, we need to
navigate to a particular state directly and display it in browser once a user views a
result. We use Selenium Web Driver to navigate to a particular state by finding
the path between the target state and initial state from the State Machine, and then
invoking events along the path.
4.3 User Interface Design
User Interface is provided for users to perform search.The user enters the search
query in text box and performs the search. The snapshots of user interface have
been provided in the Appendix A.1.
23
4.4 UseCase Model
4.4.1 UseCase Diagram
Figure 4.3 denotes the control flow pattern of our algorithm, also showing the
involvement of the various software/hardware components in the various sections
of the flow. The actors represent these components and the use cases represent the
functionality.
FIGURE 4.3: UseCase Diagram
24
4.5 System Sequence Diagram
4.5.1 Event Invocation
Figure 4.4 shows the Sequence Diagram for Event Invocation. It shows the
sequence of events involved in invoking events and updation of DOM.
FIGURE 4.4: Sequence Diagram - Event Invocation
4.5.2 Searching
Figure 4.5 shows the Sequence Diagram for Searching the crawled states. It shows
the sequence of events involved in searching and reconstruction of result state.
25
FIGURE 4.5: Sequence Diagram - Searching
4.6 Data Flow Model
Figure 4.6 shows the Level 0 DFD of the system. Level 1 DFD’s are shown in
Figure 4.7 and Figure 4.8.
4.6.1 Data Flow Diagram
FIGURE 4.6: Level 0 Data Flow Diagram
26
FIGURE 4.7: Level 1 Data Flow Diagram
27
FIGURE 4.8: Level 1 Data Flow Diagram
CHAPTER 5
SYSTEM DEVELOPMENT
5.1 Implementation
5.1.1 Tools Used
The following tools were employed to implement the project.
Operating System Windows 7Languages used for development Java,PHP,JSP
Libraries HtmlUnit,JSoup,Lucene,JUNG,Selenium DriverDatabase MySql
IDE NetBeansUser Interface Mozilla Firefox
Performance Visualization(Graphs) Google Charts,Powerpoint
TABLE 5.1: Tools Used
5.1.2 Implementation Description
This section provides the detailed implementation of the complete system. This
section discusses all the algorithms used in the project. A brief explanation of
each algorithm and its need is also described. AJAX Crawling algorithm forms the
basis of the AJAX Crawler. It is described as follows :
28
29
5.1.2.1 Ajax Crawling Algorithm
The first step in crawling is to load the initial state and then wait for background
Javascript execution (this handles the case when an Ajax call is made using
onload event). Then all clickables in the initial state are found and the event is
invoked. The clickables are extracted using XPath expression. Those elements
matching the XPath expression are clicked. If there are DOM changes, then the
state machine is updated. The crawling is done in a breadth first manner. First all
states originating from the intial state are found. Then each state is crawled in a
similar fashion. HtmlUnit [2] Java library is used for implementing the AJAX
Crawling Algorithm. A WebClient object can be viewed as a Browser instance.
This covers the requirement of an AJAX Crawler to be capable of executing
Javascript.
Algorithm 1 Ajax Crawling algorithm1: procedure CRAWL(url)2: Load url in HtmlUnit WebClient3: Wait for background Javascript execution4: StateMachine← Initialize state machine5: StateMachine.add(initial state)6: while still some state uncrawled do7: current state← f ind some uncrawled state to crawl8: webclient← get web client(current state,StateMachine,url)9: while current state still uncrawled do
10: crawl state(webclient,current state,StateMachine)11: webclient← get web client(current state,StateMachine,url)12: end while13: end while14: save the StateMachine15: end procedure
30
Algorithm 2 Ajax Crawling algorithm (Continued)1: procedure GET WEB CLIENT(current state,StateMachine,url)2: webclient← Load url in HtmlUnit WebClient3: Wait for background Javascript execution4: path← Find shortest path from initial state to current state5: while current state not reached do6: xpath← Get Xpath to traverse to next state in path7: Generate the click event on the element retrieved by xpath8: Wait for background Javascript execution9: end while
10: return webclient11: end procedure
One of the problems with the HtmlUnit Webclient is that, once a DOM change
occurs and another state is reached, we cant be able to go back to the source state
to continue the breadth first crawling process. We need to again traverse from the
initial state to the source state , to continue the crawling process. This is done by
the function GET WEB CLIENT. Here we find the path from the intial state to
the current state to be crawled. Then invoke the events along the path to reach the
current state. Another issue with the WebClient is that it cannot be serialized and
stored. Thus each time when there is a DOM change, there is a need to traverse
from the initial state to current state.
The algorithm for crawling a individual state is described by the function
CRAWL STATE.Special care must be taken in order to avoid regenerating states
that have already been crawled (i.e., duplicate elimination). This is a problem also
encountered in traditional search engines. However, traditional crawling can most
of the time solve this by comparing the URLs of the given pages - a quick
operation. AJAX cannot count on that, since all AJAX states have the same URL.
Currently, we compare the DOM tree as a whole to check if two states are same.
31
Algorithm 3 Ajax Crawling algorithm (Continued)1: procedure CRAWL STATE(webclient,current state,StateMachine)2: elements← Get all clickable elements using Xpath3: while still an element remaining do4: xpath← Get Xpath of the current element5: if current element is already clicked in the current state then6: continue7: end if8: if current element is an anchor element then9: hre f ← Get href attribute of the current element
10: if href is null then11: Generate the click event on current element12: Wait for background Javascript execution13: if dom is changed then14: if new state is not already present in StateMachine then15: Add the new state to StateMachine16: Add a transition from current state to new state17: end if18: return19: end if20: end if21: else22: Generate the click event on current element23: Wait for background Javascript execution24: if dom is changed then25: if new state is not already present in StateMachine then26: Add the new state to StateMachine27: Add a transition from current state to new state28: end if29: return30: end if31: end if32: end while33: end procedure
32
5.1.2.2 State Machine
The algorithm for maintaining the State Machine is shown below.
Algorithm 4 State Machine Representation1: transition← Initialize a MultiKey Map2: crawl status← Initialize a Bit Vector3: graph← Initialize a Directed Multi Graph4: states← Initialize a Array List5: url← url currently being crawled6: procedure ADD NEW STATE(dom xml)7: if dom xml NOT IN states then8: state id=states.size();9: states.add(dom xml);
10: doc id=md5(url);11: index state(dom xml,url,doc id,state id);12: graph.addVertex(state id);13: end if14: end procedure15: procedure ADD TRANSITION(start state,end state,event, target xpath)16: if (start state,event,target xpath) NOT IN transition then17: transition.put(start state,event,target xpath,end state);18: graph.addEdge(start state,end state,event,target xpath);19: end if20: end procedure21: procedure UPDATE CRAWL STATUS(state id)22: crawl status.set(status id);23: end procedure24: procedure CHECK CRAWL STATUS25: num states=states.size()-1;26: for i = 0→ num states do27: if !crawl status.get(i) then28: return false;29: end if30: end for31: return true;32: end procedure
33
Algorithm 5 State Machine Representation (Continued)1: procedure GET NEXT STATE TO CRAWL2: num states=states.size()-1;3: for i = 0→ num states do4: if !crawl status.get(i) then5: return i;6: end if7: end for8: return -1;9: end procedure
10:
11: procedure CHECK STATE CRAWL STATUS(state id)12: if crawl status.get(state id) then13: return true;14: end if15: return false;16: end procedure17:
18: procedure SAVE STATE MACHINE19: layout← Initialize a Circle Layout of graph20: graphWriter← Initialize a Graph Writer21: out put← Initialize a Print Writer22: Add event type custom data for each edge in graph23: Add target XPath custom data for each edge in graph24: graphWriter.save(graph,output);25: end procedure
Thus we represent State Machine as a Directed Multigraph in JUNG(Java
Universal Network/Graph Framework)[3] . Each time when a new state is added
we check if the DOM is already in the state machine. Also each time while
adding a new transition, we check if it is not a duplicate transition. The State
Machine is saved in graphML format.
34
5.1.2.3 Indexing
Indexing is the process of extracting text from web pages, tokenizing it and then
creating an index structure (inverted index) that can be used to quickly find which
pages contain a particular word. The purpose of storing an index is to optimize
speed and performance in finding relevant documents for a search query. Without
an index, the search engine would scan every document in the corpus, which
would require considerable time and computing power.The project uses Lucene
Open Source API [4] for indexing the crawled states.Another advantage with
Lucene is that it supports incremental indexing. Thus we need not index all
documents from begining each time. The index files can be updated each time.
Only the text part in the DOM is indexed. In the inverted file,we store the URL,
DOC ID and STATE ID. The algorithm for indexing is given below.
Algorithm 6 Indexing crawled states using Lucene1: procedure INDEX STATE(dom xml,doc id,url,state id)2: indexWriter=new IndexWriter(path to index files,new
SimpleAnalyzer(),false);3: Document doc=new Document();4: doc.add(new Field(”content”,dom xml, Field.Store.YES,
Field.Index.TOKENIZED));5: doc.add(new Field(”url”,url, Field.Store.YES, Field.Index.NO));6: doc.add(new Field(”docid”,doc id, Field.Store.YES, Field.Index.NO));7: doc.add(new Field(”state”,state id, Field.Store.YES, Field.Index.NO));8: indexWriter.addDocument(doc);9: indexWriter.optimize();
10: indexWriter.close();11: end procedure
35
5.1.2.4 Searching
Once the indexing is done, the inverted files are saved. Searching involves
searching through the indexed content. The project uses Lucene Search API [4]
for searching through the indexed content. Since the search is done in PHP, the
project needs a way to access the Lucene index files in PHP. Zend PHP
Framework supports Lucene Search Library. Thus Zend enables to read Lucene
index files generated in Java to be read in PHP. The search results are returned as
an associate array. The array consists of the values for the parameters we
specified during indexing. For each result a score is assigned by Lucene. The
score is assigned during the indexing process based on the frequency of each
word in the document. The higher the score, the more relevant the result is to do
with the search query. The code snippet for searching in PHP using Lucene
Search library in Zend Framework is shown below.
Algorithm 7 Searching Lucene indexed files in PHP1: procedure SEARCH(query)2: $index = new Zend Search Lucene(path to index files);3: $hits = $index->find($query);4: foreach($hits as $hit)5: {6: echo $hit->score;7: echo $hit->docid;8: echo $hit->url;9: echo $hit->state;
10: }11: end procedure
36
5.1.2.5 Reconstruction of a particular state after crawling
After crawling the states of a particular URL, the states should be indexed to be
able to be searched by the search engine. [8] Thus a state needs to be reconstructed
for being displayed in search results. A web browser can load only the initial state
of a URL. But we need to load subsequent states which actually occur in browser
after a sequence of Javascript events are invoked. Thus the project uses Selenium
Web Driver [6] to load a particular state in browser directly. A Web Driver can be
viewed as a browser which can be controlled through code. Thus the project finds
the path from the initial state to the state to be loaded.The project initially loads the
initial state in Web Driver. Then the Javascript events along the path are invoked
in the Web Driver until the required state in reached. Thus the required state is
loaded in the browser to be viewed by the user. From then the user can continue
browsing from the required state like a normal browser.
Algorithm 8 Reconstruction of a particular state after crawling1: procedure RECONSTRUCT STATE(state)2: Read the graphML file of the corresponding URL and construct a Directed
Multigraph3: path← Find shortest path from initial state to the state to be constructed
(Djikstra Algorithm)4: Load the initial state in a Web Driver like Selenium5: while state not reached do6: xpath← Get Xpath expression of the element to be clicked next7: Generate the click event on the element retrieved by xpath8: Wait for background Javascript execution9: end while
10: The required state is currently loaded in the Web Driver11: end procedure
CHAPTER 6
RESULTS AND DISCUSSION
In this chapter, we report the significant results obtained in our experiments.
6.1 Results
Table 6.1 contains the list sample test cases used for evaluating the performance of
AJAX Crawler.
Case AJAX SiteC1 http://test.thurls.com/ajax/home.phpC2 http://spci.st.ewi.tudelft.nl/demo/aowe/C3 http://www.itrix.co.in/C4 http://demo.tutorialzine.com/2009/09/simple-ajax-website-jquery/demo.htmlC5 http://test.thurls.com/ajax/home1.php
TABLE 6.1: Test Cases
Some of the sample clickables in each of the test cases are shown below.
• Sample Clickables in C1
<div onclick=‘load content(1)’>Great Wall of China</div>
<div onclick=‘load content(2)’>Petra</div>
37
38
• Sample Clickables in C2
<b>Home</b>
<b>Workshop Organizers</b>
<b>Program Committee</b>
<b>Call for Papers</b>
• Sample Clickables in C3
<p id=”hel”>About Us</p>
<p id=”hel”>Sponsors</p>
• Sample Clickables in C4
<a href=”#page1”>Page 1</a>
<a href=”#page”>Page 2</a>
• Sample Clickables in C5
<div onclick=‘load content(24)’>Test 16</div>
<div onclick=‘load content(25)’>Test 17</div>
Table 6.2 contains the experimental results obtained for the sample test cases.
Probable Clickables are those elements in the DOM which can be clicked.
Detected Clickables are those that actually trigger AJAX requests.
Case MaximumDOM StringSize(bytes)
Probable Clickables Detected Clickables Number ofStates
C1 5829 24 8 8C2 6378 61 11 11C3 17422 167 27 27C4 2159 23 5 5C5 8233 58 26 26
TABLE 6.2: Experimental Results
39
6.2 Performance Evaluation
The performance of an AJAX crawler depends on
1. Crawling Time
2. Clickable Selection Policy
3. Search Result Quality
6.2.1 Crawling Time
In traditional crawling,
Crawling time of a page = network latency + server response time
In AJAX crawling,
Crawling time of a state = network latency + server response time + AJAX req.
time
The crawl time of a page in traditional crawling is in order of milli seconds whereas
in AJAX crawling, the crawl time of a page is in order of minutes. This is due to
the time spent in executing Javascript.Table 6.3 contains the crawling time for each
test case. Also crawl time per state is also shown.
Case Number of States Total Crawlingtime (in mins) Crawling time per state (in mins)C1 8 11.44 1.43C2 11 216.45 19.68C3 27 607.5 22.5C4 5 34.9 6.98C5 26 103.13 3.97
TABLE 6.3: Crawling Time
40
6.2.1.1 Number of States Vs Crawling Time
Figure 6.1 shows the plot between number of states and Crawling time (in
minutes).
FIGURE 6.1: Number of States Vs Crawling Time(in minutes)
Inferences from the graph
• The variation between Crawling time and number of states is not uniform.
• Crawling time doesn’t depend directly on the number of states.
• For the same website, crawling time is not constant when measured at
different instances.
• Network latency and server response time is not constant.
41
• Crawling time doesn’t depend only on the number of states. It is a weighted
measure of network latency, server response time ,AJAX request time and
also number of states.
6.2.2 Clickable Selection Policy
Clickable selection refers to the process of identifying clickables for invoking
events. An ideal clickable selection policy should identify clickables in an
optimal way such that most of them trigger AJAX requests or cause a change in
DOM. A better clickable selection policy can reduce the Javascript wait time ,
thus decreasing crawling time.We define a ratio called Clickable Selection Ratio
, which can defined as a ratio of Number of AJAX Requests to that of number of
probable Clickables.Table 6.4 contains the Clickable Selection Ratio for the
sample test cases.
Clickable Selection ratio = No. of AJAX Requests / No. of Probable Clickables
Case Number of Clickables Number of AJAX Requests Clickable Selection RatioC1 24 8 0.33C2 61 11 0.18C3 167 27 0.16C4 23 5 0.21C5 58 26 0.45
TABLE 6.4: Clickable Selection Policy
42
6.2.2.1 Number of AJAX Requests Vs Probable Clickables
Figure 6.2 shows the plot between Number of AJAX Requests and Probable
Clickables.
FIGURE 6.2: Number of AJAX Requests Vs Probable Clickables
Inferences from the graph
• The variation between Probable Clickables and Number of AJAX Requests
is not uniform.
• The number of clickables depends on the structure of the web page.
• The number of AJAX requests cannot be directly related to the number of
probable clickables.
43
6.2.2.2 Probable Clickables Vs Detected Clickables
Figure 6.3 shows the plot between Probable Clickables and Detected Clickables
FIGURE 6.3: Probable Clickables Vs Detected Clickables
Inferences from the graph
• The variation between Probable Clickables and Detected Clickables is not
uniform.
• The number of Detected Clickables depends on the structure of the web page
rather than on the number of Probable Clickables.
• The number of Detected Clickables cannot be directly related to the number
of Probable Clickables.
44
• The number of Detected Clickables cannot be directly related to the number
of AJAX Requests.
• Number of AJAX Requests <= Number of Detected Clickables
6.2.3 Clickable Selection Ratio Vs Crawling Time
Figure 6.4 shows the plot between Clickable Selection Ratio Vs Crawling Time
per state (in minutes)
FIGURE 6.4: Clickable Selection Ratio Vs Crawling time per state(in minutes)
Inferences from the graph
• The variation between Clickable Selection Ratio and Crawling Time is
uniform.
45
• Crawling time is inversely proportional to Clickable Selection Ratio
6.3 Search Result Quality
Search result quality is improved on indexing hidden AJAX Content. The AJAX
Content is not visible to traditional crawlers. Thus the AJAX Content is not
indexed by traditional crawlers. Hence in AJAX Crawler the quality of search
results improve compared with traditional crawlers. We will now see how Google
Bot and AJAX Crawler fetches http://test.thurls.com/ajax/home.php (C1).
Screenshots are provided in Appendix A.2.
This is how Google bot fetches http://test.thurls.com/ajax/home.php (C1)HTTP/1.1 200 OK
Date: Fri, 20 Apr 2012 19:08:29 GMT
Content-Type: text/html
Connection: close
Server: Nginx / Varnish
X-Powered-By: PHP/5.2.17
Content-Length: 1180
<html>
<head>
<title>Ajax Crawling </title>
<script type=”text/javascript”
src=”http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js”>
</script><script type=”text/javascript”>
function load content(key)
46
{
$.get(’getcontent.php’,’key’ : key,function(data)
{
document.getElementById(’result’).innerHTML=data.result;
},’json’);
}
</script></head>
<body>
<center><h2>Wonders of the World</h2></center>
<br><br><table border=”0”><tr><td width=”250px” style=”position:fixed; ”>
<ul><li><div onclick=’load content(1)’>Great Wall of China</div></li>
<li><div onclick=’load content(2)’>Petra</div></li>
<li><div onclick=’load content(3)’>Christ the Redeemer</div></li>
<li><div onclick=’load content(4)’>Machu Picchu</div></li>
<li><div onclick=’load content(5)’>Chichen Itza</div></li>
<li><div onclick=’load content(6)’>Colosseum</div></li>
<li><div onclick=’load content(7)’>Taj Mahal</div></li>
<li><div onclick=’load content(8)’>Great Pyramid of Giza</div></li>
</ul></td>
<td style=”padding-left:350px;”>
<div id=”result”><script>load content(1);</script></div>
</td></tr></table></body></html>
Here the content inside the division called ’result’ is not loaded and the script
code is fetched as such by the Google Bot without executing.
47
This is how AJAX Crawler fetches http://test.thurls.com/ajax/home.php (C1)<html>
<head>
<title>Ajax Crawling </title>
<script type=”text/javascript”
src=”http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js”>
</script><script type=”text/javascript”>
function load content(key)
{
$.get(’getcontent.php’,’key’ : key,function(data)
{
document.getElementById(’result’).innerHTML=data.result;
},’json’);
}
</script></head>
<body>
<center><h2>Wonders of the World</h2></center>
<br><br><table border=”0”><tr><td width=”250px” style=”position:fixed; ”>
<ul><li><div onclick=’load content(1)’>Great Wall of China</div></li>
<li><div onclick=’load content(2)’>Petra</div></li>
<li><div onclick=’load content(3)’>Christ the Redeemer</div></li>
<li><div onclick=’load content(4)’>Machu Picchu</div></li>
<li><div onclick=’load content(5)’>Chichen Itza</div></li>
<li><div onclick=’load content(6)’>Colosseum</div></li>
<li><div onclick=’load content(7)’>Taj Mahal</div></li>
<li><div onclick=’load content(8)’>Great Pyramid of Giza</div></li>
</ul></td>
<td style=”padding-left:350px;”>
<div id=”result”>
The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and
other materials, generally built along an east to west line across the historical northern borders of
48
China in part to protect the Chinese Empire or its prototypical states against intrusions by various
nomadic groups or military incursions by various warlike peoples or forces. Several walls had
already been begun to be built beginning around the 7th century BC;these, later joined together and
made bigger, stronger, and unified are now collectively referred to as the Great Wall. Especially
famous is the wall built between 220 to 206 BC by the first Emperor of China, Qin Shi Huang. Little
of that wall remains. Since then, the Great Wall has on and off been rebuilt, maintained, enhanced;
the majority of the existing wall was reconstructed during the Ming Dynasty. <br/><br/>
Other purposes of the Great Wall have included allowing for border control practices, such as check
points allowing for the various imperial governments of China to tariff goods transported along the
Silk Road, to regulate or encourage trade (for example trade between horses and silk products), as
well as generally to control immigration and emigration. Furthermore, the defensive characteristics
of the Great Wall were enhanced by the construction of watch towers, troop barracks, garrison
stations, signaling capabilities through the means smoke or fire, and the fact that the path of the
Great Wall also served as a transportation corridor. <br/><br/>
The Great Wall stretches from Shanhaiguan in the east, to Lop Lake in the west, along an arc that
roughly delineates the southern edge of Inner Mongolia. The most comprehensive archaeological
survey, using advanced technologies, has concluded that all the walls measure 8,851.8 km
(5,500.3 mi).This is made up of 6,259.6 km (3,889.5 mi) sections of actual wall, 359.7 km (223.5
mi) of trenches and 2,232.5 km (1,387.2 mi) of natural defensive barriers such as hills and rivers.
</div></td></tr></table></body></html>
Here the content inside division called ’result’ , loaded through AJAX is fetched by
waiting for Javascript execution to complete. Thus we see that the DOM crawled
by the AJAX Crawler has the initial AJAX Content loaded into it whereas the
Google bot doesn’t see the initial AJAX Content.Thus the quality of results gets
improved by Crawling AJAX content.
49
6.4 Observations
The following important observations have been found out based on the analysis
if the results.
• Crawling time of an AJAX page is in order of minutes
• Crawling time doesn’t depend only on the number of states. It is a weighted
measure of network latency, server response time ,AJAX request time and
also number of states.
• Crawling time is inversely proportional to clickable selection ratio
• Number of clickables in a DOM depends on the DOM structure rather than
DOM size.
• Quality of search results is improved by AJAX Crawler.
CHAPTER 7
CONCLUSIONS
7.1 Contributions
In this chapter, we summarize the significant contributions of our work. These are:
1. Implementing an AJAX Crawler
2. Making the crawled AJAX states searchable
3. Analyzing the performance of AJAX Crawler
The results indicate that crawling AJAX content improves the quality of search
results at an overhead of large crawling time. Thus further optimization is needed
to reduce the AJAX crawling time in comparable with traditional crawling.
7.2 Future Work
The following are some of the possible extensions that can be done to the system.
• Multi Threading
This can be done by having separate HtmlUnit webclient in each thread. To
make sure that multiple threads dont crawl the same path, the state machine
needs to be synchronized between the threads.50
51
• Using the DOM change statistics between states
Consider a transition from state 1 to 2. Now there may be many static
elements common to states 1 and 2. So the effect of invoking the events on
these static elements in both the states is same. Thus, the elements which
get added or changed in DOM when the state changes from 1 to 2 needs to
be found out. The events need to be invoked on only those elements in state
2. The remaining transitions are same as in state 1.
• Avoid invoking events on nested elements
When the event has already been invoked on a element, the event need not
been invoked again on the enclosing parent element. For example, Consider
the following HTML element,
<div><b>test</b></div>
Here the clicking the element <b>test</b>has the same effect as clicking
<div><b>test</b></div>. Thus event need not be invoked on the
enclosing parent element. This would reduce the number of duplicate
transitions in the state machine.
APPENDIX A
SNAPSHOTS
A.1 Search Interface
Entering Search query
FIGURE A.1: Interface I
52
53
Search results being displayed
FIGURE A.2: Interface II
54
A.2 Google Bot and AJAX Crawler
This is how Google Bot fetches http://test.thurls.com/ajax/home.php
FIGURE A.3: Fetched By Google Bot
55
This is how Google Bot fetches http://test.thurls.com/ajax/home.php
FIGURE A.4: Fetched By Google Bot
56
This is how AJAX Crawler fetches http://test.thurls.com/ajax/home.php
FIGURE A.5: Fetched By AJAX Crawler
57
This is how AJAX Crawler fetches http://test.thurls.com/ajax/home.php
FIGURE A.6: Fetched By AJAX Crawler
APPENDIX B
DOM
B.1 DOM - Document Object Model
The Document Object Model (DOM) is an application programming interface
(API) for valid HTML and well-formed XML documents. It defines the logical
structure of documents and the way a document is accessed and manipulated. In
the DOM specification, the term ”document” is used in the broad sense -
increasingly, XML is being used as a way of representing many different kinds of
information that may be stored in diverse systems, and much of this would
traditionally be seen as data rather than as documents. Nevertheless, XML
presents this data as documents, and the DOM may be used to manage this data.
With the Document Object Model, programmers can build documents, navigate
their structure, and add, modify, or delete elements and content. Anything found in
an HTML or XML document can be accessed, changed, deleted, or added using the
Document Object Model, with a few exceptions - in particular, the DOM interfaces
for the XML internal and external subsets have not yet been specified.
B.2 DOM Tree Representation
Every valid HTML/XML document can be represented by a DOM Tree. Consider
the following HTML code snippet,
58
59
<table>
<tbody>
<tr>
<td>Shady Grove</td>
<td>Aeolian</td>
</tr>
<tr>
<td>Over the River, Charlie</td>
<td>Dorian</td>
</tr>
</tbody>
</table>
The DOM tree for the above code snippet is shown below.
FIGURE B.1: DOM Tree
References
[1] “Google’s AJAX Crawling Scheme,” https://developers.google.com/webmasters/ajax-
crawling/, 2010.
[2] “HtmlUnit,” http://htmlunit.sourceforge.net/, 2011.
[3] “JUNG,” http://jung.sourceforge.net/, 2010.
[4] “Apache Lucene,” http://lucene.apache.org/core/, 2011.
[5] “Book review: Design and validation of computer protocols by gerard
j. holzmann (prentice hall, 1991),” SIGCOMM Comput. Commun. Rev.,
vol. 21, no. 2, pp. 14–, Apr. 1991, reviewer-Fredlund, Lars-Ake. [Online].
Available: http://doi.acm.org/10.1145/122419.1024051
[6] “Selenium Web Driver,” http://seleniumhq.org/docs/03 webdriver.html,
2011.
[7] P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “Ubicrawler: a scalable
fully distributed web crawler,” Softw. Pract. Exper., vol. 34, no. 8, pp.
711–726, Jul. 2004. [Online]. Available: http://dx.doi.org/10.1002/spe.587
[8] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search
engine,” Comput. Netw. ISDN Syst., vol. 30, no. 1-7, pp. 107–117, Apr. 1998.
[Online]. Available: http://dx.doi.org/10.1016/S0169-7552(98)00110-X
[9] A. Deursen and A. Mesbah, “Research issues in the automated testing
of ajax applications,” in Proceedings of the 36th Conference on Current
Trends in Theory and Practice of Computer Science, ser. SOFSEM ’10.
60
61
Berlin, Heidelberg: Springer-Verlag, 2010, pp. 16–28. [Online]. Available:
http://dx.doi.org/10.1007/978-3-642-11266-9 2
[10] C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou, “Ajax crawl:
Making ajax applications searchable,” in Proceedings of the 2009 IEEE
International Conference on Data Engineering, ser. ICDE ’09. Washington,
DC, USA: IEEE Computer Society, 2009, pp. 78–89. [Online]. Available:
http://dx.doi.org/10.1109/ICDE.2009.90
[11] J. Eichorn, Understanding AJAX: Using JavaScript to Create Rich Internet
Applications. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2006.
[12] S. Goren and F. J. Ferguson, “On state reduction of incompletely specified
finite state machines,” Comput. Electr. Eng., vol. 33, no. 1, pp. 58–69, Jan.
2007. [Online]. Available: http://dx.doi.org/10.1016/j.compeleceng.2006.06.
001
[13] J. E. Hopcroft, “An n log n algorithm for minimizing states in a finite
automaton,” Stanford, CA, USA, Tech. Rep., 1971.
[14] A. Mesbah and A. van Deursen, “Invariant-based automatic testing
of ajax user interfaces,” in Proceedings of the 31st International
Conference on Software Engineering, ser. ICSE ’09. Washington, DC,
USA: IEEE Computer Society, 2009, pp. 210–220. [Online]. Available:
http://dx.doi.org/10.1109/ICSE.2009.5070522
[15] A. Mesbah, A. van Deursen, and S. Lenselink, “Crawling ajax-based web
applications through dynamic analysis of user interface state changes,” ACM
Trans. Web, vol. 6, no. 1, pp. 3:1–3:30, Mar. 2012. [Online]. Available:
http://doi.acm.org/10.1145/2109205.2109208
62
[16] T. Villa, T. Kam, R. K. Brayton, and A. Sangiovanni-Vincentelli, Synthesis
of finite state machines: logic optimization. Norwell, MA, USA: Kluwer
Academic Publishers, 1997.