Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG)...

Crawling Rich Internet Applications:

The State of the Art

Software Security Research Group (SSRG) University of Ottawa

In collaboration with IBM

Suryakant Choudhary, M. Emre Dincturk, Seyed M. Mirtaheri, Ali Moosavi, Gregor von Bochmann, Guy-Vincent Jourdan, Iosif Viorel Onut

CASCON 2012 November 5, 2012

Overview

• Introduction▫ The evolution of the Web applications and Crawling

• Crawling RIAs▫ Challenges and Common Assumptions

• Research on Crawling RIAs▫ Crawling for Indexing▫ Crawling for Testing▫ Research on Crawling Strategies

Greedy Strategy Model-based Crawling

Hypercube Strategy Probability Strategy Menu Strategy

• Experimental Results• Future of RIA Crawling

2

Introduction - Traditional Web Applications

▫HTML pages identified by a URL▫Synchronous communication

Traditional Synchronous Communication Pattern

User Interaction

Server Processing

Request Response

Full Page Refresh

User Waiting

User Interaction

Server Processing

Full Page Refresh

User Waiting

User Interaction

Request Response

3

Introduction - Rich Internet Applications (1)•Client-side code (JavaScript) execution •The page can be modified by the client-

side code.•Document Object Model (DOM): A tree

data structure to represent the page in the client.

Events : Occurrences that cause code execution (mouse click, timeout etc.)

4

Introduction - Rich Internet Applications (2)•Asynchronous Communication (AJAX)

Asynchronous Communication Pattern (in RIAs )

User Interaction Partial Page Update Partial Page UpdatePartial Page Update

Server Processing Server Processing

Request Request Request

Response

ResponseResponse

5

Introduction - Crawling (1)

•Crawling: Exploring an application automatically

•Motivations▫Content indexing (by search engines)▫Testing (for security, accessibility,

functionality)•Objectives

▫Find all (or ‘important’) pages▫Find the connections between the pages

(obtaining a complete model of the application, for example for page ranking)

6

Introduction - Crawling (2)

• Crawling extracts “a model” of the application▫States are the “distinct” pages▫Transitions are the connections between the

states

7

Crawling RIAs

• RIAs have events that change the page without changing the URL.▫URL –> Many States

• The aim is to find all the states reachable from a given URL by executing events.▫The Initial State: The state reached by loading the

URL▫Reset: Loading the URL to go back to the initial

state.• An event’s behaviour may depend on the state it is

executed. We have to execute in each state all the enabled events of the state.

8

Crawling RIAs – Challenges and Assumptions•State Identification

▫A state needs to be identified by its DOM.▫A DOM Equivalence Relation is needed.

•Event Identification•Assumption: No Server-side States•Assumption: Finite Representative User

Inputs•Intermediate States•Efficiency of Crawling Strategies

9

Crawling RIAs for Indexing•Duda et al. [1][2][ 3]

▫Uses a Breadth-First crawling strategy▫Introduced AjaxRank [3]: Adaptation of

PageRank to RIAs to sort the results of a search query

•Mesbah et al. [4][5] introduced “Crawljax”▫uses a strategy similar to the Depth-First▫outputs static HTML snapshots of the

discovered DOMs which can be indexed by the search engines

10

Crawling RIAs for Testing• Crawljax is also used for testing RIAs

▫Regression Testing of Ajax Applications [6]▫Security Testing of web widget interactions [7]▫ Invariant-based Testing of Ajax Applications [8]

• Marchetto et al. [9] testing to reveal faulty behaviour▫Combines analysis of user traces, static analysis

of the code and human validation to produce a model of an application

• Amalfitano et al. [10] [11] [12] [13]▫Modeling and testing based on user execution

traces obtained by User sessions and/or Automated trace generation using a Depth-First

strategy

11

Crawling Strategies for RIAs

• Crawling Strategy: an algorithm that decides what event should be explored next.▫ An efficient strategy discovers the states as soon as

possible (our definition)▫ Time to find all the states ~ the number of events

executed and the resets used during crawling• The standard strategies used in the mentioned

research, the Breadth-First and the Depth-First, are not efficient for RIAs. ▫ No predictions for the event outcomes.▫ A strict order of state exploration: Leads to

increased number of event executions and resets (used to transfer from the current state to the currently explored state).

12

Research on Crawling Strategies for RIAs•Greedy Strategy [14]

▫A simple strategy that gives priority to the event closest to the current state

▫Tries to minimize the transfer sequences but still no prediction of event outcomes

•Model-Based Crawling Strategies▫Hypercube Strategy [15]▫Probability Strategy [16]▫Menu Strategy [17]

13

Model-Based Crawling

•Meta-model: assumed structure of the application

•Crawling strategy is optimized for the case that the application follows these assumptions

•Adaptation of the strategy: the crawling strategy must be able to deal with applications that do not satisfy these assumptions

14

The Hypercube Strategy

•The Hypercube Meta-Model anticipates the application to have a hypercube model.

•Hypercube strategy is an “optimal” strategy for this meta-model.

Example: 4-Dimensional Hypercube

15

•Prioritizes events based on their probability of discovering a new state

N(e) = number of executions S(e) = number of new states found

Bayesian formula pS = 1 and pN = 2 -> initial probability = 0.5

•Aim: Choose an event e to explore such that ▫P(e) is high▫The transfer sequence from the current state

to a state where e is unexecuted is short

The Probability Strategy

16

The Menu Strategy

• The Menu Meta-Model defines three categories of events:

▫ 1. Menu-Event: Leads to the same state independent of where it is executed. (e1 and e2)

▫ 2. Self-Loop Event: Do not cause any state change. (e3)▫ 3. Other Event: An event that is neither of the above.

Simple example:

17

Experimental Results - Strategies

•We compare the performance of the model-based strategies with▫The Optimized Breadth-First Strategy▫The Optimized Depth-First Strategy▫The Greedy Strategy (explore the event

closest to the current state)▫Optimal (calculated when the model is

known)

18

Experimental Results - Applications

•Real Applications▫Periodic Table (Local version: http://ssrg.eecs.uottawa.ca/periodic/)

▫Clipmarks (Local version: http://ssrg.eecs.uottawa.ca/clipmarks/)

•Test Applications▫TestRIA ( http://ssrg.eecs.uottawa.ca/TestRIA/)

▫Altoro Mutual (http://www.altoromutual.com/)

19

Experimental Results – Measuring Efficiency• Efficiency of a strategy is measured by the cost of

discovering the states, which is based on the number of events executed and the resets used.

• Before crawling an application we measure average event execution time and average reset time for the application.

• For simplicity, we assume each event has the same unit cost (which is the average event execution time).

• The cost of reset is defined in terms of event execution cost.

• The cost of a strategy is calculated by (#events executed) +(#resets used) *(cost of reset)

20

Experimental Results – Crawling Efficiency

21

Plots are in logarithmic scale.

Cost of Reset8

Cost of Reset18

Cost of Reset2

Cost of Reset2

Future of RIA Crawling

• Avoid New States Without New Information▫Automatically identify the parts of a page that

can be crawled independently to reduce the state space explosion

• Adaptive Crawling▫Decide the meta-model for the application

during the crawling• Greater Diversity

▫Try to get a bird-eye-view of the application model as soon as possible

• Distributed Crawling▫Crawl applications using multiple processes

running concurrently to reduce crawling time

22

References[1] C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou, “Ajax crawl: Making ajax applications searchable,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89, IEEE Computer Society, 2009[2] C. Duda, G. Frey, D. Kossmann, and C. Zhou, “Ajax search: crawling, indexing and searching web 2.0 applications,” Proc. VLDB Endow., vol. 1, pp. 1440– 1443, Aug. 2008.[3] G. Frey, “Indexing ajax web applications,” Master’s thesis, ETH Zurich, 2007[4] A. Mesbah, E. Bozdag, and A. v. Deursen, “Crawling ajax by inferring user interface state changes,” in Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pp. 122–134,IEEE Computer Society, 2008. [5] A. Mesbah, A. van Deursen, and S. Lenselink, “Crawling ajax-based web applications through dynamic analysis of user interface state changes,” TWEB, vol. 6, no. 1, p. 3, 2012.[6] D. Roest, A. Mesbah, and A. van Deursen, “Regression testing ajax applications: Coping with dynamism.,” in ICST, pp. 127–136, IEEE Computer Society, 2010. [7] A C.-P. Bezemer, A. Mesbah, and A. van Deursen, “Automated security testing of web widget interactions,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09, 2009.[8] A. Mesbah and A. van Deursen, “Invariant-based automatic testing of ajax user interfaces,” in Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on, pp. 210 –220, may 2009.[9] A. Marchetto, P. Tonella, and F. Ricca, “State-based testing of ajax web applications,” in Proceedings of the 2008 International Conference on Software Testing, Verification, and Validation, ICST ’08, pp. 121–130, IEEE Computer Society, 2008.

23

References[10] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Reverse engineering finite state machines from rich internet applications,” in Proceedings of the 2008 15th Work-ing Conference on Reverse Engineering, WCRE ’08, pp. 69–73, IEEE Computer Society, 2008.[11] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Rich internet application testing using execution trace data,” in Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 274–283, IEEE Computer Society, 2010.[12] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “An iterative approach for the reverse engineering of rich internet application user interfaces,” in Proceedings of the 2010 Fifth International Conference on Internet and Web Applications and Services, ICIW ’10, pp. 401–410, IEEE Computer Society, 2010.[13] D. Amalfitano, A. R. Fasolino, A. Polcaro, and P. Tramontana, “Dynaria: A tool for ajax web application comprehension.,” in ICPC, pp. 46–47, IEEE Computer Society, 2010.[14] Z. Peng, N. He, C. Jiang, Z. Li, L. Xu, Y. Li, and Y. Ren, “Graph-based ajax crawl: Mining data from rich inter-net applications,” in Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference on, vol. 3, pp. 590 –594, march 2012.[15] K. Benjamin, G. v. Bochmann, M. E. Dincturk, G.-V. Jourdan, and I. V. Onut, “A strategy for efficient crawling of rich internet applications,” in Proceedings of the 11th international conference on Web engineering, ICWE’11, 2011.[16] M. E. Dincturk, S. Choudhary, G. v. Bochmann, , G.-V. Jourdan, and I. V. Onut, “A statistical approach for efficient crawling of rich internet applications,” in Proceedings of the 12th international conference on Web engineering, ICWE’12, 2012.[17] Choudhary, S., M-crawler: Crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa, 2012.

24

Thank You

25

Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG)...

Documents

Transcript of Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG)...