Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG)...

25
Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant Choudhary, M. Emre Dincturk, Seyed M. Mirtaheri, Ali Moosavi, Gregor von Bochmann, Guy-Vincent Jourdan, Iosif Viorel Onut CASCON 2012 November 5, 2012

Transcript of Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG)...

Page 1: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Crawling Rich Internet Applications:

The State of the Art

Software Security Research Group (SSRG) University of Ottawa

In collaboration with IBM

Suryakant Choudhary, M. Emre Dincturk, Seyed M. Mirtaheri, Ali Moosavi, Gregor von Bochmann, Guy-Vincent Jourdan, Iosif Viorel Onut

CASCON 2012 November 5, 2012

Page 2: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Overview

• Introduction▫ The evolution of the Web applications and Crawling

• Crawling RIAs▫ Challenges and Common Assumptions

• Research on Crawling RIAs▫ Crawling for Indexing▫ Crawling for Testing▫ Research on Crawling Strategies

Greedy Strategy Model-based Crawling

Hypercube Strategy Probability Strategy Menu Strategy

• Experimental Results• Future of RIA Crawling

2

Page 3: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Introduction - Traditional Web Applications

▫HTML pages identified by a URL▫Synchronous communication

Traditional Synchronous Communication Pattern

User Interaction

Server Processing

Request Response

Full Page Refresh

User Waiting

User Interaction

Server Processing

Full Page Refresh

User Waiting

User Interaction

Request Response

3

Page 4: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Introduction - Rich Internet Applications (1)•Client-side code (JavaScript) execution •The page can be modified by the client-

side code.•Document Object Model (DOM): A tree

data structure to represent the page in the client.

Events : Occurrences that cause code execution (mouse click, timeout etc.)

4

Page 5: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Introduction - Rich Internet Applications (2)•Asynchronous Communication (AJAX)

Asynchronous Communication Pattern (in RIAs )

User Interaction Partial Page Update Partial Page UpdatePartial Page Update

Server Processing Server Processing

Request Request Request

Response

ResponseResponse

5

Page 6: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Introduction - Crawling (1)

•Crawling: Exploring an application automatically

•Motivations▫Content indexing (by search engines)▫Testing (for security, accessibility,

functionality)•Objectives

▫Find all (or ‘important’) pages▫Find the connections between the pages

(obtaining a complete model of the application, for example for page ranking)

6

Page 7: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Introduction - Crawling (2)

• Crawling extracts “a model” of the application▫States are the “distinct” pages▫Transitions are the connections between the

states

7

Page 8: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Crawling RIAs

• RIAs have events that change the page without changing the URL.▫URL –> Many States

• The aim is to find all the states reachable from a given URL by executing events.▫The Initial State: The state reached by loading the

URL▫Reset: Loading the URL to go back to the initial

state.• An event’s behaviour may depend on the state it is

executed. We have to execute in each state all the enabled events of the state.

8

Page 9: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Crawling RIAs – Challenges and Assumptions•State Identification

▫A state needs to be identified by its DOM.▫A DOM Equivalence Relation is needed.

•Event Identification•Assumption: No Server-side States•Assumption: Finite Representative User

Inputs•Intermediate States•Efficiency of Crawling Strategies

9

Page 10: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Crawling RIAs for Indexing•Duda et al. [1][2][ 3]

▫Uses a Breadth-First crawling strategy▫Introduced AjaxRank [3]: Adaptation of

PageRank to RIAs to sort the results of a search query

•Mesbah et al. [4][5] introduced “Crawljax”▫uses a strategy similar to the Depth-First▫outputs static HTML snapshots of the

discovered DOMs which can be indexed by the search engines

10

Page 11: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Crawling RIAs for Testing• Crawljax is also used for testing RIAs

▫Regression Testing of Ajax Applications [6]▫Security Testing of web widget interactions [7]▫ Invariant-based Testing of Ajax Applications [8]

• Marchetto et al. [9] testing to reveal faulty behaviour▫Combines analysis of user traces, static analysis

of the code and human validation to produce a model of an application

• Amalfitano et al. [10] [11] [12] [13]▫Modeling and testing based on user execution

traces obtained by User sessions and/or Automated trace generation using a Depth-First

strategy

11

Page 12: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Crawling Strategies for RIAs

• Crawling Strategy: an algorithm that decides what event should be explored next.▫ An efficient strategy discovers the states as soon as

possible (our definition)▫ Time to find all the states ~ the number of events

executed and the resets used during crawling• The standard strategies used in the mentioned

research, the Breadth-First and the Depth-First, are not efficient for RIAs. ▫ No predictions for the event outcomes.▫ A strict order of state exploration: Leads to

increased number of event executions and resets (used to transfer from the current state to the currently explored state).

12

Page 13: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Research on Crawling Strategies for RIAs•Greedy Strategy [14]

▫A simple strategy that gives priority to the event closest to the current state

▫Tries to minimize the transfer sequences but still no prediction of event outcomes

•Model-Based Crawling Strategies▫Hypercube Strategy [15]▫Probability Strategy [16]▫Menu Strategy [17]

13

Page 14: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Model-Based Crawling

•Meta-model: assumed structure of the application

•Crawling strategy is optimized for the case that the application follows these assumptions

•Adaptation of the strategy: the crawling strategy must be able to deal with applications that do not satisfy these assumptions

14

Page 15: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

The Hypercube Strategy

•The Hypercube Meta-Model anticipates the application to have a hypercube model.

•Hypercube strategy is an “optimal” strategy for this meta-model.

Example: 4-Dimensional Hypercube

15

Page 16: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

•Prioritizes events based on their probability of discovering a new state

N(e) = number of executions S(e) = number of new states found

Bayesian formula pS = 1 and pN = 2 -> initial probability = 0.5

•Aim: Choose an event e to explore such that ▫P(e) is high▫The transfer sequence from the current state

to a state where e is unexecuted is short

The Probability Strategy

16

Page 17: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

The Menu Strategy

• The Menu Meta-Model defines three categories of events:

▫ 1. Menu-Event: Leads to the same state independent of where it is executed. (e1 and e2)

▫ 2. Self-Loop Event: Do not cause any state change. (e3)▫ 3. Other Event: An event that is neither of the above.

Simple example:

17

Page 18: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Experimental Results - Strategies

•We compare the performance of the model-based strategies with▫The Optimized Breadth-First Strategy▫The Optimized Depth-First Strategy▫The Greedy Strategy (explore the event

closest to the current state)▫Optimal (calculated when the model is

known)

18

Page 19: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Experimental Results - Applications

•Real Applications▫Periodic Table (Local version: http://ssrg.eecs.uottawa.ca/periodic/)

▫Clipmarks (Local version: http://ssrg.eecs.uottawa.ca/clipmarks/)

•Test Applications▫TestRIA ( http://ssrg.eecs.uottawa.ca/TestRIA/)

▫Altoro Mutual (http://www.altoromutual.com/)

19

Page 20: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Experimental Results – Measuring Efficiency• Efficiency of a strategy is measured by the cost of

discovering the states, which is based on the number of events executed and the resets used.

• Before crawling an application we measure average event execution time and average reset time for the application.

• For simplicity, we assume each event has the same unit cost (which is the average event execution time).

• The cost of reset is defined in terms of event execution cost.

• The cost of a strategy is calculated by (#events executed) +(#resets used) *(cost of reset)

20

Page 21: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Experimental Results – Crawling Efficiency

21

Plots are in logarithmic scale.

Cost of Reset8

Cost of Reset18

Cost of Reset2

Cost of Reset2

Page 22: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Future of RIA Crawling

• Avoid New States Without New Information▫Automatically identify the parts of a page that

can be crawled independently to reduce the state space explosion

• Adaptive Crawling▫Decide the meta-model for the application

during the crawling• Greater Diversity

▫Try to get a bird-eye-view of the application model as soon as possible

• Distributed Crawling▫Crawl applications using multiple processes

running concurrently to reduce crawling time

22

Page 23: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

References[1] C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou, “Ajax crawl: Making ajax applications searchable,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89, IEEE Computer Society, 2009[2] C. Duda, G. Frey, D. Kossmann, and C. Zhou, “Ajax search: crawling, indexing and searching web 2.0 applications,” Proc. VLDB Endow., vol. 1, pp. 1440– 1443, Aug. 2008.[3] G. Frey, “Indexing ajax web applications,” Master’s thesis, ETH Zurich, 2007[4] A. Mesbah, E. Bozdag, and A. v. Deursen, “Crawling ajax by inferring user interface state changes,” in Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pp. 122–134,IEEE Computer Society, 2008. [5] A. Mesbah, A. van Deursen, and S. Lenselink, “Crawling ajax-based web applications through dynamic analysis of user interface state changes,” TWEB, vol. 6, no. 1, p. 3, 2012.[6] D. Roest, A. Mesbah, and A. van Deursen, “Regression testing ajax applications: Coping with dynamism.,” in ICST, pp. 127–136, IEEE Computer Society, 2010. [7] A C.-P. Bezemer, A. Mesbah, and A. van Deursen, “Automated security testing of web widget interactions,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09, 2009.[8] A. Mesbah and A. van Deursen, “Invariant-based automatic testing of ajax user interfaces,” in Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on, pp. 210 –220, may 2009.[9] A. Marchetto, P. Tonella, and F. Ricca, “State-based testing of ajax web applications,” in Proceedings of the 2008 International Conference on Software Testing, Verification, and Validation, ICST ’08, pp. 121–130, IEEE Computer Society, 2008.

23

Page 24: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

References[10] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Reverse engineering finite state machines from rich internet applications,” in Proceedings of the 2008 15th Work-ing Conference on Reverse Engineering, WCRE ’08, pp. 69–73, IEEE Computer Society, 2008.[11] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Rich internet application testing using execution trace data,” in Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 274–283, IEEE Computer Society, 2010.[12] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “An iterative approach for the reverse engineering of rich internet application user interfaces,” in Proceedings of the 2010 Fifth International Conference on Internet and Web Applications and Services, ICIW ’10, pp. 401–410, IEEE Computer Society, 2010.[13] D. Amalfitano, A. R. Fasolino, A. Polcaro, and P. Tramontana, “Dynaria: A tool for ajax web application comprehension.,” in ICPC, pp. 46–47, IEEE Computer Society, 2010.[14] Z. Peng, N. He, C. Jiang, Z. Li, L. Xu, Y. Li, and Y. Ren, “Graph-based ajax crawl: Mining data from rich inter-net applications,” in Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference on, vol. 3, pp. 590 –594, march 2012.[15] K. Benjamin, G. v. Bochmann, M. E. Dincturk, G.-V. Jourdan, and I. V. Onut, “A strategy for efficient crawling of rich internet applications,” in Proceedings of the 11th international conference on Web engineering, ICWE’11, 2011.[16] M. E. Dincturk, S. Choudhary, G. v. Bochmann, , G.-V. Jourdan, and I. V. Onut, “A statistical approach for efficient crawling of rich internet applications,” in Proceedings of the 12th international conference on Web engineering, ICWE’12, 2012.[17] Choudhary, S., M-crawler: Crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa, 2012.

24

Page 25: Crawling Rich Internet Applications: The State of the Art Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Suryakant.

Thank You

25