HT2010 Paper Presentation

33
Providing Resilient Xpaths for External Adaptation Engines Iñaki Paz LKS, S. Coop. ONEKIN Research Group – UPV/EHU Donostia - San Sebastián, Spain June 14th, 2010

description

Providing resilient XPaths for external adaptation engines Session 3: Adaptation, June 14 3pm Northrop Frye Hall

Transcript of HT2010 Paper Presentation

Page 1: HT2010 Paper Presentation

Providing Resilient Xpaths for External Adaptation Engines

Iñaki PazLKS, S. Coop.

ONEKIN Research Group – UPV/EHU

Donostia - San Sebastián, SpainJune 14th, 2010

Page 2: HT2010 Paper Presentation

Iñaki Paz 2

Index

Introduction

XPath Expressions to select contents

Web pages get changed!!!!• In Space• In Time

Evaluation

Conclusions

Introduction

Page 3: HT2010 Paper Presentation

Iñaki Paz 3

Adaptation aware Web Applications

Architecture:Server

Browser

Depending on user profile and context, the Web Application reacts executing adaptation rules providing personalized contents.

RULES “kind of” CONFIGURE ADAPTATION

HTTPURL + Params

Content

Adaptation Rules

Rules address what is adapted and how, based on user profile and context

Page 4: HT2010 Paper Presentation

Iñaki Paz

Adaptation Aware Applications

Adaptation cases / rules are foreseen on application development

New not foreseen adaptation needs may appear through time

New Possible Adaptation needs:

4

• New interaction protocol (FTP) to handle application docs.

• New comm. language (RSS) to present data.

• Provide a RESTful interface to application concepts

• New data filters on searches for given user.

• Add external mashups related to certain content.

Page 5: HT2010 Paper Presentation

Iñaki Paz 5

Adaptation as an Application LayerArchitecture: Application

Layer

BrowserAdapted Content

HTTP / HTML?

Content

Protocol

AdaptationLayer

Adaptation Layer can be inside the application

• May access to application’s business logic and APIs

• Complex adaptations

Adaptation Layer can be EXTERNAL to the application

• Adapt Layer works like any other Browser (HTTP + HTML)

• More flexible, Adaptation FULLY independent from Application

Adaptation Rules

Page 6: HT2010 Paper Presentation

Iñaki Paz 6

External Adaptation

Architecture:Application

Layer

Browser

HTTP / HTML

Content (HTML P

ages)

AdaptationLayer

Adapted Communication

Protocol

Adapted Content

Content (HTML Pages)

HTTP / HTML?

• http://www.dapper.net/open/ • Web Page => RSS, Google Gadget

• GreaseMonkey Scripts• JS Scripts for the Browser to personalize app.

Page 7: HT2010 Paper Presentation

Iñaki Paz 7

External AdaptationApplication

Layer

Content (HTML Pages)

Adaptation Rules need to specify WHICH elements adaptation affects on the page.

Distinct technologies available to select elements on pages:

• Text Patterns

• Regular Expressions

• Complex Expression Languages

This work focusses on Xpath• Most browsers support DOM Level 3 Xpath

specification

• Easy to transform HTML to XHTML (e.g. Jtidy)

Adaptation Layer

Page 8: HT2010 Paper Presentation

Iñaki Paz 8

Index

Introduction

XPath expressions to select contents

Web pages get changed!!!!• In Space• In Time

Evaluation

Conclusions

XPath to Select Contents

Page 9: HT2010 Paper Presentation

Iñaki Paz

External Adaptation

9

XPATH is a language to select nodes in XML

Documents

XPATH is based on the TREE Structure

of Documents

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]

Page 10: HT2010 Paper Presentation

Iñaki Paz

Web App Pages Change!!!

10

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]

If the page changes, wanted element may not be correctly selected

Page 11: HT2010 Paper Presentation

Iñaki Paz

Web App Pages Change!!!

11

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]

If the page changes, wanted element may not be correctly selected

OUR OBJECTIVE IS TO OBTAIN

CHANGE RESILIENT XPATH EXPRESSIONS

Page 12: HT2010 Paper Presentation

Iñaki Paz

Web App Pages Change!!!

12

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]

Given the XPaths:

The Xpath:

Would select the same elements.

Notice that this XPath characterizes the banner as those ROWS with only ONE column on a table whose cellpadding is ‘2’

Obtaining these XPath expressions by hand is cumbersome and error prone. A tool has been developed to obtain a node’s absolute XPath expression and then generate an optimized XPATH.

Firefox plugins like XPather or XPath Checker (among others) enable obtaining a node’s absolute XPath.

//table[@cellpadding=‘2’]/tr[count(*)=1]

Page 13: HT2010 Paper Presentation

Iñaki Paz

Web App Pages are different!!!

13

• Distinct Pages => Distinct Structure, Distinct Contents => Distinct XPaths• XPaths are patterns to be applied over a pageClass set.

• Page Class = The SET of pages that describe the same type of information and have a similar page structure.

Page 14: HT2010 Paper Presentation

Iñaki Paz 14

Index

Introduction

XPath expressions to select contents

Web pages get changed!!!!• In Space• In Time

Evaluation

Conclusions

Web Pages get changed!!!!

Page 15: HT2010 Paper Presentation

Iñaki Paz

Variability in Space

Variability in Space denotes the distinct running versions of a given page accessible on a given time.

Web applications pages change their contents!!!• Different searches provide different results• Information caducity• Advert introduction• User and context adaptations application is aware of

An XPath working on a page of a given class may not work on another of the same class

Need to induce an XPath robust to those changes from a pageClass set contaning most of the page variants

15

Page 16: HT2010 Paper Presentation

Iñaki Paz

XPath Induction

16

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[3]

/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]

Each STEP in an absolute XPATH selects one and only one ELEMENT

Page 17: HT2010 Paper Presentation

Iñaki Paz 17

Induction: Differences on Paths

3 Main difference types may be found

/a[n]/b[m]/c[o]/a[n]/b[m]/c[p]

----------------------/a[n]/b[m]/c[conds]

Position

/a[n]/b/c[m]/a[n]/d/c[m]

------------------/a[n]/*[conds]/c[m]

/a[n]/b[m]/c[o]/a[n]/d/b[m]/c[o]

-----------------------/a[n]//b[conds]/c[o]

Node (e.g. div vs. span)

Depth

Page 18: HT2010 Paper Presentation

Iñaki Paz 18

These types may appear combined/a[n]/b[m]/c[o]/a[n]/d[m]/c[p]------------------

/a[n]/*[conds]/c[conds]

Position & node combination

Sample on http://www.carsearch.com: 2 of Position

/html/body[1]/table[2]/tr[1]/…/table[3]/tr[7]/html/body[1]/table[2]/tr[1]/…/table[2]/tr[10]

------------------/html/body[1]/table[2]/tr[1]/ … /table[@width='100%'][@border='0'] [@cellpadding='2'][@cellspacing='0'][tr]/tr[count(*)=1][count(td)=1]

Induction: Differences on Paths

Page 19: HT2010 Paper Presentation

Iñaki Paz 19

LOOP on XPaths resolving unconsidered differences

Problems:• /…/table[@class]• /…/table[@style]

Induction provides an XPath working on all the samples, does not optimize it

Ends on expressions like:

Induction Algorithm

html/body[1]/table[2]/tr[1]/ … /table[@width='100%'][@border='0'] [@cellpadding='2'][@cellspacing='0'][tr]/tr[count(*)=1][count(td)=1]

Page 20: HT2010 Paper Presentation

Iñaki Paz 20

¿Which is the problem?• XPath is based on structure.• Small changes may affect structure.

Solution:• Remove as much structural information as

possible keeping equivalence with original XPath.

Web Pages Evolve in time!!!

Page 21: HT2010 Paper Presentation

Iñaki Paz 21

Definition:• Two XPaths are equivalent if they recover the

same nodes. [Miklau 2004] have demonstrated that this problem is NP-Complete for a subset of XPath.

Definition:• An XPath is resilient to change C, if the set of

recovered nodes is independent of making change C or not.

Web Pages Evolve in time!!!

Page 22: HT2010 Paper Presentation

Iñaki Paz 22

An Example:• ¿Which XPath seems more robust?

• /html/body/table/tr/td/span• /html//span

The optimum for a change may not be such for another change. But the probability of being affected by a change IS different.

Web Pages Evolve in time!!!

Page 23: HT2010 Paper Presentation

Iñaki Paz 23

Generic probabilistic heuristic approach for global optimization problems.

Iteration starting from a solution:• Get new valid neighbor solution (RANDOM)• Test if new solution improves older based on an energy calculation function• Else, check if probabilistically solution is accepted (RANDOM)• Iterate until solution is good enough or computation budget has been exhausted

Simulated annealing with this function has been used: F(XPath)= a * nºsteps + b * nºwildcards + c * conditions

Simulated Annealing

Page 24: HT2010 Paper Presentation

Iñaki Paz 24

Selecting a neighbor solution:

• Solutions obtained by the modification of an XPath step• Resulting solution obtained by the modification must be

equivalent (select the same nodes). This is checked on SA execution.

Simulated Annealing

Page 25: HT2010 Paper Presentation

Iñaki Paz 25

How to characterize an XPATH?

Parts of an XPath:• Steps (/table): FIX an structure element on the path• Wildcards (/*): FIX an undetermined structure element on the

path• Conditions: FIX a condition over an elements attribute

Conditions:• Style (@width) vs. description (@class, @id, @alt)• Change Likelihood vs. Condition singularity

Energy Function characterization: F(xpath)=a*steps + b*wildcards + c*styleConds + d*descrConds

Simulated Annealing

Page 26: HT2010 Paper Presentation

Iñaki Paz 26

Sample on CarSearch• Area to be adapted: BANNERS

Simulated Annealing

Page 27: HT2010 Paper Presentation

Iñaki Paz 27

Sample on CarSearch• Area to be adapted: BANNERS

Simulated AnnealingNote that optimized Xpaths somehow determine WHAT characterizes the selection on the

document

Page 28: HT2010 Paper Presentation

Iñaki Paz 28

Index

Introduction

XPath Expressions to select contents

Web pages get changed!!!!• In Space• In Time

Evaluation

Conclusions

Evaluation

Page 29: HT2010 Paper Presentation

Iñaki Paz 29

Evaluation

How to obtain page evolution for a Web app?• Select apps and watch if and how change• Consult archive.org web site home pages.

www.yahoo.com || www.elmundo.es

Tests: • One page each 10 days.

• All pages analyzed for changes.

• Changes => milestones

• 2 or 3 different pages between milestones to generate Xpath

• Tested with pages AFTER milestone.

Page 30: HT2010 Paper Presentation

Iñaki Paz 30

Evaluation

Changes evaluated as:• Minor: small changes in esthetics and basic structure (e.g. add

rows to table)• Major: App redesign, new layout, etc.

Results:• 90% of XPaths were resilient to Minor Changes• 10% of XPaths were resilient to Major Changes

Conclusion:

The approach works for evolutionary changes,

not revolutionary ones

Page 31: HT2010 Paper Presentation

Iñaki Paz 31

Index

Introduction

XPath Expressions to select contents

Web pages get changed!!!!• In Space• In Time

Evaluation

ConclusionsConclusions

Page 32: HT2010 Paper Presentation

Iñaki Paz 32

Conclusions

External Adaptation Tools have appeared

Require selection patterns, such as XPath

Pattern Resilience to Web App Changes is important

Application of Induction and SA techniques

Further specific treatments based on the language should be taken into account (a table always contains rows and columns) on energy function.

Page 33: HT2010 Paper Presentation

Iñaki Paz 33

Contact

Iñaki [email protected]

http://www.lks.es http://www.onekin.org