HT2010 Paper Presentation
-
Upload
inakipaz -
Category
Technology
-
view
377 -
download
2
description
Transcript of HT2010 Paper Presentation
Providing Resilient Xpaths for External Adaptation Engines
Iñaki PazLKS, S. Coop.
ONEKIN Research Group – UPV/EHU
Donostia - San Sebastián, SpainJune 14th, 2010
Iñaki Paz 2
Index
Introduction
XPath Expressions to select contents
Web pages get changed!!!!• In Space• In Time
Evaluation
Conclusions
Introduction
Iñaki Paz 3
Adaptation aware Web Applications
Architecture:Server
Browser
Depending on user profile and context, the Web Application reacts executing adaptation rules providing personalized contents.
RULES “kind of” CONFIGURE ADAPTATION
HTTPURL + Params
Content
Adaptation Rules
Rules address what is adapted and how, based on user profile and context
Iñaki Paz
Adaptation Aware Applications
Adaptation cases / rules are foreseen on application development
New not foreseen adaptation needs may appear through time
New Possible Adaptation needs:
4
• New interaction protocol (FTP) to handle application docs.
• New comm. language (RSS) to present data.
• Provide a RESTful interface to application concepts
• New data filters on searches for given user.
• Add external mashups related to certain content.
Iñaki Paz 5
Adaptation as an Application LayerArchitecture: Application
Layer
BrowserAdapted Content
HTTP / HTML?
Content
Protocol
AdaptationLayer
Adaptation Layer can be inside the application
• May access to application’s business logic and APIs
• Complex adaptations
Adaptation Layer can be EXTERNAL to the application
• Adapt Layer works like any other Browser (HTTP + HTML)
• More flexible, Adaptation FULLY independent from Application
Adaptation Rules
Iñaki Paz 6
External Adaptation
Architecture:Application
Layer
Browser
HTTP / HTML
Content (HTML P
ages)
AdaptationLayer
Adapted Communication
Protocol
Adapted Content
Content (HTML Pages)
HTTP / HTML?
• http://www.dapper.net/open/ • Web Page => RSS, Google Gadget
• GreaseMonkey Scripts• JS Scripts for the Browser to personalize app.
Iñaki Paz 7
External AdaptationApplication
Layer
Content (HTML Pages)
Adaptation Rules need to specify WHICH elements adaptation affects on the page.
Distinct technologies available to select elements on pages:
• Text Patterns
• Regular Expressions
• Complex Expression Languages
This work focusses on Xpath• Most browsers support DOM Level 3 Xpath
specification
• Easy to transform HTML to XHTML (e.g. Jtidy)
Adaptation Layer
Iñaki Paz 8
Index
Introduction
XPath expressions to select contents
Web pages get changed!!!!• In Space• In Time
Evaluation
Conclusions
XPath to Select Contents
Iñaki Paz
External Adaptation
9
XPATH is a language to select nodes in XML
Documents
XPATH is based on the TREE Structure
of Documents
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]
Iñaki Paz
Web App Pages Change!!!
10
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]
If the page changes, wanted element may not be correctly selected
Iñaki Paz
Web App Pages Change!!!
11
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]
If the page changes, wanted element may not be correctly selected
OUR OBJECTIVE IS TO OBTAIN
CHANGE RESILIENT XPATH EXPRESSIONS
Iñaki Paz
Web App Pages Change!!!
12
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[4]
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]
Given the XPaths:
The Xpath:
Would select the same elements.
Notice that this XPath characterizes the banner as those ROWS with only ONE column on a table whose cellpadding is ‘2’
Obtaining these XPath expressions by hand is cumbersome and error prone. A tool has been developed to obtain a node’s absolute XPath expression and then generate an optimized XPATH.
Firefox plugins like XPather or XPath Checker (among others) enable obtaining a node’s absolute XPath.
//table[@cellpadding=‘2’]/tr[count(*)=1]
Iñaki Paz
Web App Pages are different!!!
13
• Distinct Pages => Distinct Structure, Distinct Contents => Distinct XPaths• XPaths are patterns to be applied over a pageClass set.
• Page Class = The SET of pages that describe the same type of information and have a similar page structure.
Iñaki Paz 14
Index
Introduction
XPath expressions to select contents
Web pages get changed!!!!• In Space• In Time
Evaluation
Conclusions
Web Pages get changed!!!!
Iñaki Paz
Variability in Space
Variability in Space denotes the distinct running versions of a given page accessible on a given time.
Web applications pages change their contents!!!• Different searches provide different results• Information caducity• Advert introduction• User and context adaptations application is aware of
An XPath working on a page of a given class may not work on another of the same class
Need to induce an XPath robust to those changes from a pageClass set contaning most of the page variants
15
Iñaki Paz
XPath Induction
16
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[3]
/html/body[1]/table[2]/tr[1]/td[3]/table[1]/tr[1]/td[2]/table[3]/tr[6]
Each STEP in an absolute XPATH selects one and only one ELEMENT
Iñaki Paz 17
Induction: Differences on Paths
3 Main difference types may be found
/a[n]/b[m]/c[o]/a[n]/b[m]/c[p]
----------------------/a[n]/b[m]/c[conds]
Position
/a[n]/b/c[m]/a[n]/d/c[m]
------------------/a[n]/*[conds]/c[m]
/a[n]/b[m]/c[o]/a[n]/d/b[m]/c[o]
-----------------------/a[n]//b[conds]/c[o]
Node (e.g. div vs. span)
Depth
Iñaki Paz 18
These types may appear combined/a[n]/b[m]/c[o]/a[n]/d[m]/c[p]------------------
/a[n]/*[conds]/c[conds]
Position & node combination
Sample on http://www.carsearch.com: 2 of Position
/html/body[1]/table[2]/tr[1]/…/table[3]/tr[7]/html/body[1]/table[2]/tr[1]/…/table[2]/tr[10]
------------------/html/body[1]/table[2]/tr[1]/ … /table[@width='100%'][@border='0'] [@cellpadding='2'][@cellspacing='0'][tr]/tr[count(*)=1][count(td)=1]
Induction: Differences on Paths
Iñaki Paz 19
LOOP on XPaths resolving unconsidered differences
Problems:• /…/table[@class]• /…/table[@style]
Induction provides an XPath working on all the samples, does not optimize it
Ends on expressions like:
Induction Algorithm
html/body[1]/table[2]/tr[1]/ … /table[@width='100%'][@border='0'] [@cellpadding='2'][@cellspacing='0'][tr]/tr[count(*)=1][count(td)=1]
Iñaki Paz 20
¿Which is the problem?• XPath is based on structure.• Small changes may affect structure.
Solution:• Remove as much structural information as
possible keeping equivalence with original XPath.
Web Pages Evolve in time!!!
Iñaki Paz 21
Definition:• Two XPaths are equivalent if they recover the
same nodes. [Miklau 2004] have demonstrated that this problem is NP-Complete for a subset of XPath.
Definition:• An XPath is resilient to change C, if the set of
recovered nodes is independent of making change C or not.
Web Pages Evolve in time!!!
Iñaki Paz 22
An Example:• ¿Which XPath seems more robust?
• /html/body/table/tr/td/span• /html//span
The optimum for a change may not be such for another change. But the probability of being affected by a change IS different.
Web Pages Evolve in time!!!
Iñaki Paz 23
Generic probabilistic heuristic approach for global optimization problems.
Iteration starting from a solution:• Get new valid neighbor solution (RANDOM)• Test if new solution improves older based on an energy calculation function• Else, check if probabilistically solution is accepted (RANDOM)• Iterate until solution is good enough or computation budget has been exhausted
Simulated annealing with this function has been used: F(XPath)= a * nºsteps + b * nºwildcards + c * conditions
Simulated Annealing
Iñaki Paz 24
Selecting a neighbor solution:
• Solutions obtained by the modification of an XPath step• Resulting solution obtained by the modification must be
equivalent (select the same nodes). This is checked on SA execution.
Simulated Annealing
Iñaki Paz 25
How to characterize an XPATH?
Parts of an XPath:• Steps (/table): FIX an structure element on the path• Wildcards (/*): FIX an undetermined structure element on the
path• Conditions: FIX a condition over an elements attribute
Conditions:• Style (@width) vs. description (@class, @id, @alt)• Change Likelihood vs. Condition singularity
Energy Function characterization: F(xpath)=a*steps + b*wildcards + c*styleConds + d*descrConds
Simulated Annealing
Iñaki Paz 26
Sample on CarSearch• Area to be adapted: BANNERS
Simulated Annealing
Iñaki Paz 27
Sample on CarSearch• Area to be adapted: BANNERS
Simulated AnnealingNote that optimized Xpaths somehow determine WHAT characterizes the selection on the
document
Iñaki Paz 28
Index
Introduction
XPath Expressions to select contents
Web pages get changed!!!!• In Space• In Time
Evaluation
Conclusions
Evaluation
Iñaki Paz 29
Evaluation
How to obtain page evolution for a Web app?• Select apps and watch if and how change• Consult archive.org web site home pages.
www.yahoo.com || www.elmundo.es
Tests: • One page each 10 days.
• All pages analyzed for changes.
• Changes => milestones
• 2 or 3 different pages between milestones to generate Xpath
• Tested with pages AFTER milestone.
Iñaki Paz 30
Evaluation
Changes evaluated as:• Minor: small changes in esthetics and basic structure (e.g. add
rows to table)• Major: App redesign, new layout, etc.
Results:• 90% of XPaths were resilient to Minor Changes• 10% of XPaths were resilient to Major Changes
Conclusion:
The approach works for evolutionary changes,
not revolutionary ones
Iñaki Paz 31
Index
Introduction
XPath Expressions to select contents
Web pages get changed!!!!• In Space• In Time
Evaluation
ConclusionsConclusions
Iñaki Paz 32
Conclusions
External Adaptation Tools have appeared
Require selection patterns, such as XPath
Pattern Resilience to Web App Changes is important
Application of Induction and SA techniques
Further specific treatments based on the language should be taken into account (a table always contains rows and columns) on energy function.