Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia.

21
Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Presented by Vinay Rambhia Vinay Rambhia

Transcript of Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia.

Nilesh Dalvi, Philip Bohannon, Fei Sha

Presented byPresented byVinay RambhiaVinay Rambhia

Script generated websites have html tree structure

Wrappers are used to extract information Xpath expression to extract director

information

w1=/html/body/div[2]/table/td[2]/text() Works for similar pages

Evolution cause wrappers to break so high maintenance

Other wrappers w2=//div[@class=‘content’]/*/td[2]/text()

w3=//table[@width=‘80%’/td[2]/text()

w4=//text()[psib::*[1][text()=‘director’]]

This paper discuss use temporal snapshot of WebPages to

develop probabilistic tree edit model use this model to improve wrapper

construction Method estimates efficiently in quadratic

time in the size of the tree When applied to IMDB it was 86% robust

whereas traditional wrappers were 40% robust

Change model is defined in terms of conditional transducer ‘п’ process

When a forest T is given to П process it converts into forest S

П process is defined into 2 sub process пins ,пds

To summarize, the generative process π is characterized by following parameters

θ = (pstop, {pdel(l)}, {pins(l)}, {psub(l1, l2))}

for l, l1, l2 ∈∑ along with the following conditions:

• 0 < pstop < 1

• 0 ≤ pdel(l) ≤ 1

• pins(l) ≥ 0, ∑L pins(l) = 1

• psub(l1, l2) ≥ 0,∑L2 psub(l1, l2) = 1

……..eq(A)

Archival data contains {S,T} pairs were S is old versions and T is new versions

Model is specified in terms of set of parameters θ

We want to find θ* θ∗ = arg max Π (T,S)∈ArchivalData Pθ(T | S) Pθ(T | S) is a Computing Transformation

Probability

The transducer π performs a sequence of edit operations consisting of insertions, deletions and substitutions to transform a tree S into another tree T.

Use dynamic programming to compute probabilities as there various ways

Let DP1(Fs, Ft) denote the probability that π(Fs) = Ft due πins ,πsub

two cases: i. The node v was the result of an insertion

by πins operator. Let p be the probability that πins inserts the node v in Ft−v to form Ft.Then, the probability of this case is DP1(Fs, Ft −v) ∗ p.

ii. The node v was the result of a substitution. The probability of this case is DP2(Fs, Ft). Hence, we have

DP1(Fs, Ft) = DP2(Fs, Ft) + p ∗ DP1(Fs, Ft − v) ……..Eq(1)

Let DP2(Fs, Ft) denote the probability that π(Fs) = Ft πsub

two cases: i. v was substituted for u. In this case, we

must have Fs − [u] transform to Ft − [v] and ⌊u⌋ transform to ⌊v⌋. Denoting psub (label(u), label(v)) with p1, the total probability of this case is p1 ∗ DP1(Fs −[u], Ft −[v]) ∗ DP1(⌊u⌋, ⌊v⌋)

ii. v was substituted for some node other than u. we have

DP2(Fs, Ft) = p1DP1(Fs − [u], Ft − [v])DP1(⌊u⌋, ⌊v⌋)+ p2DP2(Fs − u, Ft) ……..Eq(2)

Let T1 be the tree with the nodes a and b, let T2 be the tree with single node c. Let us compute the probability that π(T1) = T2,

which is denoted by DP1(T1, T2). Applying Eq (1) we get

DP1(T1, T2) = DP2(T1, T2) + pins(c) ∗ DP1(T1, ∅)

Let T3 denote the tree with single node b. Then, DP2(T1, T2) = psub(a, c) ∗ DP1(∅, ∅) ∗

DP1(T3, ∅)+ pdel(a) ∗ DP2(T3, T2) To compute DP2(T3, T2), we get DP2(T3, T2) = psub(b, c) ∗ DP1(∅, ∅) ∗

DP1(∅, ∅)+ pdel(b) ∗ DP2(∅, T2) Total probability DP1(T1, T2) = psub(a, c) ∗ pdel(b) ∗ p2 stop +

psub(b, c) ∗ pdel(a) ∗ p2 stop+ pdel(a) ∗ pdel(b) ∗ pins(c) ∗ pstop

θ∗ = arg max θ N∑n=1logPθ(Tn | Sn)

It is difficult to calculate θ∗ so we calculate by Gradient ascent

θt+1 = θt + ηg(θt)…..eq(3) g(θ) =∂ log ℓ(θ)/∂θ = N∑n=1∂ logP(Tn | Sn)/ ∂θ Θ has to satisfy eq(A) So we use variable reparameterization θij = e αij /N∑j=1 e αij

Eq(3) becomes αt+1 = αt + ηg(αt)

We use bottom up algorithm starting from general Xpath and specializing it till it matches only the target node

w0 = //table/ ∗ /td/text()

//table/tr/td/text() //table[bgcolor =′ red′]/ ∗ /td/text() //table/ ∗ /td[2]/text() Algorithm maintains a set P of partial

wrappers which has recall=1 and precision<1

Algorithm applies specialization steps to Xpaths in P to convert into new Xpath such that precision becomes 1

Rob X,θ(ϕ) =∑XY | ϕ |=Pθ(Y | X) Algorithm for calculating robustness

Change model

Generating Robust Wrappers

Evaluation of Model Learner