Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong...

31
Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Shaoxu Song Lei Chen Hong Cheng § Tsinghua University Beijing, China [email protected] The Hong Kong University of Science and Technology [email protected] § The Chinese University of Hong Kong [email protected] ICDE 2012

Transcript of Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong...

Page 1: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of DistanceThresholds for Metric Distance Constraints

Shaoxu Song† Lei Chen‡ Hong Cheng§

†Tsinghua UniversityBeijing, China

[email protected]

‡The Hong Kong University ofScience and Technology

[email protected]

§The Chinese University ofHong Kong

[email protected]

ICDE 2012

Page 2: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 1/27

Shaoxu Song, Lei Chen, Hong Cheng

Data Dependencies

Recently used for capturing inconsistencies

fd1 [Address] → [Region]

t5 and t6, with the equal value on Address, but have differentvalues of Region.

ExampleID Name Address Region

01 West Wood Hotel Fifth Avenue, 61st Street Chicago t101 West Wood Fifth Avenue, 61st Street Chicago, IL t201 West Wood (61) 5th Avenue, 61st St. Chicago, IL t322 St. Regis Hotel No.3, West Lake Road. Boston, MA t422 St. Regis Hotel #3, West Lake Rd. Boston t522 St. Regis #3, West Lake Rd. Chicago, MA t6

Page 3: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 2/27

Shaoxu Song, Lei Chen, Hong Cheng

Tolerance to Variations

Real-world information often has various representationformats.

The strict equality function limits the usage of fds.

fd1 [Address] → [Region]

t1 and t2, detected as a “violation” by mistake.“Chicago” and “Chicago, IL” denote the same region.

t4 and t6, are true violations.Cannot be detected by fd1, as address values are not equal.

ID Name Address Region

01 West Wood Hotel Fifth Avenue, 61st Street Chicago t101 West Wood Fifth Avenue, 61st Street Chicago, IL t201 West Wood (61) 5th Avenue, 61st St. Chicago, IL t322 St. Regis Hotel No.3, West Lake Road. Boston, MA t422 St. Regis Hotel #3, West Lake Rd. Boston t522 St. Regis #3, West Lake Rd. Chicago, MA t6

Page 4: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 3/27

Shaoxu Song, Lei Chen, Hong Cheng

Metric Distance Constraints

In order to be tolerant to small variations

Differential dependencies (dds) declare the dependencies withrespect to metric distances (X → Y , ϕ)

dd1 ([Address] → [Region], < 8, 3 >)

< 8, 3 > is a pattern ϕ of distance thresholds on Address andRegion respectively.

States a constraint on metric distance:

Any two tuples have distance on Address less than a threshold(≤ 8),

then their Region values should be similar as well, i.e., the editdistance on Region is less than the corresponding threshold(≤ 3).

Page 5: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 4/27

Shaoxu Song, Lei Chen, Hong Cheng

Motivation of This Work

Difficult to determine the proper settings of distance thresholds formetric distance constraints.

Unlike fds, already imply the equality function

a very tight threshold (≈ 0 as fds)too strict to be tolerant to various information formats

a loose threshold (≈ dmax the maximum distance value)meaningless, since any data can satisfy it

In this study,

employ certain statistical measures to evaluate the utility ofdistance threshold patternse.g., support, confidence and dependent quality

target on automatically determining the best settings ofdistance thresholds, having higher statistical measures.

Page 6: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Introduction 5/27

Shaoxu Song, Lei Chen, Hong Cheng

Applicable to Other Types

Metric functional dependencies (mfds)

Xδ−→ A

equality operator in the left-hand-side

metric distance operator in the right-hand-side

for violation detection

e.g., manu2−→ addr

Matching dependencies (mds)

[X ≈ X ] → [A ⇋ A]

similarity operator in the left-hand-side

matching operator in the right-hand-side

for record matching

e.g., [addr ≈ addr] → [tel ⇋ tel]

Page 7: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Outline

Introduction

Preliminary

Determination Algorithm

Experiment

Summary

Page 8: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 6/27

Shaoxu Song, Lei Chen, Hong Cheng

Statistical MeasuresSupport of ϕ:

the proportion of tuple pairs whose distances satisfy thethresholds in ϕ[XY ].

a ϕ with high support is preferred in order to detect moreviolations.

Confidence of ϕ:

the ratio of tuple pairs satisfying ϕ[XY ] to the pairs satisfyingϕ[X ].

a ϕ with high confidence is preferred to detect violations moreprecisely.

Dependent quality of ϕ denotes the quality of tolerance on thedependent attributes Y .

how close the distance threshold ϕ[Y ] to the equality is.

if the dependent quality is low (i.e., ϕ[Y ] is far away fromequality), the constraint is meaningless and useless.

Page 9: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 7/27

Shaoxu Song, Lei Chen, Hong Cheng

Interaction of Measures

If the dependent quality is set too high

e.g., ϕ[Y ] = 0, equality in fds

too strict and may identify violations by mistake

confidence measure will be low

Contrarily, consider a ϕ with the lowest dependent quality

i.e., ϕ[Y ] = dmax the maximum distance value

has the highest confidence 1.0, since any tuple pairs canalways have distances ≤ dmax on Y

miss all the violations and is useless

For example, ([Address] → [Region], < 8, dmax >)

any pair of tuples always has distance on Region ≤ dmax

the confidence is 1.0

violations t4 and t6 cannot be detected by such a dd

Page 10: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 8/27

Shaoxu Song, Lei Chen, Hong Cheng

Parameter-free Determination

To determine ϕ

applications prefer metric distance constraints with highstatistical measures

difficult to set the parameters of minimum support, confidenceand dependent quality, respectively

setting the requirements of some measures too high will makethe others low

A parameter-free style

automatically returning those best ϕ

s.t., not existing any other settings that can be found havinghigher support, confidence, and dependent quality than thereturned results at the same time.

Page 11: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 9/27

Shaoxu Song, Lei Chen, Hong Cheng

Assuring the Utility

To avoid tuning parameters manually, we are interested in anoverall evaluation of utility.

Let b be the matching distance of any tuple pair.

U(ϕ) = Pr(b � ϕ[Y ],Q(ϕ) is high | b � ϕ[X ])

the conditional probability of b satisfying ϕ[Y ] with highdependent quality given b satisfies ϕ[X ].

to accurately detect the violations with small distance, weexpect the above probability of a ϕ to be high.

This U(ϕ) can roughly denote the utility of confidence anddependent quality, while support is not investigated.

Page 12: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 10/27

Shaoxu Song, Lei Chen, Hong Cheng

Expected Utility

Compute an expected utility to refine U(ϕ) w.r.t. confidence anddependent quality by using support,

U(ϕ) = E (U(ϕ) | C(ϕ),D(ϕ),Q(ϕ)),

C(ϕ),D(ϕ) and Q(ϕ) are the statistics observed from data.

C(ϕ) is confidence measure

D(ϕ) is the proportion of tuple pairs with distance satisfyingϕ[X ], support of ϕ[X ]

support of ϕ is C(ϕ)D(ϕ)

Q(ϕ) is dependent quality

Page 13: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 11/27

Shaoxu Song, Lei Chen, Hong Cheng

Computation of Expected Utility

The computation is derived by applying the Bayesian rule andBinomial distribution.

U(ϕ) = E (U | C,D,Q)

=

uP(U = u | C,D,Q)du

...

=

uf (DCQ;D, u)π(u)du∫

f (DCQ;D, u)π(u)du.

f (k ; n, p) is the probability mass function of Binomialdistribution.

Page 14: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Preliminary 12/27

Shaoxu Song, Lei Chen, Hong Cheng

Property of Expected Utility

According to the calculation formula of U(ϕ)

TheoremFor any ϕ1, ϕ2, if ϕ1 has higher support than ϕ2, denoted byS(ϕ1)S(ϕ2)

= ρ, ρ ≥ 1, and the confidence and dependent quality of ϕ1

are higher than those of ϕ2 as followsC(ϕ1)C(ϕ2)

≥ ρ,Q(ϕ1)Q(ϕ2)

≥ 1ρ, then

we have U(ϕ1) ≥ U(ϕ2).

This conclusion verifies our intuition that

higher support, confidence and dependent quality

contribute to a larger expected utility.

DefinitionThe distance threshold determination problem is to find a distancethreshold pattern ϕ for the dd on X → Y with the maximumexpected utility U(ϕ).

Page 15: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Outline

Introduction

Preliminary

Determination Algorithm

Experiment

Summary

Page 16: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 13/27

Shaoxu Song, Lei Chen, Hong Cheng

OverviewDetermination process for the maximum U(ϕ) has two steps:

(i) to find the best ϕ[Y ] when given a fixed ϕ[X ];

(ii) to find the desired ϕ[X ] together with its best ϕ[Y ].

Candidate of distance threshold patterns, e.g., ϕ[Y ]for each A ∈ Y , consider thesearch space of distance thresholdϕ[A] from 0 to dmax.

enumerate all the distancethresholds ϕ[A] for all thedependent attributes A ∈ Y .

each node, such as < 1, 1 >,corresponds to a ϕ[Y ] ∈ CY

Page 17: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 14/27

Shaoxu Song, Lei Chen, Hong Cheng

Determination for Dependent Attributes (PA)

Given a fixed ϕ[X ], to find the corresponding best ϕ[Y ] on thedependent attributes Y with the maximum U(ϕ).

D(ϕ) value is the same for any ϕ with same ϕ[X ].

study the other two measures C(ϕ) and Q(ϕ) in terms ofcontributions to U(ϕ).

TheoremConsider any two ϕ1, ϕ2, having the same D(ϕ1) = D(ϕ2) = D. If

their confidence and dependent quality satisfy

C(ϕ1)Q(ϕ1) ≥ C(ϕ2)Q(ϕ2), then we have U(ϕ1) ≥ U(ϕ2).

for a fixed ϕ[X ],

to find a ϕ with the maximum U(ϕ) is equivalent to find theone with the maximum C(ϕ)Q(ϕ).

Page 18: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 15/27

Shaoxu Song, Lei Chen, Hong Cheng

Dominant Relationship for PruningPruning idea

Q(ϕ) directly computed from a given ϕ[Y ]

C(ϕ) is costly to compute by statistics of data

to avoid evaluate C(ϕ) for all possible candidates

DefinitionFor any ϕ1, ϕ2, if ϕ1[A] ≥ ϕ2[A], ∀A ∈ Z , then we say that ϕ1[Z ]dominates ϕ2[Z ], denoted by ϕ1[Z ]⋖ ϕ2[Z ].

Any tuple pair satisfying ϕ2[Z ] will always satisfy ϕ1[Z ]

Page 19: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 16/27

Shaoxu Song, Lei Chen, Hong Cheng

Dominant Relationship for Pruning

LemmaFor any two ϕ1, ϕ2, having ϕ1[X ] = ϕ2[X ] and ϕ1[Y ]⋖ ϕ2[Y ],then C(ϕ1) ≥ C(ϕ2) and Q(ϕ1) ≤ Q(ϕ2).

By a downward traversal of candidates in the dominant graph,

the dependent quality increases from 0 to 1

the confidence decreases from 1 to 0

Page 20: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 17/27

Shaoxu Song, Lei Chen, Hong Cheng

Pruning of Candidate Patterns (PAP)Consider the current ϕi in traversal of CY .i) Pruning by ϕmax.

The first pruning opportunity is introduced by ϕmax of thepreviously processed i − 1 candidates.

Let Vmax denote the maximum value of C(ϕ)Q(ϕ) in the firsti − 1 candidates, i.e.,

Vmax =i−1maxj=1

C(ϕj )Q(ϕj )

S0 = {ϕk | Q(ϕk) ≤ Vmax, ϕk [Y ] ∈ CY }can be pruned

For any ϕk [Y ] ∈ CY with Q(ϕk) ≤ Vmax,

C(ϕk)Q(ϕk) ≤ Q(ϕk) ≤ Vmax.

U(ϕmax) ≥ U(ϕk).

Page 21: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 18/27

Shaoxu Song, Lei Chen, Hong Cheng

Pruning of Candidate Patterns (PAP)ii) Pruning by ϕi .

The second pruning opportunity is developed according to thecurrent ϕi in i -th step.

S1 = {ϕk | ϕi ⋖ ϕk ,Q(ϕk) ≤VmaxC(ϕi )

, ϕk [Y ] ∈ CY } is pruned

For any ϕk [Y ] ∈ CY with ϕi [Y ]⋖ ϕk [Y ] and Q(ϕk) ≤VmaxC(ϕi )

,

ϕi [Y ]⋖ ϕk [Y ] implies C(ϕk) ≤ C(ϕi )

follows C(ϕk)Q(ϕk) ≤ C(ϕi )Q(ϕk) ≤ Vmax

we have U(ϕmax) ≥ U(ϕk)

ϕk in S0,S1 can be safely pruned,without computing C(ϕk)

initialization of Vmax = 0

Page 22: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 19/27

Shaoxu Song, Lei Chen, Hong Cheng

Determination for Determinant Attributes (DA)

To find a ϕ with the maximum U(ϕ)

consider all possible distance threshold patterns of thedeterminant attributes X , say CX ,

The straight-forward approach is to compute the best ϕ[Y ]for each ϕ[X ] ∈ CX

The most costly part is still the computation of ϕi [Y ], byeither pa or pap.

In order to improve the pruning power of pap, we expect tofind a larger pruning bound Vmax.

Page 23: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 20/27

Shaoxu Song, Lei Chen, Hong Cheng

Pruning of Candidate Patterns

Pruning candidates with different ϕ[X ]

TheoremConsider any two ϕ1, ϕ2, having D(ϕ1) ≥ D(ϕ2). If theirconfidence and dependent quality satisfy

C(ϕ2)Q(ϕ2) ≤ 1−D(ϕ1)

D(ϕ2)

(

1−C(ϕ1)Q(ϕ1))

then we have U(ϕ1) ≥ U(ϕ2).

We can prune those ϕ2 whose C(ϕ2)Q(ϕ2) is no higher than

1− D(ϕ1)D(ϕ2)

(

1− C(ϕ1)Q(ϕ1))

To apply this pruning bound, we require a preconditionD(ϕ1) ≥ D(ϕ2).

Page 24: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 21/27

Shaoxu Song, Lei Chen, Hong Cheng

Advanced Pruning Bound (DAP)

We process CX in descending order of D(ϕ) values

Let ϕmax be the current result with the maximum expectedutility by evaluating the first i − 1 candidates in CX .

for the next ϕi , we have D(ϕmax) ≥ D(ϕi )

An advanced pruning bound for computing ϕi [Y ]

Vmax = 1−D(ϕmax)

D(ϕi )

(

1− C(ϕmax)Q(ϕmax))

in the original pap, initialization of Vmax = 0

replace with the above possibly large bound

Page 25: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Determination Algorithm 22/27

Shaoxu Song, Lei Chen, Hong Cheng

Analysis of Pruning

Practically, the worst case of dap is exactly the basic da, whenworking together with pap

If the calculated bound Vmax is less than 0, we can simplyassign 0 to it.

Once the bound is Vmax > 0, it can achieve a tighter pruningbound.

Theoretically, the theorem for advanced pruning is a generalizationof the theorem for basic pruning

when D(ϕ1) = D(ϕ2),

1−D(ϕ1)

D(ϕ2)

(

1−C(ϕ1)Q(ϕ1))

= C(ϕ1)Q(ϕ1)

Our experiments also verify that dap+pap is at least no worsethan da+pap.

Page 26: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Outline

Introduction

Preliminary

Determination Algorithm

Experiment

Summary

Page 27: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 23/27

Shaoxu Song, Lei Chen, Hong Cheng

Settings

Preprocessing of three real data sets

pre-compute edit distance of all tuple pairs

store the distance results as up to 1,000,000 matching tuples

proposed techniques are then evaluated on the preparedmatching tuples

To determine the distance thresholds for

Rule1 : cora(author, title → venue, year)

Rule2 : cora(venue → address, publisher, editor)

Rule3 : restaurant(name, address → city, type)

Rule4 : citeseer(address, affiliation, description → subject)

where Rule 2 has a larger Y while Rule 4 has a larger X .

Page 28: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 24/27

Shaoxu Song, Lei Chen, Hong Cheng

Example Results

Results also verify our property analysis of expected utility

higher support, confidence and dependent quality yield higherexpected utility

e.g., U(ϕ2) ≥ U(ϕ4)

there does not exist any ϕ which has higher support,confidence and dependent quality at the same time than thereturned ϕ1 with the maximum expected utility

the expected utility can reflect the usefulness in applicationsϕ[X ] ϕ[Y ] Measures Violation Detection

author title venue year S(ϕ) C(ϕ) Q(ϕ) U(ϕ) Precision Recall F-measure

ϕ1 4 1 3 1 0.1529 0.3760 0.80 0.2325 0.3725 0.5425 0.4418ϕ2 5 2 3 1 0.1764 0.3667 0.80 0.2296 0.3718 0.6266 0.4667ϕ3 5 1 3 2 0.1632 0.3774 0.75 0.2232 0.3179 0.4492 0.3723ϕ4 4 2 3 2 0.1657 0.3657 0.75 0.2188 0.3073 0.4457 0.3638ϕ5 4 1 4 2 0.1529 0.3852 0.70 0.2108 0.2654 0.3267 0.2928ϕ6 5 2 5 2 0.1764 0.3985 0.65 0.2106 0.2459 0.3337 0.2831fd 0 0 0 0 0.0064 0.3595 1.00 0.1064 0.4315 0.0735 0.1256

Page 29: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 25/27

Shaoxu Song, Lei Chen, Hong Cheng

Pruning EvaluationPerformance of da and dap for determinant side X , pa and pap

for dependent side Y

dap and pap outperform da and pa, respectivelyRule 1 shows best performance when applying dap+pap

dap+pap approach can provide a pruning bound that is atleast no worse than the da+pap oneRule 3 verifies that the dap is at least no worse than the da

0

1000

2000

3000

4000

5000

6000

100k 300k 500k 700k 900k1m

Tim

e c

ost (s

)

data size

Rule 1

DA+PADA+PAP

DAP+PAP

0

1000

2000

3000

4000

5000

6000

100k 300k 500k 700k 900k1m

Tim

e c

ost (s

)

data size

Rule 3

DA+PADA+PAP

DAP+PAP

Page 30: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Experiment 26/27

Shaoxu Song, Lei Chen, Hong Cheng

Pruning EvaluationRule 2 has a larger Y while Rule 4 has a larger X

Rule 2, which has more attributes in the dependent side, mayhave more opportunities of pruning by pap

pap can achieve a significant improvement in Rule 2

Rule 4, with smaller Y , is not as significant as Rule 2 on theimprovement by pap

dap do help in providing an advanced pruning bound for pap

0

1000

2000

3000

4000

5000

6000

100k 300k 500k 700k 900k1m

Tim

e c

ost (s

)

data size

Rule 2

DA+PADA+PAP

DAP+PAP

0

1000

2000

3000

4000

5000

6000

100k 300k 500k 700k 900k1m

Tim

e c

ost (s

)

data size

Rule 4

DA+PADA+PAP

DAP+PAP

Page 31: Parameter-Free Determination of Distance Thresholds for Metric … · Shaoxu Song, Lei Chen, Hong Cheng Tolerance to Variations Real-world information often has various representation

Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints Summary 27/27

Shaoxu Song, Lei Chen, Hong Cheng

Conclusion

We study the problem of determining the distance thresholds formetric distance constraints

difficult to manually specify requirements of various statisticalmeasures

conduct the determination in a parameter-free style

i.e., to compute an expected utility of the distance thresholdpattern and return the results with the maximum expectedutility

several advanced pruning algorithms are then developed inorder to efficiently find the desired distance thresholds