Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of...
![Page 1: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/1.jpg)
Circumventing Data Quality Problems
Using Multiple Join Paths
Yannis Kotidis, Athens University of Economics and BusinessAmélie Marian, Rutgers UniversityDivesh Srivastava, AT&T Labs-Research
![Page 2: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/2.jpg)
9/11/2006 Amélie Marian - Rutgers University 2
Motivating ExampleSales
TN
TN
BAN
TN
TN
BAN
CustName
CustName
ORN
PON
Provisioning
CustName
CustName PONSubPON
Inventory
PON
TN CircuitID
CircuitID
Ordering
ORN TN
TN: Telephone NumberORN: Order NumberBAN: Billing Accoung NumberPON: Provisoning Order NumberSubPON: Related PON
What is the Circuit ID associated with a Telephone Numberthat appears in SALES?
![Page 3: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/3.jpg)
9/11/2006 Amélie Marian - Rutgers University 3
Motivations Data applications with overlapping
features Data integration Web sources
Data quality issues (duplicate, null, default values, data inconsistencies) Data-entry problems Data integration problems
![Page 4: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/4.jpg)
9/11/2006 Amélie Marian - Rutgers University 4
Contributions Multiple Join Path (MJP) framework
Quantifies answer quality Takes corroborating evidence into account Agglomerative scoring of answers
Answer computation techniques Designed for MJP scoring methodologies Several output options (top-k, top-few)
Experimental evaluation on real data VIP integration platform Quality of answers Efficiency of our techniques
![Page 5: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/5.jpg)
9/11/2006 Amélie Marian - Rutgers University 5
Outline
Multiple Join Path Framework Problem Definition
Our Approach Scoring Answers Computing Answers
Experimental Evaluation Related Work
![Page 6: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/6.jpg)
9/11/2006 Amélie Marian - Rutgers University 6
Multiple Join Path Framework:
Problem Definition Query of the form:
“Given X=a find the value of Y”
Examples: Given a telephone number of a customer, find the ID of the
circuit to which the telephone line is attached.One answer expected
Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID.Possibly several answers
![Page 7: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/7.jpg)
9/11/2006 Amélie Marian - Rutgers University 7
Schema Graph Directed acyclic graph Nodes are field names Intra-application edge
Links fields in the same application
Inter-application edge Links fields across
applicationsAll (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications
![Page 8: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/8.jpg)
9/11/2006 Amélie Marian - Rutgers University 8
Data Graph Given a specific value of the source node X what
are values of the sink node Y? Considers all join paths from X to Y in the schema
graph
X (no corresponding SALES.BAN)
X X
Example: two paths lead to answer c1
![Page 9: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/9.jpg)
9/11/2006 Amélie Marian - Rutgers University 9
Scoring Answers Which are the correct values?
Unclean data No a priori knowledge
Technique to score data edges What is the probability that the fields
associated by the edge is correct Probabilistic interpretation of data edge
scores to score full join paths Edge score aggregation Independent on the length of the path
![Page 10: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/10.jpg)
9/11/2006 Amélie Marian - Rutgers University 10
Scoring Data Edges Rely on functional
dependencies (we are considering fields that are keys)
Data edge scores model the error in the data
Intra-application edge Inter-application edge
equals 1, unless approximate matching
Fields A and B within the same application
A B (and symetrically for B -> A)
|},...,1),,{(|
1),(
nibabascore
ii
Where bi are the values instantiated from querying the application with value a
A B
|)}.*,(.*),{(|
1),(
jiji bABabascore
B Aand
![Page 11: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/11.jpg)
9/11/2006 Amélie Marian - Rutgers University 11
Scoring Data Paths A single data path is
scored using a simple sequential composition of its data edges probabilities
Data paths leading to the same answer are scored using parallel composition
n
i iedgeScorepathScore1
)*(
()
21
21
pathScorepathScore
pathScorepathscore
thScoreparallelpa
X a b Y0.5 0.8 0.6
pathScore=0.5*0.8*0.6=0.24
X a b Y0.5 0.8 0.6
c
pathScore=0.24+0.2-(0.24*0.2)pathScore=0.392
0.40.5
Independence Assumption
![Page 12: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/12.jpg)
9/11/2006 Amélie Marian - Rutgers University 12
Identifying Answers Only interested in best answers Standard top-k techniques do not apply
Answer scores can always be increased by new information
We keep score range information Return top answers when identified, may not
have complete scores Two return strategies
Top-k Top-few (weaker stop condition)
![Page 13: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/13.jpg)
9/11/2006 Amélie Marian - Rutgers University 13
Computing Answers Take advantage of early pruning
Only interested in best answers Incremental data graph computation
Probes to each applications Cost model is number of probes
Standard graph searching techniques (DFS, BFS) do not take advantage of score information
We propose a technique based on the notion of maximum benefit
![Page 14: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/14.jpg)
9/11/2006 Amélie Marian - Rutgers University 14
Maximum Benefit Benefit computation of a path uses two
components Known scores of the explored data edges Best way to augment an answer’s scores
Uses residual benefit of unexplored schema edges
Our strategy makes choices that aim at maximizing this benefit metric
![Page 15: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/15.jpg)
9/11/2006 Amélie Marian - Rutgers University 15
VIP Experimental Platform Integration platform developed at AT&T 30 legacy systems Real data Developed as a platform for resolving
disputes between applications that are due to data inconsistencies
Front-end web interface
![Page 16: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/16.jpg)
9/11/2006 Amélie Marian - Rutgers University 16
VIP Queries Random sample of 150 user queries. Analysis shows that queries can be classified
according to the number of answers they retrieve: noAnswer(nA): 56 queries anyAnswer(aA): 94 queries
oneLarge(oL): 47 queries manyLarge(mL): 4 queries manySmall(mS): 8 queries
heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query
![Page 17: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/17.jpg)
9/11/2006 Amélie Marian - Rutgers University 17
VIP Schema GraphPaths leading to an answer/paths leading to top-1 answer (94 queries)
Not considering all paths may lead to missing top-1 answers
![Page 18: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/18.jpg)
9/11/2006 Amélie Marian - Rutgers University 18
Number of Parallel Paths Contributing to the Top-1 Answer
0
2
4
6
8
10
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91
Number of Parallel Paths
Fre
qu
en
cy C
ou
nt
Average of 10 parallel paths per answer, 2.5 significant
![Page 19: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/19.jpg)
9/11/2006 Amélie Marian - Rutgers University 19
Cost of Execution
![Page 20: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/20.jpg)
9/11/2006 Amélie Marian - Rutgers University 20
Related Work Keyword Search in DBMS (BANKS, DBXPlorer,
DISCOVER, ObjectRank) Query is set of keywords Top-k query model DB as data graph Do not agglomerate scores
Top-k query evaluation (TA, MPro, Upper) Consider tuples as an entity Wait for exact answer (Except for NRA) Do not agglomerate scores
Probabilistic ranking of DB results Queries not selective, large answer set
We take corroborative evidence into account to rank query results
![Page 21: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649d4b5503460f94a2927f/html5/thumbnails/21.jpg)
9/11/2006 Amélie Marian - Rutgers University 21
Conclusion Multiple Join Path Framework
Uses corroborating evidence to identify high quality results
Looks at all paths in the schema graph Scoring mechanism
Probabilistic interpretation Takes schema information into account
Techniques to compute answers Take into account agglomerative scoring Top-k and top-few