1 Bellman: A Data Quality Browser Tamraparni Dasu Theodore Johnson S. Muthukrishnan Vladislav...
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Bellman: A Data Quality Browser Tamraparni Dasu Theodore Johnson S. Muthukrishnan Vladislav...
1
Bellman: A Data Quality Browser
Tamraparni DasuTheodore JohnsonS. Muthukrishnan
Vladislav ShkapenyukAmit Marathe
Contact: [email protected]
2
Exploring Large and Complex Databases
• AT&T has a lot of large databases.– Many types of offerings (business/consumer/hosting, voice:
LD/local, data: frame/ip/vpn/voip, etc).– Many aspects of providing a service
(sales/provisioning/billing/network maintenance/customer care).– Many databases, each with 100’s to 1000’s of tables.
• These databases are complex– Legacy systems and procedures (AT&T is an old company).– AT&T’s scale requires local autonomy.
• Different conventions and encodings in different domains
– Complex interdependencies.• A user’s order touches many databases at some point or other.
• A customer may order many types of services, etc.
3
Our Problem• We provide special consulting services to AT&T
Business units.– Data mining studies.– Targeted data cleanup.– Help with large scale database integration and
cleanup efforts.
• “Special services” means that we are always looking at new data sets.– ODBC access to databases.– Web scraping– Delimited ascii dumps.
• We need help …
4
What is Bellman?• A “data quality browser”• A platform for integrating both simple and
advanced data browsing tools.• A platform for integrating data browsing and data
cleanup tools.• A platform for browsing and correlating multiple
disparate databases and data sets.
• Current status: the platform and the tools are in good shape, the UI and integration could use some work.
5
Database Profiling
• Collect summaries of the tables, fields, and data in the databases.
• Store in a supplementary database.• Use these summaries to
– Supplement database browsing.– Answer sophisticated queries about database
structure.– Support data mining on the database structure.
6
Bellman History
• Java-Bellman (first version)– Java/JDBC client software.– Collect profiles using C++/ODBC
• Oracle, Sybase, Teradata, Informix, SQL Server
– Problem : installation is difficult, especially to configure the ODBC drivers.
• Web-Bellman (current version)– ASP .NET/C# front end, C++ profiling.– No client software installation required.– Centralized management of ODBC drivers
and some advanced tools.
7
Bellman Screen Shots
• Bellman is best shown through some examples.
• Problem: – Its only interesting with real databases.– But the data is AT&T Proprietary
• We’ll do our best to show Bellman without getting ourselves fired.
8
Development site
Bellman is the browser
Spider is an applicationof our approximate stringmatching tools.
9
Database Connections
Ted’s account is configured to access these databases.Profile queries are restricted to them.
12
Find Similar
• What other fields in the database contain similar data?– Many values in common (exact match) : resemblance– “Textually similar”
• Many q-grams in common• Distribution of q-grams is similar.
– The fields have a similar name.
• Uses– Finding join paths– Finding join paths which require field transforms– Finding other fields with names/IDs/etc.
19
Algorithms
• Field similarity by exact match– Min Hash signatures to compute resemblance
• Field similarity by substring similarity– Resemblance of q-grams– Distribution of q-grams
• Efficient key finding
20
Set Resemblance
• The resemblance of two sets A, B is, |A
• After some algebra, we find that |A,(|A|+|B|)/(1+
,)• There are fast algorithms to compute …• Application to Bellman
– The sets in question are the unique values in the fields
– The resemblance of two fields is a good measure of their similarity.
21
Min Hash Signatures• h(a) is a hash function.• m(A) = min(h(a), a in A).
• Observation:Pr[m(A) = m(B)] = ,
• Why?– Universe restricted to AB– Equality only in AB
• Repeat the experiment many times.
• Signature:(m1(A), m2(A), …, mk(A))
A
B
ignoredm(A)<m(B)
m(A)>m(B)
m(A)=m(B)
22
Min Hash Implementation
• (Schema, Table, Field) maps to unique ID• 256 hash functions• Bellman_Field_Sig table has the schema
(ID, HashNum, MinHash)• MinHash is a 32-bit integer (collisions become an issue
for large sets)
23
Min Hash Implementation (cont.)
• SQL to find fields similar to field w/ ID 23 select S2.ID, count(*)
from Bellman_Field_Sig S1, Bellman_Field_Sig S2
where S1.ID=23 and
S1.HashNum = S2.HashNum and S1.MinHash=S2.MinHash
group by S2.ID
order by count(*) desc
24
Textual similarity
• Q-gram : collection of consecutive q-letter substrings.
• Example: 3-grams of “Bellman”– Bel, ell, llm, lma, man
• Q-gram resemblance : compute signatures of the set of q-grams of a field instead of field values.
• Q-gram similarity : L2 distance of the frequency distribution of the q-grams of a field.– Use sketches to store a compact approximate
summary of the q-gram distribution.
25
Q-gram Sketches
• V is a d dimensional vector• Sk(V) is the k dimensional vector (V.X1, V.X2, …, V.Xk) where Xi are random d dimensional vectors• Typically, k << d• L2(V1, V2) is approximated by L2(Sk(V1), Sk(V2))
26
Q-gram Sketch Implementation
• 256 random sketch vectors• Bellman_Field_QSketch
(ID, SkNum, SkVal)• Tuples correspnding to a particular ID constitute the
normalized sketch vector for that field• SkVal has datatype float
27
Q-gram Sketch Implementation
• SQL to find fields similar to field w/ ID 23 select S2.ID, sum((S1.SkVal – S2.SkVal) ** 2)
from Bellman_Field_QSketch S1, Bellman_Field_QSketch S2
where S1.ID = 23
and S1.SkVal = S2.SkVal
group by S2.ID
order by sum((S1.SkVal – S2.SkVal) ** 2) desc
28
Finding Keys
• Key a (minimal) set of fields whose composite value is unique in every record.
• Problem: finding all keys of length up to k in a table with F fields requires O(Fk) expensive count distinct queries.
• Solution:– Eliminate “bad” fields : floats, mostly NULL, etc.– Collect an in-memory sample
• Can store hashes of long strings.
– Compute count distinct queries on the sample• Verify keys by query against database
– Use Tane-style level-wise pruning.
29
Finding Keys (cont.)
• Upto 4-way keys found by Bellman• Candidate-1 keys are all the “good” fields• In iteration i+1, consider the cross product of candidate-i
keys and candidate-1 keys• Classify each set of i+1 fields so obtained as an
approximate key, candidate-(i+1) key or a functional dependency.
• Thresholds: key_threshold, min_improvement