TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of...
-
Upload
kenneth-campbell -
Category
Documents
-
view
217 -
download
1
Transcript of TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of...
TDD: Topics in Distributed Databases
(Querying and cleaning big data)
11
Wenfei Fan
University of Edinburgh
What is big data?
22
3
Big data: What is it anyway?
Everyone talks about big data. But what is it?
A departure from our familiar data management!
Volume: horrendously large• PB (1015B)• EB (1018B)
Variety: heterogeneous, semi-structured or unstructured• 9:1 ratio of unstructured data vs. structured data• collecting 95% restaurants requires at least 5000 sources
Velocity: dynamic• think of the Web and Facebook, …
Veracity: trust in its quality• real-life data is typically dirty!
cf. Online ordering of overlapping data sources, PVLDB 7(3), 2013, Mariam Salloum, Xin Luna Dong, Divesh Srivastava, Vassilis J. Tsotra
3
4
Why is the data so big?
Big data is a relative notion: 1TB is already too big for your laptop
Worldwide information volume is growing annually at a minimum rate of 59%
A single jet engine produces 20TB (1012B) of data per hour
Facebook has 1.38 billion users, 140 billion links, about 300 PB of data
Genome of human: sampling, biochemistry, immunology, imaging, genetic, phenotypic data • 1 person: 1PB (1015B)• 1000 people: 1EB (1018B)• 1 billion people: 1ZB (1024B)
Gartner 2011
4
Why do we care about big data?
55
6
Example: Medicare
A new game: large number of data sources of big volume
Nature, 2009
6
7
Big data is needed everywhere
The world is becoming data-driven, like it or not!
Social media marketing: • 78% of consumers trust peer (friend, colleague and family
member) recommendations – only 14% trust ad• if three close friends of person X like items P and W, and if X
also likes P, then the chances are that X likes W too Social event monitoring:
• Prevent terrorist attack• The Net Project, Shenzhen, China (Audaque)
Scientific research: • A new yet more effective way to develop theory, by exploring
and discovering correlations of seemingly disconnected factors
7
8
The big data market is BIG
Big Data: The next frontier for innovation, competition and productivity
US HEALTH CARE $300 B
Increase industry value per year by $300 B US RETAIL 60+%
Increase net margin by 60+% MANUFACTURING –50%
Decrease development and assembly costs by 50% GLOBAL PERSONAL LOCATION DATA $100 B
Increase service provider revenue by $100 B EUROPE PUBLIC SECTOR ADMIN 250 B Euro
Increase industry value per year by 250 B EuroMcKinsey Global Institute, May 2011
8
Why study big data?
Want to find a job? • Research and development of big data systems:
ETL, distributed systems (eg, Hadoop), visualization tools, data warehouse, OLAP, data integration, data quality control, …
• Big data applications: social marketing, healthcare, …
• Data analysis: to get values out of big datadiscovering and applying patterns, predicative analysis, business intelligence, privacy and security, …
Prepare you for • graduate study: current research and practical issues;• the job market: skills/knowledge in need
Big data = Big $$$
complexity theory, distributed databases, query answering, algorithms, data quality
9
What challenges are introduced by big data?
1010
11
Big data: Through the eyes of computation
Computer science is the topic about
the computation of function f(x)
Big data: the data parameter x is horrendously large: PB or EB
Are these true?
What is the challenge introduced to query answering?
Fallacies: Big data introduces no fundamental problems Big data = MapReduce (Hadoop) Big data = data quantity (scalability)
11
Flashback: Relational queries
Questions: What is a relational schema? A relation? A relational database? What is a query? What is relational algebra? What does relationally completeness mean? What is a conjunctive query?
DB
query answer
store data
DBMS
updates
The bible for database researchers: Foundations of Databases12
13
Traditional database management systems
A database is a collection of data, typically containing the
information about one or more related organizations.
A database management system (DBMS) is a software
package designed to store and manage databases.
Database: local DBMS: centralized; single processor (CPU); managing local
databases (single memory, disk)
DB
query answer
store data
DBMS
updates
14
Facebook: Graph Search
Find me restaurants in New York my friends have been to in 2013• friend(pid1, pid2)• person(pid, name, city)• dine(pid, rid, dd, mm, yy)
SQL query (in fact, a conjunctive query, or an SPC query)
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
pid2 = dine.pid and city = NYC and yy = 2013
Is it feasible on big data?
Facebook : more than 1.38 billion nodes, and over 140 billion links
14
15
Example queries: Graph pattern matching
Input: A pattern graph Q and a graph G
Output: All the matches of Q in G, i.e., all subgraphs of G that are isomorphic to Q
Applications• pattern recognition• intelligence analysis• transportation network analysis• Web site classification • social position detection• user targeted advertising• knowledge base disambiguation …
a bijective function f on nodes:
(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G
What other graph queries do you know? 15
Find all matches of a pattern in a graphFind all matches of a pattern in a graph
Graph pattern matching
Identify suspects in a drug ring
Identify suspects in a drug ring
16“Understanding the structure of drug trafficking organizations”
pattern graph
B
A1 Am
W
W
W
W W
W
WW
33
1
B
A S
W Is this feasible? Facebook : more than 1.38 billion nodes, and over 140 billion links
16
17
Querying big data: New challenges
A departure from classical theory and traditional techniques
Given a query Q and a dataset D, compute Q(D)
What are new challenges introduced by querying big data?
Does querying big data introduce new fundamental problems?
What new methodology do we need to cope with the sheer size of big data D?
DDQ( ) DDQ( )
traditional database big data (PB or EB)
17
Why?
18
The good, the bad and the ugly
Traditional computational complexity theory of almost 50 years:
• The good: polynomial time computable (PTIME)
• The bad: NP-hard (intractable)
• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
Polynomial time queries become intractable on big data!
What happens when it comes to big data?
Using SSD of 6G/s, a linear scan of a data set DD would take
• 1.9 days when DD is of 1PB (1015B)
• 5.28 years when DD is of 1EB (1018B)
O(n) time is already beyond reach on big data in practice!
How long does it take?
What query is this?
18
19
Tractability revisited for big data
BD-tractable queries: properly contained in P unless P = NC
NP and beyond
PP
BD-tractablenot
BD-tractable
Parallel polylog time
Yes, querying big data
comes with new and hard
fundamental problems
19
20
Challenges: query evaluation is costly
Already beyond reach in practice when the data is not very big
Graph pattern matching by subgraph isomorphism
• NP-complete to decide whether there exists a match
• possibly exponentially many matches
intractable even in the traditional complexity theory
Membership problem for relational queries
Input: a query Q, a database D, and a tuple t
Question: Is t in Q(D)?
• NP-complete if Q is a conjunctive query (SPC)
• PSPACE-complete if Q is in relational algebra (SQL)
20
What is the complexity?
DB
P
M
DB
P
M
DB
P
M
interconnection network
10,000 processors
21
Using 10000 SSD of 6G/s, a linear scan of DD might take: 1.9 days/10000 = 16 seconds when DD is of 1PB (1015B)5.28 years/10000 = 4.63 days when DD is of 1EB (1018B)
Only ideally!
Is it still feasible to query big data?
Can we do better if we are given more resources?Parallel and distributed query processing – TDD
Yes, parallel query processing. But how?
The two sides of a coin
Data = quantity + quality
When we talk about big data, we typically mean its quantity: What capacity of a system provides to cope with the sheer size
of the data? Is a query feasible on big data within our available resources? How can we make our queries tractable on big data? . . .
The study of data quality is as important as data quantity
Can we trust the answers to our queries?
Dirty data routinely lead to misleading financial reports, strategic
business planning decision loss of revenue, credibility and customers, disastrous consequences
Veracity!
22
Data consistency
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC
Q1: how many employees are in the NY office?
3 may not be the correct answer: the AC and city in the first tuple are inconsistent!
Error rates: 10% - 75% (telecommunication) 23
Information completeness
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC Q2: how many distinct employees have first name Marry?
3 may not be the correct answer:
• The first three tuples refer to the same person • The information may be incomplete
“information perceived as being needed for clinical
decisions was unavailable 13.6%--81% of the time” (2005) 24
Data currency
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
Q3: what is Mary’s current salary?
In a customer file, within two years about 50% of record may become obsolete” (2002)
Mary RobertEntities:
80k In the real world, salary is monotonically increasing
Consistent, complete, and once correct
25
Data fusion
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
Q4: what is Mary’s current last name?
Deduce the true values of an entity
Dupont In real life: • Marital status only changes from single married divorced• Tuples with the most current marital status also have the most
current last name
26
Data in real-life is often dirty
Dirty data: inconsistent, inaccurate, incomplete, stale
500,000 dead people
retain active Medicare
cards
Pentagon asked 200+ dead officers to re-enlist
81 million National Insurance numbers but only 60 million eligible citizens
Data error rates in industry: 1% - 30% (Redman, 1998)
98000 deaths each year, caused by errors in medical data
27
Dirty data are costly
Poor data cost US businesses $611 billion annually
Erroneously priced data in retail databases cost US
customers $2.5 billion each year
1/3 of system development projects were forced to delay or
cancel due to poor data quality
30%-80% of the development time and budget for data
warehousing are for data cleaning
CIA dirty data about WMD in Iraq!
The scale of the data quality problem is far worse on big data!
1998
2001
2000
28
Can we trust answers to our queries in dirty data?
What does this course cover?
2929
Big data = quantity + quality
Volume (quantity) Veracity (quality)
30
Basic topic 1: Parallel database management systems
Recall traditional DBMS: Database: “single” memory, disk DBMS: centralized; single processor (CPU);
Can we do better provided with multiple processors?
Parallel DBMS: exploring parallelism Improve performance Reliability and availability
DB
P
M
DB
P
M
DB
P
M
interconnection network MapReduce
31
Basic topic 2: Distributed databases
Data is stored in several sites, each with an independent DBMS Local ownership: physically stored across different sites Increased availability and reliability Performance
network
DB
DBMS
local schema
DB
DBMS
local schema
DB
DBMS
local schema
network
global schema
query answer
Cloud computing
32
Advanced topic 1: MapReduce
A programming model with two primitive functions:
Applications in cloud computing
Connection between MapReduce and parallel query processing
Other parallel programming models
• BSP (Bulk Synchronous Parallel)
• Vertex-centric
• Partial evaluation
Map: <k1, v1> list (k2, v2)
Reduce: <k2, list(v2)> list (k3, v3)
33
Advanced topic 2: Querying big data
Foundations for querying big data
Tractability revised for querying big data
Parallel scalability
Bounded evaluability of queries
Querying big data: theory and practice 33
Techniques for querying big data
Develop parallel algorithms for querying big data
Bounded evaluability and access constraints
Query preserving compression
Query answering using views
Bounded incremental query processing
34
Central issues for data quality Object identification (data fusion): do two objects refer to the same real-world entity? What is the true value of the entity? Data consistency: do our data values have conflicts? Data accuracy: is one value more accurate than another for a real-word entity? Data currency: is our data out of date? Information completeness: does D have enough information to answer our queries?
Make our data consistent, accurate, complete and up to date!
Big data = quantity + quality!
34
TDD: the Veracity of big data
Advanced topic 3: Data quality management
35
Advanced topic 4: Dependencies as data quality rules
Fundamental problems for data quality rules:• consistency: are the data quality rules “dirty” themselves?• implication: can we optimize the rules by removing redundant
ones?
A revision of classical dependencies
Data quality rules:• Conditional (functional and inclusion) dependencies to
capture data inconsistencies • Matching dependencies for record matching Data
consistency: do our data values have conflicts?• There are also quality rules for data accuracy, data currency
and information completeness – in the textbook
A uniform logic framework for improving data quality
36
Discover rules
Reasoning
Detect errors
Repair
Advanced topic 5: Data cleaning
Semi-automated systems for improving data quality
• Discover data quality rules• Validate rules discovered• Detect errors with rules• Repairing data with rules• Certain fixes• Deducing the true values of
entities
37
Putting together
Basic technology Parallel DBMS: architectures, data partition, (intra/inter) operator
parallelism, parallel query processing and optimization Distributed DBMS: architectures, fragmentation, replication
Advanced topics Big data: the Volume
– MapReduce and other parallel programming models– Querying big data: theory and practice
Big data: the Veracity– Central issues for data quality– Dependencies as data quality rules– Cleaning distributed data: rule discovery, rule validation, error
detection, data repairing, certain fixes
Prerequisites
Volume (quantity) Veracity (quality)• Variety (entity resolution, conflict resolution• Velocity (incremental computation)
relational algebra/SQL, query processing, basic complexity and algorithmic
background (e.g., NP, undecidability)
Course format
3838
39
Basic information
Web site:
http://homepages.inf.ed.ac.uk/wenfei/tdd/home.html – Syllabus – Announcements – Lecture notes– deadlines
TA: Chao Tian– [email protected]
Office hours:– Informatics Forum 5.23, 11:00-12:00, Thursday
40
Course format
Seminar course: there will be no exam!– Lectures: background.
http://homepages.inf.ed.ac.uk/wenfei/tdd/lecture/lecture-notes.html– Textbook:
R. Ramakrishnan, J. Gehrke: Database Management Systems. WCB/McGraw-Hill 2003 (3rd edition). Chap 22
Database System Concept, 4th edition, A. Silberschatz, H. Korth, S. Sudarshan, Part 6 (Parallel and Distributed Database Systems)
W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012 (Chapters 1-4; e-copy available upon request)
– Research papers or chapters related to the topics (3-4 each)• At the end of ln3-ln8
41
Grading
Reviews of research papers (8 in total) 40% Project (report): 45% Project presentation: 15%
Homework:
Four sets of homework, starting from week 4; deadlines:• 9am, Thursday, February 5, week 4• 9am, Thursday, February 19, week 6• 9am, Thursday, March 5, week 8• 9am, Thursday, March 19, week 10
– Papers: choose two each time (two reviews) – not chapters– 5% for each paper, and 10% for each homework
down from 12, 2012
42
Review Evaluation
Pick 2 research papers each time from the lecture note to be covered in next two weeks, starting from Week 4.
Write a one-page review for each of the papers, 10 marksSummary: 2 marks
• A clear problem statement: input, question/output • The need for this line of research: motivation• A summary of key ideas, techniques and contributions
Evaluation: 5 marks – Criteria for the line of research (e.g., expressive power,
complexity, accuracy, scalability, etc)– Evaluation based on your criteria; justify your evaluation
• 3 strong points• 3 weak points
Suggest possible extensions: 3 marks
43
Project – Research and development (recommended)
Research and development:– Topic: pick one from lecture notes (ln3 – ln8)
Example: A MapReduce algorithm for graph simulation
Development– Pick a research paper from the reading list of ln3—ln8– Implement its main algorithms– Conduct its experimental study
Multiple people may work on the same project independently
You are encouraged to come up with your own project – talk to me first
Start early!
44
Grading – design and development
Distribution:– Algorithms: technical depth, performance guarantees 20%– Prove the correctness, complexity analysis and performance
guarantees of your algorithms 15%– Justification (experimental evaluation) 10%
Report: in the form of technical report/research paper – Introduction: problem statement, motivation– Related work: survey– Techniques; algorithms, illustration via intuitive examples– Correctness/complexity/property/proofs– Experimental evaluation– Possible extensions
45
Project – survey
Topic: pick one topic from a lecture note (ln3 – ln8)
Example: techniques for conflict resolution Distribution:
– Select 5-6 representative papers, independently 10%
– Develop a set of criteria: the most important issues in that line of research, based on your own understanding; justify your criteria 10%
– Evaluate each of the papers based on your criteria 15%– A table to summarize the assessment, based on your
criteria, draw and justify your conclusion and recommendation for various application 10%
Sample survey: A Brief Survey of Automatic Methods for Author Name Disambiguation
Find and download it from Google
Your understanding of the topic
46
Project report and presentation – 15%
A clear problem statement
Motivation and challenges
Key ideas, techniques/approaches
Key results – what you have got, intuitive examples
Findings/recommendations for different applications
Demonstration: a must if you do a development project
Presentation: question handling (show that you have developed
a good understanding of the line of work)
Learn how to present your work
47
Summary and Review
What is big data?
What is the volume of big data? Variety? Velocity? Veracity?
Why do we care about big data?
Is there any fundamental challenge introduced by querying big
data?
Why study data quality?
What is consistency? Information completeness? Data
currency? Data accuracy? Object identification?
48
Reading list
For next week, parallel databases, before the next lecture– Database Management Systems, 2nd edition, R. Ramakrishnan
and J. Gehrke, Chapter 22.– Database System Concept, 4th edition, A. Silberschatz, H. Korth,
S. Sudarshan, Part 6 (Parallel and Distributed Database Systems)
About relational databases: – Foundations of databases, S. Abiteboul, R. Hull, V. VIanu
About big data
– W. Fan and J. Huai. Querying Big Data: Theory and Practice, JCST 2014
http://homepages.inf.ed.ac.uk/wenfei/papers/JCST14.pdf