TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of...

TDD: Topics in Distributed Databases

(Querying and cleaning big data)

11

Wenfei Fan

University of Edinburgh

What is big data?

22

3

Big data: What is it anyway?

Everyone talks about big data. But what is it?

A departure from our familiar data management!

Volume: horrendously large• PB (1015B)• EB (1018B)

Variety: heterogeneous, semi-structured or unstructured• 9:1 ratio of unstructured data vs. structured data• collecting 95% restaurants requires at least 5000 sources

Velocity: dynamic• think of the Web and Facebook, …

Veracity: trust in its quality• real-life data is typically dirty!

cf. Online ordering of overlapping data sources, PVLDB 7(3), 2013, Mariam Salloum, Xin Luna Dong, Divesh Srivastava, Vassilis J. Tsotra

3

4

Why is the data so big?

Big data is a relative notion: 1TB is already too big for your laptop

Worldwide information volume is growing annually at a minimum rate of 59%

A single jet engine produces 20TB (1012B) of data per hour

Facebook has 1.38 billion users, 140 billion links, about 300 PB of data

Genome of human: sampling, biochemistry, immunology, imaging, genetic, phenotypic data • 1 person: 1PB (1015B)• 1000 people: 1EB (1018B)• 1 billion people: 1ZB (1024B)

Gartner 2011

4

Why do we care about big data?

55

6

Example: Medicare

A new game: large number of data sources of big volume

Nature, 2009

6

7

Big data is needed everywhere

The world is becoming data-driven, like it or not!

Social media marketing: • 78% of consumers trust peer (friend, colleague and family

member) recommendations – only 14% trust ad• if three close friends of person X like items P and W, and if X

also likes P, then the chances are that X likes W too Social event monitoring:

• Prevent terrorist attack• The Net Project, Shenzhen, China (Audaque)

Scientific research: • A new yet more effective way to develop theory, by exploring

and discovering correlations of seemingly disconnected factors

7

8

The big data market is BIG

Big Data: The next frontier for innovation, competition and productivity

US HEALTH CARE $300 B

Increase industry value per year by $300 B US RETAIL 60+%

Increase net margin by 60+% MANUFACTURING –50%

Decrease development and assembly costs by 50% GLOBAL PERSONAL LOCATION DATA $100 B

Increase service provider revenue by $100 B EUROPE PUBLIC SECTOR ADMIN 250 B Euro

Increase industry value per year by 250 B EuroMcKinsey Global Institute, May 2011

8

Why study big data?

Want to find a job? • Research and development of big data systems:

ETL, distributed systems (eg, Hadoop), visualization tools, data warehouse, OLAP, data integration, data quality control, …

• Big data applications: social marketing, healthcare, …

• Data analysis: to get values out of big datadiscovering and applying patterns, predicative analysis, business intelligence, privacy and security, …

Prepare you for • graduate study: current research and practical issues;• the job market: skills/knowledge in need

Big data = Big $$$

complexity theory, distributed databases, query answering, algorithms, data quality

9

What challenges are introduced by big data?

1010

11

Big data: Through the eyes of computation

Computer science is the topic about

the computation of function f(x)

Big data: the data parameter x is horrendously large: PB or EB

Are these true?

What is the challenge introduced to query answering?

Fallacies: Big data introduces no fundamental problems Big data = MapReduce (Hadoop) Big data = data quantity (scalability)

11

Flashback: Relational queries

Questions: What is a relational schema? A relation? A relational database? What is a query? What is relational algebra? What does relationally completeness mean? What is a conjunctive query?

DB

query answer

store data

DBMS

updates

The bible for database researchers: Foundations of Databases12

13

Traditional database management systems

A database is a collection of data, typically containing the

information about one or more related organizations.

A database management system (DBMS) is a software

package designed to store and manage databases.

Database: local DBMS: centralized; single processor (CPU); managing local

databases (single memory, disk)

DB

query answer

store data

DBMS

updates

14

Facebook: Graph Search

Find me restaurants in New York my friends have been to in 2013• friend(pid1, pid2)• person(pid, name, city)• dine(pid, rid, dd, mm, yy)

SQL query (in fact, a conjunctive query, or an SPC query)

select rid

from friend(pid1, pid2), person(pid, name, city),

dine(pid, rid, dd, mm, yy)

where pid1 = p0 and pid2 = person.pid and

pid2 = dine.pid and city = NYC and yy = 2013

Is it feasible on big data?

Facebook : more than 1.38 billion nodes, and over 140 billion links

14

15

Example queries: Graph pattern matching

Input: A pattern graph Q and a graph G

Output: All the matches of Q in G, i.e., all subgraphs of G that are isomorphic to Q

Applications• pattern recognition• intelligence analysis• transportation network analysis• Web site classification • social position detection• user targeted advertising• knowledge base disambiguation …

a bijective function f on nodes:

(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G

What other graph queries do you know? 15

Find all matches of a pattern in a graphFind all matches of a pattern in a graph

Graph pattern matching

Identify suspects in a drug ring

Identify suspects in a drug ring

16“Understanding the structure of drug trafficking organizations”

pattern graph

B

A1 Am

W

W

W

W W

W

WW

33

1

B

A S

W Is this feasible? Facebook : more than 1.38 billion nodes, and over 140 billion links

16

17

Querying big data: New challenges

A departure from classical theory and traditional techniques

Given a query Q and a dataset D, compute Q(D)

What are new challenges introduced by querying big data？

Does querying big data introduce new fundamental problems?

What new methodology do we need to cope with the sheer size of big data D?

DDQ( ) DDQ( )

traditional database big data (PB or EB)

17

Why?

18

The good, the bad and the ugly

Traditional computational complexity theory of almost 50 years:

• The good: polynomial time computable (PTIME)

• The bad: NP-hard (intractable)

• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…

Polynomial time queries become intractable on big data!

What happens when it comes to big data?

Using SSD of 6G/s, a linear scan of a data set DD would take

• 1.9 days when DD is of 1PB (1015B)

• 5.28 years when DD is of 1EB (1018B)

O(n) time is already beyond reach on big data in practice!

How long does it take?

What query is this?

18

19

Tractability revisited for big data

BD-tractable queries: properly contained in P unless P = NC

NP and beyond

PP

BD-tractablenot

BD-tractable

Parallel polylog time

Yes, querying big data

comes with new and hard

fundamental problems

19

20

Challenges: query evaluation is costly

Already beyond reach in practice when the data is not very big

Graph pattern matching by subgraph isomorphism

• NP-complete to decide whether there exists a match

• possibly exponentially many matches

intractable even in the traditional complexity theory

Membership problem for relational queries

Input: a query Q, a database D, and a tuple t

Question: Is t in Q(D)?

• NP-complete if Q is a conjunctive query (SPC)

• PSPACE-complete if Q is in relational algebra (SQL)

20

What is the complexity?

DB

P

M

DB

P

M

DB

P

M

interconnection network

10,000 processors

21

Using 10000 SSD of 6G/s, a linear scan of DD might take: 1.9 days/10000 = 16 seconds when DD is of 1PB (1015B)5.28 years/10000 = 4.63 days when DD is of 1EB (1018B)

Only ideally!

Is it still feasible to query big data?

Can we do better if we are given more resources?Parallel and distributed query processing – TDD

Yes, parallel query processing. But how?

The two sides of a coin

Data = quantity + quality

When we talk about big data, we typically mean its quantity: What capacity of a system provides to cope with the sheer size

of the data? Is a query feasible on big data within our available resources? How can we make our queries tractable on big data? . . .

The study of data quality is as important as data quantity

Can we trust the answers to our queries?

Dirty data routinely lead to misleading financial reports, strategic

business planning decision loss of revenue, credibility and customers, disastrous consequences

Veracity!

22

Data consistency

FN LN address AC city

Mary Smith 2 Small St 908 NYC

Mary Dupont 10 Elm St 610 PHI

Mary Dupont 6 Main St 212 NYC

Bob Luth 8 Cowan St 215 PHI

Robert Luth 6 Drum St 212 NYC

Q1: how many employees are in the NY office?

3 may not be the correct answer: the AC and city in the first tuple are inconsistent!

Error rates: 10% - 75% (telecommunication) 23

Information completeness

FN LN address AC city

Mary Smith 2 Small St 908 NYC

Mary Dupont 10 Elm St 610 PHI

Mary Dupont 6 Main St 212 NYC

Bob Luth 8 Cowan St 215 PHI

Robert Luth 6 Drum St 212 NYC Q2: how many distinct employees have first name Marry?

3 may not be the correct answer:

• The first three tuples refer to the same person • The information may be incomplete

“information perceived as being needed for clinical

decisions was unavailable 13.6%--81% of the time” (2005) 24

Data currency

FN LN address salary status

Mary Smith 2 Small St 50k single

Mary Dupont 10 Elm St 50k married

Mary Dupont 6 Main St 80k married

Bob Luth 8 Cowan St 80k married

Robert Luth 6 Drum St 55k married

Q3: what is Mary’s current salary?

In a customer file, within two years about 50% of record may become obsolete” (2002)

Mary RobertEntities:

80k In the real world, salary is monotonically increasing

Consistent, complete, and once correct

25

Data fusion

FN LN address salary status

Mary Smith 2 Small St 50k single

Mary Dupont 10 Elm St 50k married

Mary Dupont 6 Main St 80k married

Bob Luth 8 Cowan St 80k married

Robert Luth 6 Drum St 55k married

Q4: what is Mary’s current last name?

Deduce the true values of an entity

Dupont In real life: • Marital status only changes from single married divorced• Tuples with the most current marital status also have the most

current last name

26

Data in real-life is often dirty

Dirty data: inconsistent, inaccurate, incomplete, stale

500,000 dead people

retain active Medicare

cards

Pentagon asked 200+ dead officers to re-enlist

81 million National Insurance numbers but only 60 million eligible citizens

Data error rates in industry: 1% - 30% (Redman, 1998)

98000 deaths each year, caused by errors in medical data

27

Dirty data are costly

Poor data cost US businesses $611 billion annually

Erroneously priced data in retail databases cost US

customers $2.5 billion each year

1/3 of system development projects were forced to delay or

cancel due to poor data quality

30%-80% of the development time and budget for data

warehousing are for data cleaning

CIA dirty data about WMD in Iraq!

The scale of the data quality problem is far worse on big data!

1998

2001

2000

28

Can we trust answers to our queries in dirty data?

What does this course cover?

2929

Big data = quantity + quality

Volume (quantity) Veracity (quality)

30

Basic topic 1: Parallel database management systems

Recall traditional DBMS: Database: “single” memory, disk DBMS: centralized; single processor (CPU);

Can we do better provided with multiple processors?

Parallel DBMS: exploring parallelism Improve performance Reliability and availability

DB

P

M

DB

P

M

DB

P

M

interconnection network MapReduce

31

Basic topic 2: Distributed databases

Data is stored in several sites, each with an independent DBMS Local ownership: physically stored across different sites Increased availability and reliability Performance

network

DB

DBMS

local schema

DB

DBMS

local schema

DB

DBMS

local schema

network

global schema

query answer

Cloud computing

32

Advanced topic 1: MapReduce

A programming model with two primitive functions:

Applications in cloud computing

Connection between MapReduce and parallel query processing

Other parallel programming models

• BSP (Bulk Synchronous Parallel)

• Vertex-centric

• Partial evaluation

Map: <k1, v1> list (k2, v2)

Reduce: <k2, list(v2)> list (k3, v3)

33

Advanced topic 2: Querying big data

Foundations for querying big data

Tractability revised for querying big data

Parallel scalability

Bounded evaluability of queries

Querying big data: theory and practice 33

Techniques for querying big data

Develop parallel algorithms for querying big data

Bounded evaluability and access constraints

Query preserving compression

Query answering using views

Bounded incremental query processing

34

Central issues for data quality Object identification (data fusion): do two objects refer to the same real-world entity? What is the true value of the entity? Data consistency: do our data values have conflicts? Data accuracy: is one value more accurate than another for a real-word entity? Data currency: is our data out of date? Information completeness: does D have enough information to answer our queries?

Make our data consistent, accurate, complete and up to date!

Big data = quantity + quality!

34

TDD: the Veracity of big data

Advanced topic 3: Data quality management

35

Advanced topic 4: Dependencies as data quality rules

Fundamental problems for data quality rules:• consistency: are the data quality rules “dirty” themselves?• implication: can we optimize the rules by removing redundant

ones?

A revision of classical dependencies

Data quality rules:• Conditional (functional and inclusion) dependencies to

capture data inconsistencies • Matching dependencies for record matching Data

consistency: do our data values have conflicts?• There are also quality rules for data accuracy, data currency

and information completeness – in the textbook

A uniform logic framework for improving data quality

36

Discover rules

Reasoning

Detect errors

Repair

Advanced topic 5: Data cleaning

Semi-automated systems for improving data quality

• Discover data quality rules• Validate rules discovered• Detect errors with rules• Repairing data with rules• Certain fixes• Deducing the true values of

entities

37

Putting together

Basic technology Parallel DBMS: architectures, data partition, (intra/inter) operator

parallelism, parallel query processing and optimization Distributed DBMS: architectures, fragmentation, replication

Advanced topics Big data: the Volume

– MapReduce and other parallel programming models– Querying big data: theory and practice

Big data: the Veracity– Central issues for data quality– Dependencies as data quality rules– Cleaning distributed data: rule discovery, rule validation, error

detection, data repairing, certain fixes

Prerequisites

Volume (quantity) Veracity (quality)• Variety (entity resolution, conflict resolution• Velocity (incremental computation)

relational algebra/SQL, query processing, basic complexity and algorithmic

background (e.g., NP, undecidability)

Course format

3838

39

Basic information

Web site:

http://homepages.inf.ed.ac.uk/wenfei/tdd/home.html – Syllabus – Announcements – Lecture notes– deadlines

TA: Chao Tian– [email protected]

Office hours:– Informatics Forum 5.23, 11:00-12:00, Thursday

40

Course format

Seminar course: there will be no exam!– Lectures: background.

http://homepages.inf.ed.ac.uk/wenfei/tdd/lecture/lecture-notes.html– Textbook:

R. Ramakrishnan, J. Gehrke: Database Management Systems. WCB/McGraw-Hill 2003 (3rd edition). Chap 22

Database System Concept, 4th edition, A. Silberschatz, H. Korth, S. Sudarshan, Part 6 (Parallel and Distributed Database Systems)

W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012 (Chapters 1-4; e-copy available upon request)

– Research papers or chapters related to the topics (3-4 each)• At the end of ln3-ln8

41

Grading

Reviews of research papers (8 in total) 40% Project (report): 45% Project presentation: 15%

Homework:

Four sets of homework, starting from week 4; deadlines:• 9am, Thursday, February 5, week 4• 9am, Thursday, February 19, week 6• 9am, Thursday, March 5, week 8• 9am, Thursday, March 19, week 10

– Papers: choose two each time (two reviews) – not chapters– 5% for each paper, and 10% for each homework

down from 12, 2012

42

Review Evaluation

Pick 2 research papers each time from the lecture note to be covered in next two weeks, starting from Week 4.

Write a one-page review for each of the papers, 10 marksSummary: 2 marks

• A clear problem statement: input, question/output • The need for this line of research: motivation• A summary of key ideas, techniques and contributions

Evaluation: 5 marks – Criteria for the line of research (e.g., expressive power,

complexity, accuracy, scalability, etc)– Evaluation based on your criteria; justify your evaluation

• 3 strong points• 3 weak points

Suggest possible extensions: 3 marks

43

Project – Research and development (recommended)

Research and development:– Topic: pick one from lecture notes (ln3 – ln8)

Example: A MapReduce algorithm for graph simulation

Development– Pick a research paper from the reading list of ln3—ln8– Implement its main algorithms– Conduct its experimental study

Multiple people may work on the same project independently

You are encouraged to come up with your own project – talk to me first

Start early!

44

Grading – design and development

Distribution:– Algorithms: technical depth, performance guarantees 20%– Prove the correctness, complexity analysis and performance

guarantees of your algorithms 15%– Justification (experimental evaluation) 10%

Report: in the form of technical report/research paper – Introduction: problem statement, motivation– Related work: survey– Techniques; algorithms, illustration via intuitive examples– Correctness/complexity/property/proofs– Experimental evaluation– Possible extensions

45

Project – survey

Topic: pick one topic from a lecture note (ln3 – ln8)

Example: techniques for conflict resolution Distribution:

– Select 5-6 representative papers, independently 10%

– Develop a set of criteria: the most important issues in that line of research, based on your own understanding; justify your criteria 10%

– Evaluate each of the papers based on your criteria 15%– A table to summarize the assessment, based on your

criteria, draw and justify your conclusion and recommendation for various application 10%

Sample survey: A Brief Survey of Automatic Methods for Author Name Disambiguation

Find and download it from Google

Your understanding of the topic

46

Project report and presentation – 15%

A clear problem statement

Motivation and challenges

Key ideas, techniques/approaches

Key results – what you have got, intuitive examples

Findings/recommendations for different applications

Demonstration: a must if you do a development project

Presentation: question handling (show that you have developed

a good understanding of the line of work)

Learn how to present your work

47

Summary and Review

What is big data?

What is the volume of big data? Variety? Velocity? Veracity?

Why do we care about big data?

Is there any fundamental challenge introduced by querying big

data?

Why study data quality?

What is consistency? Information completeness? Data

currency? Data accuracy? Object identification?

48

Reading list

For next week, parallel databases, before the next lecture– Database Management Systems, 2nd edition, R. Ramakrishnan

and J. Gehrke, Chapter 22.– Database System Concept, 4th edition, A. Silberschatz, H. Korth,

S. Sudarshan, Part 6 (Parallel and Distributed Database Systems)

About relational databases: – Foundations of databases, S. Abiteboul, R. Hull, V. VIanu

About big data

– W. Fan and J. Huai. Querying Big Data: Theory and Practice, JCST 2014

http://homepages.inf.ed.ac.uk/wenfei/papers/JCST14.pdf

TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of...

Documents

Transcript of TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of...