Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and...

70
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.-Ing. Sebastian Michel [email protected] Distributed Data Management, SoSe 2013, S. Michel 1

Transcript of Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and...

Page 1: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Distributed Data Management Summer Semester 2013

TU Kaiserslautern

Dr.-Ing. Sebastian Michel

[email protected]

Distributed Data Management, SoSe 2013, S. Michel 1

Page 2: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

MOTIVATION AND OVERVIEW Lecture 1

Distributed Data Management, SoSe 2013, S. Michel 2

Page 3: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Distributed Data Management

• What does “distributed” mean?

• And why would we want/need to do things in a distributed way?

Distributed Data Management, SoSe 2013, S. Michel 3

Page 4: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Reason: Federated Data

• Data is per se hosted at different sites

• Autonomy of sites • Maintained by diff. organizations • Mashups over such independent sources • Linked Open Data (LOD)

Distributed Data Management, SoSe 2013, S. Michel 4

Page 5: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Reason: Sensor Data

• Data originates at different sensors

• Spread across the world

• Health data from mobile devices

Distributed Data Management, SoSe 2013, S. Michel 5

Continuous queries!

Page 6: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Distributed Data Management, SoSe 2013, S. Michel 6

IP Bytes in kB

192.168.1.7 31kB

192.168.1.3 23kB

192.168.1.4 12kB

IP Bytes in kB

192.168.1.8 81kB

192.168.1.3 33kB

192.168.1.1 12kB

IP Bytes in kB

192.168.1.4 53kB

192.168.1.3 21kB

192.168.1.1 9kB

IP Bytes in kB

192.168.1.1 29kB

192.168.1.4 28kB

192.168.1.5 12kB

E.g. find clients that cause high network traffic.

Reason: Network Monitoring

Page 7: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Reason: Individuals as Providers/Consumers

• Don’t want single operator with global knowledge -> better decentralized?

• Distributed search engines

• Data on mobile phones

• Peer-to-Peer (P2P) systems

• Distributed social networks

• Leveraging idle resources

Distributed Data Management, SoSe 2013, S. Michel 7

Page 8: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Example: SETI@Home

• Distributed Computing

• Donate idle time of your personal computer

• Analyze extraterrestrial radio signals when screensaver is running

Distributed Data Management, SoSe 2013, S. Michel 8

Page 9: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Distributed Data Management, SoSe 2013, S. Michel 9

Example: P2P Systems: Napster

File Download

File

Do

wn

load

• Central server (index) • Client software sends information about users‘ contents to server. • User send queries to server • Server responds with IP of users that store matching files. Peer-to-Peer file sharing!

• Developed in 1998. • First P2P file-sharing system

Pirate-to-Pirate?

Page 10: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Example: Self Organization & Message Flooding

Distributed Data Management, SoSe 2013, S. Michel 10

TTL 3

TTL 3

TTL 2

TTL 2

TTL 2 TTL 1

TTL 0

TTL 1

TTL 1

TTL 0

Page 11: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Example: Structured Overlay Networks

• Logarithmic cost with

routing tables

(not shown here)

• Self organizing

• Will see later twice:

NoSQL KeyValue stores

and P2P Systems

Distributed Data Management, SoSe 2013, S. Michel 11

p1

p8

p14

p21

p32 p38

p42

p48

p51

p56

k10

k24

k30 k38

k54

Page 12: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Reason: Size

Distributed Data Management, SoSe 2013, S. Michel 12

Page 13: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Showcase Scenario

• Assume you got 10 TB data on disk

• Now, do some analysis of it

• With a 100MB/s disk, reading alone takes

– 100000 seconds

– 1666 minutes

– 27 hours

Distributed Data Management, SoSe 2013, S. Michel 13

Page 14: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Huge Amounts of Data

• Google:

– Billions of Websites

(around 50 billion, Spring 2013)

– TBs of data

• Twitter:

– 100s million tweets per day

• Cern’s LHC

– 25 Petabytes of data per year

Distributed Data Management, SoSe 2013, S. Michel 14

Page 15: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Huge Amounts of Data(2)

• Megaupload

– 28 PB of data

• AT&T (US Telecomm. Provider)

– 30 PB of data through its networks each day

• Facebook

– 100 PB Hadoop cluster

Distributed Data Management, SoSe 2013, S. Michel 15

http://en.wikipedia.org/wiki/Petabyte

Page 16: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Need to do something about it

Distributed Data Management, SoSe 2013, S. Michel 16 http://flickr.com/photos/jurvetson/157722937/

http://www.google.com/about/datacenter

Page 17: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Scale-Out vs. Scale-Up

• Scale-Out (Many Servers-> Distributed)

• As opposed to Scale-Up

Distributed Data Management, SoSe 2013, S. Michel 17

Page 18: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Scale-Out • Common technique is scale-out

– Many machines

– Amazon’s EC2 cloud, around 400, 000 machines

• Commodity machines (many but not individually super fast)

• Failures happen virtually at any time.

• Electricity is an issue (particularly for cooling)

Distributed Data Management, SoSe 2013, S. Michel 18

http://huanliu.wordpress.com/2012/03/13/amazon-data-center-size/

Page 19: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Hardware Failures • Lots of machines (commodity hardware)

failure is not exception but very common

• P[machine fails today] = 1/365 • n machines: P[failure of at least 1 machine] =

1-(1-P[machine fails today])^n – for n=1: 0. 0.0027 – for n=10: 0.02706 – for n=100: 0.239 – for n=1000: 0.9356 – for n=10 000: ~ 1.0

Distributed Data Management, SoSe 2013, S. Michel 19

Page 20: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Failure Handling & Recovery

• Hardware failures happen virtually at any time

• Algorithms/Infrastructures have to compensate that

• Replication of data, logging of state, also redundancy in task execution

Distributed Data Management, SoSe 2013, S. Michel 20

Page 21: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Cost Numbers (=>Complex Cost Model) • L1 cache reference 0.5 ns

• L2 cache reference 7 ns

• Main memory reference 100 ns

• Compress 1K bytes with Zippy 10,000 ns

• Send 2K bytes over 1 Gbps network 20,000 ns

• Read 1 MB sequentially from memory 250,000 ns

• Round trip within same datacenter 500,000 ns

• Disk seek 10,000,000 ns

• Read 1 MB sequentially from network 10,000,000 ns

• Read 1 MB sequentially from disk 30,000,000 ns

• Send packet CA->Netherlands->CA 150,000,000 ns

Distributed Data Management, SoSe 2013, S. Michel 21

Numbers source: Jeff Dean

1ns = 10^-6 ms

Page 22: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Map Reduce • “Novel” computing paradigm introduced by Google in

2004.

• Have many machines in a data center. • Don’t want to care about impl. details like data

placement, failure handling, cost models.

• Abstract computation to two basic functions:

• Think “functional programming” with map and fold (reduce), but – Distributed and – Large scale

Distributed Data Management, SoSe 2013, S. Michel 22

Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150

Page 23: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Map Reduce: Example Map + Count

• Line 1

– “One ring to rule them all, one ring to find them,

• Line 2

– “One ring to bring them all and in the darkness bind them.”

Distributed Data Management, SoSe 2013, S. Michel 23

Page 24: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Map Line to Terms and Counts

{"one"=>["1", "1"],

"ring"=>["1", "1"],

"to"=>["1", "1"],

"rule"=>["1"],

"them"=>["1", "1"],

"all"=>["1"],

"find"=>["1"]}

Distributed Data Management, SoSe 2013, S. Michel 24

{"one"=>["1"], "ring"=>["1"], "to"=>["1"], "bring"=>["1"], "them"=>["1", "1"], "all"=>["1"], "and"=>["1"], "in"=>["1"], "the"=>["1"], "darkness"=>["1"], "bind"=>["1"]}

Line 1

Line 2

Page 25: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Group by Term

Distributed Data Management, SoSe 2013, S. Michel 25

{"one"=>["1", "1"],

"ring"=>["1", "1"],

….

{"one"=>["1"], "ring"=>["1"], …

{"one"=>[["1”,”1”],[“1”]], "ring"=>[["1”,”1”],[“1”]], …

Page 26: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Sum Up

Distributed Data Management, SoSe 2013, S. Michel 26

{"one"=>[["1”,”1”],[“1”]], "ring"=>[["1”,”1”],[“1”]], …

{"one"=>[“3”], "ring"=>[“3”], …

Page 27: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Application: Computing PageRank

• Link analysis model proposed by Brin&Page

• Compute authority scores

• In terms of:

– incoming links (weights)

from other pages

• “Random surfer model”

Distributed Data Management, SoSe 2013, S. Michel 27

S. Brin & L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW Conf. 1998.

Page 28: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

New Requirements

• Map Reduce is one prominent example that novel businesses have new requirements.

• Going away from traditional RDBMS.

• Addressing huge data volumes, processed in multiple, distributed (wide spread) data centers.

Distributed Data Management, SoSe 2013, S. Michel 28

Page 29: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

New Requirements (Cont’d)

• Massive amounts of unstructured (text) data

• Processed often in batches (with MapReduce).

• Huge graphs like Facebook’s friendship graph

• Often enough to store (key, value) pairs

• No need for RDBMS overhead

• Often wanted: open source or at least not bound to particular commercial product (vendor).

Distributed Data Management, SoSe 2013, S. Michel 29

Page 30: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Wish List

• Data should always be consistent

• Provided service should be always quickly responding to requests

• Data can be (is) distributed across many machines (partitions)

• Even if some machines fail, the system should be up and running

Distributed Data Management, SoSe 2013, S. Michel 30

Page 31: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

CAP Theorem (Brewer's Theorem)

• System cannot provide all 3 properties at the same time:

– Consistency

– Availability

– Partition Tolerance

Distributed Data Management, SoSe 2013, S. Michel 31

C A

P

C+P A+P

http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

Page 32: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

With Huge Data Sets ….

• Partition tolerance is strictly required

• That leaves trading off consistency and availability

Distributed Data Management, SoSe 2013, S. Michel 32

Page 33: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Best effort: BASE

• Basically Available

• Soft State

• Eventual Consistency

Distributed Data Management, SoSe 2013, S. Michel 33

see http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

W. Vogels. Eventually Consistent. ACM Queue vol. 6, no. 6, December 2008.

Page 34: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

The NoSQL “Movement”

• No one-size-fits-all

• Not only SQL (not necessarily “no” SQL at all)

• for group of non-traditional DBMS (not relational, often no SQL), for different purposes

– key value stores

– graph databases

– document stores

Distributed Data Management, SoSe 2013, S. Michel 34

Page 35: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Example: Key Value Stores

• Like Apache Cassandra, Amazon’s Dynamo, Riak

• Handling of (K,V) pairs

• Consistent hashing of values to nodes based on their keys

• Simple CRUD operations (create, read, update, delete) (no SQL, or at least not full)

Distributed Data Management, SoSe 2013, S. Michel 35

Page 36: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Criticisms

• Some DB folks say “Map Reduce is a major step backward”.

• And NoSQL is too basic and will end up re-inventing DB standards (once they need it).

• Will ask in a few weeks: What do you think?

Distributed Data Management, SoSe 2013, S. Michel 36

Page 37: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Cloud Computing

• On demand hardware – rent your computing machinery

– virtualization

• Google App engine, Amazon AWS, Microsoft Azure – Infrastructure as a Service (IaaS)

– Platform as a Service (PaaS)

– Software as a Service (SaaS)

Distributed Data Management, SoSe 2013, S. Michel 37

Page 38: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Cloud Computing (Cont’d)

• Promises “no” startup cost for own business in terms of hardware you need to buy

• Scalability: Just rent more when you need them

• And return them when there is no demand

• Prominent showcase: Animoto, in Amazon’s EC2. From 50 to 3,500 machines in few days.

• But also problematic: – fully dependent on a vendors hardware/service

– sensitive data (all your data) is with vendor, maybe stored in a diff country (likely)

Distributed Data Management, SoSe 2013, S. Michel 38

Page 39: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Dynamic Big Data

• Scalable, continuous processing of massive data streams

• Twitter’s Storm, Yahoo! (now Apache) S4

Distributed Data Management, SoSe 2013, S. Michel 39

http://storm-project.net/

Page 40: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Last but not least: Fallacies of Distributed Computing

1. The network is reliable

2. Latency is zero

3. Bandwidth is infinite

4. The network is secure

5. Topology doesn't change

6. There is one administrator

7. Transport cost is zero

8. The network is homogeneous

Distributed Data Management, SoSe 2013, S. Michel 40

source: Peter Deutsch and others at Sun

Page 41: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

LECTURE: CONTENT & REGULATIONS

Distributed Data Management, SoSe 2013, S. Michel 41

Page 42: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

What you will learn in this Lecture • Most of the lecture is on processing big data

– Map Reduce, NoSQL, Cloud computing

• Will operate on state of the art research results and tools

• Middle way between pure systems/tools discussion and learning how to build algorithms on top of them (see Joins over MR, n-grams, etc.)

• But also basic (important) techniques, like consistent hashing, PageRank, Bloom filters

• Very relevant stuff. Think “CV” ;)

Distributed Data Management, SoSe 2013, S. Michel 42

Page 43: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

• We will critically discuss techniques (philosophies).

Distributed Data Management, SoSe 2013, S. Michel 43

Page 44: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Prerequisites

• Successfully attended information systems or database lectures.

• Practical exercises require solid Java skills

• Work with systems/tools requires will to dive into APIs and installation procedures

Distributed Data Management, SoSe 2013, S. Michel 44

Page 45: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

• VL 1 (18. April): Motivation, Regulations, Big Data • VL 2 (25. April): Map Reduce 1 • VL 3 (02. Mai): Map Reduce 2 • No Lecture (09. Mai) (Himmelfahrt, Ascension) • VL 4 (16. Mai): NoSQL 1 • VL 5 (23. Mai): NoSQL 2 • No Lecture (30. Mai) (Fronleichnam, Corpus Christi) • VL 7 (06. June): Cloud Computing • VL 8 (13. June): Stream Processing • VL 9 (20. June) : Distributed RDBMS 1 • VL 10 (27. June): Distributed RDBMS 2 • VL 11 (04. July): Peer to Peer Systems • VL 12 (11. July): Open Topic 1 • VL 13 (18. July): Last Lecture / Oral exams

Distributed Data Management, SoSe 2013, S. Michel 45

Schedule of Lectures (Topics Tentative)

Page 46: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Lecturer and TA • Lecturer : Sebastian Michel (Uni Saarland)

– smichel (at) mmci.uni-saarland.de

– Building E 1.7, Room 309 (Uni Saarland)

– Phone: 0681 302 70803

– or better, catch me after lecture!

• TA: Johannes Schildgen

– schildgen (at) cs.uni-kl.de

– Room: 36/340

Distributed Data Management, SoSe 2013, S. Michel 46

Page 47: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Organization & Regulations

• Lecture: – Thursday

– 11:45 - 13:15

– Room 48-379

• Exercise: – Tuesday (bi-weekly)

– 15:30 - 17:00

– Room 52-203

– First session: May 7th.

Distributed Data Management, SoSe 2013, S. Michel 47

Page 48: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Lecture Organization

• New Lecture (almost all slides are new).

• On topics that are often brand new.

• Later topics are still tentative.

• Please provide feedback. E.g., too slow / too fast? Important topics you want to be addressed?

Distributed Data Management, SoSe 2013, S. Michel 48

Page 49: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Exercises

• Assignment sheet, every two weeks

• Sheet + TA session by Johannes Schildgen

• Mixture of: – Practical: Implementation (e.g., Map Reduce)

– Practical: Algorithms on “paper”

– Theory: Where appropriate (show that …)

– Brief Essay: Explain the difference of x and y (short summary)

• Active participation wanted!

Distributed Data Management, SoSe 2013, S. Michel 49

Page 50: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Exam

• Oral Exam at the end of semester/early in semester break.

• Around 20min

• Topics captured announced few (1-2) weeks before exams

• We assume you actively participated in the exercises.

Distributed Data Management, SoSe 2013, S. Michel 50

Page 51: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Registration

• Please register by email to

– Sebastian Michel and Johannes Schildgen

– Use subject prefix: [ddm13]

– With content:

• Your name

• Matriculation number

• In particular to receive announcements/news

Distributed Data Management, SoSe 2013, S. Michel 51

Page 52: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

BIG DATA

Distributed Data Management, SoSe 2013, S. Michel 52

source: Dilbert by Scott Adams (cropped)

(The Big data Challenge)

Page 53: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

What is Big Data? • Massive amounts of data from a variety of

sources

– Web search logs

– social networks and blogs

– RFID and other sensor data

– sales data

– scientific data

& it is a big buzzword!

Distributed Data Management, SoSe 2013, S. Michel 53

Page 54: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

What is Big Data? (Cont’d)

Distributed Data Management, SoSe 2013, S. Michel 54

• Big data is often associated with NoSQL and MapReduce tools to process it.

• Processed in and across gigantic data centers

• The term “Big Data” denotes not only size but things we want to/can do with it (benefits)

Page 55: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Traditional Handling

• Data warehousing, e.g., at Walmart, Ebay, etc. Also super big and constantly growing.

• But you know your data, know what you are looking for

• Schema is “small” enough to allow human input (admin)

• It is “just” YOUR data

Distributed Data Management, SoSe 2013, S. Michel 55

Page 56: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

“Simple” Case: Shopping Patterns

• Famous story:

– statistician at target.com (large retailer in US)

– task: figure out woman is pregnant even if she doesn’t want them to know

– even more: roughly which week/month

– Why? To sell products!

Distributed Data Management, SoSe 2013, S. Michel 56

Read more: e.g., http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=all&_r=0

Page 57: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

“Simple” Case: Use of Search Logs

• Swine Flu epidemic of 2009

• Google tracks epidemic by following searches for flu-related topics.

Distributed Data Management, SoSe 2013, S. Michel 57 source: Google

Page 58: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

What is different now?

• Large amounts of heterogeneous data

• Take all the PBs together, not only your own one ( From TB to PB and EB)

• Manual input of humans hardly scales

• Who anyway understand complex data and schema (if there is one)?

• It is now beyond asking SQL queries.

Distributed Data Management, SoSe 2013, S. Michel 58

Page 59: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Data Science: What it takes

• many fields touched – math, statistics

– data engineering

– pattern recognition and learning

– natural language processing

– visualization

– uncertainty modeling

– data warehousing

– high performance computing

Distributed Data Management, SoSe 2013, S. Michel 59

Page 60: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

The BIG Data Challenge: The 4 Vs

• Volume

– Lots of data

• Velocity

– Changing / growing data

• Variety

– Heterogeneity

• Verity

– True or not?

Distributed Data Management, SoSe 2013, S. Michel 60

Addressed in this lecture

According to Gartner and others.

Page 61: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Example: Trend Mining in Twitter

Distributed Data Management, SoSe 2013, S. Michel 61

• Mine trends in text streams (Twitter, RSS feeds, etc.)

• No human input. Massive amount of noisy unstructured text data.

• Wand to find

trends like:

#benedictXVI #retirement

#schavan #guttenberg

#armstrong #doping

#cyprus #bankruptcy

Page 62: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Sliding Window Model and Objective

Distributed Data Management, SoSe 2013, S. Michel 62

• Data valid for certain time

time

• Now: Detect change in co-occurrence, thus emerging trend!

tag A

tag B tag A

tag B

evolving time

Page 63: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Prediction Model and Trend Ranking

Distributed Data Management, SoSe 2013, S. Michel 63

0

0,2

0,4

0,6

0,8

1

1 2 3 4 5 6 7 8 9 10

Correlation

Prediction

Error

Intensity of trend as prediction error

Exponential smoothing forecast

Page 64: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Data Sources are Heterogeneous 64

super fast not controlled (noisy) text little structure

super fast structured

static structured administered

Page 65: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

… so is the Data 65

Music

Publications

Health Data

KB of Entire Wikipedia

Page 66: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Why is Big Data Interesting?

• Novel insights about customers – Beyond pure shopping cart analyses and purchase

history

– Beyond running separate surveys/polls

• Social media involvement

• Demographic data

• (Purchase) trend prediction in social media (=> investment)

• Why? Money

Distributed Data Management, SoSe 2013, S. Michel 66

Page 67: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Need to be Careful

Distributed Data Management, SoSe 2013, S. Michel 67

• Not only are facts often wrong

• Also statistics can reveal wrong clues.

• With enough data you can “tell” anything

Page 68: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Recap of Today’s Lecture

• Teaser for content addressed in coming lectures:

– Hot topics (Map Reduce, NoSQL, Cloud Computing, Big Data)

– and fundamental techniques

• Lecture regulations

• Short excerpt on “Big Data”

Distributed Data Management, SoSe 2013, S. Michel 68

Page 69: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Next few lectures are on

Map Reduce

Distributed Data Management, SoSe 2013, S. Michel 69

Page 70: Distributed Data Management - - TU Kaiserslautern€¢Abstract computation ... –Distributed and –Large scale Distributed Data ... Sanjay Ghemawat: MapReduce: Simplified Data Processing

Summary: Papers/Books/Articles

• Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150

• W. Vogels. Eventually Consistent. ACM Queue vol. 6, no. 6, December 2008.

• Nancy Lynch and Seth Gilbert, “Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.

• In general, for NoSQL references: http://nosql-database.org/

• Hadoop (Map Reduce): Tom White. The Definitive Guide. 3rd edition.

• http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=all&_r=0

Distributed Data Management, SoSe 2013, S. Michel 70