Delip Rao [email protected]

46

description

Delip Rao [email protected]. What is the typical size of data you deal with on a daily basis?. Processes 20 Petabytes of raw data a day Works out to 231 TB per second! Numbers from 2008, grows by the day. http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html. - PowerPoint PPT Presentation

Transcript of Delip Rao [email protected]

Page 1: Delip Rao delip@jhu.edu
Page 2: Delip Rao delip@jhu.edu

Delip [email protected]

Page 3: Delip Rao delip@jhu.edu

What is the typical size of data you deal with on a daily basis?

Page 4: Delip Rao delip@jhu.edu

• Processes 20 Petabytes of raw data a day• Works out to 231 TB per second!

• Numbers from 2008, grows by the day.http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html

Page 5: Delip Rao delip@jhu.edu

• 200 GB per day (March 2008)• 2+ TB (compressed) per day in April 2009• 4+ TB (compressed) per day in Oct 2009– 15 TB (uncompressed) a day (2009)

Page 6: Delip Rao delip@jhu.edu

And Many More …

• eBay: 50 TB / day

• NYSE: 1TB / day

• CERN LHC : 15 PB / Year

Page 7: Delip Rao delip@jhu.edu

Storage and Analysis of Tera-scale Data : 1 of 2

415 Database Class11/17/09

Page 8: Delip Rao delip@jhu.edu

In Today’s Class We Will ..

• Deal with scale from a completely different perspective

• Discuss problems with traditional approaches• Discuss how to analyze large quantities of data• Discuss how to store (physically) huge

amounts of data in a scalable, reliable fashion• Discuss a simple effective approach to store

record-like data

Page 9: Delip Rao delip@jhu.edu

Dealing with scale

• MySQL will crawl with 500 GB of data• “Enterprise” databases (Oracle, DB2)– Expensive $$$$$$$$– Does not scale well

• Indexing painful• Aggregate operations (SELECT COUNT(*) ..) almost

impossible

• Distributed databases– Expensive $$$$$$$$$$$$$$$$$$$$$$$$– Doesn’t scale well either

New approaches r

equired!

Page 10: Delip Rao delip@jhu.edu

Large Scale Data: Do we need databases?

• Traditional database design is inspired from decades of research on storage and retrieval.

• Complicated database systems == more tuning– A whole industry of “Database Administrators”– Result: Increased operational expenses

• Complicated indexing, transaction processing algorithms are not needed if all we care about are analyses from the data

Page 11: Delip Rao delip@jhu.edu

Parallelize both data access and processing

• Over time processing capacity has increased compared to– Disk transfer time (slow)– Disk seek time (even slower)

• Solution:– Process data using a cluster of nodes using

independent CPUs and independent disks.

Page 12: Delip Rao delip@jhu.edu

Overview

• MapReduce is a design pattern:– Manipulate large quantities of data– Abstracts away system specific issues– Encourage cleaner software engineering– Inspired from functional programming primitives

Page 13: Delip Rao delip@jhu.edu

MapReduce by Example• Output: Word frequency histogram• Input: Text, read one line at a time• Single core design: Use a hash table• MapReduce:

def mapper(line):foreach  word  in

 line.split():output(word, 1)

def reducer(key, values):output(key, sum(values))

Page 14: Delip Rao delip@jhu.edu

Word Frequency Histogram (contd)

the quick brown fox

the fox ate the rabbit

the brown rabbit

Page 15: Delip Rao delip@jhu.edu

Word Frequency Histogram (contd)

the quick brown fox

the fox ate the rabbit

the brown rabbit

MAPPER

MAPPER

MAPPER

REDUCER

REDUCERSH

UFF

LE

(the, 1)

(the, 1)

(the, 1)

(the, 1

)(th

e, (1,

1, 1,

1))

(the, 4)

(ate, 1)

(brown, 2)

(fox, 2)

(quick, 1)

(rabbit, 2)

Page 16: Delip Rao delip@jhu.edu

WordCount review

• Output: Word frequency histogram• Input: Text, read one line at a time– Key: ignore, Value: Line of text

def mapper(key, value):foreach  word  in

 value.split():output(word, 1)

def reducer(key, values):output(key, sum(values))

Page 17: Delip Rao delip@jhu.edu

WordCount: In actual code

• Mapper

Page 18: Delip Rao delip@jhu.edu

WordCount: In actual code

• Reducer

Page 19: Delip Rao delip@jhu.edu

WordCount: In actual code

• Driver (main) method

Observe the benefits of abstraction from hardware dependence, reliability and job distribution

Page 20: Delip Rao delip@jhu.edu

“Thinking” in MapReduce

• Input is a sequence of key value pairs (records)• Processing of any record is independent of the

others• Need to recast algorithms and sometimes data

to fit to this model– Think of structured data (Graphs!)

Page 21: Delip Rao delip@jhu.edu

Example: Inverted Indexing

• Say you have a large (billions) collection of documents

• How do you efficiently find all documents that contain a certain word?

• Database solution:SELECT doc_id FROM doc_table

where doc_text CONTAINS ‘word’;• Forget scalability. Very inefficient. • Another demonstration of when not to use a DB

Page 22: Delip Rao delip@jhu.edu

Example: Inverted Indexing

• Well studied problem in Information Retrieval community

• More about this in 600.466 course (Spring)• For now, we will build a simple index– Scan all documents in the collection– For each word record the document in which it

appears• Can write a few lines of Perl/Python to do it– Simple but will take forever to finish

What is the complexity of this

code?

Page 23: Delip Rao delip@jhu.edu

“Thinking” in MapReduce (contd)

• Building inverted indexes• Input: Collection of documents• Output: For each word find all documents with

the word def mapper(filename, content):

foreach  word  in  content.split():

output(word, filename)def reducer(key, values):

output(key, unique(values))

What is the latency of this code?

Page 24: Delip Rao delip@jhu.edu

Suggested Exercise

• Twitter has data in following format:<user_id, tweet_text, timestamp>• Write map-reduces for– Finding all users who tweeted on “Comic Con”– Ranking all users by frequency of their tweets– Finding number of tweets containing “iPhone”

varies with time

Page 25: Delip Rao delip@jhu.edu

MapReduce vs RDBMSTraditional RDBMS MapReduce

Data size Gigabytes PetabytesAccess Interactive & Batch BatchUpdates Read & write many times Write once, read manyStructure Static schema Dynamic schemaIntegrity High LowScaling Nonlinear Linear

Page 26: Delip Rao delip@jhu.edu

The Apache Hadoop Zoo

PIG CHUKWA HIVE HBASE

MAPREDUCE HDFS ZOOKEEPER

COMMON AVRO

Page 27: Delip Rao delip@jhu.edu

Storing Large Data: HDFS

• Hadoop File System (HDFS)• Very large distributed file system (~10PB)• Assumes commodity hardware– Replication– Failure detection & Recovery

• Optimized for batch processing• Single namespace for entire clusterhdfs://node-21/user/smith/job21/input01.txt

Page 28: Delip Rao delip@jhu.edu

HDFS Concepts

• Blocks – A single unit of storage• Namenode (master)– manages namespace• Filesystem namespace tree + metadata

– Maintains file to block mapping• Datanodes (workers)– Performs block level operations

Page 29: Delip Rao delip@jhu.edu

Storing record data

• HDFS is a filesystem: An abstraction for files with raw bytes and no structure

• Lot of real world data occur as tuples– Hence RDBMS. If only they were scalable …

• Google’s solution: BigTable (2004)– A scalable distributed multi-dimensional sorted map

• Currently used in 100+ projects inside Google– 70+ PB data; 30+ GB/s IO (Jeff Dean, LADIS ’09)

Page 30: Delip Rao delip@jhu.edu

Storing record data: HBase

• Open source clone of Google’s Bigtable• Originally created in PowerSet in 2007• Used in : Yahoo, Microsoft, Adobe, Twitter, …• Distributed column-oriented database on top of

HDFS• Real time read/write random-access• Not Relational and does not support SQL• But can work with very large datasets– Billions of rows, millions of columns

Page 31: Delip Rao delip@jhu.edu

HBase: Data Model

• Data stored in labeled tables– Multi-dimensional sorted map

• Table has rows, columns.• Cell: Intersection of a row & column– Cells are versioned (timestamped)– Contains an uninterrupted array of bytes (no type information)

• Primary key: A cell that uniquely identifies a row

Page 32: Delip Rao delip@jhu.edu

HBase: Data model (contd)• Columns are grouped into column families– Eg., temperature:air, temperature:dew_point

• Thus column name is family_name:identifier• The column families are assumed to be known

a priori• However can add new columns for an existing

family at run timeROW COLUMN FAMILIES

temperature: humidity: …

location_id temperature:airtemperature:dew_point

humidity:absolutehumidity:relativehumidity:specific

Page 33: Delip Rao delip@jhu.edu

HBase: Data model (contd)

• Tables are partitioned into regions• Region: Subset of rows• Regions are units that get distributed across a

cluster• Locking– Row updates are atomic– Updating a cell will lock entire row.– Simple to implement. Efficient.– Also, updates are rare

Page 34: Delip Rao delip@jhu.edu

HBase vs. RDBMS

• Scale : Billions of rows and Millions of columns• Traditional RDBMS:– Fixed schema– Good for small to medium volume applications– Scaling RDBMS involves violating Codd’s Rules,

loosening ACID properties

Page 35: Delip Rao delip@jhu.edu

HBase schema design case study

• Store information about students, courses, course registration

• Relationships (Two one-to-many)– A student can take multiple courses– A course is taken by multiple students

Page 36: Delip Rao delip@jhu.edu

HBase schema design case study

• RDBMS solution

STUDENTS

id (primary key)

name

department_idREGISTRATION

student_id

course_id

type

COURSES

id (primary key)

title

faculty_id

Page 37: Delip Rao delip@jhu.edu

HBase schema design case study

• HBase solution

ROW COLUMN FAMILIES

info: course:

<student_id> info:nameinfo:department_id

course:course_id=type

ROW COLUMN FAMILIESinfo: student:

<course_id> info:titleinfo:faculty_id

student:student_id=type

Page 38: Delip Rao delip@jhu.edu

HBase: A real example

• Search Engine Query Log

ROW COLUMN FAMILIESquery cookie: request:

request_md5_hash query:textquery:lang...

cookie:id request:user_agentrequest:ip_addrrequest:timestamp

Common practice to use hash of the row of as key when no natural primary key exists

This is okay when data is accessed sequentially. No need to “lookup”

Page 39: Delip Rao delip@jhu.edu

Suggested Exercise

• Write an RDBMS schema to model User-Follower network in Twitter

• Now, write its HBase equivalent

Page 40: Delip Rao delip@jhu.edu

Access to HBase

• via Java API– Map semantics: Put, Get, Scan, Delete– Versioning support

• via the HBase shell$ hbase shell...hbase> create 'test' 'data'hbase> listtest…hbase> put 'test', 'row1', 'data:1', 'value1'hbase> put 'test', 'row2', 'data:1', 'value1'hbase> put 'test', 'row2', 'data:2', 'value2'hbase> put 'test', 'row3', 'data:3', 'value3'

hbase> scan 'test'…hbase> disable 'test'hbase> drop 'test'

Not a preferred way to access. Typically API is used.

Not a real query language. Useful for “inspecting” a table.

Page 41: Delip Rao delip@jhu.edu

A Word on HBase Performance

• Original HBase had performance issues• HBase 2.0 (latest release) much faster– Open source development!!!

• Performance Analysis by StumbleUpon.com– Website uses over 9b rows in a single HBase table– 1.2m row reads/sec using just 19 nodes– Scalable with more nodes– Caching (new feature) further improves

performance

Page 42: Delip Rao delip@jhu.edu

Are traditional databases really required?

• Will a bank store all its data on HBase or its equivalents?

• Unlikely because Hadoop– Does not have a notion of a transaction– No security or access control like databases

• Fortunately batch processing large amounts of data does not require such guarantees

• Hot research topic: Integrate databases and MapReduce – “In database MapReduce”

Page 43: Delip Rao delip@jhu.edu

Summary

• (Traditional) Databases are not Swiss-Army knives• Large data problems require radically different

solutions• Exploit the power of parallel I/O and computation• MapReduce as a framework for building reliable

distributed data processing applications• Storing large data requires redesign from the

ground up, i.e. filesystem (HDFS)

Page 44: Delip Rao delip@jhu.edu

Summary (contd)

• HDFS : A reliable open source distributed file system• HBase : A sorted multi-dimensional map for record

oriented data– Not Relational– No query language other than map semantics (Get and

Put)• Using MapReduce + HBase involves a fair bit of

programming experience – Next class we will study Pig and Hive: A “data analyst

friendly” interface to processing large data.

Page 45: Delip Rao delip@jhu.edu

Suggested Reading

• Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”

• Dewitt and Stonebraker, Mapreduce: “A major step backwards”

• Chu-Carroll, “Databases are hammers; MapReduce is a screwdriver”

• Dewitt and Stonebraker, “Mapreduce II”

Page 46: Delip Rao delip@jhu.edu

Suggested Reading (contd)

• Hadoop Overviewhttp://hadoop.apache.org/common/docs/current/mapred_tutorial.html

• Who uses Hadoop?http://wiki.apache.org/hadoop/PoweredBy

• HDFS Architecturehttp://hadoop.apache.org/common/docs/current/hdfs_design.html

• Chang et. al., “Bigtable: A Distributed Storage System for Structured Data”

http://labs.google.com/papers/bigtable.html

• Hbasehttp://hadoop.apache.org/hbase/