Download - MapReduce and Bigtableikons/Open_source_large_scale...10/11/2010 AIT, Athens Ioannis Konstantinou Hadoop Distributed File System 1/3 •Master/Slave Architecture •Files are split

10/11/2010 AIT, Athens Ioannis Konstantinou

Open source large scale distributed data management with Google’s

MapReduce and Bigtable

Email: [email protected]: http://www.cslab.ntua.gr/~ikons

Computing Systems LaboratorySchool of Electrical and Computer Engineering

National Technical University of Athens

Ioannis Konstantinou


Big Data

• Facebook: 20TB/day compressed

• CERN/LHC: 40TB/day (15PB/year)

• NYSE: 1TB/day

• 2009 IDC Digital Universe: 800.000 Petabytes or 0.8 Zettabytes

• Moore's Law: Data doubles every 18 months

• 2020 prediction: 35 Zettabytes (44 times bigger than 2009)


What is Hadoop?

• It’s a distributed framework for large-scale data processing:

• Inspired by Google’s architecture: Map Reduce and Google File System

• Can scale to thousands of nodes and petabytes of data

• A top-level Apache project (since 2008) –Hadoop is open source

• Written in Java, plus a few shell scripts


Why Hadoop?

• Hadoop is designed to run on cheap commodity hardware

• Fault-tolerant hardware is expensive

• It automatically handles data replication and node failure

• It does the hard work – you can focus on processing data


When to use Hadoop?

• There is access to lots of commodity hardware

• The processing can be easily parallelized

• Need to process lots of unstructured data

– Data intensive applications

• It is ok to run batch jobs (no need for interactive results)


Architecture

• HDFS: Distributed file system

– Hard to store a PB

– Based on Google FS

– Fault-tolerant: handles replication, node failure, etc

• MapReduce : Data aware parallel computation framework

– Even harder to process a PB

– Based on a research paper by Google


Hadoop Distributed File System 1/3

• Master/Slave Architecture

• Files are split into one or more blocks and these blocks are stored in a set of DataNodes

• A Master NameNode

– a master server that manages the file system namespace and regulates access to files by clients

– determines the mapping of blocks to DataNodes

• Many DataNodes

– Serve client read/write requests

– Create/delete/replicate blocks



• HDFS is good for storing large amounts of data, but what about:

– Transactional data? (e.g. concurrent reads and writes to the same data)

– Structured data? (e.g. record oriented views, columns)

– Relational data? (e.g. indexes)

• HDFS does not support these features


What is HBase?

• Open source implementation of Google’s BigTable

• Distributed Storage system for structured data

• Scales to petabytes of data across thousands of commodity servers

• Primitive relational and transactional ops (NoSQL)

• Builts on top of Hadoop’s HDFS

– HMaster co-exists with NameNode and knows table locations

– RegionServers co-exist with DataNodes and are responsible for table regions


Data model - Conceptual view

• A sparse, distributed, multidimensional sorted map.

– Map is indexed by row key, column key and timestamp

– Lexicographic sorting per row key

– Column key consists of <column family>:<column label>

– Billions of rows, billions of column labels, hundreds of column families


KeyValue Physical Storage View

Key Time Stamp Column "contents:"

"com.cnn.www" t6 "<html>..."

t5 "<html>..."

t3 "<html>..."

Row Key Time Stamp Column "anchor:"

"com.cnn.www" t9 "anchor:cnnsi.com" "CNN"

t8 "anchor:my.look.ca" "CNN.com"

• One KeyValue per column family (contents and anchor)

• Sparse: only columns with values are included per key

KeyValue 1

KeyValue 2


From KeyValues to HFiles

• Contains many sorted KeyValues

• Has a fixed upper size in MB

• Index is located at the end of the HFile.

• Used to quickly locate a single KeyValue


From HFiles to HTables

• Many HFiles create an HRegion

– Region is identified by start and end key

– When HRegions get too large, they are split and two new are created

• Many HRegions create an HTable

– HMaster knows the locations of HRegions


HBase Architecture

Taken from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

• HBase uses HDFS for data access


Hbase Operations

• Supports basic DBMS operations

– Put(row_key, column_key,timestamp,value)

– Get(row_key) and optionally column_key, timestamp, value

– Scan(start_row_key, end_row_key)

• No table joins!!!

• No multi-row transactions

– Atomic single-row writes

– Optional atomic single-row reads


Other NoSQL alternatives

• Cassandra

• Voldermort

• Dynamo

• CouchDB, MongoDB, SimpleDB, Hypertable…. And many more: check http://en.wikipedia.org/wiki/NoSQL


MapReduce 1/3

• A programming model

• A software framework

• for writing applications that

– rapidly process vast amounts of data in parallel

– on large clusters of compute nodes

• Utilizes HDFS for input/output

– HDFS stores and MapReduce processes.


MapReduce 2/3

• Problem is separated in two different phases, the Map and Reduce phase.

• Map: Non overlapping chunks (<key,value> records) of input data is assigned to separate processes (mappers) that emit a set of intermediate <key,value> results

• Reduce: Map results are fed to a usually smaller number of processes called reducers that “summarize” their input in a smaller number of <key,value> results


MapReduce 3/3


Example: Word Count 1/3

• Count the number of times each word appears in a large set of documents

• Possible usage: find popular urls in log files

• Work Plan:

– Upload documents to HDFS

– Write a map and a reduce function

– Execute MapReduce job in Hadoop

– Get the job output in HDFS



map(key, value):

// key: document name; value: text of document

for each word w in value:

emit(w, 1)

reduce(key, values):// key: a word; value: an iterator over counts

result = 0for each count v in values:

result += vemit(result)



(d1, ‘’w1 w2 w4’)

(d2, ‘ w1 w2 w3 w4’)(w2, 3)

(w2,4)

(w3, 2)

(w3,2)

(w2,3)

(d4, ‘ w1 w2 w3’)

(d5, ‘w1 w3 w4’)

(d8, ‘ w2 w2 w3’)

(d9, ‘w1 w1 w3 w3’)

(d3, ‘ w2 w3 w4’)(w2,4)

(w1,3)

(w3,2)

(w4,3)

(w3,2)

(w1,7)

(d10, ‘ w2 w1 w4 w3’) (w3,4)

(w2,3) (w2,15)

M=3 mappers R=2 reducers

(w1, 2)

(w4,3)

(w1,3)

(w4,3)

(w1,3)

(w4,1)

(d6, ‘ w1 w4 w2 w2)

(d7, ‘ w4 w2 w1’)

(w1,3)

(w4,3)

(w2,3)

(w1,2)

(w3,4)

(w4,1)

(w3,8)

(w4,7)


When should I use it?

• Good choice for

– Indexing log files

– Sorting vast amounts of data

– Image analysis

• Bad choice for

– Figuring π to 1,000,000 digits

– Calculating Fibonacci sequences

– MySQL replacement


Typical problems

• Log and/or clickstream analysis of various kinds

• Marketing analytics

• Machine learning and/or sophisticated data mining

• Image processing

• Processing of XML messages

• Web crawling and/or text processing

• General archiving, including of relational/tabular data, e.g. for compliance


Hadoop MapReduce

• Master/Slave architecture

• A JobTracker Master

– Runs together with NameNode

– Receives client job requests

– Schedules and monitors MR jobs

• Move computation near the data

• Speculative execution

• Many TaskTrackers

– Run together with DataNodes

– Perform I/O operations with DataNodes


Use cases 1/3

• Large Scale Image Conversions

• 100 Amazon EC2 Instances, 4TB raw TIFF data

• 11 Million PDF in 24 hours and 240$

• Internal log processing

• Reporting, analytics and machine learning

• Cluster of 1110 machines, 8800 cores and 12PB raw storage

• Open source contributors (Hive)

• Store and process tweets, logs, etc

• Open source contributors (hadoop-lzo)


Use cases 2/3

• 100.000 CPUs in 25.000 computers

• Content/Ads Optimization, Search index

• Machine learning (e.g. spam filtering)

• Open source contributors (Pig)

• Natural language search (through Powerset)

• 400 nodes in EC2, storage in S3

• Open source contributors (!) to HBase

• ElasticMapReduce service

• On demand elastic Hadoop clusters for the Cloud


Use cases 3/3

• ETL processing, statistics generation

• Advanced algorithms for behavioral analysis and targeting

• Used for discovering People you May Know, and for other apps

• 3X30 node cluster, 16GB RAM and 8TB storage

• Leading Chinese language search engine

• Search log analysis, data mining

• 300TB per week

• 10 to 500 node clusters


Questions