Programming on Hadoop

40
Programming on Hadoop

description

Programming on Hadoop. Outline. Different perspective of Cloud Computing The Anatomy of Data Center The Taxonomy of Computation Computation intensive Data intensive The Hadoop Eco-system Limitations of Hadoop. Cloud Computing. From user perspective - PowerPoint PPT Presentation

Transcript of Programming on Hadoop

Page 1: Programming on  Hadoop

Programming on Hadoop

Page 2: Programming on  Hadoop

Outline

• Different perspective of Cloud Computing• The Anatomy of Data Center• The Taxonomy of Computation– Computation intensive– Data intensive

• The Hadoop Eco-system• Limitations of Hadoop

Page 3: Programming on  Hadoop

Cloud Computing

• From user perspective– A service which enables users to run their applications

on the Internet • From service provider perspective– A resource pool which is used to deliver cloud services

through the Internet– The resource pool is hosted in on-premise data center– What the data center (DC) looks like ?

Page 4: Programming on  Hadoop

An Example of DC

• Google’s Data Center at 2009.

From Jeffrey Dean’s talk on WSDM2009

Page 5: Programming on  Hadoop

A Closer Look at DC – Overview

Figure is copied from [4]

Page 6: Programming on  Hadoop

A Closer Look of DC – Cooling

Figure is copied from [4]

Page 7: Programming on  Hadoop

A Closer Look at DC – Computing Resources

Figure is copied from [4]

Page 8: Programming on  Hadoop

The Commodity Server

• Commodity server is NOT low-end server– Standard components vs. proprietary hardware

• Common configuration in 2008– Processor: 2 quad-core Intel Xeon 2.0GHz CPUs– Memory: 8 GB ECC RAM– Storage: 4 1TB SATA disks– Network: Gigabit Ethernet

Page 9: Programming on  Hadoop

Approaches to Deliver Service

• The dedicated approach– Serve each customer with dedicated computing

resources• The shared approach (multi-tenant architecture)– Serve customers with the shared resource pool

Page 10: Programming on  Hadoop

The Dedicated Approach

• Pros:– Easy to implement– Performance & security guarantee

• Cons:– Pain for the customer to scale their applications– Poor resource utilization

Page 11: Programming on  Hadoop

The Shared Approach

• Pros:– No pain for customers to scale their applications– Better resources utilization– Better performance in some cases– Low service cost per customer

• Cons:– Need complicated software layer– Performance isolation/tuning may be complicated

• To achieve better performance customers should be familiar with the software/hardware architecture to some degree

Page 12: Programming on  Hadoop

The Hadoop Eco-system

• An software infrastructure to deliver a DC as a service through shared-resources approach– Customers can use Hadoop to develop/deploy certain

data-intensive applications on the cloud • We focus on the Hadoop core in this lecture– Hadoop == Hadoop – core afterwards

Hadoop Distributed File System (HDFS) MapReduceCore

HBase Chukwa Hive PigExtensions

Page 13: Programming on  Hadoop

The Taxonomy of Computations

• Computation-intensive tasks– Small data (in-memory), Lots of CPU cycles per data

item processing– Examples: machine learning

• Data-intensive tasks– Large-volume data (in-disk), relatively small CPU

cycles per data item processing– Examples: DBMS

Page 14: Programming on  Hadoop

The Data-intensive Tasks

• Streaming-oriented data access– Read/Write a large portion of dataset in streaming manner

(sequentially)– Character:

• NO-seek, high-throughput• Optimized for larger data transferring rate

• Random-oriented data access– Read/Write a small number of data items randomly located in

the dataset– Character:

• Seek-oriented• Optimized for low-latency data access for each data item

Page 15: Programming on  Hadoop

What Hadoop does & doesn’t

• Hadoop can perform– High-throughput streaming data access– Limited low-latency random data access through

HBase– Large-scale analysis through MapReduce

• Hadoop cannot do– Perform transactions– Certain time-critical applications

Page 16: Programming on  Hadoop

Hadoop Quick Start

• Very simple– Download Hadoop package from Apache

• http://hadoop.apache.org/– Unpack into a folder– Do some configurations on hadoop-site.xml

• fs.default.name select the default file system (e.g., HDFS)• mapred.job.tracker point to the JobTracker of MapReduce cluster

– Start• Format the file system only once (in a fresh installation)

– bin/hadoop namenode –format• Launch HDFS & MapReduce cluster

– bin/start-all.sh

Page 17: Programming on  Hadoop

The Launched HDFS cluster

Page 18: Programming on  Hadoop

The Launched MapReduce Cluster

Page 19: Programming on  Hadoop

The Hadoop Distributed Filesystem

• Wraps the DC as a resource pool and provides a set of API to let users read/write data from/into the DC sequentially

Page 20: Programming on  Hadoop

A Closer Look at the API

• Aha, writing “hello word!”– bin/hadoop jar test.jar

public class Main {

public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataOutputStream fsOut = fs.create(“testFile”); fsOut.writeBytes(“Hello Hadoop”) fsOut.close(); } }

Page 21: Programming on  Hadoop

A Closer Look at the API (cont.)

• Reading data from the HDFS

public class Main {

public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataInputStream fsIn = fs.open(new Path(“testFile”)); byte[] buf = new byte[1024]; int len = fsIn.read(buf); System.out.println(new String(buf, 0, len); }

}

Page 22: Programming on  Hadoop

Inside HDFS

• A single NameNode multiple DataNodes architecture (see [5] for reference)– Chop each file as a set of fix-sized blocks and store

those data blocks on all available DataNodes– NameNode hosting all file system meta-data (file

block mapping, block locations etc) in memory– DataNode hosting all file data for reading/writing

Page 23: Programming on  Hadoop

Inside HDFS – Architecture

• Figure is copied from http://hadoop.apache.org/common/docs/current/hdfs_design.html

Page 24: Programming on  Hadoop

Inside HDFS – Writing Data

Figure is copied from [2]

Page 25: Programming on  Hadoop

Inside HDFS – Reading Data • What is the problem with reading/writing ?

Figure is copied from [2]

Page 26: Programming on  Hadoop

The HDFS Cons

• Single reader/writer– Reading and writing a single block each time– Only touch ONE data node– Data transferring rate == disk bandwidth of a SINGLE

node– Too slow for a large file

• Suppose disk bandwidth == 100MB/sec• Reading /writing a 1TB file requires ~3 hrs

– How to fix it ?

Page 27: Programming on  Hadoop

Multiple Reader/Writers

• Reading/Writing a large data set using multiple processes– Each process reads/writes a subset of the whole data

set and materialize the sub-data set as file– File collection for the whole data set

• Typically, the file collection is stored in a directory named with the data set

Page 28: Programming on  Hadoop

Multiple Readers/Writers (cont.)

• Question – what is the proper number of readers and writers ?

Sub-set 2

Sub-set 1

Sub-set 3

Sub-set 4

Data set A

Process 1

Process 2

Process 3

Process 4

part-0001

part-0002

part-0003

part-0004

/root/datasetA

Page 29: Programming on  Hadoop

Multiple Readers/Writers (cont.)

• Reading/writing a large data set using multiple readers/writers and the materialize the data set as a collection of files is common pattern in HDFS

• But, too painful !– Invocation of multiple readers/writers in the cluster– Coordination of those readers/writers– Machine failure– ….

• Rescue: MapReduce

Page 30: Programming on  Hadoop

The MapReduce System

• MapReduce is a programming model and its associated implementation for processing and generating large data sets [1]

• The computation performs key/value oriented operations and consists of two functions– Map: transform the input key/value pair into a set of

intermediate key/value pairs– Reduce: merge intermediate key/value pairs with the

same key and produce an other key/value pair

Page 31: Programming on  Hadoop

The MapReduce Programming Model

• Map: (k0, v0) -> (k1, [v1])• Reduce: (k1, [v1]) -> (k2, v2)

Page 32: Programming on  Hadoop

The System Architecture

• One JobTacker for Job submission• Multiple TaskTrackers for invocation of mappers

or reducers

Figure is from Google image

Page 33: Programming on  Hadoop

The Mapper Interface

• Mapper/Reducer is defined as a generic java interface in Hadoop

public interface Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter);}

public interface Reducer<K2, V2, K3, V3> { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter);}

Page 34: Programming on  Hadoop

The Data Types of MapReduce

• MapReduce makes no assumption of the data type– It does not know what constitutes key/value pair

• Users must figure out what is appropriate input/output data types– The runtime data interpreting pattern– Achieved by implementing two Hadoop interface

• RecordReader<K, V> for parsing input key/value pair• RecordWriter<K, V> for serializing output key/value pair

Page 35: Programming on  Hadoop

The RecordReader/Writer Interface

interface RecordReader<K, V> {

// Omit other functions boolean next(K key, V value); }

interface RecordWriter<K, V> {

// Omit other functions void write(K key, V value);

}

Page 36: Programming on  Hadoop

The Overall Picture

• The data set are spitted into many parts• Each part is processed by one mapper• The intermediated results are processed by reducer• Each reducer writes its results as a file

InputSplit-n map reduce part-000nRecordReader Shuffle/merge RecordWriter

Page 37: Programming on  Hadoop

Performance Tuning

• A lot of factors …• From architecture level– Record parsing, map-side sorting, …, see [3]– Shuffling see many research papers on VLDB, SIGMOD

• Parameter Tuning– Memory buffer for mapper/reducer– The thumb of rule for concurrent mapper and reducers

• Map: per file block per map• Reducer: a small multiple of available TaskTrackers

Page 38: Programming on  Hadoop

Limitations of Hadoop

• HDFS– No reliable appending yet– File is immutable

• MapReduce– Basically row-oriented– Support for complicated computation is not strong

Page 39: Programming on  Hadoop

Reference

• [1] Jeffrey Dean, Sanjay Chemawat. MapReduce: Simplified data processing on large clusters

• [2] Tom White. Hadoop: The Definitive Guide• [3] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu. The Performance of

MapReduce: An Indepth Study• [4] Luiz André Barroso and Urs Holzle. The Datacenter as a Computer: An

Introduction to the Design of Warehouse-Scale Machines• [5] Sanjay Chemawat, Howard Gobioff, Shun-Tak Leung. The Google File

System

Page 40: Programming on  Hadoop

Thank You!