Programming on Hadoop

Outline

• Different perspective of Cloud Computing• The Anatomy of Data Center• The Taxonomy of Computation– Computation intensive– Data intensive

• The Hadoop Eco-system• Limitations of Hadoop

Cloud Computing

• From user perspective– A service which enables users to run their applications

on the Internet • From service provider perspective– A resource pool which is used to deliver cloud services

through the Internet– The resource pool is hosted in on-premise data center– What the data center (DC) looks like ?

An Example of DC

• Google’s Data Center at 2009.

From Jeffrey Dean’s talk on WSDM2009

A Closer Look at DC – Overview

Figure is copied from [4]

A Closer Look of DC – Cooling


A Closer Look at DC – Computing Resources


The Commodity Server

• Commodity server is NOT low-end server– Standard components vs. proprietary hardware

• Common configuration in 2008– Processor: 2 quad-core Intel Xeon 2.0GHz CPUs– Memory: 8 GB ECC RAM– Storage: 4 1TB SATA disks– Network: Gigabit Ethernet

Approaches to Deliver Service

• The dedicated approach– Serve each customer with dedicated computing

resources• The shared approach (multi-tenant architecture)– Serve customers with the shared resource pool

The Dedicated Approach

• Pros:– Easy to implement– Performance & security guarantee

• Cons:– Pain for the customer to scale their applications– Poor resource utilization

The Shared Approach

• Pros:– No pain for customers to scale their applications– Better resources utilization– Better performance in some cases– Low service cost per customer

• Cons:– Need complicated software layer– Performance isolation/tuning may be complicated

• To achieve better performance customers should be familiar with the software/hardware architecture to some degree

The Hadoop Eco-system

• An software infrastructure to deliver a DC as a service through shared-resources approach– Customers can use Hadoop to develop/deploy certain

data-intensive applications on the cloud • We focus on the Hadoop core in this lecture– Hadoop == Hadoop – core afterwards

Hadoop Distributed File System (HDFS) MapReduceCore

HBase Chukwa Hive PigExtensions

The Taxonomy of Computations

• Computation-intensive tasks– Small data (in-memory), Lots of CPU cycles per data

item processing– Examples: machine learning

• Data-intensive tasks– Large-volume data (in-disk), relatively small CPU

cycles per data item processing– Examples: DBMS

The Data-intensive Tasks

• Streaming-oriented data access– Read/Write a large portion of dataset in streaming manner

(sequentially)– Character:

• NO-seek, high-throughput• Optimized for larger data transferring rate

• Random-oriented data access– Read/Write a small number of data items randomly located in

the dataset– Character:

• Seek-oriented• Optimized for low-latency data access for each data item

What Hadoop does & doesn’t

• Hadoop can perform– High-throughput streaming data access– Limited low-latency random data access through

HBase– Large-scale analysis through MapReduce

• Hadoop cannot do– Perform transactions– Certain time-critical applications

Hadoop Quick Start

• Very simple– Download Hadoop package from Apache

• http://hadoop.apache.org/– Unpack into a folder– Do some configurations on hadoop-site.xml

• fs.default.name select the default file system (e.g., HDFS)• mapred.job.tracker point to the JobTracker of MapReduce cluster

– Start• Format the file system only once (in a fresh installation)

– bin/hadoop namenode –format• Launch HDFS & MapReduce cluster

– bin/start-all.sh

The Launched HDFS cluster

The Launched MapReduce Cluster

The Hadoop Distributed Filesystem

• Wraps the DC as a resource pool and provides a set of API to let users read/write data from/into the DC sequentially

A Closer Look at the API

• Aha, writing “hello word!”– bin/hadoop jar test.jar

public class Main {

public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataOutputStream fsOut = fs.create(“testFile”); fsOut.writeBytes(“Hello Hadoop”) fsOut.close(); } }

A Closer Look at the API (cont.)

• Reading data from the HDFS

public class Main {

public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataInputStream fsIn = fs.open(new Path(“testFile”)); byte[] buf = new byte[1024]; int len = fsIn.read(buf); System.out.println(new String(buf, 0, len); }

}

Inside HDFS

• A single NameNode multiple DataNodes architecture (see [5] for reference)– Chop each file as a set of fix-sized blocks and store

those data blocks on all available DataNodes– NameNode hosting all file system meta-data (file

block mapping, block locations etc) in memory– DataNode hosting all file data for reading/writing

Inside HDFS – Architecture

• Figure is copied from http://hadoop.apache.org/common/docs/current/hdfs_design.html

Inside HDFS – Writing Data


Inside HDFS – Reading Data • What is the problem with reading/writing ?


The HDFS Cons

• Single reader/writer– Reading and writing a single block each time– Only touch ONE data node– Data transferring rate == disk bandwidth of a SINGLE

node– Too slow for a large file

• Suppose disk bandwidth == 100MB/sec• Reading /writing a 1TB file requires ~3 hrs

– How to fix it ?

Multiple Reader/Writers

• Reading/Writing a large data set using multiple processes– Each process reads/writes a subset of the whole data

set and materialize the sub-data set as file– File collection for the whole data set

• Typically, the file collection is stored in a directory named with the data set

Multiple Readers/Writers (cont.)

• Question – what is the proper number of readers and writers ?

Sub-set 2

Sub-set 1

Sub-set 3

Sub-set 4

Data set A

Process 1

Process 2

Process 3

Process 4

part-0001

part-0002

part-0003

part-0004

/root/datasetA

Multiple Readers/Writers (cont.)

• Reading/writing a large data set using multiple readers/writers and the materialize the data set as a collection of files is common pattern in HDFS

• But, too painful !– Invocation of multiple readers/writers in the cluster– Coordination of those readers/writers– Machine failure– ….

• Rescue: MapReduce

The MapReduce System

• MapReduce is a programming model and its associated implementation for processing and generating large data sets [1]

• The computation performs key/value oriented operations and consists of two functions– Map: transform the input key/value pair into a set of

intermediate key/value pairs– Reduce: merge intermediate key/value pairs with the

same key and produce an other key/value pair

The MapReduce Programming Model

• Map: (k0, v0) -> (k1, [v1])• Reduce: (k1, [v1]) -> (k2, v2)

The System Architecture

• One JobTacker for Job submission• Multiple TaskTrackers for invocation of mappers

or reducers

Figure is from Google image

The Mapper Interface

• Mapper/Reducer is defined as a generic java interface in Hadoop

public interface Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter);}

public interface Reducer<K2, V2, K3, V3> { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter);}

The Data Types of MapReduce

• MapReduce makes no assumption of the data type– It does not know what constitutes key/value pair

• Users must figure out what is appropriate input/output data types– The runtime data interpreting pattern– Achieved by implementing two Hadoop interface

• RecordReader<K, V> for parsing input key/value pair• RecordWriter<K, V> for serializing output key/value pair

The RecordReader/Writer Interface

interface RecordReader<K, V> {

// Omit other functions boolean next(K key, V value); }

interface RecordWriter<K, V> {

// Omit other functions void write(K key, V value);

}

The Overall Picture

• The data set are spitted into many parts• Each part is processed by one mapper• The intermediated results are processed by reducer• Each reducer writes its results as a file

InputSplit-n map reduce part-000nRecordReader Shuffle/merge RecordWriter

Performance Tuning

• A lot of factors …• From architecture level– Record parsing, map-side sorting, …, see [3]– Shuffling see many research papers on VLDB, SIGMOD

• Parameter Tuning– Memory buffer for mapper/reducer– The thumb of rule for concurrent mapper and reducers

• Map: per file block per map• Reducer: a small multiple of available TaskTrackers

Limitations of Hadoop

• HDFS– No reliable appending yet– File is immutable

• MapReduce– Basically row-oriented– Support for complicated computation is not strong

Reference

• [1] Jeffrey Dean, Sanjay Chemawat. MapReduce: Simplified data processing on large clusters

• [2] Tom White. Hadoop: The Definitive Guide• [3] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu. The Performance of

MapReduce: An Indepth Study• [4] Luiz André Barroso and Urs Holzle. The Datacenter as a Computer: An

Introduction to the Design of Warehouse-Scale Machines• [5] Sanjay Chemawat, Howard Gobioff, Shun-Tak Leung. The Google File

System

Thank You!

Programming on Hadoop

Documents

Transcript of Programming on Hadoop