Hadoop Introduction

37
Hadoop Introduction CS237 Presentation – Yiping Wan

description

Hadoop Introduction. CS237 Presentation – Yiping Wan. Outline. Basic Idea Architecture Closer Look Demo Project Plan. What is Hadoop ?. Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. - PowerPoint PPT Presentation

Transcript of Hadoop Introduction

Page 1: Hadoop  Introduction

Hadoop Introduction

CS237 Presentation – Yiping Wan

Page 2: Hadoop  Introduction

OutlineBasic IdeaArchitectureCloser LookDemoProject Plan

Page 3: Hadoop  Introduction

What is Hadoop?Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

Page 4: Hadoop  Introduction

Hadoop HighlightsDistributed File SystemFault ToleranceOpen Data FormatFlexible SchemaQueryable Database

Page 5: Hadoop  Introduction

Why use Hadoop?Need to process Multi Petabyte DatasetsData may not have strict schemaExpensive to build reliability in each applicationNodes fails everydayNeed common infrastructure

Page 6: Hadoop  Introduction

Who uses Hadoop?AmazonFacebookGoogleYahoo!…

Page 7: Hadoop  Introduction

Goal of HDFSVery Large Distributed File SystemAssumes Commodity HardwareOptimized for Batch ProcessingRuns on heterogeneous OS

Page 8: Hadoop  Introduction

HDFS Architecture

Page 9: Hadoop  Introduction

NameNodeMeta-data in Memory

The entire metadata is in main memoryTypes of Metadata

List of filesList of Blocks for each fileList of DataNodes for each blockFile attributes, e.g. creation time, replication factor

A Transaction Log

Records file creations, deletions, etc

Page 10: Hadoop  Introduction

DataNodeA Block Sever

Stores data in local file systemStores meta-data of a block - checksumServes data and meta-data to clients

Block ReportPeriodically sends a report of all existing blocks to NameNode

Facilitate Pipelining of DataForwards data to other specified DataNodes

Page 11: Hadoop  Introduction

Block PlacementReplication Strategy

One replica on local nodeSecond replica on a remote rackThird replica on same remote rackAdditional replicas are randomly placed

Clients read from nearest replica

Page 12: Hadoop  Introduction

Data CorrectnessUse Checksums to validate data – CRC32File Creation

Client computes checksum per 512 byteDataNode stores the checksum

File AccessClient retrieves the data and checksum from DataNodeIf validation fails, client tries other replicas

Page 13: Hadoop  Introduction

Data PipeliningClient retrieves a list of DataNodes on which to place replicas of a blockClient writes block to the first DataNodeThe first DataNode forwards the data to the next DataNode in the PipelineWhen all replicas are written, the client moves on to write the next block in file

Page 14: Hadoop  Introduction

Hadoop MapReduceMapReduce programming model

Framework for distributed processing of large data setsPluggable user code runs in generic framework

Common design pattern in data processingcat * | grep | sort | uniq -c | cat > fileinput | map | shuffle | reduce | output

Page 15: Hadoop  Introduction

MapReduce UsageLog processingWeb search indexingAd-hoc queries

Page 16: Hadoop  Introduction

MapReduce Architecture

Page 17: Hadoop  Introduction

Closer LookMapReduce Component

JobClientJobTrackerTaskTrackerChild

Job Creation/Execution Process

Page 18: Hadoop  Introduction

MapReduce Process (org.apache.hadoop.mapred)

JobClientSubmit job

JobTrackerManage and schedule job, split job into tasks

TaskTrackerStart and monitor the task execution

ChildThe process that really execute the task

Page 19: Hadoop  Introduction

Inter Process CommunicationIPC/RPC (org.apache.hadoop.ipc)

ProtocolJobClient <-------------> JobTracker

TaskTracker <------------> JobTracker

TaskTracker <-------------> ChildJobTracker impliments both protocol and works as server in both IPCTaskTracker implements the TaskUmbilicalProtocol; Child gets task information and reports task status through it.

JobSubmissionProtocol

InterTrackerProtocol

TaskUmbilicalProtocol

Page 20: Hadoop  Introduction

JobClient.submitJob - 1Check input and output, e.g. check if the output directory is already existing

job.getInputFormat().validateInput(job);job.getOutputFormat().checkOutputSpecs(fs, job);

Get InputSplits, sort, and write output to HDFSInputSplit[] splits = job.getInputFormat().

getSplits(job, job.getNumMapTasks());writeSplitsFile(splits, out); // out is $SYSTEMDIR/$JOBID/job.split

Page 21: Hadoop  Introduction

JobClient.submitJob - 2The jar file and configuration file will be uploaded to HDFS system directory

job.write(out); // out is $SYSTEMDIR/$JOBID/job.xmlJobStatus status = jobSubmitClient.submitJob(jobId);

This is an RPC invocation, jobSubmitClient is a proxy created in the initialization

Page 22: Hadoop  Introduction

Job initialization on JobTracker - 1

JobTracker.submitJob(jobID) <-- receive RPC invocation requestJobInProgress job = new JobInProgress(jobId, this, this.conf)Add the job into Job Queue

jobs.put(job.getProfile().getJobId(), job);jobsByPriority.add(job);jobInitQueue.add(job);

Page 23: Hadoop  Introduction

Job initialization on JobTracker - 2

Sort by priorityresortPriority();compare the JobPrioity first, then compare the JobSubmissionTime

Wake JobInitThreadjobInitQueue.notifyall();job = jobInitQueue.remove(0);job.initTasks();

Page 24: Hadoop  Introduction

JobInProgress - 1JobInProgress(String jobid, JobTracker jobtracker, JobConf default_conf);JobInProgress.initTasks()

DataInputStream splitFile = fs.open(new Path(conf.get(“mapred.job.split.file”)));

// mapred.job.split.file --> $SYSTEMDIR/$JOBID/job.split

Page 25: Hadoop  Introduction

JobInProgress - 2splits = JobClient.readSplitFile(splitFile);numMapTasks = splits.length;maps[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i);reduces[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i);JobStatus --> JobStatus.RUNNING

Page 26: Hadoop  Introduction

JobTracker Task Scheduling - 1

Task getNewTaskForTaskTracker(String taskTracker)Compute the maximum tasks that can be running on taskTracker

int maxCurrentMap Tasks = tts.getMaxMapTasks();int maxMapLoad = Math.min(maxCurrentMapTasks, (int)Math.ceil(double) remainingMapLoad/numTaskTrackers));

Page 27: Hadoop  Introduction

JobTracker Task Scheduling - 2

int numMaps = tts.countMapTasks(); // running tasks numberIf numMaps < maxMapLoad, then more tasks can be allocated, then based on priority, pick the first job from the jobsByPriority Queue, create a task, and return to TaskTracker

Task t = job.obtainNewMapTask(tts, numTaskTrackers);

Page 28: Hadoop  Introduction

Start TaskTracker - 1initialize()

Remove original local directoryRPC initialization

TaskReportServer = RPC.getServer(this, bindAddress, tmpPort, max, false, this, fConf);InterTrackerProtocol jobClient = (InterTrackerProtocol) RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID, jobTrackAddr, this.fConf);

Page 29: Hadoop  Introduction

Start TaskTracker - 2run();offerService();TaskTracker talks to JobTracker with HeartBeat message periodically

HeatbeatResponse heartbeatResponse = transmitHeartBeat();

Page 30: Hadoop  Introduction

Run Task on TaskTracker - 1TaskTracker.localizeJob(TaskInProgress tip);launchTasksForJob(tip, new JobConf(rjob.jobFile));

tip.launchTask(); // TaskTracker.TaskInProgresstip.localizeTask(task); // create folder, symbol linkrunner = task.createRunner(TaskTracker.this);runner.start(); // start TaskRunner thread

Page 31: Hadoop  Introduction

Run Task on TaskTracker - 2TaskRunner.run();

Configure child process’ jvm parameters, i.e. classpath, taskid, taskReportServer’s address & portStart Child Process

runChild(wrappedCommand, workDir, taskid);

Page 32: Hadoop  Introduction

Child.main()Create RPC Proxy, and execute RPC invocation

TaskUmbilicalProtocol umbilical = (TaskUmbilicalProtocol) RPC.getProxy(TaskUmbilicalProtocol.class, TaskUmbilicalProtocol.versionID, address, defaultConf);Task task = umbilical.getTask(taskid);

task.run(); // mapTask / reduceTask.run

Page 33: Hadoop  Introduction

Finish Job - 1Child

task.done(umilical);RPC call: umbilical.done(taskId, shouldBePromoted)

TaskTrackerdone(taskId, shouldPromote)

TaskInProgress tip = tasks.get(taskid);tip.reportDone(shouldPromote);

taskStatus.setRunState(TaskStatus.State.SUCCEEDED)

Page 34: Hadoop  Introduction

Finish Job - 2JobTracker

TaskStatus report: status.getTaskReports();TaskInProgress tip = taskidToTIPMap.get(taskId);JobInProgress update JobStatus

tip.getJob().updateTaskStatus(tip, report, myMetrics);One task of current job is finishedcompletedTask(tip, taskStatus, metrics);If (this.status.getRunState() == JobStatus.RUNNING && allDone) {this.status.setRunState(JobStatus.SUCCEEDED)}

Page 35: Hadoop  Introduction

DemoWord Count

hadoop jar hadoop-0.20.2-examples.jar wordcount <input dir> <output dir>

Hivehive -f pagerank.hive

Page 36: Hadoop  Introduction

Project PlanHadoop Load Balance Enhancement

Static Load Balancing RulesDynamic Process Migration

Page 37: Hadoop  Introduction

Q & A