Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02...

Introduc)on to Apache Hadoop Tom Wheeler | St. Louis Java User Group

About the Presenter…

Tom Wheeler Software Engineer, etc.!Greater St. Louis Area | Information Technology and Services!

Current:! Senior Curriculum Developer at Cloudera!

Past:! Principal Software Engineer at Object Computing, Inc. (Boeing)!Software Engineer, Level IV at WebMD!Senior Software Engineer at Teralogix!Senior Programmer/Analyst at A.G. Edwards and Sons, Inc.!

Contract Web development, full-time instructor, worked at an ISP!Distant Past:!

• Helps companies profit from their data •  Founded in 2008 to commercialize Hadoop •  Staff includes commiJers to Hadoop and related projects •  Ac)ve contributors to the Apache SoLware Founda)on

• Our mission is to help organiza)ons profit from their data •  SoLware (CDH distribu)on and Cloudera Manager) •  Consul)ng and support services •  Training and cer)fica)on

About Cloudera

About the Presenta)on…

• What’s ahead •  Fundamental Concepts •  HDFS: The Hadoop Distributed File System •  Data Processing with MapReduce •  The Hadoop Ecosystem •  Conclusion + Q&A

Fundamental Concepts Why the World Needs Hadoop

Volume

•  Every day… • More than 1.5 billion shares are traded on the NYSE

•  Every minute… •  TransUnion makes nearly 70,000 updates to credit files

•  Every second… •  Banks process more than 10,000 credit card transac)ons

• We are genera)ng data faster than ever • People are increasingly interac)ng online • Processes are increasingly automated •  Systems are increasingly interconnected

Velocity

Variety

• We’re producing a variety of data, including •  Audio •  Video •  Images •  Log files • Web pages •  Product ra)ngs •  Social network connec)ons

• Not all of this maps cleanly to the rela)onal model

Big Data Can Mean Big Opportunity

• One tweet is an anecdote •  But a million tweets may signal important trends

• One person’s product review is an opinion •  But a million reviews might reveal a design flaw

• One person’s diagnosis is an isolated case •  But a million medical records could lead to a cure

We Just Need a System that Scales

•  Too much data for tradi)onal tools •  Two key problems

•  How to reliably store this data at a reasonable cost •  How to we process all the data we’ve stored

What is Apache Hadoop?

•  Scalable data storage and processing •  Distributed and fault-‐tolerant •  Runs on standard hardware

•  Two main components •  Storage: Hadoop Distributed File System (HDFS) •  Processing: MapReduce

• Hadoop clusters are composed of computers called nodes •  Clusters range in size from one to thousands of nodes

How Did Apache Hadoop Originate?

•  Spun off from an open source search engine project •  Papers published by Google strongly influenced its design

• Other Web companies quickly saw the benefits •  Early adop)on by Yahoo, Facebook and others

2002 2003 2004 2005 2006 2007

Google publishes MapReduce paper

Nutch rewritten for MapReduce

Hadoop becomesLucene subproject

Nutch spun off from Lucene

Google publishes GFS paper

1,000 nodecluster at Yahoo

How Are Organiza)ons Using Hadoop?

•  Let’s look at a few common uses…

Hadoop Use Case: Customer Analy)cs

Example Inc. Public Web Site (February 9 - 15)Category Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on PageTelevision 1,967,345 8,439,206 23% 51% 17 seconds

Smartphone 982,384 3,185,749 47% 41% 23 secondsMP3 Player 671,820 2,174,913 61% 12% 42 seconds

Stereo

472,418

1,627,843 74% 19% 26 secondsMonitor

327,018

1,241,837 56% 17% 37 seconds

Tablet

217,328

816,545 48% 28% 53 secondsPrinter 127,124

535,261 27% 64% 34 seconds

Hadoop Use Case: Sen)ment Analysis

00 01 02 03 04 05 06 07 08 09 10

Negative

NeutralPositive

References to Product in Social Media (Hourly)

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Hour 11 12 13 14 15 16 17 18 19 20 21 22 23

Hadoop Use Case: Product Recommenda)ons

Grand Theft Auto - V

Tom, we suggest the following...

$54.99

(7,268 ratings)

Death of a Decade (CD)

$11.99

(372 ratings)

St. Louis Cardinals Hat

$15.99

(15,916 ratings)

Hadoop Use Case: Fraud Detec)on

— Grace Hopper, early advocate of distributed compu)ng

“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,

we didn’t try to grow a larger ox”

Comparing Hadoop to Other Systems

• Monolithic systems don’t scale

• Modern high-‐performance compu)ng systems are distributed •  They spread computa)ons across many machines in parallel • Widely-‐used used for scien)fic applica)ons •  Let’s examine how a typical HPC system works

Architecture of a Typical HPC System

Storage System

Compute Nodes

Fast Network


Storage System

Compute Nodes

Step 1: Copy input data

Fast Network


Storage System

Compute Nodes

Step 2: Process the data

Fast Network


Storage System

Compute Nodes

Step 3: Copy output data

Fast Network

You Don’t Just Need Speed…

•  The problem is that we have way more data than code

$ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314

You Need Speed At Scale

Storage System

Compute Nodes

Bottleneck

Hadoop Design Fundamental: Data Locality

•  This is a hallmark of Hadoop’s design •  Don’t bring the data to the computa)on •  Bring the computa)on to the data!

• Hadoop uses the same machines for storage and processing •  Significantly reduces need to transfer data across network

The Hadoop Distributed Filesystem

HDFS

HDFS: Hadoop Distributed File System

•  Inspired by the Google File System •  Reliable, low-‐cost storage for massive amounts of data

•  Similar to a UNIX filesystem in some ways •  Hierarchical, with a root directory •  UNIX-‐style paths (e.g., /sales/west/accounts.txt) •  UNIX-‐style file ownership and permissions

HDFS: Hadoop Distributed File System

•  There are also some major devia)ons from UNIX filesystems •  Designed for sequen)al access to large files •  Cannot modify file content once wriJen

• HDFS is implemented as a user-‐space Java process •  Requires the use of special commands and APIs

Accessing HDFS via the Command Line

• Users typically access HDFS via the hadoop fs command •  Ac)ons specified with subcommands (prefixed with a minus sign) • Most are immediately familiar to UNIX users

$ hadoop fs -ls /sales $ hadoop fs -cat /sales/billing_data.csv $ hadoop fs -rm -r -f /webdata/logfiles $ hadoop fs -mkdir /reports/marketing

Copying Data To and From HDFS

• HDFS is dis)nct from your local filesystem •  hadoop fs –put copies local files to HDFS •  hadoop fs –get fetches a local copy of a file from HDFS

$ hadoop fs -put sales.txt /reports

Hadoop Cluster

Client Machine

$ hadoop fs -get /reports/sales.txt

HDFS Blocks

•  Files added to HDFS are split into fixed-‐size blocks •  Block size is configurable, but defaults to 64 megabytes

•  Each block is then replicated to three nodes (by default)...

150 MB input fileBlock #1 (64 MB)

Block #2 (64 MB)

Block #3 (remaining 22 MB)

HDFS Replica)on (cont’d)

Block #1

Block #2

Block #3

C

D

A

B

E

HDFS Replica)on (cont’d)

•  Copies of the block remain even if a node fails •  Data will be re-‐replicated to other nodes automa)cally

•  Replica)on improves both reliability and performance

C

D

A

B

E

This failed node held blocks #1 and #3 X

Blocks #1 and #3 are still available here

Block #3 is still available here

Block #1 is still available here

A Scalable Data Processing Framework

Data Processing with MapReduce

What is MapReduce?

• MapReduce is a programming model •  It’s a way of processing data •  You can implement MapReduce in any language

Understanding Map and Reduce

•  You supply two func)ons to process data: Map and Reduce •  Each piece is simple, but can be powerful when combined

• Map: typically used to transform, parse, or filter data •  The map func)on always runs first

•  Reduce: typically used to summarize results from map func)on •  The reduce func)on is op)onal

MapReduce Benefits

•  Scalability •  Hadoop divides the processing job into individual tasks •  Tasks execute in parallel (independently) across cluster •  Hadoop provides job scheduling and other infrastructure

•  Simplicity •  Processes one record at a )me •  No explicit I/O •  Far simpler for developers than typical distributed compu)ng

MapReduce in Hadoop

• MapReduce processing in Hadoop is batch-‐oriented •  High throughput, but also high latency

• A MapReduce job is broken down into smaller tasks •  A task runs either the map func)on or reduce func)on •  Each processes a subset of overall input data

• MapReduce code for Hadoop is usually wriJen in Java •  Can do basic MapReduce in other languages too

Visual Overview (1 -‐ Map Phase) Input Data for Job

Soft kitty, warm kitty,

little ball of fur. Happy kitty,

Sleepy kitty, Purr Purr Purr.

Soft kitty, warm kitty,

little ball of fur.

Happy kitty, Sleepy kitty,

Purr Purr Purr.

(soft, 1)(kitty, 1)(warm, 1)(kitty, 1)(little, 1)(ball, 1)(of, 1)(fur, 1)

(happy, 1)(kitty, 1)(sleepy, 1)(kitty, 1)(purr, 1)(purr, 1)(purr, 1)

Map Task #1

Map Task #2

Input Split #1

Input Split #2

Visual Overview (2 -‐ Shuffle and Sort Phase)

(soft, 1)(kitty, 1)(warm, 1)(kitty, 1)(little, 1)(ball, 1)(of, 1)(fur, 1)

(happy, 1)(kitty, 1)(sleepy, 1)(kitty, 1)(purr, 1)(purr, 1)(purr, 1)

(ball, [1])(little, [1])(of, [1])(purr, [1, 1, 1])(sleepy, [1])

(happy, [1])(kitty, [1, 1, 1, 1, 1, 1])(fur, [1])(soft, [1])(warm, [1])

Keys are shuffled, sorted, and grouped

Visual Overview (3 -‐ Reduce Phase)

(ball, [1])(little, [1])(of, [1])(purr, [1, 1, 1])(sleepy, [1])

(happy, [1])(kitty, [1, 1, 1, 1, 1, 1])(fur, [1])(soft, [1])(warm, [1])

ball 1 little 1 of 1 purr 3 sleepy 1

happy 1 kitty 6 fur 1 soft 1 warm 1

Final Output

Reduce Task #1

Reduce Task #2

Can Coun)ng Words Really Be Useful?

•  Log file analysis is a common task in Hadoop

•  Rela)onal databases aren't good at this •  Not good at parsing semi-‐structured format •  Content does not fit into rela)onal model •  Cannot handle large volume of data at low cost

• Applica)on logs are a valuable source of data •  Let's see how we could summarize log events in MapReduce

Map Phase

•  The input to our job is in a typical Log4J format

•  The map func)on parses this to extract the level (highlighted) •  It then emits the level and a literal 1

2014-02-12 22:16:49.391 CDT INFO "Here's something I thought you should know"2014-02-12 22:16:52.143 CDT INFO "Blah blah, yadda, yadda, yadda"2014-02-12 22:16:54.276 CDT WARN "Uh oh. This seems bad"2014-02-12 22:16:57.471 CDT INFO "It's cold in the server room. Brrr..."2014-02-12 22:17:01.290 CDT WARN "Things are looking down for the program"2014-02-12 22:17:03.812 CDT INFO "And now for something fairly unimportant"2014-02-12 22:17:05.314 CDT ERROR "Too late. I'm out of memory!"

INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1

Java Code for Map Func)on (1)

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;

public class LogEventParseMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

1 2 3 4 5 6 7 8 9

10

Input key and value types

Output key and value types

Java Code for Map Func)on (2)

private enum Level { TRACE, DEBUG, INFO, WARN, ERROR, FATAL };

@Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString(); String[] fields = line.split(" "); String levelField = fields[3]; for (Level level : Level.values()) {

String levelName = level.name(); if (levelName.equalsIgnoreCase(levelField)) { context.write(new Text(levelName), new IntWritable(1)); } } } }

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Convert Text value to a String

Emit key and value

The “Shuffle and Sort”

•  Remember, this phase is automa)c •  The result is passed as input to the reduce func)on


Shuffle and Sort

Map Output

Reduce Input

ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]

Reduce Phase

•  The reduce func)on can iterate over all values for a key •  It simply sums them up to produce the final results

Reduce ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]

ERROR 1INFO 4WARN 2

Java Code for Reduce Func)on (1)

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class LogEventSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

1 2 3 4 5 6 7 8 9

10

Output key and value types

Input key and value types

Java Code for Reduce Func)on (2)

@Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); }

// Our output is the event type (key) and the sum (value) context.write(key, new IntWritable(sum)); }

}

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Output key and value

Complete Data Flow Diagram

2014-02-12 22:16:49.391 CDT INFO "Here's something I thought you should know"2014-02-12 22:16:52.143 CDT INFO "Blah blah, yadda, yadda, yadda"2014-02-12 22:16:54.276 CDT WARN "Uh oh. This seems bad"2014-02-12 22:16:57.471 CDT INFO "It's cold in the server room. Brrr..."2014-02-12 22:17:01.290 CDT WARN "Things are looking down for the program"2014-02-12 22:17:03.812 CDT INFO "And now for something fairly unimportant"2014-02-12 22:17:05.314 CDT ERROR "Too late. I'm out of memory!"


ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]

ERROR 1INFO 4WARN 2

Map

Shuffle & Sort Reduce

Java Code for Driver Func)on (1)

package com.cloudera.example; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; // The driver is just a regular Java class with a "main" method public class Driver { public static void main(String[] args) throws Exception {

1 2 3 4 5 6 7 8 9

10 11 12 13


// validate commandline arguments (we require the user // to specify the HDFS paths to use for the job; see below) if (args.length != 2) { System.out.printf("Usage: Driver <input dir> <output dir>\n"); System.exit(-1); } // Instantiate a Job object for our job's configuration. Job job = new Job(); // configure input and output paths based on supplied arguments FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));

14 15 16 17 18 19 20 21 22 23 24 25 26


// tells Hadoop to copy the JAR containing this class // to cluster nodes, as required to run this job job.setJarByClass(Driver.class); // give the job a descriptive name. This is optional, but // helps us identify this job on a busy cluster job.setJobName("Log Event Counter Driver");

// Specify which classes to use for the Mapper and Reducer job.setMapperClass(LogEventParseMapper.class); job.setReducerClass(LogEventSumReducer.class);

27 28 29 30 31 32 33 34 35 36 37


// specify the job's output key and value classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // start the MapReduce job and wait for it to finish. // if it finishes successfully, return 0; otherwise 1. boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

38 39 40 41 42 43 44 45 46 47

Packaging, Deployment, and Execu)on

1.  The mapper, reducer, and driver are then compiled 2.  The resul)ng classes are packaged into a JAR 3.  Execute using the hadoop jar command

$ hadoop jar logsummary.jar \ com.cloudera.example.Driver \ /data/logs \ /data/logsummary

Open Source Tools that Complement Hadoop

The Hadoop Ecosystem

The Hadoop Ecosystem

•  "Core Hadoop" consists of HDFS and MapReduce •  These are the kernel of a much broader plaworm

• Hadoop has many related projects •  Some help you integrate Hadoop with other systems •  Others help you analyze your data

•  These are not considered “core Hadoop” •  Rather, they’re part of the Hadoop ecosystem • Many are also open source Apache projects

Apache Sqoop

•  Sqoop exchanges data between RDBMS and Hadoop

•  Can import en)re DB, a single table, or a table subset into HDFS •  Does this very efficiently via a Map-‐only MapReduce job •  Can also export data from HDFS back to the database

Database Hadoop Cluster

Apache Flume

§ Flume imports data into HDFS as it is being generated by various sources

Hadoop Cluster

ProgramOutput

UNIXsyslog

Log Files

CustomSources

And manymore...

Apache Mahout

• Mahout is a scalable machine learning library

•  Support for several categories of problems, including •  Classifica)on •  Clustering •  Collabora)ve filtering

• Many algorithms implemented in MapReduce

•  Can parallelize inexpensively with Hadoop

Apache Pig

•  Pig offers high-‐level data processing on Hadoop •  An alterna)ve to wri)ng low-‐level MapReduce code

•  Pig turns this into MapReduce jobs that run on Hadoop

people = LOAD '/data/customers' AS (cust_id, name); orders = LOAD '/data/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;

Apache Hive

• Hive is another abstrac)on on top of MapReduce •  Like Pig, it also reduces development )me •  Hive uses a SQL-‐like language called HiveQL

SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;

Cloudera Impala

• Massively parallel SQL engine that runs on a Hadoop cluster •  Inspired by Google’s Dremel project •  Can query data stored in HDFS or HBase tables

• High performance / low latency •  Typically 10x -‐ 50x )mes faster than Pig or Hive •  Query syntax is virtually iden)cal to HiveQL / SQL

•  Impala is 100% open source (Apache-‐licensed)

Visual Overview of a Complete Workflow Import Transaction Data

from RDBMSSessionize WebLog Data with Pig

Analyst uses Impala forbusiness intelligence

Sentiment Analysis on Social Media with Hive

Hadoop Cluster with Impala

Generate Nightly Reports using Pig, Hive, or Impala

Build product recommendations for

Web site

Conclusion

Key Points

• Hadoop supports large-‐scale data storage and processing •  Heavily influenced by Google's architectural papers •  Already in produc)on at thousands of companies •  HDFS is Hadoop's storage layer • MapReduce is Hadoop's processing framework

• Many ecosystem projects complement Hadoop •  Some help you to integrate Hadoop with exis)ng systems •  Others help you analyze the data you’ve stored

Ques)ons?

•  Thank you for aJending!

•  I’ll be happy to answer any addi)onal ques)ons now…

• Want to learn even more? •  Cloudera training: developers, analysts, sysadmins, and more •  Offered in more than 50 ci)es worldwide, and online too! •  See hJp://university.cloudera.com/ for more info

Highly Recommended Book

Author: Tom White ISBN: 1-‐449-‐31152-‐0

Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02...

Documents

Transcript of Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02...