Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02...
Transcript of Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02...
Introduc)on to Apache Hadoop Tom Wheeler | St. Louis Java User Group
About the Presenter…
Tom Wheeler Software Engineer, etc.!Greater St. Louis Area | Information Technology and Services!
Current:! Senior Curriculum Developer at Cloudera!
Past:! Principal Software Engineer at Object Computing, Inc. (Boeing)!Software Engineer, Level IV at WebMD!Senior Software Engineer at Teralogix!Senior Programmer/Analyst at A.G. Edwards and Sons, Inc.!
Contract Web development, full-time instructor, worked at an ISP!Distant Past:!
• Helps companies profit from their data • Founded in 2008 to commercialize Hadoop • Staff includes commiJers to Hadoop and related projects • Ac)ve contributors to the Apache SoLware Founda)on
• Our mission is to help organiza)ons profit from their data • SoLware (CDH distribu)on and Cloudera Manager) • Consul)ng and support services • Training and cer)fica)on
About Cloudera
About the Presenta)on…
• What’s ahead • Fundamental Concepts • HDFS: The Hadoop Distributed File System • Data Processing with MapReduce • The Hadoop Ecosystem • Conclusion + Q&A
Fundamental Concepts Why the World Needs Hadoop
Volume
• Every day… • More than 1.5 billion shares are traded on the NYSE
• Every minute… • TransUnion makes nearly 70,000 updates to credit files
• Every second… • Banks process more than 10,000 credit card transac)ons
• We are genera)ng data faster than ever • People are increasingly interac)ng online • Processes are increasingly automated • Systems are increasingly interconnected
Velocity
Variety
• We’re producing a variety of data, including • Audio • Video • Images • Log files • Web pages • Product ra)ngs • Social network connec)ons
• Not all of this maps cleanly to the rela)onal model
Big Data Can Mean Big Opportunity
• One tweet is an anecdote • But a million tweets may signal important trends
• One person’s product review is an opinion • But a million reviews might reveal a design flaw
• One person’s diagnosis is an isolated case • But a million medical records could lead to a cure
We Just Need a System that Scales
• Too much data for tradi)onal tools • Two key problems
• How to reliably store this data at a reasonable cost • How to we process all the data we’ve stored
What is Apache Hadoop?
• Scalable data storage and processing • Distributed and fault-‐tolerant • Runs on standard hardware
• Two main components • Storage: Hadoop Distributed File System (HDFS) • Processing: MapReduce
• Hadoop clusters are composed of computers called nodes • Clusters range in size from one to thousands of nodes
How Did Apache Hadoop Originate?
• Spun off from an open source search engine project • Papers published by Google strongly influenced its design
• Other Web companies quickly saw the benefits • Early adop)on by Yahoo, Facebook and others
2002 2003 2004 2005 2006 2007
Google publishes MapReduce paper
Nutch rewritten for MapReduce
Hadoop becomesLucene subproject
Nutch spun off from Lucene
Google publishes GFS paper
1,000 nodecluster at Yahoo
How Are Organiza)ons Using Hadoop?
• Let’s look at a few common uses…
Hadoop Use Case: Customer Analy)cs
Example Inc. Public Web Site (February 9 - 15)Category Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on PageTelevision 1,967,345 8,439,206 23% 51% 17 seconds
Smartphone 982,384 3,185,749 47% 41% 23 secondsMP3 Player 671,820 2,174,913 61% 12% 42 seconds
Stereo
472,418
1,627,843 74% 19% 26 secondsMonitor
327,018
1,241,837 56% 17% 37 seconds
Tablet
217,328
816,545 48% 28% 53 secondsPrinter 127,124
535,261 27% 64% 34 seconds
Hadoop Use Case: Sen)ment Analysis
00 01 02 03 04 05 06 07 08 09 10
Negative
NeutralPositive
References to Product in Social Media (Hourly)
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Hour 11 12 13 14 15 16 17 18 19 20 21 22 23
Hadoop Use Case: Product Recommenda)ons
Grand Theft Auto - V
Tom, we suggest the following...
$54.99
(7,268 ratings)
Death of a Decade (CD)
$11.99
(372 ratings)
St. Louis Cardinals Hat
$15.99
(15,916 ratings)
Hadoop Use Case: Fraud Detec)on
— Grace Hopper, early advocate of distributed compu)ng
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox”
Comparing Hadoop to Other Systems
• Monolithic systems don’t scale
• Modern high-‐performance compu)ng systems are distributed • They spread computa)ons across many machines in parallel • Widely-‐used used for scien)fic applica)ons • Let’s examine how a typical HPC system works
Architecture of a Typical HPC System
Storage System
Compute Nodes
Fast Network
Architecture of a Typical HPC System
Storage System
Compute Nodes
Step 1: Copy input data
Fast Network
Architecture of a Typical HPC System
Storage System
Compute Nodes
Step 2: Process the data
Fast Network
Architecture of a Typical HPC System
Storage System
Compute Nodes
Step 3: Copy output data
Fast Network
You Don’t Just Need Speed…
• The problem is that we have way more data than code
$ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314
You Need Speed At Scale
Storage System
Compute Nodes
Bottleneck
Hadoop Design Fundamental: Data Locality
• This is a hallmark of Hadoop’s design • Don’t bring the data to the computa)on • Bring the computa)on to the data!
• Hadoop uses the same machines for storage and processing • Significantly reduces need to transfer data across network
The Hadoop Distributed Filesystem
HDFS
HDFS: Hadoop Distributed File System
• Inspired by the Google File System • Reliable, low-‐cost storage for massive amounts of data
• Similar to a UNIX filesystem in some ways • Hierarchical, with a root directory • UNIX-‐style paths (e.g., /sales/west/accounts.txt) • UNIX-‐style file ownership and permissions
HDFS: Hadoop Distributed File System
• There are also some major devia)ons from UNIX filesystems • Designed for sequen)al access to large files • Cannot modify file content once wriJen
• HDFS is implemented as a user-‐space Java process • Requires the use of special commands and APIs
Accessing HDFS via the Command Line
• Users typically access HDFS via the hadoop fs command • Ac)ons specified with subcommands (prefixed with a minus sign) • Most are immediately familiar to UNIX users
$ hadoop fs -ls /sales $ hadoop fs -cat /sales/billing_data.csv $ hadoop fs -rm -r -f /webdata/logfiles $ hadoop fs -mkdir /reports/marketing
Copying Data To and From HDFS
• HDFS is dis)nct from your local filesystem • hadoop fs –put copies local files to HDFS • hadoop fs –get fetches a local copy of a file from HDFS
$ hadoop fs -put sales.txt /reports
Hadoop Cluster
Client Machine
$ hadoop fs -get /reports/sales.txt
HDFS Blocks
• Files added to HDFS are split into fixed-‐size blocks • Block size is configurable, but defaults to 64 megabytes
• Each block is then replicated to three nodes (by default)...
150 MB input fileBlock #1 (64 MB)
Block #2 (64 MB)
Block #3 (remaining 22 MB)
HDFS Replica)on (cont’d)
Block #1
Block #2
Block #3
C
D
A
B
E
HDFS Replica)on (cont’d)
Block #1
Block #2
Block #3
C
D
A
B
E
HDFS Replica)on (cont’d)
Block #1
Block #2
Block #3
C
D
A
B
E
HDFS Replica)on (cont’d)
• Copies of the block remain even if a node fails • Data will be re-‐replicated to other nodes automa)cally
• Replica)on improves both reliability and performance
C
D
A
B
E
This failed node held blocks #1 and #3 X
Blocks #1 and #3 are still available here
Block #3 is still available here
Block #1 is still available here
A Scalable Data Processing Framework
Data Processing with MapReduce
What is MapReduce?
• MapReduce is a programming model • It’s a way of processing data • You can implement MapReduce in any language
Understanding Map and Reduce
• You supply two func)ons to process data: Map and Reduce • Each piece is simple, but can be powerful when combined
• Map: typically used to transform, parse, or filter data • The map func)on always runs first
• Reduce: typically used to summarize results from map func)on • The reduce func)on is op)onal
MapReduce Benefits
• Scalability • Hadoop divides the processing job into individual tasks • Tasks execute in parallel (independently) across cluster • Hadoop provides job scheduling and other infrastructure
• Simplicity • Processes one record at a )me • No explicit I/O • Far simpler for developers than typical distributed compu)ng
MapReduce in Hadoop
• MapReduce processing in Hadoop is batch-‐oriented • High throughput, but also high latency
• A MapReduce job is broken down into smaller tasks • A task runs either the map func)on or reduce func)on • Each processes a subset of overall input data
• MapReduce code for Hadoop is usually wriJen in Java • Can do basic MapReduce in other languages too
Visual Overview (1 -‐ Map Phase) Input Data for Job
Soft kitty, warm kitty,
little ball of fur. Happy kitty,
Sleepy kitty, Purr Purr Purr.
Soft kitty, warm kitty,
little ball of fur.
Happy kitty, Sleepy kitty,
Purr Purr Purr.
(soft, 1)(kitty, 1)(warm, 1)(kitty, 1)(little, 1)(ball, 1)(of, 1)(fur, 1)
(happy, 1)(kitty, 1)(sleepy, 1)(kitty, 1)(purr, 1)(purr, 1)(purr, 1)
Map Task #1
Map Task #2
Input Split #1
Input Split #2
Visual Overview (2 -‐ Shuffle and Sort Phase)
(soft, 1)(kitty, 1)(warm, 1)(kitty, 1)(little, 1)(ball, 1)(of, 1)(fur, 1)
(happy, 1)(kitty, 1)(sleepy, 1)(kitty, 1)(purr, 1)(purr, 1)(purr, 1)
(ball, [1])(little, [1])(of, [1])(purr, [1, 1, 1])(sleepy, [1])
(happy, [1])(kitty, [1, 1, 1, 1, 1, 1])(fur, [1])(soft, [1])(warm, [1])
Keys are shuffled, sorted, and grouped
Visual Overview (3 -‐ Reduce Phase)
(ball, [1])(little, [1])(of, [1])(purr, [1, 1, 1])(sleepy, [1])
(happy, [1])(kitty, [1, 1, 1, 1, 1, 1])(fur, [1])(soft, [1])(warm, [1])
ball 1 little 1 of 1 purr 3 sleepy 1
happy 1 kitty 6 fur 1 soft 1 warm 1
Final Output
Reduce Task #1
Reduce Task #2
Can Coun)ng Words Really Be Useful?
• Log file analysis is a common task in Hadoop
• Rela)onal databases aren't good at this • Not good at parsing semi-‐structured format • Content does not fit into rela)onal model • Cannot handle large volume of data at low cost
• Applica)on logs are a valuable source of data • Let's see how we could summarize log events in MapReduce
Map Phase
• The input to our job is in a typical Log4J format
• The map func)on parses this to extract the level (highlighted) • It then emits the level and a literal 1
2014-02-12 22:16:49.391 CDT INFO "Here's something I thought you should know"2014-02-12 22:16:52.143 CDT INFO "Blah blah, yadda, yadda, yadda"2014-02-12 22:16:54.276 CDT WARN "Uh oh. This seems bad"2014-02-12 22:16:57.471 CDT INFO "It's cold in the server room. Brrr..."2014-02-12 22:17:01.290 CDT WARN "Things are looking down for the program"2014-02-12 22:17:03.812 CDT INFO "And now for something fairly unimportant"2014-02-12 22:17:05.314 CDT ERROR "Too late. I'm out of memory!"
INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1
Java Code for Map Func)on (1)
package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;
public class LogEventParseMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
1 2 3 4 5 6 7 8 9
10
Input key and value types
Output key and value types
Java Code for Map Func)on (2)
private enum Level { TRACE, DEBUG, INFO, WARN, ERROR, FATAL };
@Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); String[] fields = line.split(" "); String levelField = fields[3]; for (Level level : Level.values()) {
String levelName = level.name(); if (levelName.equalsIgnoreCase(levelField)) { context.write(new Text(levelName), new IntWritable(1)); } } } }
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Convert Text value to a String
Emit key and value
The “Shuffle and Sort”
• Remember, this phase is automa)c • The result is passed as input to the reduce func)on
INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1
Shuffle and Sort
Map Output
Reduce Input
ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]
Reduce Phase
• The reduce func)on can iterate over all values for a key • It simply sums them up to produce the final results
Reduce ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]
ERROR 1INFO 4WARN 2
Java Code for Reduce Func)on (1)
package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class LogEventSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
1 2 3 4 5 6 7 8 9
10
Output key and value types
Input key and value types
Java Code for Reduce Func)on (2)
@Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); }
// Our output is the event type (key) and the sum (value) context.write(key, new IntWritable(sum)); }
}
11 12 13 14 15 16 17 18 19 20 21 22 23 24
Output key and value
Complete Data Flow Diagram
2014-02-12 22:16:49.391 CDT INFO "Here's something I thought you should know"2014-02-12 22:16:52.143 CDT INFO "Blah blah, yadda, yadda, yadda"2014-02-12 22:16:54.276 CDT WARN "Uh oh. This seems bad"2014-02-12 22:16:57.471 CDT INFO "It's cold in the server room. Brrr..."2014-02-12 22:17:01.290 CDT WARN "Things are looking down for the program"2014-02-12 22:17:03.812 CDT INFO "And now for something fairly unimportant"2014-02-12 22:17:05.314 CDT ERROR "Too late. I'm out of memory!"
INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1
ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]
ERROR 1INFO 4WARN 2
Map
Shuffle & Sort Reduce
Java Code for Driver Func)on (1)
package com.cloudera.example; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; // The driver is just a regular Java class with a "main" method public class Driver { public static void main(String[] args) throws Exception {
1 2 3 4 5 6 7 8 9
10 11 12 13
Java Code for Driver Func)on (2)
// validate commandline arguments (we require the user // to specify the HDFS paths to use for the job; see below) if (args.length != 2) { System.out.printf("Usage: Driver <input dir> <output dir>\n"); System.exit(-1); } // Instantiate a Job object for our job's configuration. Job job = new Job(); // configure input and output paths based on supplied arguments FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 15 16 17 18 19 20 21 22 23 24 25 26
Java Code for Driver Func)on (3)
// tells Hadoop to copy the JAR containing this class // to cluster nodes, as required to run this job job.setJarByClass(Driver.class); // give the job a descriptive name. This is optional, but // helps us identify this job on a busy cluster job.setJobName("Log Event Counter Driver");
// Specify which classes to use for the Mapper and Reducer job.setMapperClass(LogEventParseMapper.class); job.setReducerClass(LogEventSumReducer.class);
27 28 29 30 31 32 33 34 35 36 37
Java Code for Driver Func)on (4)
// specify the job's output key and value classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // start the MapReduce job and wait for it to finish. // if it finishes successfully, return 0; otherwise 1. boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
38 39 40 41 42 43 44 45 46 47
Packaging, Deployment, and Execu)on
1. The mapper, reducer, and driver are then compiled 2. The resul)ng classes are packaged into a JAR 3. Execute using the hadoop jar command
$ hadoop jar logsummary.jar \ com.cloudera.example.Driver \ /data/logs \ /data/logsummary
Open Source Tools that Complement Hadoop
The Hadoop Ecosystem
The Hadoop Ecosystem
• "Core Hadoop" consists of HDFS and MapReduce • These are the kernel of a much broader plaworm
• Hadoop has many related projects • Some help you integrate Hadoop with other systems • Others help you analyze your data
• These are not considered “core Hadoop” • Rather, they’re part of the Hadoop ecosystem • Many are also open source Apache projects
Apache Sqoop
• Sqoop exchanges data between RDBMS and Hadoop
• Can import en)re DB, a single table, or a table subset into HDFS • Does this very efficiently via a Map-‐only MapReduce job • Can also export data from HDFS back to the database
Database Hadoop Cluster
Apache Flume
§ Flume imports data into HDFS as it is being generated by various sources
Hadoop Cluster
ProgramOutput
UNIXsyslog
Log Files
CustomSources
And manymore...
Apache Mahout
• Mahout is a scalable machine learning library
• Support for several categories of problems, including • Classifica)on • Clustering • Collabora)ve filtering
• Many algorithms implemented in MapReduce
• Can parallelize inexpensively with Hadoop
Apache Pig
• Pig offers high-‐level data processing on Hadoop • An alterna)ve to wri)ng low-‐level MapReduce code
• Pig turns this into MapReduce jobs that run on Hadoop
people = LOAD '/data/customers' AS (cust_id, name); orders = LOAD '/data/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;
Apache Hive
• Hive is another abstrac)on on top of MapReduce • Like Pig, it also reduces development )me • Hive uses a SQL-‐like language called HiveQL
SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;
Cloudera Impala
• Massively parallel SQL engine that runs on a Hadoop cluster • Inspired by Google’s Dremel project • Can query data stored in HDFS or HBase tables
• High performance / low latency • Typically 10x -‐ 50x )mes faster than Pig or Hive • Query syntax is virtually iden)cal to HiveQL / SQL
• Impala is 100% open source (Apache-‐licensed)
Visual Overview of a Complete Workflow Import Transaction Data
from RDBMSSessionize WebLog Data with Pig
Analyst uses Impala forbusiness intelligence
Sentiment Analysis on Social Media with Hive
Hadoop Cluster with Impala
Generate Nightly Reports using Pig, Hive, or Impala
Build product recommendations for
Web site
Conclusion
Key Points
• Hadoop supports large-‐scale data storage and processing • Heavily influenced by Google's architectural papers • Already in produc)on at thousands of companies • HDFS is Hadoop's storage layer • MapReduce is Hadoop's processing framework
• Many ecosystem projects complement Hadoop • Some help you to integrate Hadoop with exis)ng systems • Others help you analyze the data you’ve stored
Ques)ons?
• Thank you for aJending!
• I’ll be happy to answer any addi)onal ques)ons now…
• Want to learn even more? • Cloudera training: developers, analysts, sysadmins, and more • Offered in more than 50 ci)es worldwide, and online too! • See hJp://university.cloudera.com/ for more info
Highly Recommended Book
Author: Tom White ISBN: 1-‐449-‐31152-‐0