Post on 19-Aug-2014
description
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Introduc=on to Apache Hadoop and its Ecosystem Mark Grover | Intro to Cloud Compu=ng, Carnegie Mellon SV github.com/markgrover/hadoop-‐intro-‐fast
© Copyright 2010-‐2014 Cloudera, Inc. All rights reserved.
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
About Me • CommiNer on Apache Bigtop, commiNer and PPMC member on Apache Sentry (incuba=ng). • Contributor to Apache Hadoop, Hive, Spark, Sqoop, Flume. • SoUware developer at Cloudera • @mark_grover • www.linkedin.com/in/grovermark
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Co-‐author O’Reilly book
• @hadooparchbook • hadooparchitecturebook.com • To be released early 2015
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
About the Presenta=on…
• What’s ahead • Fundamental Concepts • HDFS: The Hadoop Distributed File System • Data Processing with MapReduce • Demo • Conclusion + Q&A
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Fundamental Concepts Why the World Needs Hadoop
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
What’s the craze about Hadoop?
• Volume • More and more data being generated • Machine generated data increasing
• Velocity • Data coming it at higher speed
• Variety • Audio, video, images, log files, web pages, social network connec=ons, etc.
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
We Need a System that Scales
• Too much data for tradi=onal tools • Two key problems
• How to reliably store this data at a reasonable cost • How to we process all the data we’ve stored
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
What is Apache Hadoop?
• Scalable data storage and processing • Distributed and fault-‐tolerant • Runs on standard hardware
• Two main components • Storage: Hadoop Distributed File System (HDFS) • Processing: MapReduce
• Hadoop clusters are composed of computers called nodes • Clusters range from a single node up to several thousand nodes
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
How Did Apache Hadoop Originate?
• Heavily influenced by Google’s architecture • Notably, the Google Filesystem and MapReduce papers
• Other Web companies quickly saw the benefits • Early adop=on by Yahoo, Facebook and others
2002 2003 2004 2005 2006
Google publishes MapReduce paper
Nutch rewritten for MapReduce
Hadoop becomesLucene subproject
Nutch spun off from Lucene
Google publishes GFS paper
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Comparing Hadoop to Other Systems
• Monolithic systems don’t scale • Modern high-‐performance compu=ng systems are distributed
• They spread computa=ons across many machines in parallel • Widely-‐used used for scien=fic applica=ons • Let’s examine how a typical HPC system works
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Architecture of a Typical HPC System
Storage System
Compute Nodes
Fast Network
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Architecture of a Typical HPC System
Storage System
Compute Nodes
Step 1: Copy input data
Fast Network
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Architecture of a Typical HPC System
Storage System
Compute Nodes
Step 2: Process the data
Fast Network
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Architecture of a Typical HPC System
Storage System
Compute Nodes
Step 3: Copy output data
Fast Network
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
You Don’t Just Need Speed…
• The problem is that we have way more data than code
$ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
You Need Speed At Scale
Storage System
Compute Nodes
Bottleneck
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Hadoop Design Fundamental: Data Locality
• This is a hallmark of Hadoop’s design • Don’t bring the data to the computa=on • Bring the computa=on to the data
• Hadoop uses the same machines for storage and processing • Significantly reduces need to transfer data across network
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Other Hadoop Design Fundamentals
• Machine failure is unavoidable – embrace it • Build reliability into the system
• “More” is usually beNer than “faster” • Throughput maNers more than latency
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
The Hadoop Distributed Filesystem
HDFS
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
HDFS: Hadoop Distributed File System
• Inspired by the Google File System • Reliable, low-‐cost storage for massive amounts of data
• Similar to a UNIX filesystem in some ways • Hierarchical • UNIX-‐style paths (e.g., /sales/alice.txt) • UNIX-‐style file ownership and permissions
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
HDFS: Hadoop Distributed File System
• There are also some major devia=ons from UNIX filesystems • Highly-‐op=mized for processing data with MapReduce
• Designed for sequen=al access to large files • Cannot modify file content once wriNen
• It’s actually a user-‐space Java process • Accessed using special commands or APIs
• No concept of a current working directory
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Copying Local Data To and From HDFS
• Remember that HDFS is dis=nct from your local filesystem • hadoop fs –put copies local files to HDFS • hadoop fs –get fetches a local copy of a file from HDFS
$ hadoop fs -put sales.txt /reports
Hadoop Cluster
Client Machine
$ hadoop fs -get /reports/sales.txt
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
HDFS Demo
• I will now demonstrate the following 1. How to list the contents of a directory 2. How to create a directory in HDFS 3. How to copy a local file to HDFS 4. How to display the contents of a file in HDFS 5. How to remove a file from HDFS
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
A Scalable Data Processing Framework
Data Processing with MapReduce
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
What is MapReduce?
• MapReduce is a programming model • It’s a way of processing data • You can implement MapReduce in any language
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Understanding Map and Reduce
• You supply two func=ons to process data: Map and Reduce • Map: typically used to transform, parse, or filter data • Reduce: typically used to summarize results
• The Map func=on always runs first • The Reduce func=on runs aUerwards, but is op=onal
• Each piece is simple, but can be powerful when combined
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
MapReduce Benefits
• Scalability • Hadoop divides the processing job into individual tasks • Tasks execute in parallel (independently) across cluster
• Simplicity • Processes one record at a =me
• Ease of use • Hadoop provides job scheduling and other infrastructure • Far simpler for developers than typical distributed compu=ng
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
MapReduce in Hadoop
• MapReduce processing in Hadoop is batch-‐oriented • A MapReduce job is broken down into smaller tasks
• Tasks run concurrently • Each processes a small amount of overall input
• MapReduce code for Hadoop is usually wriNen in Java • This uses Hadoop’s API directly
• You can do basic MapReduce in other languages • Using the Hadoop Streaming wrapper program • Some advanced features require Java code
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
MapReduce Example in Python
• The following example uses Python • Via Hadoop Streaming
• It processes log files and summarizes events by type • I’ll explain both the data flow and the code
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Job Input
• Here’s the job input
• Each map task gets a chunk of this data to process • Typically corresponds to a single block in HDFS
2013-06-29 22:16:49.391 CDT INFO "This can wait"2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"2013-06-29 22:16:54.276 CDT WARN "This seems bad"2013-06-29 22:16:57.471 CDT INFO "More blather"2013-06-29 22:17:01.290 CDT WARN "Not looking good"2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
#!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() level = fields[3].upper() if level in levels: print "%s\t1" % level
1 2 3 4 5 6 7 8 9
10 11 12 1314
Python Code for Map Func=on
If it matches a known level, print it, a tab separator, and the literal value 1 (since the level can only occur once per line)
Read records from standard input. Use whitespace to split into fields.
Define list of known log levels
Extract “level” field and convert to uppercase for consistency.
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Output of Map Func=on
• The map func=on produces key/value pairs as output
INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
The “Shuffle and Sort”
• Hadoop automa9cally merges, sorts, and groups map output • The result is passed as input to the reduce func=on • More on this later…
INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1
ERROR 1INFO 1INFO 1INFO 1INFO 1WARN 1WARN 1
Shuffle and Sort
Map Output Reduce Input
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Input to Reduce Func=on
• Reduce func=on receives a key and all values for that key
• Keys are always passed to reducers in sorted order
• Although not obvious here, values are unordered
ERROR 1INFO 1INFO 1INFO 1INFO 1WARN 1WARN 1
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Python Code for Reduce Func=on
#!/usr/bin/env python import sys previous_key = None sum = 0 for line in sys.stdin: key, value = line.split() if key == previous_key: sum = sum + int(value) # continued on next slide
1 2 3 4 5 6 7 8 9
10 11 12 13
Ini=alize loop variables
Extract the key and value passed via standard input
If key unchanged, increment the count
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Python Code for Reduce Func=on
# continued from previous slide else: if previous_key: print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum)
14 15 16 17 18 19 20 21 22 Print data for the final
key
If key changed, print data for old level
Start tracking data for the new record
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Output of Reduce Func=on
• Its output is a sum for each level
ERROR 1INFO 4WARN 2
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Recap of Data Flow
ERROR 1INFO 4WARN 2
INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1
ERROR 1INFO 1INFO 1INFO 1INFO 1WARN 1WARN 1
Map input
Map output Reduce input Reduce output
2013-06-29 22:16:49.391 CDT INFO "This can wait"2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"2013-06-29 22:16:54.276 CDT WARN "This seems bad"2013-06-29 22:16:57.471 CDT INFO "More blather"2013-06-29 22:17:01.290 CDT WARN "Not looking good"2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
Shuffle and sort
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
How to Run a Hadoop Streaming Job
• I’ll demonstrate this now…
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Open Source Tools that Complement Hadoop
The Hadoop Ecosystem
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
The Hadoop Ecosystem
• "Core Hadoop" consists of HDFS and MapReduce • These are the kernel of a much broader plauorm
• Hadoop has many related projects • Some help you integrate Hadoop with other systems • Others help you analyze your data
• These are not considered “core Hadoop” • Rather, they’re part of the Hadoop ecosystem • Many are also open source Apache projects
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Visual Overview of a Complete Workflow Import Transaction Data
from RDBMSSessionize WebLog Data with Pig
Analyst uses Impala forbusiness intelligence
Sentiment Analysis on Social Media with Hive
Hadoop Cluster with Impala
Generate Nightly Reports using Pig, Hive, or Impala
Build product recommendations for
Web site
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Key Points
• We’re genera=ng massive volumes of data • This data can be extremely valuable • Companies can now analyze what they previously discarded
• Hadoop supports large-‐scale data storage and processing • Heavily influenced by Google's architecture • Already in produc=on by thousands of organiza=ons • HDFS is Hadoop's storage layer • MapReduce is Hadoop's processing framework
• Many ecosystem projects complement Hadoop • Some help you to integrate Hadoop with exis=ng systems • Others help you analyze the data you’ve stored
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Highly Recommended Books
Author: Tom White ISBN: 1-‐449-‐31152-‐0
Author: Eric Sammer ISBN: 1-‐449-‐32705-‐2
© 2010 – 2015 Cloudera, Inc. All Rights Reserved
Ques=ons?
• Thank you for aNending! • I’ll be happy to answer any addi=onal ques=ons now… • Demo and slides at github.com/markgrover/hadoop-‐intro-‐fast • TwiNer: mark_grover • Survey page: =ny.cloudera.com/mark