Hadoop
-
Upload
scott-leberknight -
Category
Technology
-
view
21.075 -
download
0
Transcript of Hadoop
HadoopScott Leberknight
Yahoo! "Search Assist"
Yahoo!
Baidu
eBay
New York Times
Rackspace
eHarmony
Powerset
http://wiki.apache.org/hadoop/PoweredBy
Notable Hadoop users...
Hadoop in the Real
World...
Financial analysisRecommendationsystems
Natural LanguageProcessing (NLP)
Image/video processing
Log analysis
Data warehousing
Correlation engines
Market research/forecasting
Finance
Health &Life Sciences Academic research
Government
Social networking
Telecommunications
History...
Originally built to support distribution for Nutch search engine
Created by Doug Cutting
Inspired by Google BigTable and MapReduce papers circa 2004
Named after a stuffed elephant
OK, So what exactly
is Hadoop?
An open source...
general purpose framework for creating distributed applications that
process huge amounts of data.
batch/offline oriented...
data & I/O intensive...
One definition of "huge"
25,000 machines
More than 10 clusters
3 petabytes of data (compressed, unreplicated)
700+ users
10,000+ jobs/week
Distributed File System(HDFS)
Map/Reduce System
Major Hadoop
Components:
But first, what
isn't Hadoop?
...a relational database!
...an online transaction processing (OLTP) system!
Hadoop is NOT:
...a structured data store of any kind!
Hadoop vs. Relational
Hadoop Relational
Scale-out Scale-up(*)
Key/value pairs Tables
Say how to process the data
Say what you want (SQL)
Offline/batch Online/real-time
(*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
(Hadoop Distributed File System)
HDFS
Data is distributed and replicated over multiple machines
Designed for large files(where "large" means GB to TB)
Block oriented
Linux-style commands, e.g. ls, cp, mv, rm, etc.
File Block Mappings:
/user/aaron/data1.txt -> 1, 2, 3/user/aaron/data2.txt -> 4, 5/user/andrew/data3.txt -> 6, 7
1
3
4
5
2
2
5
6
4
7
3
2
4
7
1
6
7
1
6
3
5
NameNode
DataNode(s)
Self-healing
fault tolerant
scalable
when nodes fail
rebalances files across cluster
just by adding new nodes!
Map/Reduce
Operate on key/value pairs
Mappers filter & transform input data
Reducers aggregate mapper output
Split input files (e.g. by HDFS blocks)
move code to data
map:
(K1, V1) list(K2, V2)
list(K3, V3)(K2, list(V2))
reduce:
(the canonical Map/Reduce example)
Word Count
the quick brown foxjumped over
the lazy brown dog
(0, "the quick brown fox")
(K1, V1)
(20, "jumped over")
(32, "the lazy brown dog")
map phase -
inputs
list(K2, V2)
("brown", 1) ("dog", 1)
("lazy", 1)("the", 1)
("over", 1)
("the", 1) ("quick", 1)
("brown", 1) ("fox", 1)
("jumped", 1)
map phase - outputs
(K2, list(V2))reduce phase -
inputs
("dog", (1))
("lazy", (1)) ("over", (1))
("the", (1, 1))("quick", (1))
("brown", (1, 1))
("fox", (1)) ("jumped", (1))
list(K3, V3)reduce phase - outputs
("dog", 1)
("lazy", 1) ("over", 1)
("the", 2)("quick", 1)
("brown", 2)
("fox", 1) ("jumped", 1)
WordCount in code...
public class SimpleWordCount extends Configured implements Tool {
public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... }
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... }
public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... }}
public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text();
@Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } }}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable();
@Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); }}
public int run(String[] args) throws Exception { Configuration conf = getConf();
Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;}
public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result);}
Map/Reduce Data Flow
(Image from Hadoop in Action...great book!)
Partitioning
Deciding which keys go to which reducer
Desire even distribution across reducers
Skewed data can overload a single reducer!
Map/Reduce Partitioning & Shuffling
(Image from Hadoop in Action...great book!)
Combiner
a.k.a. "Local Reduce"
Effectively a reduce in the mappers
data # k/v pairs shuffled
without combiner
with combiner
("the", 1) 1000
("the", 1000) 1
Shuffling WordCount
(looking at one mapper that sees the word "the" 1000 times)
Advanced Map/Reduce
Chaining Map/Reduce jobs
Hadoop Streaming
Joining data
Bloom filters
Architecture
HDFS
Map/Reduce
NameNode
SecondaryNameNode
DataNode JobTracker
TaskTracker
NameNode JobTracker
SecondaryNameNode
DataNode1
TaskTracker1
map
reduce
DataNode2
TaskTracker2
map
reduce
DataNodeN
TaskTrackerN
map
reduce
NameNode
Bookkeeper for HDFS
Single point of failure!
Should not store data or run jobs
Manages DataNodes
DataNode
Store actual file blocks on disk
Does not store entire files!
Report block info to NameNode
Receive instructions from NameNode
Secondary NameNode
Not a failover server for NameNode!
Snapshot of NameNode
Help minimize downtime/data loss if NameNode fails
JobTracker
Track map/reduce tasks
Partition tasks across HDFS cluster
Re-start failed tasks on different nodes
Speculative execution
TaskTracker
Track individual map & reduce tasks
Report progress to JobTracker
Monitoring/
Debugging
distributed debugging
distributed processing
Logs
View task logs on machine where specific task was processed
(or via web UI)
$HADOOP_HOME/logs/userlogs on task tracker
Counters
Define one or more counters
Increment counters during map/reduce tasks
Counter values displayed in job tracker UI
IsolationRunner
Re-run failed tasks with original input data
Must set keep.failed.tasks.files to 'true'
Skipping Bad Records
Data may not always be clean
New data may have new interesting twists
Can you pre-process to filter & validate input?
Performance Tuning
Speculative execution(on by default)
Reduce amount of input data
Data compression
Use a Combiner
JVM Re-use(be careful)
Refactor code/algorithms
Managing
Hadoop
Lots of knobs
Needs active management
"Fair" scheduling
Trash can
Add/removedata nodes
Network topology/rack awareness
NameNode/SNNmanagement Permissions/quotas
Hive
Simulate structure for data stored in Hadoop
Query language analogous to SQL (Hive QL)
Translates queries into Map/Reduce job(s)...
...so not for real-time processing!
Queries:
Projection Joins (inner, outer, semi)
Grouping Aggregation
Sub-queries Multi-table insert
User-defined functions
Input/output formats with SerDe
Customizable:
"CITING","CITED"3858241,9562033858241,13242343858241,33984063858241,35573843858241,36348893858242,15157013858242,33192613858242,36687053858242,37070043858243,29496113858243,31464653858243,31569273858243,32213413858243,3574238...
Patent citation dataset
http://www.nber.org/patents
/user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt
create external table patent_citations (citing string, cited string)row format delimited fields terminated by ','stored as textfilelocation '/user/sleberkn/nber-patent/tables/patent_citation';
create table citation_histogram (num_citations int, count int)stored as sequencefile;
insert overwrite table citation_histogramselect num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_countsgroup by num_citationsorder by num_citations;
Hadoop in the clouds
Amazon EC2 + S3
EC2 instances are compute nodes (Map/Reduce)
Storage options:
HDFS on EC2 nodes
HDFS on EC2 nodes loading data from S3
Native S3 (bypasses HDFS)
Amazon Elastic MapReduce
Interact via web-based console
EMR configures & launches Hadoop cluster for job
Submit Map/Reduce job(streaming, Hive, Pig, or JAR)
Uses S3 for data input/output
Recap...
Hadoop = HDFS + Map/Reduce
Horizontal scale-out
Designed for fault tolerance
Distributed, parallel processing
Structure & queries via Hive
References
http://hadoop.apache.org/
http://hadoop.apache.org/hive/
Hadoop in Actionhttp://www.manning.com/lam/
Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388
Yahoo! Hadoop bloghttp://developer.yahoo.net/blogs/hadoop/
Clouderahttp://www.cloudera.com/
http://lmgtfy.com/?q=hadoop
http://www.letmebingthatforyou.com/?q=hadoop