Post on 07-Jul-2015
description
© inovex Academy
Hadoop & map-reduce
1
© inovex Academy
Speakers
1
Dr. Kathrin SpreyerBig Data Engineer
Patrick ThomaHead of Solution Development
© inovex Academy
Inevitable hadoop
2004: Google MapReduce paper
2006: Hadoop team around Doug Cutting at Yahoo!
2010/11: IBM’s Watson
2011/12: Hadoop connectors for Oracle products
Oct 2012: Microsoft (connectors f. Azure, HDInsights)
Oct 2012: SAP (cooperation w/ support companies)
3
© inovex Academy
Motivation
1. sample use case: logfile analytics @ 1&1
2. 80 TB/month to be processed
3. too slow on existing hardware
4. further scaling not possible -- or extremely expensive
4
© inovex Academy
Amazing performance improvement
4
© inovex Academy
Overview
1. Map-Reduce
2. HDFS
3. APIs
4. Cluster sizing
6
© inovex Academy
What?
1. framework for distributed data processing
2. highly scalable: TBs and PBs
3. originated at Google
4. open-source implementation: Apache Hadoop
7
© inovex Academy
The big picture
8
input
© inovex Academy
The big picture
8
© inovex Academy
Why?
1. too much data for one machine
2. processing speed
3. scaling out vs. scaling up
9
Photo by Flo P.
© inovex Academy 14
HDFS(hadoop distributed file system)
1. Map-Reduce
2. HDFS
3. APIs
4. Cluster sizing
© inovex Academy
Apis
20
1. Map-Reduce
2. HDFS
3. APIs
4. Cluster sizing
© inovex Academy
Basic map-reduce Apis
1. Java
2. C++ (Pipes)
3. Python (Dumbo)
4. streaming (any language)
21
© inovex Academy
Higher-level Apis
1. Apache Pig (data flow language)
2. Apache Hive (SQL dialect)
22
alternative: graphical ETL tools, e.g., Pentaho Data Integration
© inovex Academy
Cluster sizing
23
1. Map-Reduce
2. HDFS
3. APIs
4. Cluster sizing
© inovex Academy
Network topology
1. single data center
2. rack topology
3. bandwidth
25
© inovex Academy
Questions?
26
© inovex Academy
Contact:bigdata@inovex.de
27