Bigdata workshop february 2015
-
Upload
clairvoyantllc -
Category
Technology
-
view
56 -
download
2
Transcript of Bigdata workshop february 2015
![Page 1: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/1.jpg)
Big Data WorkshopBy
Avinash RamineniShekhar Vemuri
![Page 2: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/2.jpg)
Big Data• Big Data
– More than a single machine can process
– 100s of GB, TB, ?PB!!
• Why the Buzz now ?
– Low Storage Costs
– Increased Computing power
– Potential for Tremendous Insights
– Data for Competitive Advantage
– Velocity , Variety , Volume
• Structured
• Unstructured - Text, Images, Video, Voice, etc
![Page 3: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/3.jpg)
Volume, Velocity & Variety
![Page 4: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/4.jpg)
Big Data Use Cases
![Page 5: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/5.jpg)
![Page 6: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/6.jpg)
Traditional Large-Scale Computation
• Computation has been Processor and Memory bound– Relatively small amounts of data– Faster processor , more RAM
• Distributed Systems– Multiple machines for a single job– Data exchange requires synchronization– Finite bandwidth is available – Temporal dependencies are complicated– Difficult to deal with partial failures
• Data is stored on a SAN• Data is copied to compute nodes
![Page 7: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/7.jpg)
Data Becomes the Bottleneck• Moore’s Law has held firm over 40 years
– Processing power doubles every two years– Processing speed is no longer the problem
• Getting the data to the processors becomes the bottleneck
• Quick Math– Typical disk data transfer rate : 75MB/Sec– Time taken to transfer 100GB of data to the processor ? 22
minutes !!• Assuming sustained reads• Actual time will be worse, since most servers have less than 100 GB of RAM
![Page 8: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/8.jpg)
Need for New Approach• System must support partial Failure
– Failure of a component should result in a graceful degration of application performance
• If a component of the system fails, the workload needs to be picked up by other components.– Failure should not result in loss of any data
• Components should be able to join back upon recovery• Component failures during execution of a job impact the
outcome of the job• Scalability
– Increasing resources should support a proportional incease in load capacity.
![Page 9: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/9.jpg)
Traditional vs Hadoop Approach• Traditional
– Application Server: Application resides– Database Server: Data resides– Data is retrieved from database and sent to application for
processing. – Can become network and I/O intensive for GBs of data.
• Hadoop– Distribute data on to multiple commodity hardware (data
nodes)– Send application to data nodes instead and process data in
parallel.
![Page 10: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/10.jpg)
What is Hadoop
• Hadoop is a framework for distributed data processing with distributed
filesystem.
• Consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
• Map and then Reduce
• A set of machines running HDFS and MapReduce is known as a Hadoop
Cluster.
• Individual machines are known as Nodes
![Page 11: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/11.jpg)
Core Hadoop Concepts
• Distribute data as it is stored in to the system
• Nodes talk to each other as little as possible
–Shared Nothing Architecture
• Computation happens where the data is stored.
• Data is replicated multiple times on the system for
increased availability and reliability.
• Fault Tolerance
![Page 12: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/12.jpg)
HDFS -1
• HDFS is a file system written in Java
– Sits on top of a native filesystem• Such as ext3, ext4 or xfs
• Designed for storing very large files with streaming data access patterns
• Write-once, read-many-times pattern
• No Random writes to files are allowed
– High Latency, High Throughput
– Provides redundant storage by replicating
– Not Recommended for lot of small files and low latency requirements
![Page 13: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/13.jpg)
HDFS - 2
• Files are split into blocks of size 64MB or 128 MB
• Data is distributed across machines at load time
• Blocks are replicated across multiple machines, known as DataNodes
– Default replication is 3 times
• Two types of Nodes in HDFS Cluster – (Master – worker Pattern)
– NameNode (Master)• Keeps track of blocks that make up a file and where those blocks are located
– DataNodes • Hold the actual blocks
• Stored as standard files in a set of directories specified in the Configuration
![Page 14: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/14.jpg)
HDFS
Block1
Block2
Block3
File150MB
DataNode 1Block1Block2Block3
DataNode 2Block1Block2
DataNode3Block1Block3
DataNode4Block2Block3
ReplicateSplit
Myfile.txt
Myfile.txt: block1,block2,block3
NameNode
Secondary NameNode
![Page 15: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/15.jpg)
HDFS - 3
• HDFS Daemons
– NameNode
• Keeps track of which blocks make up a file and where those blocks are located
• Cluster is not accessible with out a NameNode
– Secondary NameNode
• Takes care of house keeping activities for NameNode (not a backup)
– DataNode
• Regularly report their status to namenode in a heartbeat
• Every hour or at startup sends a block report to NameNode
![Page 16: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/16.jpg)
HDFS - 4
• Blocks are replicated across multiple machines, known as
DataNodes
• Read and Write HDFS files via the Java API
• Access to HDFS from the command line is achieved with the
hadoop fs command
– File system commands – ls, cat, put, get,rm
![Page 17: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/17.jpg)
hadoop fs
"
![Page 18: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/18.jpg)
hadoop fs –get
"
![Page 19: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/19.jpg)
hadoop fs –put
"
![Page 20: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/20.jpg)
Hands On: Using HDFS
![Page 21: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/21.jpg)
MapReduce
• OK, You have data in HDFS, what next?
• Processing via Map Reduce
• Parallel processing for large amounts of data
•Map function, Reduce function
• Data Locality
• Code is shipped to the node closest to data
• Java, Python, Ruby, Scala…
![Page 22: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/22.jpg)
MapReduce Features
• Automatic parallelization and distribution
• Fault-tolerance (Speculative Execution)
• Status and monitoring tools
• Clean abstraction for programmers
–MapReduce programs usually written in Java (Hadoop Streaming)
• Abstracts all the house keeping activities from developers
![Page 23: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/23.jpg)
WordCount MapReduce
![Page 24: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/24.jpg)
Hadoop Daemons• Hadoop is comprised of five separate daemons
• NameNode
• Secondary NameNode
• DataNode
• JobTracker
–Manages MapReduce jobs, distributes individual tasks to machines
• TaskTracker
– Instantiates and monitors individual Map and Reduce tasks
![Page 25: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/25.jpg)
MapReduce
![Page 26: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/26.jpg)
MapReduce Classes
![Page 27: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/27.jpg)
Hadoop Cluster
Job Tracker
Name Node
MapReduce
HDFS
Task Tracker Task Tracker
Data Node Data Node
1,2,3 … n
Master Node Worker processes
![Page 28: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/28.jpg)
MapReduce
•MR1
• only MapReduce
• aka - batch mode
•MR2
• YARN based
•MR is one implementation on top of YARN
• batch, interactive, online, streaming….
![Page 29: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/29.jpg)
Hadoop 2.0
![Page 30: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/30.jpg)
MapReduce
HDFS
NameNodeDataNode
MapReduce
JobTrackerTaskTracker
YARNContainer
NodeManagerAppMaster
ResourceManager
MapReduce
![Page 31: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/31.jpg)
MapReduce
![Page 32: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/32.jpg)
Hands On: Running a MR Job
![Page 33: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/33.jpg)
How do you write MapReduce?
•Map Reduce is a basic building block,
programming approach to solve larger problems
• In most real world applications, multiple MR jobs
are chained together
• Processing pipeline - output of one - input to
another
![Page 34: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/34.jpg)
How do you write MapReduce?
•Map Reduce is a basic building block,
programming approach to solve larger problems
• In most real world applications, multiple MR jobs
are chained together
• Processing pipeline - output of one - input to
another
![Page 35: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/35.jpg)
Anatomy of Map Reduce Job (WordCount)
![Page 36: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/36.jpg)
Hands On: Writing a MR Program
![Page 37: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/37.jpg)
Hadoop Ecosystem
• Rich Ecosystem around Hadoop
•We have used many,
• Sqoop, Flume, Oozie, Mahout, Hive, Pig, Cascading
• Cascading, Lingual
•More out there!
• Impala, Tez, Drill, Stinger, Shark, Spark, Storm…….
![Page 38: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/38.jpg)
Big Data RolesDate Engineer Vs Data Scientist Vs Data Analyst
![Page 39: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/39.jpg)
Hive and Pig
• MapReduce code is typically written in Java• Requires:– A good java programmer– Who understands MapReduce– Who understands the problem– Who will be available to maintain and update the
code in the future as requirements change
![Page 40: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/40.jpg)
Hive and Pig• Many organizations have only a few developers who can
write good MapReduce code• Many other people want to analyze data
– Business analysts– Data scientists– Statisticians– Data analysts
• Need a higher level abstraction on top of MapReduce– Ability to query the data without needing to know MapReduce
intimately– Hive and Pig address these needs
![Page 41: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/41.jpg)
Hive
![Page 42: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/42.jpg)
Hive
• Abstraction on top of MapReduce• Allows users to query data in the Hadoop cluster
without knowing Java or MapReduce• Uses HiveQL Language– Very Similar to SQL
• Hive Interpreter runs on a client machine– Turns HiveQL queries into MapReduce jobs– Submits those jobs to the cluster
![Page 43: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/43.jpg)
Hive Data Model
• Hive ‘layers’ table definitions on top of data in HDFS
• Tables– Typed columns
• TINYINT , SMALLINT, INT, BIGINT• FLOAT,DOUBLE• STRING• BINARY• BOOLEAN, TIMESTAMP
• ARRAY , MAP, STRUCT
![Page 44: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/44.jpg)
Hive Metastore
• Hive’s Metastore is a database containing table definitions and other metadata– By default, stored locally on the client machine in a
Derby database– Usually MySQL if the metastore needs to be shared
across multiple people– table definitions on top of data in HDFS
![Page 45: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/45.jpg)
Hive Examples
![Page 46: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/46.jpg)
Hive Examples
![Page 47: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/47.jpg)
Hive Examples
![Page 48: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/48.jpg)
Hive Limitations
• Not all ‘standard’ SQL is supported– Subqueries are only supported in the FROM clause
• No correlated subqueries– No support for UPDATE or DELETE– No support for INSERTing single rows
![Page 49: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/49.jpg)
Hands-On: Hive
![Page 50: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/50.jpg)
Apache Pig
![Page 51: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/51.jpg)
Pig• Originally created at yahoo• High-level platform for data analysis and processing on Hadoop• Alternative to writing low-level MapReduce code• Relatively simple syntax• Features
– Joining datasets– Grouping data– Loading non-delimited data– Creation of user-defined functions written in java– Relational operations– Support for custom functions and data formats– Complex data structures
![Page 52: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/52.jpg)
Pig
• No shared metadata like Hive• Installation requires no modification to the cluster• Components of Pig
– Data flow language – Pig Latin• Scripting Language for writing data flows
– Interactive Shell – Grunt Shell• Type and execute Pig Latin Statements• Use pig interactively
– Pig Interpreter • Interprets and executes the Pig Latin• Runs on the client machine
![Page 53: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/53.jpg)
Grunt Shell
![Page 54: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/54.jpg)
Pig Interpreter
1. Preprocess and parse Pig Latin2. Check data types3. Make optimizations4. Plan execution5. Generate MapReduce jobs6. Submit jobs to Hadoop7. Monitor progress
![Page 55: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/55.jpg)
Pig Concepts
• A single element of data is an atom• A collection of atoms – such as a row or a
partial row – is a tuple• Tuples are collected together into bags• PigLatin script starts by loading one or more
datasets into bags, and then creates new bags by modifying those it already has
![Page 56: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/56.jpg)
Sample Pig Script
![Page 57: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/57.jpg)
Sample Pig Script
![Page 58: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/58.jpg)
Hands-On: Pig Exercise
![Page 59: Bigdata workshop february 2015](https://reader034.fdocuments.us/reader034/viewer/2022042707/58ebfb861a28abce6d8b4771/html5/thumbnails/59.jpg)
References
•Hadoop Definitive Guide - Tom White - must
read!
•http://highlyscalable.wordpress.com/
2012/02/01/mapreduce-patterns/
•http://hortonworks.com/hadoop/yarn/