Lecture 3 – Foundations for Big Data Systems and Programming - Computer...
Transcript of Lecture 3 – Foundations for Big Data Systems and Programming - Computer...
CS 626 Large Scale Data Science
Jun ZhangOriginally prepared by Dr. Licong Cui
Department of Computer ScienceUniversity of Kentucky
January 28, 2020
Lecture 3 – Foundations for Big Data Systems and Programming
Review: Five P’s of Data Science
People
Purpose
Process
Platforms
Programmability
Review: Steps in the Data Science Process
Acquire Prepare Analyze Report Act
Big Data Engineering Computational Big Data Science
Basic Scalable Computing Concepts
Distributed File Systems
Scalable Computing over the Internet
Programming Models for Big Data
Traditional File System
64GB 256GB 512GB
1TB
Copy data to an external hard drive?
Buy a bigger disk?
Work Personal
Distribute data on multiple computers?
Store Data in Server?
Cluster of Machines
Distributed File System (DFS)
File system spreads over multiple, autonomous computers
Distributed File System
Rack
Data Replication
1
2
4
5
3
3
1
2
4
5
2
4
5
3
1
5
3
1
2
4
4
5
3
1
2
Rack
1 2 3 4 5
Data
Fault Tolerance
1
2
4
5
3
3
1
2
4
5
2
4
5
3
1
5
3
1
2
4
4
5
3
1
2
Rack
1 2 3 4 5
Data
High Concurrency
1
2
4
5
3
3
1
2
4
5
2
4
5
3
1
5
3
1
2
4
4
5
3
1
2
Rack
1 2 3 4 5
Data
Reader 1 Reader 2
Reader 3
Distributed File SystemRa
ck
Data Partitioning
Data Replication
Data Scalability
Fault Tolerance
High Concurrency
Scalable Computing Over the Internet
Single compute node
Parallel Computer
Expensive
Commodity Cluster
Affordable
Less-specialized
Distributed computing over the Internet
Reduced computing cost
Architecture of a Commodity Cluster
NetworkRa
ck
Distributed ComputingNetwork
Rack
Network
Rack
Net
wor
kRack
• Enables data-parallelism• Move computation to data
Programming Models for Big Data
Runtime Libraries Programming Languages
Programming Model = abstractions
Distributed File System
Rack
Infrastructure:
Requirements for Big Data Programming Models
1. Support Big Data Operationso Split large volumes of data
o Access data fast
o Distribute computations to nodes
Requirements for Big Data Programming Models (cont.)
2. Handle Fault Toleranceo Replicate data partitions
o Recover files when needed
Requirements for Big Data Programming Models (cont.)
3. Enable Adding More Racks
4
1
2 5
3Rack
1 2 3 4 5
Data
Rack
Rack
Requirements for Big Data Programming Models (cont.)
4. Optimized for specific data types
Document Table
Key-value Graph
Multimedia Stream
Example – Suits of Cards
Example – Suits of Cards
Key Programming Model
MapReduce
A programming model for Big Data
Many implementations
Programming Model = abstractions
Runtime Libraries Programming Languages
Support large data volumes
Provide fault tolerance
Enable scale out
MapReduce
What is MapReduce?
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a map procedure performs filtering and sorting (such as
sorting students by first name into queues, one queue for each name)
a reduce method performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)
Google Distributed System
Google File System Architecture
Google MapReduce
Hadoop Ecosystem - History
Yahoo! released Hadoop in 2005
More open-source projects
2004
Hadoop Ecosystem – Layer Diagram
What is Hadoop?
Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.
Goals/Requirements Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)
Fault-tolerance Move computation to data
Hadoop Architecture
Hadoop Architecture
Hadoop Architecture (cont.)
HDFS Name Node
Data Node
MapReduce Job Tracker
Task Tracker
Reminder: Downloading and Installing Hadoop
Download and Install VirtualBoxhttps://www.virtualbox.org/wiki/Downloads
Download and Install Cloudera QuickStart VMhttps://www.cloudera.com/downloads/quickstart_vms/5-13.html
Launch the Cloudera VM
Reminder: Import Appliance in VirtualBox
Reminder: Amazon Web Service (AWS) Educate Sign Up
AWS Educatehttps://aws.amazon.com/education/awseducate/