Lecture 3 – Foundations for Big Data Systems and Programming - Computer...

Post on 22-Jan-2021

2 views 0 download

Transcript of Lecture 3 – Foundations for Big Data Systems and Programming - Computer...

CS 626 Large Scale Data Science

Jun ZhangOriginally prepared by Dr. Licong Cui

Department of Computer ScienceUniversity of Kentucky

January 28, 2020

Lecture 3 – Foundations for Big Data Systems and Programming

Review: Five P’s of Data Science

People

Purpose

Process

Platforms

Programmability

Review: Steps in the Data Science Process

Acquire Prepare Analyze Report Act

Big Data Engineering Computational Big Data Science

Basic Scalable Computing Concepts

Distributed File Systems

Scalable Computing over the Internet

Programming Models for Big Data

Traditional File System

64GB 256GB 512GB

1TB

Copy data to an external hard drive?

Buy a bigger disk?

Work Personal

Distribute data on multiple computers?

Store Data in Server?

Cluster of Machines

Distributed File System (DFS)

File system spreads over multiple, autonomous computers

Distributed File System

Rack

Data Replication

1

2

4

5

3

3

1

2

4

5

2

4

5

3

1

5

3

1

2

4

4

5

3

1

2

Rack

1 2 3 4 5

Data

Fault Tolerance

1

2

4

5

3

3

1

2

4

5

2

4

5

3

1

5

3

1

2

4

4

5

3

1

2

Rack

1 2 3 4 5

Data

High Concurrency

1

2

4

5

3

3

1

2

4

5

2

4

5

3

1

5

3

1

2

4

4

5

3

1

2

Rack

1 2 3 4 5

Data

Reader 1 Reader 2

Reader 3

Distributed File SystemRa

ck

Data Partitioning

Data Replication

Data Scalability

Fault Tolerance

High Concurrency

Scalable Computing Over the Internet

Single compute node

Parallel Computer

Expensive

Commodity Cluster

Affordable

Less-specialized

Distributed computing over the Internet

Reduced computing cost

Architecture of a Commodity Cluster

NetworkRa

ck

Distributed ComputingNetwork

Rack

Network

Rack

Net

wor

kRack

• Enables data-parallelism• Move computation to data

Programming Models for Big Data

Runtime Libraries Programming Languages

Programming Model = abstractions

Distributed File System

Rack

Infrastructure:

Requirements for Big Data Programming Models

1. Support Big Data Operationso Split large volumes of data

o Access data fast

o Distribute computations to nodes

Requirements for Big Data Programming Models (cont.)

2. Handle Fault Toleranceo Replicate data partitions

o Recover files when needed

Requirements for Big Data Programming Models (cont.)

3. Enable Adding More Racks

4

1

2 5

3Rack

1 2 3 4 5

Data

Rack

Rack

Requirements for Big Data Programming Models (cont.)

4. Optimized for specific data types

Document Table

Key-value Graph

Multimedia Stream

Example – Suits of Cards

Example – Suits of Cards

Key Programming Model

MapReduce

A programming model for Big Data

Many implementations

Programming Model = abstractions

Runtime Libraries Programming Languages

Support large data volumes

Provide fault tolerance

Enable scale out

MapReduce

What is MapReduce?

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a map procedure performs filtering and sorting (such as

sorting students by first name into queues, one queue for each name)

a reduce method performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

Google Distributed System

Google File System Architecture

Google MapReduce

Hadoop Ecosystem - History

Google

Yahoo! released Hadoop in 2005

More open-source projects

2004

Hadoop Ecosystem – Layer Diagram

What is Hadoop?

Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

Goals/Requirements Abstract and facilitate the storage and processing of

large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)

Fault-tolerance Move computation to data

Hadoop Architecture

Hadoop Architecture

Hadoop Architecture (cont.)

HDFS Name Node

Data Node

MapReduce Job Tracker

Task Tracker

Reminder: Downloading and Installing Hadoop

Download and Install VirtualBoxhttps://www.virtualbox.org/wiki/Downloads

Download and Install Cloudera QuickStart VMhttps://www.cloudera.com/downloads/quickstart_vms/5-13.html

Launch the Cloudera VM

Reminder: Import Appliance in VirtualBox

Reminder: Amazon Web Service (AWS) Educate Sign Up

AWS Educatehttps://aws.amazon.com/education/awseducate/