Lecture 3 – Foundations for Big Data Systems and Programming - Computer...

CS 626 Large Scale Data Science

Jun ZhangOriginally prepared by Dr. Licong Cui

Department of Computer ScienceUniversity of Kentucky

January 28, 2020

Lecture 3 – Foundations for Big Data Systems and Programming

Review: Five P’s of Data Science

People

Purpose

Process

Platforms

Programmability

Review: Steps in the Data Science Process

Acquire Prepare Analyze Report Act

Big Data Engineering Computational Big Data Science

Basic Scalable Computing Concepts

Distributed File Systems

Scalable Computing over the Internet

Programming Models for Big Data

Traditional File System

64GB 256GB 512GB

Copy data to an external hard drive?

Buy a bigger disk?

Work Personal

Distribute data on multiple computers?

Store Data in Server?

Cluster of Machines

Distributed File System (DFS)

File system spreads over multiple, autonomous computers

Distributed File System

Data Replication

1 2 3 4 5

Fault Tolerance

1 2 3 4 5

High Concurrency

1 2 3 4 5

Reader 1 Reader 2

Reader 3

Distributed File SystemRa

Data Partitioning

Data Replication

Data Scalability

Fault Tolerance

High Concurrency

Scalable Computing Over the Internet

Single compute node

Parallel Computer

Expensive

Commodity Cluster

Affordable

Less-specialized

Distributed computing over the Internet

Reduced computing cost

Architecture of a Commodity Cluster

NetworkRa

Distributed ComputingNetwork

Network

• Enables data-parallelism• Move computation to data

Programming Models for Big Data

Runtime Libraries Programming Languages

Programming Model = abstractions

Distributed File System

Infrastructure:

Requirements for Big Data Programming Models

1. Support Big Data Operationso Split large volumes of data

o Access data fast

o Distribute computations to nodes

Requirements for Big Data Programming Models (cont.)

2. Handle Fault Toleranceo Replicate data partitions

o Recover files when needed

3. Enable Adding More Racks

1 2 3 4 5

4. Optimized for specific data types

Document Table

Key-value Graph

Multimedia Stream

Example – Suits of Cards

Key Programming Model

MapReduce

A programming model for Big Data

Many implementations

Programming Model = abstractions

Runtime Libraries Programming Languages

Support large data volumes

Provide fault tolerance

Enable scale out

MapReduce

What is MapReduce?

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a map procedure performs filtering and sorting (such as

sorting students by first name into queues, one queue for each name)

a reduce method performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

Google Distributed System

Google File System Architecture

Google MapReduce

Hadoop Ecosystem - History

Google

Yahoo! released Hadoop in 2005

More open-source projects

Hadoop Ecosystem – Layer Diagram

What is Hadoop?

Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

Goals/Requirements Abstract and facilitate the storage and processing of

large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)

Fault-tolerance Move computation to data

Hadoop Architecture

Hadoop Architecture (cont.)

HDFS Name Node

Data Node

MapReduce Job Tracker

Task Tracker

Reminder: Downloading and Installing Hadoop

Download and Install VirtualBoxhttps://www.virtualbox.org/wiki/Downloads

Download and Install Cloudera QuickStart VMhttps://www.cloudera.com/downloads/quickstart_vms/5-13.html

Launch the Cloudera VM

Reminder: Import Appliance in VirtualBox

Reminder: Amazon Web Service (AWS) Educate Sign Up

AWS Educatehttps://aws.amazon.com/education/awseducate/

Lecture 3 – Foundations for Big Data Systems and Programming - Computer...

Documents

Transcript of Lecture 3 – Foundations for Big Data Systems and Programming - Computer...

CurriculumVitae ofJunZhang - University of Kentucky ...jzhang/cv.pdf · CurriculumVitae ofJunZhang Department of Computer Science Tel: (859) 257–3892 (oﬃce) University of Kentucky

Diffusion Tensor Imaging Analysis of Regional …jzhang/pub/MRI/liang1.pdfLexington, KY 40506-0046, USA jzhang@cs.uky.edu 2 Abstract Diffusion tensor imaging (DTI) based tractography

Lecture Numerical Solution of Ordinary Differential …jzhang/CS537/lecture8.pdfLecture Numerical Solution of Ordinary Differential Equations Professor Jun Zhang Department of Computer

CS460/626 : Natural Language Processing/Speech …pb/cs626-460-2011/cs626-460...Thus, only a limited no. of possible two‐consonant clusters. Three‐consonant: Restricted to licensed

CS626 Data Analysis and Simulation - William & Mary

CS460/626 : Natural Language: Natural Language ...pb/cs626-460-2011/cs626-460...2011/02/14 · EttiExpectation-Miiti l ithMaximization algorithm 1. Start with uniform word translation

CS626 Data Analysis and Simulation - Computer Science

Abhimanyu Dhamija (07005024) E K Venkatesh (07005031) G ...cs626-460-2012/cs626... · Saurabh Chanderiya (07005004) Abhimanyu Dhamija (07005024) E K Venkatesh (07005031) G Hrudil

CS626 : Natural Language Processing/Speech, NLP …pb/cs626-sem1-2012/cs626...2012/11/01 · Introduction to sonority theory “The Sonority of a sound is its loudness relative to

CS460/626 : Natural LanguageCS460/626 : Natural Language ...cs626-460-2012/cs626... · The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it

CS626 Data Analysis and Simulation - cs.wm.edu

CS626 Data Analysis and Simulationkemper/cs626/slides/v5.pdf · Input modeling Deriving a representation of the uncertainty or randomness in a stochastic simulation. Common representations

Parallel and Distributed ComputingParallel and Distributed ...jzhang/CS621/chapter1.pdf · Parallel and Distributed ComputingParallel and Distributed Computing ... Computer with a

CS460/626 : Natural Language Processing/Speech, NLP …cs626-460-2012/lecture_slides/cs626... · CS460/626 : Natural Language Processing/Speech, ... is a technique to translate ...

CS460/626 : Natural Language Processing/Speech, …pb/cs626-sem1-2012/cs626...Speech and NLP Speech is the “original” language data Writing system came much later! Word boundary

CS460/626 : Natural Language Processing/Speech …cs626-460-2012/cs626-460-2011/lecture... · Syntagmatic vs. Paradigmatic realtions? ... those people who have a busy social life,

CS626-460: Language Technology for the Web/Natural Language Processing

Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.

CS460/626 : Natural LanguageCS460/626 : Natural …cs626-460-2012/cs626-460-2011/lecture... · CS460/626 : Natural LanguageCS460/626 : Natural Language Processing/Speech, NLP and

CS460/626 : Natural Language Processing/Speech …cs626-460-2012/cs626-460...CS460/626 : Natural Language Processing/Speech NLP and the WebProcessing/Speech, NLP and the Web (Lecture