Data-Intensive Distributed Computingcs451/slides/big...347,955.20 3,348.48 5.99 0.06 0.02 0.001 0.01...

Data-Intensive Distributed Computing

Part 1: MapReduce Algorithm Design (1/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2020)

Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/

Agenda for Today

Who am I?What is big data?

Why big data?What is this course about?

Administrivia

Who am I?

PhD from Waterloo (2017)Systems and Networking Research Group

Source: Wikipedia (Hard disk drive)

Big Data

Storage evolution over time

Today1980s1950s

347,955.20

3,348.48

0.060.02

100000

1000000

10000000

1980 1990 2000 2010 2020

Storage cost over time

Hadoop: 10K nodes, 150K cores, 150 PB (4/2014)

Processes 20 PB a day (2008)Crawls 20B web pages a day (2012)Search index is 100+ PB (5/2014)Bigtable serves 2+ EB, 600M QPS (5/2014)

300 PB data in Hive + 600 TB/day (4/2014)

400B pages, 10+ PB (2/2014)

LHC: ~15 PB a year

LSST: 6-10 PB a year (~2020)640K ought to be

enough for anybody.

150 PB on 50k+ servers running 15k apps (6/2011)

S3: 2T objects, 1.1M request/second (4/2013)

SKA: 0.3 – 1.5 EB per year (~2020)

19 Hadoop clusters: 600 PB, 40k servers (9/2015)

How much data?

Data generation

Storage cost

Big Data

Source: Wikipedia (Everest)

Why big data? ScienceBusinessSociety

Emergence of the 4th Paradigm

Data-intensive e-ScienceMaximilien Brice, © CERN

Science

Maximilien Brice, © CERN

Source: Wikipedia (DNA)

GATGCTTACTATGCGGGCCCC

CGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGC

AATGCTTAGCTATGCGGGC

CGGTCTAGATGCTTACTATGC

CGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

Subject genome

Sequencer

Human genome: 3 gbpA few billion short reads (~100 GB compressed data)

BusinessData-driven decisions

Data-driven products

Source: Wikiedia (Shinjuku, Tokyo)

An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc.

Business Intelligence

In the 1990s, Wal-Mart found that customers tended to buy diapers and beer together. So they put them next to each other and increased sales of both.*

This is not a new idea!

So what’s changed?

More compute and storage

Ability to gather behavioral data

* BTW, this is completely apocryphal. (But it makes a nice story.)

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

data sciencedata products

Virtuous Product Cycle

Source: https://images.lookhuman.com/render/standard/8002245806006052/pillow14in-whi-z1-t-netflixing.png

Source: https://www.reddit.com/r/teslamotors/comments/6gsc6v/i_think_the_neural_net_mining_is_just_starting/ (June 2017)

Source: Wikipedia (Rosetta Stone)

(Banko and Brill, ACL 2001)

(Brants et al., EMNLP 2007)

No data like more data!

Source: Guardian

Humans as social sensors

Computational social science

Society

Predicting X with Twitter

(Paul and Dredze, ICWSM 2011; Bond et al., Nature 2011)

Political Mobilization on Facebook

2010 US Midterm Elections: 60m users shown “I Voted” Messages

Summary: increased turnout by 60k directly and 280k indirectly

The Political Blogosphere and the 2004 U.S. Election

Source: Popular Internet Meme

What is this course about?

Execution

Infrastructure

Analytics

Infrastructure

Data Science

e“big data stack”

Buzzwords

MapReduce, Spark, Flink, Pig, Dryad, Hive, Dryad, noSQL, Pregel, Giraph, Storm/Heron

Execution

Infrastructure

Analytics

Infrastructure

Data Science

Text: frequency estimation, language models, inverted indexes

Graphs: graph traversals, random walks (PageRank)

Relational data: SQL, joins, column stores

Data mining: hashing, clustering (k-means), classification, recommendations

Streams: probabilistic data structures (Bloom filters, CMS, HLL counters)

data science, data analytics, business intelligence, data warehouses and data lakes

This course focuses on algorithm design and “thinking at scale”

“big data stack”

Structure of the Course

“Core” framework features and algorithm design for batch processing

What’s beyond batch processing?

Source: Google

Tackling Big Data

“Work”

w1 w2 w3

r1 r2 r3

“Result”

worker worker worker

Partition

Aggregate

Divide and Conquer

What’s the common theme of all of these challenges?

Parallelization Challenges

How do we assign work units to workers?What if we have more work units than workers?

What if workers need to communicate partial results?What if workers need to access shared resources?

How do we know when a worker has finished? (Or is simply waiting?)What if workers die?

Difficult because:

We don’t know the order in which workers run…We don’t know when workers interrupt each other…

We don’t know when workers need to communicate partial results…We don’t know the order in which workers access shared resources…

Common Theme?

Parallelization challenges arise from:

Need to communicate partial resultsNeed to access shared resources

How do we tackle these challenges?

(In other words, sharing state)

“Current” Tools

Basic primitives

Semaphores (lock, unlock)Conditional variables (wait, notify, broadcast)

Barriers

Awareness of Common Problems

Deadlock, livelock, race conditions...Dining philosophers, sleeping barbers, cigarette smokers...

“Current” Tools

Programming Models

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5M

Design Patterns

coordinator

workers

producer consumer

work queue

When Theory Meets Practices

Now throw in:

The scale of clusters and (multiple) datacentersThe presence of hardware failures and software bugs

The presence of multiple interacting services

The reality:

Lots of one-off solutions, custom codeWrite you own dedicated library, then program with it

Burden on the programmer to explicitly manage everything

Concurrency is already difficult to reason about…

Bottom line: it’s hard!

Source: Ricardo Guimarães Herrmann

Source: CS 251

Source: Google

The datacenter is the computer!

It’s all about the right level of abstractionMoving beyond the von Neumann architecture

What’s the “instruction set” of the datacenter computer?

Hide system-level details from the developersNo more race conditions, lock contention, etc.

No need to explicitly worry about reliability, fault tolerance, etc.

Separating the what from the howDeveloper specifies the computation that needs to be performed

Execution framework (“runtime”) handles actual execution

MapReduce is the first instantiation of this idea… but not the last!

Source: Wikipedia (Japanese rock garden)

Questions?

Source: http://www.flickr.com/photos/artmind_etcetera/6336693594/

Course Administrivia

Two in One!

CS 451/651CS 451: version for CS ugrads (most students)CS 651: version for CS grads

CS 431/631CS 431: version for non-CS ugradsCS 631: version for non-CS grads

Important Coordinates

Course website:https://www.student.cs.uwaterloo.ca/~cs451/

Bespinhttp://bespin.io/

Communicating with us:Piazza for general/private questions (link on course homepage)

Lots of info there, read it!(“I didn’t see it” will not be accepted as an excuse)

Course Design

Components of the final grade:

6 (CS 431/631) or 8 (CS 451/651) individual assignmentsFinal exam

Additional group final project (CS 631/651)

This course focuses on algorithm design and “thinking at scale”

Not the “mechanics” (API, command-line invocations, et.)You’re expected to pick up MapReduce/Spark with minimal help

Expectations (CS 451)

You are:

Genuinely interested in the topicBe prepared to put in the time

Comfortable with rapidly-evolving software

Your background:

Pre-reqs: CS 341, CS 348, CS 350Comfortable in Java and Scala (or be ready to pick it up quickly)

Know how to use GitReasonable “command-line”-fu skills

Experience in compiling, patching, and installing open source softwareGood debugging skills

MapReduce/Spark Environments (CS 451)

Single-Node Hadoop: Local installationsInstall all software components on your own machine

Requires at least 4GB RAM and plenty of disk spaceWorks fine on Mac and Linux, YMMV on Windows

Important: For your convenience only!We’ll provide basic instructions, but not technical support

Single-Node Hadoop: Linux Student CS EnvironmentEverything is set up for you, just follow instructions

We’ll make sure everything works

See “Software” page in course homepage for instructions

Distributed Hadoop: Datasci Cluster

Assignment Mechanics (CS 451)

Note late policy (details on course homepage)

Late by up to 24 hours: 25% reduction in gradeLate 24-48 hours: 50% reduction in gradeLate by more the 48 hours: not accepted

By assumption, we’ll pull and mark at deadline:If you want us to hold off, you must let us know!

We’ll be using private Git repos for assignments

Complete your assignments, push to GitLabWe’ll pull your repos at the deadline and grade

Assignments will use Python and JupyterEverything you need to know is in the assignment itself

Assignments will generally be submitted using GitDetails are on the course website for the appropriate assignment

Note late policy (details on course homepage)Late by up to 24 hours: 25% reduction in grade

Late 24-48 hours: 50% reduction in gradeLate by more the 48 hours: not accepted

By assumption, we’ll pull and mark at deadline:If you want us to hold off, you must let us know!

Course Materials

One (required) textbook +Three (optional but recommended) books +

Additional readings from other sources as appropriate

Note: 4th Edition

(optional but recommended)

If you’re not (yet) registered:

Register for the wait list at:

Note: late registration is not an excuse for late assignments

By sending Ali an email at ali.abedi@uwaterloo.ca

Priority for unregistered students

CS studentsHave all the pre-reqs

Final opportunity to take the course (e.g., 4B students)Continue to attend class until final decision

Once the course is full, it is full

Data-Intensive Distributed Computingcs451/slides/big...347,955.20 3,348.48 5.99 0.06 0.02 0.001 0.01...

Documents

Transcript of Data-Intensive Distributed Computingcs451/slides/big...347,955.20 3,348.48 5.99 0.06 0.02 0.001 0.01...

Automatically generated PDF from existing images. · 2020-06-09 · (Yoooo \oooooo u 00000 u 00000 1000000 000000 Rqcqgq ? Y 00000 g Yooooo (000000 ? yooooo g Y 00000 1000000 1000000

TMPL Form 604 21 June 2012 - asx.com.au · 1000000 -500000 -20000 -500000 -25000 25000 -60000 1000000 2162500 143023 50000 -150000 150000 -25000 -1888888 1000000 -10000 Person whose

Budget Line Instructions: Create/Edit/Activate/Deactivate ... · 2000 1000000 2002 DUPLICATE 2001 1000000 123 2001 OK 2001 1000000 2002 DUPLICATE 2001 1000000 2001 ABC TEST OK . 4.

5700EE Parts Manual - Tennant Co · PDF fileY 222388 (10000000- ) Blade, Sqge, Rear, 45.65l Pyu [32”] 1 222066 (10000000- ) Wheel, 3.0 Dia 4 82138 (10000000- ) Nut Speed,Retainer,

Ljubisa Bajićand Jasmina Vasiljević · 2020. 8. 18. · Dynamic Execution 1 10 100 1000 10000 100000 1000000 10000000 2019 2020 2021 2022 2023 2024 2025 2026 ML vs. Moore's Law

Varmebehandling - orbit.dtu.dk · DTU Fødevareinstituttet Varmetolerance – D-værdi 1 10 100 1000 10000 100000 1000000 10000000 0 10 20 30 40 50 60 Tid (min.) ved konstant temperatur

T1 Parts list (sn 10000000-20000000)d2z4qs2e3spnc1.cloudfront.net/product_file/file/... · Y 24 1013295 (10000000--20000000 ) Spring, Cmpr [2100 Series] 1 Y 25 9006059 (10000000--20000000

[XLS]reports.mca.gov.inreports.mca.gov.in/Reports/MasterDataExcels/company... · Web view100000 100000 100000 100000 10000000 430000 100000 100000 100000 100000 10000000 425000 100000

RECON… · XLS file · Web view7/11/2012 13:30:49. 1 193140000. 20000000. 60000000. 10000000. 10000000. 15000000. 3000000000 124400648.16. 1345669130.6600001 124021061.84. 111502192.

Jul 2014 - China - , Bill LEE (1000000) · Press Clipping Lac de Come Jul 2014 - China - , Bill LEE (1000000) MEPAX, September 05 2014 - 4/27

CATEGORY: DESIGN & ENGINEERING - 03 · 2020-03-17 · CATEGORY: DESIGN & ENGINEERING - 03. 100000000 10000000 1000000 100000 1000 100 100 1,000 10.000 NO OF CROSS POWER DEVICES 100,000

gajanandelectrocontrols.comgajanandelectrocontrols.com/pdf/gajananad.pdf · · 2015-05-29Perfect workability . S.S. GLASS AUTO DOOR GECI - Dil Features : ... COLLAPSIBLE DOOR 10000000,

Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Adverse Events: Documenting, Recording, and Reportingclinicaltrialtools.vc.ons.org/file_depot/0-10000000/0-10000/1338/... · Adverse Events: Documenting, Recording, and Reporting.

· XLS file · Web view974 6/29/2018 100000 100000. 975 6/14/2018 1000000 10000. 976 6/20/2018 1000000 380000. 977 6/19/2018 100000 100000. 978 6/21/2018 100000 100000. 979 6/25/2018

Smart Tax Saving 1000000

data.ncdaonline.org · 1400000 1200000 1000000 800000 600000 400000 200000 Year I 1000000 940000 1020000 963500 1040400 987588 Compliance Period 10 1195093 1208712 11

Analysis, prevalence and impact of microplastics in ... · Smallest particle size considered (µm) (m 3) 0.01 0.1 1 10 100 1000 10000 100000 1000000 10000000 1 11 .1 10 100 1000 10000

ACR Missing 407 Section C ICAC Redacted - rev 5 FINAL Missing 407... · 237.407 20 13 000000 180.407 1 1000000 43.407 21 13 000000 69.407 2 2000000 382.407 5 5000000 37.407 1 1000000

Heterogeneous Parallel Computingcs451/lectures/grad/targhetta.pdfWhat is Heterogeneous Parallel Computing (HPC)? Parallel computation using a collection of unlike computational machines