CS 525 Advanced Distributed Systems Spring 09

1

Indranil Gupta (Indy)Lecture 4

The Grid. Clouds.January 29, 2009

CS 525 Advanced Distributed Systems

Spring 09

2

Two Questions We’ll Try to Answer

• What is the Grid? Basics, no hype.• What is its relation to p2p?

3

Example: Rapid Atmospheric Modeling System, ColoState U

• Hurricane Georges, 17 days in Sept 1998– “RAMS modeled the mesoscale convective

complex that dropped so much rain, in good agreement with recorded data”

– Used 5 km spacing instead of the usual 10 km– Ran on 256+ processors

• Can one run such a program without access to a supercomputer?

4

Wisconsin

MITNCSA

Distributed ComputingResources

5

An Application Coded by a PhysicistJob 0

Job 2

Job 1

Job 3

Output files of Job 0Input to Job 2


Jobs 1 and 2 can be concurrent

6

An Application Coded by a Physicist

Job 2



May take several hours/days4 stages of a job

InitStage inExecuteStage outPublish

Computation Intensive, so Massively Parallel

Several GBs

7

Wisconsin

MITNCSA

Job 0

Job 2Job 1

Job 3

8

Job 0

Job 2Job 1

Job 3

Wisconsin

MIT

Condor Protocol

NCSAGlobus Protocol

9

Job 0

Job 2Job 1

Job 3Wisconsin

MITNCSA

Globus Protocol

Internal structure of differentsites invisible to Globus

External Allocation & SchedulingStage in & Stage out of Files

10

Job 0

Job 3Wisconsin

Condor Protocol

Internal Allocation & SchedulingMonitoringDistribution and Publishing of Files

11

Tiered Architecture (OSI 7 layer-like)

Resource discovery,replication, brokering

High energy Physics apps

Globus, Condor

Workstations, LANs

Opportunity for Crossover ideas from p2p systems

12

The Grid TodaySome are 40Gbps links!(The TeraGrid links)

“A parallel Internet”

13

Globus Alliance

• Alliance involves U. Illinois Chicago, Argonne National Laboratory, USC-ISI, U. Edinburgh, Swedish Center for Parallel Computers

• Activities : research, testbeds, software tools, applications

• Globus Toolkit (latest ver - GT3) “The Globus Toolkit includes software services and libraries

for resource monitoring, discovery, and management, plus security and file management. Its latest version, GT3, is the first full-scale implementation of new Open Grid Services Architecture (OGSA).”

14

More

• Entire community, with multiple conferences, get-togethers (GGF), and projects

• Grid Projects:http://www-fp.mcs.anl.gov/~foster/grid-projects/

• Grid Users: – Today: Core is the physics community (since the Grid originates

from the GriPhyN project)

– Tomorrow: biologists, large-scale computations (nug30 already)?

15

Some Things Grid Researchers Consider Important

• Single sign-on: collective job set should require once-only user authentication

• Mapping to local security mechanisms: some sites use Kerberos, others using Unix

• Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job 1

• Community authorization: e.g., third-party authentication

16

Grid History – 1990’s• CASA network: linked 4 labs in California and New Mexico

– Paul Messina: Massively parallel and vector supercomputers for computational chemistry, climate modeling, etc.

• Blanca: linked sites in the Midwest– Charlie Catlett, NCSA: multimedia digital libraries and remote

visualization

• More testbeds in Germany & Europe than in the US• I-way experiment: linked 11 experimental networks

– Tom DeFanti, U. Illinois at Chicago and Rick Stevens, ANL:, for a week in Nov 1995, a national high-speed network infrastructure. 60 application demonstrations, from distributed computing to virtual reality collaboration.

• I-Soft: secure sign-on, etc.

17

Trends: Technology

• Doubling Periods – storage: 12 mos, bandwidth: 9 mos, and (what law is this?) cpu speed: 18 mos

• Then and Now

Bandwidth– 1985: mostly 56Kbps links nationwide

– 2004: 155 Mbps links widespread

Disk capacity

– Today’s PCs have 100GBs, same as a 1990 supercomputer

18

Trends: Users• Then and Now Biologists:

– 1990: were running small single-molecule simulations – 2004: want to calculate structures of complex

macromolecules, want to screen thousands of drug candidatesPhysicists– 2006: CERN’s Large Hadron Collider produced 10^15

B/year

• Trends in Technology and User Requirements: Independent or Symbiotic?

19

Prophecies

In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”.

Plug your thin client into the computing Utiling and Play your favorite Intensive Compute &Communicate Application

– [Will this be a reality with the Grid?]

20

“We must addressscale & failure”

“We need infrastructure”

P2P Grid

21

Definitions

Grid

P2P

• “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998)

• “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002)

• “Applications that takes advantage of resources at the edges of the Internet” (2000)

• “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002)

22

Definitions

Grid

P2P

• “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998)

• “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002)

• “Applications that takes advantage of resources at the edges of the Internet” (2000)

• “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002)

525: (good legal applications without intellectual fodder)

525: (clever designs without good, legal applications)

23

Grid versus P2P - Pick your favorite

24

ApplicationsGrid• Often complex & involving various

combinations of– Data manipulation– Computation– Tele-instrumentation

• Wide range of computational models, e.g.– Embarrassingly ||– Tightly coupled – Workflow

• Consequence– Complexity often inherent in the application

itself

P2P• Some

– File sharing– Number crunching– Content distribution– Measurements

• Legal Applications?

• Consequence– Low Complexity

25

ApplicationsGrid• Often complex & involving various

combinations of– Data manipulation– Computation– Tele-instrumentation

• Wide range of computational models, e.g.– Embarrassingly ||– Tightly coupled – Workflow

• Consequence– Complexity often inherent in the application

itself

P2P• Some

– File sharing– Number crunching– Content distribution– Measurements

• Legal Applications?

• Consequence– Low Complexity

26

Scale and FailureP2P• V. large numbers of entities

• Moderate activity– E.g., 1-2 TB in Gnutella (’01)

• Diverse approaches to failure– Centralized (SETI)– Decentralized and Self-Stabilizing

FastTrackC 4,277,745

iMesh 1,398,532

eDonkey 500,289

DirectConnect 111,454

Blubster 100,266

FileNavigator 14,400

Ares 7,731

(www.slyck.com, 2/19/’03)

Grid

• Moderate number of entities

– 10s institutions, 1000s users

• Large amounts of activity

– 4.5 TB/day (D0 experiment)

• Approaches to failure reflect assumptions

– E.g., centralized components

27

Scale and FailureGrid

• Moderate number of entities

– 10s institutions, 1000s users

• Large amounts of activity

– 4.5 TB/day (D0 experiment)

• Approaches to failure reflect assumptions

– E.g., centralized components

P2P• V. large numbers of entities

• Moderate activity– E.g., 1-2 TB in Gnutella (’01)

• Diverse approaches to failure– Centralized (SETI)– Decentralized and Self-Stabilizing

FastTrackC 4,277,745

iMesh 1,398,532

eDonkey 500,289

DirectConnect 111,454

Blubster 100,266

FileNavigator 14,400

Ares 7,731

(www.slyck.com, 2/19/’03)

28

Services and InfrastructureGrid• Standard protocols (Global Grid

Forum, etc.)• De facto standard software (open

source Globus Toolkit)• Shared infrastructure (authentication,

discovery, resource access, etc.)Consequences• Reusable services• Large developer & user communities• Interoperability & code reuse

P2P• Each application defines & deploys

completely independent “infrastructure”

• JXTA, BOINC, XtremWeb?• Efforts started to define common APIs,

albeit with limited scope to dateConsequences• New (albeit simple) install per

application • Interoperability & code reuse not

achieved

29

Services and InfrastructureGrid• Standard protocols (Global Grid

Forum, etc.)• De facto standard software (open

source Globus Toolkit)• Shared infrastructure (authentication,

discovery, resource access, etc.)Consequences• Reusable services• Large developer & user communities• Interoperability & code reuse

P2P• Each application defines & deploys

completely independent “infrastructure”

• JXTA, BOINC, XtremWeb?• Efforts started to define common APIs,

albeit with limited scope to dateConsequences• New (albeit simple) install per

application • Interoperability & code reuse not

achieved

30

Coolness FactorGrid P2P

31

Coolness FactorGrid P2P

32

Summary: Grid and P2P

1) Both are concerned with the same general problem– Resource sharing within virtual communities

2) Both take the same general approach– Creation of overlays that need not correspond in structure to

underlying organizational structures

3) Each has made genuine technical advances, but in complementary directions– “Grid addresses infrastructure but not yet scale and failure”

– “P2P addresses scale and failure but not yet infrastructure”

4) Complementary strengths and weaknesses => room for collaboration (Ian Foster at UChicago)

33

Crossover IdeasSome P2P ideas useful in the Grid

– Resource discovery (DHTs), e.g., how do you make “filenames” more expressive, i.e., a computer cluster resource?

– Replication models, for fault-tolerance, security, reliability– Membership, i.e., which workstations are currently available?– Churn-Resistance, i.e., users log in and out; problem difficult

since free host gets a entire computations, not just small files

• All above are open research directions, waiting to be explored!

34

Cloud Computing

What’s it all about?

A First Step

35

Life of Ra (a Research Area)

TIME

PO

PU

LAR

ITY

O

F A

RE

A

First peak – end of hype (“This is a hot area!”)Hype- “Wow!”

First trough – “I told you so!”

Young Adolescent Middle Age Old Age

(low-hangingfruits)

(interestingProblems)

(solid base, hybrid algorithms)

(incremental Solutions)

Where is Grid?Where is cloud computing?

36

How do I identify what stage a research area is in?

1. If there have been no publications in research area more than 1-2 years old, it is in the “Young Phase”

2. Pick a paper in the last 1 year published in the research area. Read it. If you think that you could have come up with the core idea in that paper (given all the background etc.), then the research area is in its “Young” phase.

3. Find the latest published paper that you think you could have come up with the idea for. If this paper has been cited by one round of papers (but these citing papers themselves have not been cited), then the research area is in the “Adolescent” phase.

4. Do Step 3 above, and if you find that the citing papers themselves have been cited, and so on, then the research area is at least in the “Middle Age” phase.

5. Pick a paper in the last 1-2 years. If you find that there are only incremental developments in these latest published papers, and the ideas may be innovative but are not yielding large enough performance benefits, then the area is mature.

6. If no one works in the research area, or everyone you talk to thinks negatively about the area (except perhaps the inventors of the area), then the area is dead.

37

What is a cloud?

• It’s a cluster! It’s a supercomputer! It’s a datastore!

• It’s superman!

• None of the above

• Cloud = Lots of storage + compute cycles nearby

38

Data-intensive Computing

• Computation-Intensive Computing– Example areas: MPI-based, High-performance computing, Grids– Typically run on supercomputers (e.g., NCSA Blue Waters)

• Data-Intensive– Typically store data at datacenters– Use compute nodes nearby– Compute nodes run computation services

• In data-intensive computing, the focus shifts from computation to the data: problem areas include

– Storage – Communication bottleneck– Moving tasks to data (rather than vice-versa)– Security– Availability of Data– Scalability

39

Distributed Clouds

• A single-site cloud consists of– Compute nodes (split into racks)

– Switches, connecting the racks

– Storage (backend) nodes connected to the network

– Front-end for submitting jobs

– Services: physical resource set, software services

• A geographically distributed cloud consists of– Multiple such sites

– Each site perhaps with a different structure and services

40

Only show internal switches used for data transfers, 1GbE with 48 ports

InternalSwitch

32 nodes

DL160

ProcurveSwitch

ProcurveSwitch

8 ports

8 ports

InternalSwitch

32 nodes

DL160

InternalSwitch

32 nodes

DL160

InternalSwitch

32 nodes

DL160

StorageNode

StorageNode

StorageNode

StorageNode

HeadNode

2 ports

2 ports

Note: System management, monitoring, and operator console will use a different set of switches not pictured here.

Cirrus Cloud at University of Illinois

41

Example: Cirrus Cloud at U. Illinois

• 128 servers. Each has– 8 cores (total 1024 cores)– 16 GB RAM– 2 TB disk

• Backing store of about 250 TB

• Total storage: 0.5 PB

• Gigabit Networking

42

6 Diverse Sites within Cirrus

I. UIUC – Systems Research for Cloud Computing + Cloud Computing Applications

II. Karlsruhe Institute of Tech (KIT, Germany): Grid-style jobs

III. IDA, SingaporeIV. IntelV. HPVI. Yahoo!: CMU’s M45 clusterAll will be networked together: see

http://www.cloudtestbed.org

43

What “Services”?

Different Clouds Export different services• Industrial Clouds

– Amazon S3 (Simple Storage Service): store arbitrary datasets – Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary

images– Google AppEngine: develop applications within their appengine

framework, upload data that will be imported into their format, and run

• Academic Clouds – Google-IBM Cloud (U. Washington): run apps programmed atop

Hadoop– Cirrus cloud: run (i) apps programmed atop Hadoop and Pig, and

(ii) systems-level research on this first generation of cloud computing models

44

Software “Services”

• Computational– MapReduce (Hadoop)– Pig Latin

• Naming and Management– Zookeeper– Tivoli, OpenView

• Storage– HDFS– PNUTS

45

Sample Service: MapReduce

• Google uses MapReduce to run 100K jobs per day, processing up to 20 PB of data

• Yahoo! has released open-source software Hadoop that implements MapReduce

• Other companies that have used MapReduce to process their data: A9.com, AOL, Facebook, The New York Times

• Highly-Parallel Data-Processing

46

What is MapReduce?• Terms are borrowed from Functional Language (e.g.,

Lisp)Sum of squares:

• (map square ‘(1 2 3 4))– Output: (1 4 9 16)[processes each record sequentially and independently]

• (reduce + ‘(1 4 9 16))– (+ 16 (+ 9 (+ 4 1) ) )– Output: 30[processes set of all records in a batch]

47

Map

• Process individual key/value pair to generate intermediate key/value pairs.

Welcome EveryoneHello Everyone

Welcome1Everyone 1 Hello 1Everyone 1 Input <filename, file text>

48

Reduce

• Processes and merges all intermediate values associated with each given key assigned to it

Welcome1Everyone 1 Hello 1Everyone 1

Everyone 2 Hello 1Welcome1

49

Some Applications• Distributed Grep:

– Map - Emits a line if it matches the supplied pattern– Reduce - Copies the the intermediate data to output

• Count of URL access frequency– Map – Process web log and outputs <URL, 1>– Reduce - Emits <URL, total count>

• Reverse Web-Link Graph– Map – process web log and outputs <target, source>– Reduce - emits <target, list(source)>

50

Programming MapReduce

• Externally: For user1. Write a Map program (short), write a Reduce program (short)2. Submit job; wait for result3. Need to know nothing about parallel/distributed programming!

• Internally: For the cloud (and for us distributed systems researchers)

1. Parallelize Map2. Transfer data from Map to Reduce3. Parallelize Reduce4. Implement Storage for Map input, Map output, Reduce input,

and Reduce output

51

Inside MapReduce

• For the cloud (and for us distributed systems researchers)

1. Parallelize Map: easy! each map job is independent of the other!2. Transfer data from Map to Reduce:

• All Map output records with same key assigned to same Reduce task

• use partitioning function (more soon)3. Parallelize Reduce: easy! each map job is independent of the

other!4. Implement Storage for Map input, Map output, Reduce input,

and Reduce output• Map input: from distributed file system• Map output: to local disk (at Map node); uses local file system• Reduce input: from (multiple) remote disks; uses local file systems• Reduce output: to distributed file systemlocal file system = Linux FS, etc.distributed file system = GFS (Google File System), HDFS (Hadoop

Distributed File System)

52

Internal Workings of MapReduce

53

Flow of Data• Input slices are typically 16MB to 64MB.

• Map workers use a partitioning function to store intermediate key/value pair to the local disk.– e.g., Hash (key) mod R

Output files

Map workers

Reduce workerspartitioning

54

Fault Tolerance

• Worker Failure– Master keeps 3 states for each worker task

• (idle, in-progress, completed)

– Master sends periodic pings to each worker to keep track of it (central failure detector)

• If fail while in-progress, mark the task as idle

• If map workers fail after completed, mark as idle

• Notify the reduce task about the map worker failure

• Master Failure– Checkpoint

55

Locality and Backup tasks• Locality

– Since cloud has hierarchical topology– GFS stores 3 replicas of each of 64MB chunks

• Maybe on different racks

– Attempt to schedule a map task on a machine that contains a replica of corresponding input data: why?

• Stragglers (slow nodes)– Due to Bad Disk, Network Bandwidth, CPU, or

Memory.– Perform backup (replicated) execution of straggler task:

task done when first replica complete

56

Grep

Locality optimization helps: • 1800 machines read 1 TB at peak ~31 GB/s • W/out this, rack switches would limit to 10 GB/s

Startup overhead is significant for short jobs

Workload: 1010 100-byte records to extract records

matching a rare pattern (92K matching records)

Testbed: 1800 servers each with 4GB RAM, dual 2GHz Xeon, dual 169 GB IDE disk, 100 Gbps, Gigabit ethernet per machine

57

Normal No backup tasks 200 processes killed

Sort

• Backup tasks reduce job completion time a lot!• System deals well with failures

M = 15000 R = 4000

Workload: 1010 100-byte records (modeled after TeraSort benchmark)

58

Discussion Points• Storage: Is the local write-remote read model good for Map

output/Reduce input?– What happens on node failure?

• Entire Reduce phase needs to wait for all Map tasks to finish– Why? What is the disadvantage?

• What are the other issues related to our challenges:– Storage – Communication bottleneck– Moving tasks to data (rather than vice-versa)– Security– Availability of Data– Scalability– Locality: within clouds, or across them– Inter-cloud/multi-cloud computations– Other Programming Models?

• Based on MapReduce• Beyond MapReduce-based ones

• Concern: Do clouds run the risk of going the Grid way?

59

P2P and Clouds/Grid

• Opportunity to use p2p design techniques, principles, and algorithms in cloud computing

• Cloud computing vs. Grid computing: what are the differences?

60

Prophecies

In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”.

Plug your thin client into the computing Utiling and Play your favorite Intensive Compute & Storage

& Communicate Application– [Will this be a reality with the Grid and Clouds?]

Are we there yet?

???

Are we going towards it?

61

Administrative AnnouncementsStudent-led paper presentations (see instructions on website)• Start from February 12th• Groups of up to 2 students each class, responsible for a set

of 3 “Main Papers” on a topic– 45 minute presentations (total) followed by discussion– Set up appointment with me to show slides by 5 pm day prior to

presentation

• List of papers is up on the website• Each of the other students (non-presenters) expected to read

the papers before class and turn in a one to two page review of the any two of the main set of papers (summary, comments, criticisms and possible future directions)

62

Announcements (contd.)• Presentation Deadline: form groups by midnight

of January 31 by dropping by my office hours (10.45 am – 12 pm, Tu, Th in 3112 SC)– Hurry! Some interesting topics are already taken!– I can help you find partners

• Use course newsgroup for forming groups and discussion: class.cs525

63

Announcements (contd.)

Projects• Groups of 2 (need not be same as presentation

groups)• We’ll start detailed discussions “soon” (a few

classes into the student-led presentations)

• Please turn in filled-out “Student Infosheets” today or next lecture.

64

Next week

• No lecture Tuesday February 3 (no office hours either)

• Thursday (February 5) lecture: read Basic Distributed Computing Concepts papers

65

Backup Slides

66

Example: Rapid Atmospheric Modeling System, ColoState U

• Weather Prediction is inaccurate

• Hurricane Georges, 17 days in Sept 1998

68

Next Week Onwards

• Student led presentations start– Organization of presentation is up to you– Suggested: describe background and motivation for the

session topic, present an example or two, then get into the paper topics

• Reviews: You have to submit both an email copy (which will appear on the course website) and a hardcopy (on which I will give you feedback). See website for detailed instructions.– 1-2 pages only, 2 papers only

69

Refinements and Extensions

• Local Execution– For debugging purpose– Users have control on specific Map tasks

• Status Information– Master runs an HTTP server– Status page shows the status of computation– Link to output file– Standard Error list

70

Refinements and Extensions

• Combiner Function– User defined

– Done within map task.

– Save network bandwidth.

• Skipping Bad records– Best solution is to debug & fix

• Not always possible ~ third-party source libraries

– On segmentation fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed

– If master sees two failures for same record: • Next worker is told to skip the record

CS 525 Advanced Distributed Systems Spring 09

Documents

Transcript of CS 525 Advanced Distributed Systems Spring 09