CS 525 Advanced Distributed Systems Spring 09
description
Transcript of CS 525 Advanced Distributed Systems Spring 09
1
Indranil Gupta (Indy)Lecture 4
The Grid. Clouds.January 29, 2009
CS 525 Advanced Distributed Systems
Spring 09
2
Two Questions We’ll Try to Answer
• What is the Grid? Basics, no hype.• What is its relation to p2p?
3
Example: Rapid Atmospheric Modeling System, ColoState U
• Hurricane Georges, 17 days in Sept 1998– “RAMS modeled the mesoscale convective
complex that dropped so much rain, in good agreement with recorded data”
– Used 5 km spacing instead of the usual 10 km– Ran on 256+ processors
• Can one run such a program without access to a supercomputer?
4
Wisconsin
MITNCSA
Distributed ComputingResources
5
An Application Coded by a PhysicistJob 0
Job 2
Job 1
Job 3
Output files of Job 0Input to Job 2
Output files of Job 2Input to Job 3
Jobs 1 and 2 can be concurrent
6
An Application Coded by a Physicist
Job 2
Output files of Job 0Input to Job 2
Output files of Job 2Input to Job 3
May take several hours/days4 stages of a job
InitStage inExecuteStage outPublish
Computation Intensive, so Massively Parallel
Several GBs
7
Wisconsin
MITNCSA
Job 0
Job 2Job 1
Job 3
8
Job 0
Job 2Job 1
Job 3
Wisconsin
MIT
Condor Protocol
NCSAGlobus Protocol
9
Job 0
Job 2Job 1
Job 3Wisconsin
MITNCSA
Globus Protocol
Internal structure of differentsites invisible to Globus
External Allocation & SchedulingStage in & Stage out of Files
10
Job 0
Job 3Wisconsin
Condor Protocol
Internal Allocation & SchedulingMonitoringDistribution and Publishing of Files
11
Tiered Architecture (OSI 7 layer-like)
Resource discovery,replication, brokering
High energy Physics apps
Globus, Condor
Workstations, LANs
Opportunity for Crossover ideas from p2p systems
12
The Grid TodaySome are 40Gbps links!(The TeraGrid links)
“A parallel Internet”
13
Globus Alliance
• Alliance involves U. Illinois Chicago, Argonne National Laboratory, USC-ISI, U. Edinburgh, Swedish Center for Parallel Computers
• Activities : research, testbeds, software tools, applications
• Globus Toolkit (latest ver - GT3) “The Globus Toolkit includes software services and libraries
for resource monitoring, discovery, and management, plus security and file management. Its latest version, GT3, is the first full-scale implementation of new Open Grid Services Architecture (OGSA).”
14
More
• Entire community, with multiple conferences, get-togethers (GGF), and projects
• Grid Projects:http://www-fp.mcs.anl.gov/~foster/grid-projects/
• Grid Users: – Today: Core is the physics community (since the Grid originates
from the GriPhyN project)
– Tomorrow: biologists, large-scale computations (nug30 already)?
15
Some Things Grid Researchers Consider Important
• Single sign-on: collective job set should require once-only user authentication
• Mapping to local security mechanisms: some sites use Kerberos, others using Unix
• Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job 1
• Community authorization: e.g., third-party authentication
16
Grid History – 1990’s• CASA network: linked 4 labs in California and New Mexico
– Paul Messina: Massively parallel and vector supercomputers for computational chemistry, climate modeling, etc.
• Blanca: linked sites in the Midwest– Charlie Catlett, NCSA: multimedia digital libraries and remote
visualization
• More testbeds in Germany & Europe than in the US• I-way experiment: linked 11 experimental networks
– Tom DeFanti, U. Illinois at Chicago and Rick Stevens, ANL:, for a week in Nov 1995, a national high-speed network infrastructure. 60 application demonstrations, from distributed computing to virtual reality collaboration.
• I-Soft: secure sign-on, etc.
17
Trends: Technology
• Doubling Periods – storage: 12 mos, bandwidth: 9 mos, and (what law is this?) cpu speed: 18 mos
• Then and Now
Bandwidth– 1985: mostly 56Kbps links nationwide
– 2004: 155 Mbps links widespread
Disk capacity
– Today’s PCs have 100GBs, same as a 1990 supercomputer
18
Trends: Users• Then and Now Biologists:
– 1990: were running small single-molecule simulations – 2004: want to calculate structures of complex
macromolecules, want to screen thousands of drug candidatesPhysicists– 2006: CERN’s Large Hadron Collider produced 10^15
B/year
• Trends in Technology and User Requirements: Independent or Symbiotic?
19
Prophecies
In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”.
Plug your thin client into the computing Utiling and Play your favorite Intensive Compute &Communicate Application
– [Will this be a reality with the Grid?]
20
“We must addressscale & failure”
“We need infrastructure”
P2P Grid
21
Definitions
Grid
P2P
• “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998)
• “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002)
• “Applications that takes advantage of resources at the edges of the Internet” (2000)
• “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002)
22
Definitions
Grid
P2P
• “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998)
• “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002)
• “Applications that takes advantage of resources at the edges of the Internet” (2000)
• “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002)
525: (good legal applications without intellectual fodder)
525: (clever designs without good, legal applications)
23
Grid versus P2P - Pick your favorite
24
ApplicationsGrid• Often complex & involving various
combinations of– Data manipulation– Computation– Tele-instrumentation
• Wide range of computational models, e.g.– Embarrassingly ||– Tightly coupled – Workflow
• Consequence– Complexity often inherent in the application
itself
P2P• Some
– File sharing– Number crunching– Content distribution– Measurements
• Legal Applications?
• Consequence– Low Complexity
25
ApplicationsGrid• Often complex & involving various
combinations of– Data manipulation– Computation– Tele-instrumentation
• Wide range of computational models, e.g.– Embarrassingly ||– Tightly coupled – Workflow
• Consequence– Complexity often inherent in the application
itself
P2P• Some
– File sharing– Number crunching– Content distribution– Measurements
• Legal Applications?
• Consequence– Low Complexity
26
Scale and FailureP2P• V. large numbers of entities
• Moderate activity– E.g., 1-2 TB in Gnutella (’01)
• Diverse approaches to failure– Centralized (SETI)– Decentralized and Self-Stabilizing
FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/’03)
Grid
• Moderate number of entities
– 10s institutions, 1000s users
• Large amounts of activity
– 4.5 TB/day (D0 experiment)
• Approaches to failure reflect assumptions
– E.g., centralized components
27
Scale and FailureGrid
• Moderate number of entities
– 10s institutions, 1000s users
• Large amounts of activity
– 4.5 TB/day (D0 experiment)
• Approaches to failure reflect assumptions
– E.g., centralized components
P2P• V. large numbers of entities
• Moderate activity– E.g., 1-2 TB in Gnutella (’01)
• Diverse approaches to failure– Centralized (SETI)– Decentralized and Self-Stabilizing
FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/’03)
28
Services and InfrastructureGrid• Standard protocols (Global Grid
Forum, etc.)• De facto standard software (open
source Globus Toolkit)• Shared infrastructure (authentication,
discovery, resource access, etc.)Consequences• Reusable services• Large developer & user communities• Interoperability & code reuse
P2P• Each application defines & deploys
completely independent “infrastructure”
• JXTA, BOINC, XtremWeb?• Efforts started to define common APIs,
albeit with limited scope to dateConsequences• New (albeit simple) install per
application • Interoperability & code reuse not
achieved
29
Services and InfrastructureGrid• Standard protocols (Global Grid
Forum, etc.)• De facto standard software (open
source Globus Toolkit)• Shared infrastructure (authentication,
discovery, resource access, etc.)Consequences• Reusable services• Large developer & user communities• Interoperability & code reuse
P2P• Each application defines & deploys
completely independent “infrastructure”
• JXTA, BOINC, XtremWeb?• Efforts started to define common APIs,
albeit with limited scope to dateConsequences• New (albeit simple) install per
application • Interoperability & code reuse not
achieved
30
Coolness FactorGrid P2P
31
Coolness FactorGrid P2P
32
Summary: Grid and P2P
1) Both are concerned with the same general problem– Resource sharing within virtual communities
2) Both take the same general approach– Creation of overlays that need not correspond in structure to
underlying organizational structures
3) Each has made genuine technical advances, but in complementary directions– “Grid addresses infrastructure but not yet scale and failure”
– “P2P addresses scale and failure but not yet infrastructure”
4) Complementary strengths and weaknesses => room for collaboration (Ian Foster at UChicago)
33
Crossover IdeasSome P2P ideas useful in the Grid
– Resource discovery (DHTs), e.g., how do you make “filenames” more expressive, i.e., a computer cluster resource?
– Replication models, for fault-tolerance, security, reliability– Membership, i.e., which workstations are currently available?– Churn-Resistance, i.e., users log in and out; problem difficult
since free host gets a entire computations, not just small files
• All above are open research directions, waiting to be explored!
34
Cloud Computing
What’s it all about?
A First Step
35
Life of Ra (a Research Area)
TIME
PO
PU
LAR
ITY
O
F A
RE
A
First peak – end of hype (“This is a hot area!”)Hype- “Wow!”
First trough – “I told you so!”
Young Adolescent Middle Age Old Age
(low-hangingfruits)
(interestingProblems)
(solid base, hybrid algorithms)
(incremental Solutions)
Where is Grid?Where is cloud computing?
36
How do I identify what stage a research area is in?
1. If there have been no publications in research area more than 1-2 years old, it is in the “Young Phase”
2. Pick a paper in the last 1 year published in the research area. Read it. If you think that you could have come up with the core idea in that paper (given all the background etc.), then the research area is in its “Young” phase.
3. Find the latest published paper that you think you could have come up with the idea for. If this paper has been cited by one round of papers (but these citing papers themselves have not been cited), then the research area is in the “Adolescent” phase.
4. Do Step 3 above, and if you find that the citing papers themselves have been cited, and so on, then the research area is at least in the “Middle Age” phase.
5. Pick a paper in the last 1-2 years. If you find that there are only incremental developments in these latest published papers, and the ideas may be innovative but are not yielding large enough performance benefits, then the area is mature.
6. If no one works in the research area, or everyone you talk to thinks negatively about the area (except perhaps the inventors of the area), then the area is dead.
37
What is a cloud?
• It’s a cluster! It’s a supercomputer! It’s a datastore!
• It’s superman!
• None of the above
• Cloud = Lots of storage + compute cycles nearby
38
Data-intensive Computing
• Computation-Intensive Computing– Example areas: MPI-based, High-performance computing, Grids– Typically run on supercomputers (e.g., NCSA Blue Waters)
• Data-Intensive– Typically store data at datacenters– Use compute nodes nearby– Compute nodes run computation services
• In data-intensive computing, the focus shifts from computation to the data: problem areas include
– Storage – Communication bottleneck– Moving tasks to data (rather than vice-versa)– Security– Availability of Data– Scalability
39
Distributed Clouds
• A single-site cloud consists of– Compute nodes (split into racks)
– Switches, connecting the racks
– Storage (backend) nodes connected to the network
– Front-end for submitting jobs
– Services: physical resource set, software services
• A geographically distributed cloud consists of– Multiple such sites
– Each site perhaps with a different structure and services
40
Only show internal switches used for data transfers, 1GbE with 48 ports
InternalSwitch
32 nodes
DL160
ProcurveSwitch
ProcurveSwitch
8 ports
8 ports
InternalSwitch
32 nodes
DL160
InternalSwitch
32 nodes
DL160
InternalSwitch
32 nodes
DL160
StorageNode
StorageNode
StorageNode
StorageNode
HeadNode
2 ports
2 ports
Note: System management, monitoring, and operator console will use a different set of switches not pictured here.
Cirrus Cloud at University of Illinois
41
Example: Cirrus Cloud at U. Illinois
• 128 servers. Each has– 8 cores (total 1024 cores)– 16 GB RAM– 2 TB disk
• Backing store of about 250 TB
• Total storage: 0.5 PB
• Gigabit Networking
42
6 Diverse Sites within Cirrus
I. UIUC – Systems Research for Cloud Computing + Cloud Computing Applications
II. Karlsruhe Institute of Tech (KIT, Germany): Grid-style jobs
III. IDA, SingaporeIV. IntelV. HPVI. Yahoo!: CMU’s M45 clusterAll will be networked together: see
http://www.cloudtestbed.org
43
What “Services”?
Different Clouds Export different services• Industrial Clouds
– Amazon S3 (Simple Storage Service): store arbitrary datasets – Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary
images– Google AppEngine: develop applications within their appengine
framework, upload data that will be imported into their format, and run
• Academic Clouds – Google-IBM Cloud (U. Washington): run apps programmed atop
Hadoop– Cirrus cloud: run (i) apps programmed atop Hadoop and Pig, and
(ii) systems-level research on this first generation of cloud computing models
44
Software “Services”
• Computational– MapReduce (Hadoop)– Pig Latin
• Naming and Management– Zookeeper– Tivoli, OpenView
• Storage– HDFS– PNUTS
45
Sample Service: MapReduce
• Google uses MapReduce to run 100K jobs per day, processing up to 20 PB of data
• Yahoo! has released open-source software Hadoop that implements MapReduce
• Other companies that have used MapReduce to process their data: A9.com, AOL, Facebook, The New York Times
• Highly-Parallel Data-Processing
46
What is MapReduce?• Terms are borrowed from Functional Language (e.g.,
Lisp)Sum of squares:
• (map square ‘(1 2 3 4))– Output: (1 4 9 16)[processes each record sequentially and independently]
• (reduce + ‘(1 4 9 16))– (+ 16 (+ 9 (+ 4 1) ) )– Output: 30[processes set of all records in a batch]
47
Map
• Process individual key/value pair to generate intermediate key/value pairs.
Welcome EveryoneHello Everyone
Welcome1Everyone 1 Hello 1Everyone 1 Input <filename, file text>
48
Reduce
• Processes and merges all intermediate values associated with each given key assigned to it
Welcome1Everyone 1 Hello 1Everyone 1
Everyone 2 Hello 1Welcome1
49
Some Applications• Distributed Grep:
– Map - Emits a line if it matches the supplied pattern– Reduce - Copies the the intermediate data to output
• Count of URL access frequency– Map – Process web log and outputs <URL, 1>– Reduce - Emits <URL, total count>
• Reverse Web-Link Graph– Map – process web log and outputs <target, source>– Reduce - emits <target, list(source)>
50
Programming MapReduce
• Externally: For user1. Write a Map program (short), write a Reduce program (short)2. Submit job; wait for result3. Need to know nothing about parallel/distributed programming!
• Internally: For the cloud (and for us distributed systems researchers)
1. Parallelize Map2. Transfer data from Map to Reduce3. Parallelize Reduce4. Implement Storage for Map input, Map output, Reduce input,
and Reduce output
51
Inside MapReduce
• For the cloud (and for us distributed systems researchers)
1. Parallelize Map: easy! each map job is independent of the other!2. Transfer data from Map to Reduce:
• All Map output records with same key assigned to same Reduce task
• use partitioning function (more soon)3. Parallelize Reduce: easy! each map job is independent of the
other!4. Implement Storage for Map input, Map output, Reduce input,
and Reduce output• Map input: from distributed file system• Map output: to local disk (at Map node); uses local file system• Reduce input: from (multiple) remote disks; uses local file systems• Reduce output: to distributed file systemlocal file system = Linux FS, etc.distributed file system = GFS (Google File System), HDFS (Hadoop
Distributed File System)
52
Internal Workings of MapReduce
53
Flow of Data• Input slices are typically 16MB to 64MB.
• Map workers use a partitioning function to store intermediate key/value pair to the local disk.– e.g., Hash (key) mod R
Output files
Map workers
Reduce workerspartitioning
54
Fault Tolerance
• Worker Failure– Master keeps 3 states for each worker task
• (idle, in-progress, completed)
– Master sends periodic pings to each worker to keep track of it (central failure detector)
• If fail while in-progress, mark the task as idle
• If map workers fail after completed, mark as idle
• Notify the reduce task about the map worker failure
• Master Failure– Checkpoint
55
Locality and Backup tasks• Locality
– Since cloud has hierarchical topology– GFS stores 3 replicas of each of 64MB chunks
• Maybe on different racks
– Attempt to schedule a map task on a machine that contains a replica of corresponding input data: why?
• Stragglers (slow nodes)– Due to Bad Disk, Network Bandwidth, CPU, or
Memory.– Perform backup (replicated) execution of straggler task:
task done when first replica complete
56
Grep
Locality optimization helps: • 1800 machines read 1 TB at peak ~31 GB/s • W/out this, rack switches would limit to 10 GB/s
Startup overhead is significant for short jobs
Workload: 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
Testbed: 1800 servers each with 4GB RAM, dual 2GHz Xeon, dual 169 GB IDE disk, 100 Gbps, Gigabit ethernet per machine
57
Normal No backup tasks 200 processes killed
Sort
• Backup tasks reduce job completion time a lot!• System deals well with failures
M = 15000 R = 4000
Workload: 1010 100-byte records (modeled after TeraSort benchmark)
58
Discussion Points• Storage: Is the local write-remote read model good for Map
output/Reduce input?– What happens on node failure?
• Entire Reduce phase needs to wait for all Map tasks to finish– Why? What is the disadvantage?
• What are the other issues related to our challenges:– Storage – Communication bottleneck– Moving tasks to data (rather than vice-versa)– Security– Availability of Data– Scalability– Locality: within clouds, or across them– Inter-cloud/multi-cloud computations– Other Programming Models?
• Based on MapReduce• Beyond MapReduce-based ones
• Concern: Do clouds run the risk of going the Grid way?
59
P2P and Clouds/Grid
• Opportunity to use p2p design techniques, principles, and algorithms in cloud computing
• Cloud computing vs. Grid computing: what are the differences?
60
Prophecies
In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”.
Plug your thin client into the computing Utiling and Play your favorite Intensive Compute & Storage
& Communicate Application– [Will this be a reality with the Grid and Clouds?]
Are we there yet?
???
Are we going towards it?
61
Administrative AnnouncementsStudent-led paper presentations (see instructions on website)• Start from February 12th• Groups of up to 2 students each class, responsible for a set
of 3 “Main Papers” on a topic– 45 minute presentations (total) followed by discussion– Set up appointment with me to show slides by 5 pm day prior to
presentation
• List of papers is up on the website• Each of the other students (non-presenters) expected to read
the papers before class and turn in a one to two page review of the any two of the main set of papers (summary, comments, criticisms and possible future directions)
62
Announcements (contd.)• Presentation Deadline: form groups by midnight
of January 31 by dropping by my office hours (10.45 am – 12 pm, Tu, Th in 3112 SC)– Hurry! Some interesting topics are already taken!– I can help you find partners
• Use course newsgroup for forming groups and discussion: class.cs525
63
Announcements (contd.)
Projects• Groups of 2 (need not be same as presentation
groups)• We’ll start detailed discussions “soon” (a few
classes into the student-led presentations)
• Please turn in filled-out “Student Infosheets” today or next lecture.
64
Next week
• No lecture Tuesday February 3 (no office hours either)
• Thursday (February 5) lecture: read Basic Distributed Computing Concepts papers
65
Backup Slides
66
Example: Rapid Atmospheric Modeling System, ColoState U
• Weather Prediction is inaccurate
• Hurricane Georges, 17 days in Sept 1998
67
68
Next Week Onwards
• Student led presentations start– Organization of presentation is up to you– Suggested: describe background and motivation for the
session topic, present an example or two, then get into the paper topics
• Reviews: You have to submit both an email copy (which will appear on the course website) and a hardcopy (on which I will give you feedback). See website for detailed instructions.– 1-2 pages only, 2 papers only
69
Refinements and Extensions
• Local Execution– For debugging purpose– Users have control on specific Map tasks
• Status Information– Master runs an HTTP server– Status page shows the status of computation– Link to output file– Standard Error list
70
Refinements and Extensions
• Combiner Function– User defined
– Done within map task.
– Save network bandwidth.
• Skipping Bad records– Best solution is to debug & fix
• Not always possible ~ third-party source libraries
– On segmentation fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed
– If master sees two failures for same record: • Next worker is told to skip the record