Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

48
Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015

Transcript of Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Page 1: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 1CS 188,Winter 2015

Cloud ComputingCS 188

Distributed SystemsMarch 12, 2015

Page 2: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 2CS 188,Winter 2015

Introduction

• What is cloud computing?

• Cloud computing and distributed systems

• Important cloud computing tools

– Map Reduce

Page 3: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 3CS 188,Winter 2015

What Is Cloud Computing?• Essentially, moving your computing into a

vague “network cloud”• Not just putting your network transmissions

there• But putting storage, computation, and other

services into “the cloud”• Offloading complicated management issues

to someone else

Page 4: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 4CS 188,Winter 2015

What Is It Really?

• Somebody buys and runs a vast number of machines

• They offer to rent use of their machines to essentially anyone

• One or more of the machines host each client’s computations

• Possibly also providing stable storage

Page 5: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 5CS 188,Winter 2015

A Cloud Computing Facility

• A huge farm of machines

• With a high speed interconnect

• And special software to help manage the machines

• Clients’ jobs are placed on some subset of the machines

Page 6: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 6CS 188,Winter 2015

The Cloud Computing Concept

I’ve got a big data job to get done

I need to run a small web server

I need nodes for a scientific

computation

Page 7: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 7CS 188,Winter 2015

Cloud Storage Systems

• Systems that specialize in storing data for large numbers of users

• Often don’t provide compute services

– Just storage services

• They take care of backup, ensuring accessibility, etc.

• Sometimes consumer oriented

• Sometimes big data oriented

Page 8: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 8CS 188,Winter 2015

Implications of Cloud Computing – For Clients

• Computing becomes a commodity– Spend X dollars, get Y amount of computing– If you want 2Y computing, spend X more dollars

• No worries about complex issues of managing machines

• No need to sink money into machines you only need some of the time

• But a lot of details are out of your hands

Page 9: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 9CS 188,Winter 2015

Implications of Cloud Computing – For Providers

• Need to make the most efficient use of your resources as possible

– Which requires flexibly moving jobs around over time

– Nodes will be heavily reused

• Must isolate each client from all others

• Must handle all the grubby distributed system details

Page 10: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 10CS 188,Winter 2015

Who Does Cloud Computing?

• Lots of people

• Companies wanting to run web services

• Parties with large quantities of data to analyze or store

• Those who don’t want to pay for system administrators

Page 11: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 11CS 188,Winter 2015

Who Provides Cloud Computing?

• Mostly large companies

• Big costs involved in setting up and running a cloud environment

• Huge hardware costs, electric bills, repair and maintenance costs, system and network admin salaries, etc.

• Generally pays off best at high scale

Page 12: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 12CS 188,Winter 2015

Some Sample Cloud Services

• Amazon Elastic Compute Cloud• Google Cloud Computing• Microsoft Cloud• Apple iCloud

– Primarily cloud storage• Dropbox

– Also primarily cloud storage

Page 13: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 13CS 188,Winter 2015

Cloud Computing and Distributed Systems

• Cloud computing relates to distributed systems in two ways

1. Cloud users often run distributed systems– Since it’s most beneficial when your job

needs many resources

2. Cloud computing facilities are inherently distributed systems– Of a specialized type requiring special kinds

of control software

Page 14: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 14CS 188,Winter 2015

Running Distributed Systems In the Cloud

• What if your job is large and won’t fit on one computer?

• Well, design it as a distributed system

• Contract with the cloud provider to rent the right number of nodes

• Configure those nodes as the distributed system you want

Page 15: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 15CS 188,Winter 2015

Advantages of This Approach

• Beyond basic advantages of the cloud

• A relatively friendly distributed environment

• Short, predictable delays

• Generally homogeneous hardware

• Nice recovery from failures

• Easy expansion

Page 16: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 16CS 188,Winter 2015

An Example of Advantages

I need to run a small web server

I also need to run a backend database server

Business is good, so now I need a second

web server.

Business is very good, so now I need more web

servers.

And a load balancer and a special firewall machine and . . .

Page 17: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 17CS 188,Winter 2015

What Does This Look Like to the Client?

Web server

Back end database server

Second web serverFirewall

Load balancer

Web servers

Back end database server

Page 18: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 18CS 188,Winter 2015

How Does The Client Connect His Cloud Distributed System?

• Cloud offers tools for specifying connectivity

• Client indicates what connects to what

• Cloud uses various virtual networking software to arrange those connections

• Flexible and easy to change

Page 19: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 19CS 188,Winter 2015

The Cloud As A Distributed System

• The cloud environment is a collection of computers

• Connected by a local area network

• Which must be flexibly set up in many different ways

• Requires treating the environment as a distributed system

Page 20: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 20CS 188,Winter 2015

Challenges for the Cloud Distributed System

• Sharing resources

– How to make sure a shared network is properly used by all

• Enforcing topologies

• Flexible remapping of services to different nodes

• Security issues

Page 21: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 21CS 188,Winter 2015

Cloud Computing and Virtual Machines

• Sometimes a client doesn’t need many resources

– Perhaps fewer than on one machine

• Wasteful to give him a whole machine

• Why not give him a virtual machine hosted on a real machine?

• Possibly shared with others

Page 22: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 22CS 188,Winter 2015

For Example

Instead of this, Do this,

Note sharing of physical machines

Page 23: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 23CS 188,Winter 2015

Cloud Computing and Virtual Machines

• Achieving this effect requires supporting virtual machines

• Generally a good thing for cloud computing

• Also provides security advantages

• Often clouds treat all client machines as virtual

– Even if they have all of a physical machine’s resources

Page 24: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 24CS 188,Winter 2015

Failures in Cloud Computing• When you have a lot of nodes, you’ll have a lot

of failures• A failed node will often belong to a client• Under many circumstances, you can simply give

him another node– Assuming you can recover state– Sometimes done by saving VM state– Sometimes handled by a cloud computing tool

Page 25: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 25CS 188,Winter 2015

Security In Cloud Computing• Everyone shares the same network links

• Sometimes multiple virtual machines share one physical machine

• Different clients live on the same physical machine as time passes

• Must provide each client a totally clean and safe environment

– One client shouldn’t be able to affect any other client

Page 26: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 26CS 188,Winter 2015

Cloud Computing Tools

• Clients can do anything they want on cloud machines, usually

• But there are classes of things many clients need

• Cloud environments try to provide libraries or other tools to do them

Page 27: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 27CS 188,Winter 2015

MapReduce

• Perhaps the most common cloud computing software tool/technique

• A method of dividing large problems into compartmentalized pieces

• Each of which can be performed on a separate node

• With an eventual combined set of results

Page 28: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 28CS 188,Winter 2015

The Origin of MapReduce

• Built by Google

• In response to their internal needs

– They did lots of parallel-ish processing on lots of data

• Observed common characteristics of many of their tasks

• Built a framework to handle all of them

Page 29: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 29CS 188,Winter 2015

The Idea Behind MapReduce• There is a single function you want to perform

on a lot of data– Such as searching it for a string

• Divide the data into disjoint pieces• Perform the function on each piece on a

separate node (map)

• Combine the results to obtain output (reduce)

Page 30: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 30CS 188,Winter 2015

An Example

• We have 64 megabytes of text data

• Count how many times each word occurs in the text

• Divide it into 4 chunks of 16 mbytes

• Assign each chunk to one processor

• Perform the map function of “count words” on each

Page 31: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 31CS 188,Winter 2015

The Example Continued

1 2 3 4

Foo 1Bar 4Baz 3

Zoo 6Yes 12Too 5

Foo 7Bar 3Baz 9

Zoo 1Yes 17Too 8

Foo 2Bar 6Baz 2

Zoo 2Yes 10Too 4

Foo 4Bar 7Baz 5

Zoo 9Yes 3Too 7

That’s the map stage

Page 32: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 32CS 188,Winter 2015

On To Reduce

• We might have two more nodes assigned to doing the reduce operation

• They will each receive a share of data from a map node

• The reduce node performs a reduce operation to “combine” the shares

• Outputting its own result

Page 33: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 33CS 188,Winter 2015

Continuing the Example

Foo 1Bar 4Baz 3

Zoo 6Yes 12Too 5

Foo 7Bar 3Baz 9

Zoo 1Yes 17Too 8

Foo 2Bar 6Baz 2

Zoo 2Yes 10Too 4

Foo 4Bar 7Baz 5

Zoo 9Yes 3Too 7

Page 34: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 34CS 188,Winter 2015

The Reduce Nodes Do Their Job

Foo 14Bar 20Baz 19

Zoo 16Yes 42Too 24

And MapReduce is done!Write out the results to files

Page 35: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 35CS 188,Winter 2015

But I Wanted A Combined List

• No problem

• Run another (slightly different) MapReduce on the outputs

• Have one reduce node that combines everything

Page 36: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 36CS 188,Winter 2015

Synchronization in MapReduce

• Each map node produces an output file for each reduce node

• It is produced atomically

• The reduce node can’t work on this data until the whole file is written

• Forcing a synchronization point between the map and reduce phases

Why can’t the reduce nodes start working on data as it’s produced?

Page 37: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 37CS 188,Winter 2015

Controlling the Synchronization

• One node in the computation is the master

• It assigns input pieces to the map nodes

• And indicates which outputs go to which reduce nodes

• Also keeps track of the health of participant nodes

Page 38: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 38CS 188,Winter 2015

Handling Failures in MapReduce

• Relatively simple

• If a map node fails, redo its work on another node

• Reduce nodes will need to wait for the new node’s results

• But result is correct

Page 39: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 39CS 188,Winter 2015

What If a Reduce Node Fails?• Pretty much the same answer

• Choose a node to replace it

• Send that node the reduce files from the failed node

• Reduce results are also written atomically

– So not necessary if it failed after finishing everything

Page 40: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 40CS 188,Winter 2015

MapReduce and Determinism

• Most MapReduce applications are deterministic

• So restarting a computation a second time produces the same results

• Not an absolute requirement

• But possible results don’t follow such clean semantices

Page 41: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 41CS 188,Winter 2015

MapReduce and Load Balancing

• Depending on map function and data, some data may take longer to process

• Leading to possibility of poor assignments of work to nodes

• In turn, leading to longer run times

• Handled by dividing inputs into lots of pieces (~100 per worker machine)

Page 42: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 42CS 188,Winter 2015

How Does That Help?

• First, better load balancing

• Second, if map node fails after completing a piece, no need to restart it

• Just assign the incomplete pieces to another node

• Also, if load balance is poor anyway, idle nodes can take on extra pieces

Page 43: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 43CS 188,Winter 2015

Using MapReduce• Designed as a library

– In C++• User defines the map and reduce

functions• Links to the library• Provides the input and number of nodes• The library handles the details

Page 44: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 44CS 188,Winter 2015

MapReduce and Hadoop• The original MapReduce library was built by

Google• Apache has built an open source version

– In Java– As part of its Hadoop package– Which also includes stuff like a distributed file

system• There are other open source versions of MapReduce

Page 45: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 45CS 188,Winter 2015

Use of MapReduce

• Extremely widespread

• If you can define your task using Map/Reduce, other things become easy

• High quality open source libraries are available

• MapReduce itself handles most tricky issues

• Of course, not everything is Map/Reduce

Page 46: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 46CS 188,Winter 2015

A Related Issue

• What about those files?

• They’re stored on different machines

• How do we get them from one machine (say a map node) to another (a reduce node)?

• Probably we need a distributed file system . . .

Page 47: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 47CS 188,Winter 2015

Distributed File Systems for MapReduce

• Google has their own distributed file system

– Not bundled with MapReduce

– But works well with it

• The Hadoop package also has one

– Designed to work well there

• Both particularly intended for cluster or cloud environments

Page 48: Lecture 18 Page 1 CS 188,Winter 2015 Cloud Computing CS 188 Distributed Systems March 12, 2015.

Lecture 18Page 48CS 188,Winter 2015

Conclusion

• Cloud computing is an inherently distributed system

• It avoids or hides many messy issues

• Doesn’t solve everyone’s problems, but of very wide utility

• Good cloud service requires handling many tricky distributed systems issues