Cloud Architectures - Jinesh Varia - GrepTheWeb

On Cloud Computing….

“We in academia and the government labs have not kept up with the times, Universities really need to get on board.”

- Randal E. Bryant, Dean of the Computer Science School at Carnegie Mellon University.

source: http://www.nytimes.com/2007/10/08/technology/08cloud.html

http://www.nytimes.com/2007/10/08/technology/08cloud.html

What is Amazon?

3

1996 1997 1998 1999 2000 2001 20022001 2002 2003 2004 2005 2006 2007

Bandwidth consumed byAmazon Web Services

Bandwidth consumed byAmazon’s global websites

2008

Amazon.com and AWS

AWS Customer Momentum (490,000)

0 100 200 300 400 500 600

Q4 2008

Q1 2008

Q1 2007

Q1 2006

Amazon S3 Momentum

6

Q2

2006

800,000,000

Total Objects Stored in Amazon S3

Q2

2007

5,000,000,000

Q3

2007

10,000,000,000

Q4

2008

40,000,000,000

Why Are People So Excited ?

Most Companies Worry About This

Your Idea Successful

Product

Undifferentiated

“Heavy Lifting”

Power/Cooling

Hardware Management

Bandwidth Management

Contract Negotiations

Maintenance

Deployment

Purchasing Decisions

Load Balancing/Scaling

Managing Growth

70/30 Switch

Focus on Innovation

Successful

Product

Undifferentiated

“Heavy Lifting”Your Idea

Cloud Computing

Amazon Cloud Computing

Focus On Your Idea

Spend Cash Wisely

Get Big Fast

Pay As You Go

Simple, Reliable, Fast

Elastic Unlimited Capacity

Amazon

EC2-EBS

Amazon

SimpleDB

Amazon

S3

Amazon

EC2Amazon

SQS

ANIMOTO.COM

Scale: 50 servers to 5000 servers in 3 days

Nu

mb

er

of

EC

2 I

nsta

nces

4/12/2008

Launch of Facebook modification.

Amazon EC2 easily scaled

to handle additional traffic

Peak of 5000 instances

4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008

Steady state of ~40 instances

“TimesMachine” from NY Times

1851-1922 Articles

TIFF -> PDF

Input: 11 Million Articles (4TB of data)

What did he do ?

100 EC2 Instances for 24 hours

All data on S3

Output: 1.5 TB of Data

Hadoop, iText, JetS3t

CS290F : Scalable Internet Services

USCB Fall 2006

Prof created an app to manage team usage

Ruby on Rails

Complete Stack: From Load balancer, App Server to DB

Learn how to scale: Simulated load

Generated Graphs

All course contents, students assignments, lessons learned are on the Wiki

CS345a : Data Mining @ Stanford

Tools used:

Shell/Linux/Java

Hadoop on EC2

Data set on S3

Datasets :NetFlix, Alexa, IR datasets from TREC

Class organization:

Stanford Winter 2007

30-35 Students

Each Team spawns 10-15 Hadoop slave nodes

TA created Getting-Started AMIs (& scripts)

TA managed the students usage

Bioinformatics @ Northwestern University

31

• Using Hadoop to perform sequence alignments on large genomic datasets– Northwestern University (Flatow & Lin) presented

a talk at the Next-gen Sequencing Data Analysis meeting• “An understanding of the industrial strength map-

reduce paradigm will be invaluable to those looking to cope with the next-generation datasets. Combined with the power of elastic computing clouds, many of the potential barriers to dealing with such large-scale data can be completely eliminated.”

Cloud Architectures

Hardware

Infrastructure/Cost

time

Job execution time

Shrink your processing time

CPUs

time

Main Problems

• How to co-ordinate jobs between machines (distributed processing) ?

• What if a machine fails ?

• How will I Scale-out ?

Technical

• How do I get management signoff ?

• Resources to manage the infrastructure?

• How do I get rid of the Idle Infrastructure?

Business

Hadoop

Web Services

Cloud Computing

GrepTheWeb

What’s so cool about GrepTheWeb ?

RegExWWW

Examples of Patterns

Source Code

int x = 40 + i

Any thing with punctuation

“Hey!” he said, “Are you ok?”

Case Sensitive

Function CallOrderController()

Equations

f(x) = x^2

Other Patterns

(dis)integration of life, Email Address

Zoom Level 1

AlexaGrepTheWeb

Service

RegExGetStatus

Subset of document URLs that matched the RegEx

Input dataset (List of Document Urls)

Zoom Level 2

Amazon SQS

Controller

AmazonEC2

Cluster AmazonS3

AmazonSimpleDB

DB

User info, Job status info

Launch, Monitor, Shutdown

InputOutput

Manage phases

StartGrepRegEx

GetStatus

Input Files (Alexa Crawl)

Get Output

Amazon SQSDistributed TransientBuffer

Never Lose a message

Ideal for small short-lived messages

Access control

Message Locking

Amazon S3Infinitely Scalable Storage in the cloud

Highly Available, Durable and Reliable

Private and Public StoragePay by the GB

Amazon EC2Resizable Computing Capacity in the cloud

Spawn Server Instances using a Web Service call

Root Level Access

Pay by the hour

Amazon SimpleDBDatabase in the cloud

Lightweight Query-able Attribute Store

Distributed and Partitioned

Pay by GB, Pay per Query

Zoom Level 3

Amazon SimpleDB

Amazon SQS

Controller

Amazon S3

Master MSlaves N

HDFS

Hadoop Cluster on Amazon EC2

Launch Queue

Monitor Queue

Launch Controller

ShutdownQueue

Monitor Controller

Billing Queue

Shutdown Controller

StatusDB

Output

Billing Service

Billing Controller

launch

ping

Shutdown

Insert JobID, Status

Insert EC2 info

Get EC2 Info

Put File

InputGet File

Check for results

StartGrep

GetStatus

Input Files (Alexa Crawl)

Get Output

Zoom Level 4

Map

Map

Map

…..

Map

Reduce

Combine

Hadoop JobTasks

User1StartJob1 StopJob1

Service

Map

Map

Map

…..

Map

Reduce

Combine

Hadoop JobTasks

User2StartJob2

StopJob2

Store status and results

Get Result

SideTrack: WordCount Example

MAPPER: For each input record, extract

a set of key/value pairs that we care

about the each record

REDUCER: For each extracted

key/value pair, combine it with other

values that share the same key

“Hi Hadoop, Bye Hadoop”

(“Hi”, 1), (“Hadoop”, 1),

(“Bye”, 1), (“Hadoop”, 1)

(“Hadoop”, [1,1])

(“Hadoop”, 2)

Source: Doug Cutting’s Slide Deck on Hadoop

Input key

value pairs

key 1

Values..

AggregateKey 1

All Values..

key 3

Values..

Final Key 1

Values..

Input

Map

Reduce

Zoom Level 5 (Hadoop MapReduce)

Input key

value pairs

key 1

Values..

AggregateKey 1

All Values..

MAPPER: For each input record, extract a set of key/value pairs that we care about the each record

REDUCER: For each extracted key/value pair, combine it with other values that share the same key

(LineNumber, s3pointer)

(s3pointer, [matches])

Identity Function

key 3

Values..

Final Key 1 Values..

Source: Doug Cutting’s Slide Deck on Hadoop

Input

Map

Reduce

Cloud Architectures - Jinesh Varia - GrepTheWeb

Technology

Transcript of Cloud Architectures - Jinesh Varia - GrepTheWeb