Cloud Architectures - Jinesh Varia - GrepTheWeb
-
Upload
jineshvaria -
Category
Technology
-
view
3.726 -
download
2
description
Transcript of Cloud Architectures - Jinesh Varia - GrepTheWeb
On Cloud Computing….
“We in academia and the government labs have not kept up with the times, Universities really need to get on board.”
- Randal E. Bryant, Dean of the Computer Science School at Carnegie Mellon University.
source: http://www.nytimes.com/2007/10/08/technology/08cloud.html
What is Amazon?
3
1996 1997 1998 1999 2000 2001 20022001 2002 2003 2004 2005 2006 2007
Bandwidth consumed byAmazon Web Services
Bandwidth consumed byAmazon’s global websites
2008
Amazon.com and AWS
AWS Customer Momentum (490,000)
0 100 200 300 400 500 600
Q4 2008
Q1 2008
Q1 2007
Q1 2006
Amazon S3 Momentum
6
Q2
2006
800,000,000
Total Objects Stored in Amazon S3
Q2
2007
5,000,000,000
Q3
2007
10,000,000,000
Q4
2008
40,000,000,000
Why Are People So Excited ?
Most Companies Worry About This
Your Idea Successful
Product
Undifferentiated
“Heavy Lifting”
Power/Cooling
Hardware Management
Bandwidth Management
Contract Negotiations
Maintenance
Deployment
Purchasing Decisions
Load Balancing/Scaling
Managing Growth
70/30 Switch
Focus on Innovation
Successful
Product
Undifferentiated
“Heavy Lifting”Your Idea
Cloud Computing
Amazon Cloud Computing
Focus On Your Idea
Spend Cash Wisely
Get Big Fast
Pay As You Go
Simple, Reliable, Fast
Elastic Unlimited Capacity
Amazon
EC2-EBS
Amazon
SimpleDB
Amazon
S3
Amazon
EC2Amazon
SQS
ANIMOTO.COM
Scale: 50 servers to 5000 servers in 3 days
Nu
mb
er
of
EC
2 I
nsta
nces
4/12/2008
Launch of Facebook modification.
Amazon EC2 easily scaled
to handle additional traffic
Peak of 5000 instances
4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/20084/17/20084/13/2008
Steady state of ~40 instances
“TimesMachine” from NY Times
1851-1922 Articles
TIFF -> PDF
Input: 11 Million Articles (4TB of data)
What did he do ?
100 EC2 Instances for 24 hours
All data on S3
Output: 1.5 TB of Data
Hadoop, iText, JetS3t
26
CS290F : Scalable Internet Services
USCB Fall 2006
Prof created an app to manage team usage
Ruby on Rails
Complete Stack: From Load balancer, App Server to DB
Learn how to scale: Simulated load
Generated Graphs
All course contents, students assignments, lessons learned are on the Wiki
CS345a : Data Mining @ Stanford
Tools used:
Shell/Linux/Java
Hadoop on EC2
Data set on S3
Datasets :NetFlix, Alexa, IR datasets from TREC
Class organization:
Stanford Winter 2007
30-35 Students
Each Team spawns 10-15 Hadoop slave nodes
TA created Getting-Started AMIs (& scripts)
TA managed the students usage
Bioinformatics @ Northwestern University
31
• Using Hadoop to perform sequence alignments on large genomic datasets– Northwestern University (Flatow & Lin) presented
a talk at the Next-gen Sequencing Data Analysis meeting• “An understanding of the industrial strength map-
reduce paradigm will be invaluable to those looking to cope with the next-generation datasets. Combined with the power of elastic computing clouds, many of the potential barriers to dealing with such large-scale data can be completely eliminated.”
Cloud Architectures
Hardware
Infrastructure/Cost
time
Job execution time
Shrink your processing time
CPUs
time
Shrink your processing time
CPUs
time
Main Problems
• How to co-ordinate jobs between machines (distributed processing) ?
• What if a machine fails ?
• How will I Scale-out ?
Technical
• How do I get management signoff ?
• Resources to manage the infrastructure?
• How do I get rid of the Idle Infrastructure?
Business
Hadoop
Web Services
Cloud Computing
GrepTheWeb
What’s so cool about GrepTheWeb ?
RegExWWW
Examples of Patterns
Source Code
int x = 40 + i
Any thing with punctuation
“Hey!” he said, “Are you ok?”
Case Sensitive
Function CallOrderController()
Equations
f(x) = x^2
Other Patterns
(dis)integration of life, Email Address
Zoom Level 1
AlexaGrepTheWeb
Service
RegExGetStatus
Subset of document URLs that matched the RegEx
Input dataset (List of Document Urls)
Zoom Level 2
Amazon SQS
Controller
AmazonEC2
Cluster AmazonS3
AmazonSimpleDB
DB
User info, Job status info
Launch, Monitor, Shutdown
InputOutput
Manage phases
StartGrepRegEx
GetStatus
Input Files (Alexa Crawl)
Get Output
Amazon SQSDistributed TransientBuffer
Never Lose a message
Ideal for small short-lived messages
Access control
Message Locking
Amazon S3Infinitely Scalable Storage in the cloud
Highly Available, Durable and Reliable
Private and Public StoragePay by the GB
Amazon EC2Resizable Computing Capacity in the cloud
Spawn Server Instances using a Web Service call
Root Level Access
Pay by the hour
Amazon SimpleDBDatabase in the cloud
Lightweight Query-able Attribute Store
Distributed and Partitioned
Pay by GB, Pay per Query
Zoom Level 3
Amazon SimpleDB
Amazon SQS
Controller
Amazon S3
Master MSlaves N
HDFS
Hadoop Cluster on Amazon EC2
Launch Queue
Monitor Queue
Launch Controller
ShutdownQueue
Monitor Controller
Billing Queue
Shutdown Controller
StatusDB
Output
Billing Service
Billing Controller
launch
ping
Shutdown
Insert JobID, Status
Insert EC2 info
Get EC2 Info
Put File
InputGet File
Check for results
StartGrep
GetStatus
Input Files (Alexa Crawl)
Get Output
Zoom Level 4
Map
Map
Map
…..
Map
Reduce
Combine
Hadoop JobTasks
User1StartJob1 StopJob1
Service
Map
Map
Map
…..
Map
Reduce
Combine
Hadoop JobTasks
User2StartJob2
StopJob2
Store status and results
Get Result
SideTrack: WordCount Example
MAPPER: For each input record, extract
a set of key/value pairs that we care
about the each record
REDUCER: For each extracted
key/value pair, combine it with other
values that share the same key
“Hi Hadoop, Bye Hadoop”
(“Hi”, 1), (“Hadoop”, 1),
(“Bye”, 1), (“Hadoop”, 1)
(“Hadoop”, [1,1])
(“Hadoop”, 2)
Source: Doug Cutting’s Slide Deck on Hadoop
Input key
value pairs
key 1
Values..
AggregateKey 1
All Values..
key 3
Values..
Final Key 1
Values..
Input
Map
Reduce
Zoom Level 5 (Hadoop MapReduce)
Input key
value pairs
key 1
Values..
AggregateKey 1
All Values..
MAPPER: For each input record, extract a set of key/value pairs that we care about the each record
REDUCER: For each extracted key/value pair, combine it with other values that share the same key
(LineNumber, s3pointer)
(s3pointer, [matches])
Identity Function
key 3
Values..
Final Key 1 Values..
Source: Doug Cutting’s Slide Deck on Hadoop
Input
Map
Reduce