3rd meetup - Intro to Amazon EMR
-
Upload
faizan-javed -
Category
Technology
-
view
2.231 -
download
0
Transcript of 3rd meetup - Intro to Amazon EMR
![Page 1: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/1.jpg)
‘Amazon EMR’ coming up…by Sujee Maniyam
![Page 2: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/2.jpg)
Birmingham Big Data Group
Amazon Elastic Map Reduce For Startups
Sujee Maniyam
[email protected] / www.sujee.net
Sept 14, 2011
![Page 3: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/3.jpg)
Hi, I’m Sujee
• 10+ years of software development
– enterprise apps web apps iphone
apps Hadoop
• More : http://sujee.net/tech
![Page 4: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/4.jpg)
I am an ‘expert’
![Page 5: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/5.jpg)
Ah.. Data
•
![Page 6: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/6.jpg)
Nature of Data…
• Primary Data
– Email, blogs, pictures, tweets
– Critical for operation (Gmail can’t loose emails)
• Secondary data
– Wikipedia access logs, Google search logs
– Not ‘critical’, but used to ‘enhance’ user experience
– Search logs help predict ‘trends’
– Yelp can figure out you like Chinese food
![Page 7: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/7.jpg)
Data Explosion• Primary data has grown phenomenally
• But secondary data has exploded in recent years
• “log every thing and ask questions later”
• Used for
– Recommendations (books, restaurants ..etc)
– Predict trends (job skills in demand)
– Show ADS ($$$)
– ..etc
• ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)
• Startups are struggling to get on top of ‘big data’
![Page 8: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/8.jpg)
Big Guys
![Page 9: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/9.jpg)
Startups
![Page 10: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/10.jpg)
Startups and bigdata
![Page 11: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/11.jpg)
Hadoop to Rescue
• Hadoop can help with BigData
• Hadoop has been proven in the
field
• Under active development
• Throw hardware at the problem
– Getting cheaper by the year
• Bleeding edge technology
– Hire good people!
![Page 12: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/12.jpg)
Hadoop : It is a CAREER
![Page 13: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/13.jpg)
Data Spectrum
![Page 14: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/14.jpg)
Who is Using Hadoop?
60 PB40,000 machines running hadoop4000 node cluster (largest)
36 PB 100 TB /day2500 node cluster
7 TB / day40 PB50 TB / day1000+ nodes cluster
Many many startupsMultiple Terrabytes100s Gigs / dayCluster size : 10-50
![Page 15: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/15.jpg)
About This Presentation
• Based on my experience with a startup
• 5 people (3 Engineers)
• Ad-Serving Space
• Amazon EC2 is our ‘data center’
• Technologies:
– Web stack : Python, Tornado, PHP, mysql , LAMP
– Amazon EMR to crunch data
• Data size : 1 TB / week
![Page 16: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/16.jpg)
Story of a Startup
• We served targeted ads
• Tons of click data
• Stored them in mysql db
• Outgrew mysql pretty quickly
![Page 17: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/17.jpg)
Data @ 6 months
• 2 TB of data already
• 50-100 G new data / day
• And we were operating
at 20% of our capacity!
![Page 18: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/18.jpg)
Future…
![Page 19: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/19.jpg)
Solution?
• Scalable database (NOSQL)
– Hbase
– Cassandra
• Hadoop log processing / Map Reduce
![Page 20: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/20.jpg)
What We Evaluated
• 1) Hbase cluster
• 2) Hadoop cluster
• 3) Amazon EMR
![Page 21: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/21.jpg)
Hadoop on Amazon EC2
• Two ways to run Hadoop on EC2
– 1) Permanent Cluster
– 2) On demand cluster (elastic map
reduce)
![Page 22: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/22.jpg)
1) Permanent Hadoop Cluster
![Page 23: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/23.jpg)
Architecture 1
![Page 24: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/24.jpg)
Hadoop Cluster
• 7 C1.xlarge machines
• 15 TB EBS volumes
• Sqoop exports mysql log tables into HDFS
• Logs are compressed (gz) to minimize
disk usage (data locality trade-off)
• All is working well…
![Page 25: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/25.jpg)
2 months later
• Couple of EBS volumes DIE
• Couple of EC2 instances DIE
• Maintaining the hadoop cluster is mechanical job
less appealing
• COST!
– Our jobs utilization is about 50%
– But still paying for machines running 24x7
![Page 26: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/26.jpg)
Lessons Learned
• C1.xlarge is pretty stable (8 core / 8G memory)
• EBS volumes
– max size 1TB, so string few for higher density / node
– DON’T RAID them; let hadoop handle them as individual disks
– Might fail
• Backup data on S3
• Skip EBS. Use instance store disks, and store data in S3
• Use Apache WHIRR to setup cluster easily
![Page 27: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/27.jpg)
Amazon Storage OptionsS3 EBS Ephemeral
- highly available storage
- high latency access time
- recommended uses : distributing media / video, backups, store bigdata
- cost : 1TB = $150 / month(compare 2TB disk = $150)
- in between S3 and ephemeral storage
- persistent storage
- attaches to instances like regular disks (/dev/sdx)
- access over network
- recommended uses : database storage..
- max size : 1TB
- these do fail. So snapshot backups are recommended
- disk you get with server instance
- Free
- data is local (not over network)
- recommended use : cache, tmp data
- size upto 1.5 TB
- fast access times
- goes away when your server instance is terminated
![Page 28: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/28.jpg)
Amazon EC2 Cost
![Page 29: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/29.jpg)
Hadoop cluster on EC2 cost
• $3,500 = 7 c1.xlarge @ $500 / month
• $1,500 = 15 TB EBS storage @ $0.10 per
GB
• $ 500 = EBS I/O requests @ $0.10 per 1
million I/O requests
• $5,500 / month
• $60,000 / year !
![Page 30: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/30.jpg)
Buy / Rent ?
• Typical hadoop machine cost : $10-15k
– 10 node cluster = $100k
– Plus data center costs
– Plus IT-ops costs
• Amazon Ec2 10 node cluster:
– $500 * 10 = $5,000 / month = $60k / year
![Page 31: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/31.jpg)
Buy / Rent• Amazon EC2 is great, for
– Quickly getting started
• Startups
– Scaling on demand / rapidly adding more servers
• popular social games
– Netflix story
• Streaming is powered by EC2
• Encoding movies ..etc
• Use 1000s of instances
• Not so economical for running clusters 24x7
• http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-
colocation-comparison-for-hadoop-clusters/
![Page 32: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/32.jpg)
Buy vs Rent
![Page 33: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/33.jpg)
Amazon’s Elastic Map Reduce
• Basically ‘on demand’ hadoop cluster
• Store data on Amazon S3
• Kick off a hadoop cluster to process
data
• Shutdown when done
• Pay for the HOURS used
![Page 34: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/34.jpg)
Architecture2 : Amazon EMR
![Page 35: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/35.jpg)
Moving parts
• Logs go into Scribe
• Scribe master ships logs into S3, gzipped
• Spin EMR cluster, run job, done
• Using same old Java MR jobs for EMR
• Summary data gets directly updated to a
mysql (no output files from reducers)
![Page 36: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/36.jpg)
EMR Wins• Cost only pay for use
• http://aws.amazon.com/elasticmapreduce/pricing/
• Example: EMR ran on 5 C1.xlarge for 3hrs
– EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20
– http://aws.amazon.com/elasticmapreduce/faqs/#billing-4
– (1 hour of c1.xlarge = 8 hours normalized compute time)
– EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr =
$14.40
– Plus S3 storage cost : 1TB / month = $150
– Data bandwidth from S3 to EC2 is FREE!
– $25 bucks
![Page 37: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/37.jpg)
Design Wins
• Bidders now write logs to Scribe
directly
– No mysql at web server machines
–Writes much faster!
• S3 has been a reliable storage and
cheap
![Page 38: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/38.jpg)
EMR Wins
• No hadoop cluster to maintain
no failed nodes / disks
![Page 39: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/39.jpg)
EMR Wins
• Hadoop clusters can be of any size!
• Spin clusters for each job
– smaller jobs fewer number of
machines
–memory hungry tasks m1.xlarge
– cpu hungry tasks c1.xlarge
![Page 40: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/40.jpg)
EMR trade-offs• Lower performance on MR jobs compared to a cluster
Reduced data throughput (S3 isn’t the same as local
disk)
• Streaming data from S3, for each job
• EMR Hadoop is not the latest version
• Missing tools : Oozie
• Right now, trading performance for convenience and
cost
![Page 41: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/41.jpg)
Lessons Learned
• Debugging a failed MR job is tricky
– Because the hadoop cluster is
terminated no logs files
• Save log files to S3
![Page 42: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/42.jpg)
Lessons : Script every thing
• scripts
– to launch jar EMR jobs
– Custom parameters depending on job needs (instance types,
size of cluster ..etc)
– monitor job progress
– Save logs for later inspection
– Job status (finished / cancelled)
• https://github.com/sujee/amazon-emr-beyond-basics
![Page 43: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/43.jpg)
Saved Logs
![Page 44: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/44.jpg)
Map reduce tips: Logfile format
• CSV JSON
• Started with CSV
• CSV: "2","26","3","07807606-7637-41c0-9bc0-
8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-
09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?
s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.28401
92E9","0","221","0.60000","NULL","NULL
• 20-40 fields… fragile, position dependant, hard to code
– url = csv[18]…counting position numbers gets old after 100th time
around)
– If (csv.length == 29) url = csv[28] else url = csv[26]
![Page 45: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/45.jpg)
Map reduce tips: Logfile format
• JSON:
{ exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?
s=01&e=07”….}
• Self-describing, easy to add new fields, easy to
process
– url = map.get(‘url’)
• Flatten JSON to fit in ONE LINE
• Compresses pretty well (not much data inflation)
![Page 46: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/46.jpg)
Map reduce tips: Incremental Log Processing
• Recent data (today / yesterday / this
week) is more relevant than older
data (6 months +)
![Page 47: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/47.jpg)
Map reduce tips: Incremental Log Processing
• Adding ‘time window’ to our stats
• only process newer logs
faster
![Page 48: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/48.jpg)
Next steps
• New Software
– Pig, python mrJOB (from Yelp)
• Scribe Cloudera flume?
• Use work flow tools like Oozie
• Hive?
– Adhoc SQL like queries
![Page 49: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/49.jpg)
Next Steps : SPOT instances
• SPOT instances : name your price (ebay style)
• Been available on EC2 for a while
• Just became available for Elastic map reduce!
• New cluster setup:
– 10 normal instances + 10 spot instances
– Spots may go away anytime
• That is fine! Hadoop will handle node failures
• Bigger cluster : cheaper & faster
![Page 50: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/50.jpg)
Example Price Comparison
![Page 51: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/51.jpg)
In summary…
• Amazon EMR could be a great
solution
• We are happy!
![Page 52: 3rd meetup - Intro to Amazon EMR](https://reader035.fdocuments.us/reader035/viewer/2022062512/554a0669b4c905e56c8b56a2/html5/thumbnails/52.jpg)
Take a test drive• Just bring your credit-card
• http://aws.amazon.com/
elasticmapreduce/
• Forum :
https://forums.aws.amazon.com/foru
m.jspa?forumID=52