(BDT210) Building Scalable Big Data Solutions: Intel & AOL
-
Upload
amazon-web-services -
Category
Technology
-
view
896 -
download
1
Transcript of (BDT210) Building Scalable Big Data Solutions: Intel & AOL
![Page 1: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/1.jpg)
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel
Durga Nemani, System Architect AOL Inc.
October 2015
Building Scalable Big Data
Solutions
BDT210
![Page 2: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/2.jpg)
Building Scalable Big Data Solutions
October 2015
Bob Rogers, PhD
Chief Data Scientist for Big Data Solutions
Intel
![Page 3: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/3.jpg)
@scientistBob 3
About me
![Page 4: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/4.jpg)
@scientistBob
What does Big Data have to do with Intel?
Trusted Analytics Platform
![Page 5: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/5.jpg)
@scientistBob 5
Intel contributions to Apache Hadoop
EncryptionIntel® AES-NI
![Page 6: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/6.jpg)
@scientistBob 6
Use case:
Assemble an accurate patient problem list
Why?
• To improve patient outcomes
KPI
• False negatives in problem list
![Page 7: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/7.jpg)
@scientistBob 7
What does a patient look like to a data scientist?
![Page 8: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/8.jpg)
@scientistBob
8
My first enterprise data hub
![Page 9: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/9.jpg)
@scientistBob
0-25 %
25-50 %
50-75%
75-100 %
Poll: What percent of the key clinical data to you think is missing from
the problem list?
?
![Page 10: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/10.jpg)
@scientistBob
>63%
Missing
Poll: What percent of the key clinical data to you think is missing from
the problem list?
![Page 11: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/11.jpg)
@scientistBob
Real patient example
Coded
Data
Free Text
Scanned
Document
s
Other
Data Silos
![Page 12: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/12.jpg)
@scientistBob
Missing information
![Page 13: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/13.jpg)
@scientistBob 13
What did we learn?
• Start with what you know
• Leverage existing
technologies
• Use simple tools
• Measure your results
![Page 14: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/14.jpg)
@scientistBob
Powerful Big Data analytics reveal the truth about your…
…customers
…products
…ecosystem
…opportunities
14
![Page 16: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/16.jpg)
Building Scalable Big Data Solutions
Durga Nemani – AOL Inc.
![Page 17: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/17.jpg)
BACKGROUND&ARCHITECTURE
![Page 18: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/18.jpg)
HYBRID
![Page 19: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/19.jpg)
The Three Vs
• Volume• Multiple Terabytes per day
• Variety• Delimited, Avro, JSON
• Velocity
• Hourly, Batch
![Page 20: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/20.jpg)
Workload Management
• “One size fits all” model does not work.• Specific infrastructure tuned to needs and requirements• Variety of EMR clusters as per Data need
2
0
Workloads with significant
diversity of needs
Resources with lowest
common denominatorResources for
workloads with significant
diversity of needs
![Page 21: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/21.jpg)
S3
EMR
EMR
EMR
EMR
![Page 22: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/22.jpg)
JSON
EC2EMRS3
Apache HiveApache PigApache Hadoop
Open Source Data Formats
AWS Services
Open Source Technologies
Avro Parquet
![Page 23: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/23.jpg)
UNIQUE FEATURES & ADVANTAGES
![Page 24: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/24.jpg)
Separation of Compute and Storage
![Page 25: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/25.jpg)
SEE, SPOT, SQUEEZE
• Just enough spot instances to finish the job in 59 minutes.
![Page 26: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/26.jpg)
Key Features
• Separation of Compute and Storage: Amazon S3 and Amazon EMR
• Transient Clusters: No permanent cluster. Different size clusters for
different datasets
• Separation of duties: Independent jobs for Processing,
Extracting, loading and monitoring.
• Parallelism: Process the smallest chunk of data possible in
parallel to reduce dependencies
• Scalability: Hundreds of Amazon EMR clusters in multiple
regions and Availability Zones
• Cost optimized: All Spot instances. Launch in Availability Zone
with lowest spot prices.
![Page 27: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/27.jpg)
DATA & INSIGHTS
![Page 28: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/28.jpg)
CLOUD Facts
2
8
Total Compressed
Amazon S3 Data Size
150 TB
Uncompressed
RAW Data/Day
2-3 TB
Amazon EMR
Clusters/Day
350
Amazon S3 Data
Retention Period
13-24 Months
![Page 29: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/29.jpg)
150
24,000
Restatement Use Case
Terabytes raw
2
9
10 Availability Zone
550EMR Clusters EC2 Instances
![Page 30: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/30.jpg)
AWS COST BREAKOUT
44%
40%
16%
3
0** Storage cost is recurring every month at 2.85$/100 GB
EC2 Cost
EMR Fee
S3 Cost
![Page 31: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/31.jpg)
Best Practices & Suggestions
![Page 32: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/32.jpg)
Tag all resources
Infrastructure as
Code
Command Line Interface
JSON as configuration files
AWS Identity and
Access Management
(IAM) roles and policies
Use of application ID
Enable CloudTrail
S3 lifecycle
management
S3 versioning
Separate code/data/logs buckets
Keyless EMR
clusters
Hybrid model
Enable debugging
Create multiple CLI profiles
Multi-factor authentication
CloudWatch billing alarms
EC2 Spot
instances
SNS notifications for failures
Loosely coupled Apps
Scale horizontally
![Page 33: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/33.jpg)
Next Steps
![Page 34: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/34.jpg)
3
4
Database on cloud
• Database on AWS
• Options: Amazon RDS, Amazon Redshift, or others using
Amazon EC2
Event-driven design
• Kick off code based on events
• Run downstream processes as soon as upstream completes
• Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS
Data Pipeline
Data analytics
• Implement massive parallel processing technologies
• Options: Spark, Impala or Presto
DevOPS on cloud
• Rapidly and automatically deploy new code
• Continuous Integration/Continuous Deployment
• Options: AWS CodeDeploy, AWS CodeCommit, or AWS
CodePipeline
![Page 35: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/35.jpg)
Q & A
![Page 36: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/36.jpg)
THANK YOU
Recommended session:
BDT208 - A Technical Introduction to
Amazon Elastic MapReduce
Thursday, Oct 8, 12:15 PM - 1:15 PM
– Titian 2201B
![Page 37: (BDT210) Building Scalable Big Data Solutions: Intel & AOL](https://reader031.fdocuments.us/reader031/viewer/2022030317/5a667f007f8b9ac5128b4c81/html5/thumbnails/37.jpg)
Remember to complete your
evaluations!