Big Data Analytics
-
Upload
amazon-web-services -
Category
Technology
-
view
1.615 -
download
2
description
Transcript of Big Data Analytics
Big Data Analytics
Peter Sirota
General Manager, Amazon Elastic MapReduce
1. Introducing Big Data
2. From data to actionable information
3. Analytics and Cloud Computing
4. The Big Data ecosystem
Overview
Introducing Big Data
1
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
The cost of data generation
is falling
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Highly
constrained
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure
Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Elastic and highly scalable
No upfront capital expense
Only pay for what you use +
+
Available on-demand
+
= Remove
constraints
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Highly
constrained
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Accelerated
Close the gap.
Technologies and techniques for
working productively with data,
at any scale.
Big Data
From data to
actionable information
2
“Who buys video games?”
3.5 billion records
13 TB of click stream logs
71 million unique cookies
Per day:
500% return on ad spend
17,000% reduction in procurement time
Results:
“Who is using our
service?”
Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs
9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013
Open web index.
3.4 billion records.
Available to all.
Full parse for impact of
social networks
300 lines of Ruby code.
14 hours.
$100.
You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011
Tweeting about Flu
Tweets about
the price of rice
Official food
price inflation
Tweeting about Food
Analytics and
Cloud Computing
3
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
S3, Glacier,
Storage Gateway,
DynamoDB,
Redshift, RDS,
HBase
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
EC2 &
Elastic MapReduce
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
EC2 & S3,
CloudFormation,
Elastic MapReduce,
RDS, DynamoDB, Redshift
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
EC2 & S3,
CloudFormation,
Elastic MapReduce,
RDS, DynamoDB, Redshift
EC2 &
Elastic MapReduce
S3, Glacier,
Storage Gateway,
DynamoDB,
Redshift, RDS,
HBase AWS Data Pipeline
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
EC2 & S3,
CloudFormation,
Elastic MapReduce,
RDS, DynamoDB, Redshift
EC2 &
Elastic MapReduce
S3, Glacier,
Storage Gateway,
DynamoDB,
Redshift, RDS,
HBase AWS Data Pipeline
Elastic MapReduce
Managed Hadoop analytics
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code Name
node
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code Name
node
Input data
Elastic
cluster
S3, DynamoDB, Redshift
S3/HDFS
Elastic
MapReduce
Code Name
node
Input data
S3/HDFS Queries
+ BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic
cluster
Elastic
MapReduce
Code Name
node
Output
Input data
Queries
+ BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic
cluster
S3/HDFS
Output
Input data
S3, DynamoDB, Redshift
1. Elastic clusters
10 hours
6 hours
Peak capacity
2. Rapid, tuned provisioning
Tedious.
Remove undifferentiated
heavy lifting.
3. Hadoop all the way down
Robust ecosystem. Databases, machine learning, segmentation,
clustering, analytics, metadata stores,
exchange formats, and so on...
4. Agility for experimentation
Instance choice. Stay flexible on instance type & number.
5. Cost optimizations
Built for Spot. Name-your-price supercomputing.
1. Elastic clusters
2. Rapid, tuned provisioning
3. Hadoop all the way down
4. Agility for experimentation.
5. Cost optimizations
Vin Sharma [email protected]
Director, Product Strategy & Marketing
Big Data Software, Intel Corporation
Analysis of Data Can Transform Society
Create new business
models and improve
organizational
processes.
Enhance scientific
understanding, drive
innovation, and
accelerate medical cures.
Increase public safety
and improve
energy efficiency with
smart grids.
Intel’s Vision to Democratize Big Data
Unlock Value in
Silicon
Support Open
Platforms
Deliver Software Value
Intel at the Intersection of Big Data
Enabling exascale
computing on massive
data sets
Helping enterprises build open
interoperable clouds
Contributing code and fostering ecosystem
HPC Cloud Open Source
Intel® Technology at the Heart of the Cloud
Server
Storage
Network
Scale-Out Big Data
Compute Platform Optimization
Cost-effective performance
•Intel® Advanced Vector Extension Technology
•Intel® Turbo Boost Technology 2.0
•Intel® Advanced Encryption Standard New
Instructions Technology
73
Intel® Advanced Vector Extensions Technology
• Newest in a long line of
processor instruction
innovations
• Increases floating point
operations per clock up to
2X1 performance
1 : Performance comparison using Linpack benchmark. See backup for configuration details.
For more legal information on performance forecasts go to http://www.intel.com/performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Intel® Turbo Boost Technology 2.0
More Performance Higher turbo speeds maximize
performance for single and
multi-threaded applications
Intel® Advanced Encryption
Standard New Instructions
• Processor assistance for performing AES encryption 7 new instructions
• Makes enabled encryption software faster and stronger
The Power of Intel® Platform Solutions:
Richer
user
experiences
4 HRS
50% Reduction
10 MIN
80% Reduction 50%
Reduction 40% Reduction
TeraSort for
1 TB sort
Intel®
Xeon®
Processor
E5 2600
Solid-State
Drive 10G
Ethernet Intel® Apache
Hadoop
Previous
Intel®
Xeon®
Processor
Cloud
Intelligent Systems
Clients
The Virtuous Cycle of User Experience
The Big Data
Ecosystem
4
Data, data, everywhere... Data is stored in silos.
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On-premises
“How do I get my data to the cloud?”
Data mobility
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regional replication of AMIs and snapshots
“How do I integrate my data for
maximum impact?”
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On-premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On-premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On premises
AWS Data Pipeline
Announced in November, available now.
Orchestration for data-intensive workloads.
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage temporary compute
resources
Anatomy of a pipeline
Additional checks and notifications
Arbitrarily complex pipelines
aws.amazon.com/datapipeline
aws.amazon.com/big-data
1. Introducing Big Data
2. From data to actionable information
3. Analytics and Cloud Computing
4. The Big Data ecosystem
Summary
Get 600 Hours of free supercomputing
time!
www.powerof60.com