(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
-
Upload
amazon-web-services -
Category
Technology
-
view
1.678 -
download
0
Transcript of (ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jagmeet Chawla, Chief Architect, The Weather Channel
Raul Frias, Solutions Architect, AWS
October 2015
Scaling to 25 Billion Daily Requests
Within 3 MonthsBuilding a Global Big Data Distribution Platform
ARC346
What to Expect from the Session
Building a Big Data Distribution Platform:
- Goals
- Architecture
- Logical and Physical Components
- Data Supply Chain, from Ingest to
Distribution
- Journey
- Building, Tuning and Scaling the Platform
- AWS Insights
- Evolution of the Architecture
Audience:
- Engineering Leaders
- Architects
Background: The Weather Company
We power weather for
Apple, Facebook,
Google, Microsoft,
Twitter, Yahoo and
many more
Our B2B Division, WSI,
has 4,600+ B2B clients
in 60 countries.
WHERE THE WORLD GETS ITS WEATHER
#1 MOST DISTRIBUTED
Cable Network
170M+ App Downloads
47.2M Unduplicated Monthly
Uniques
124M+
Monthly Unique
72% visit 2x or more Daily
Background: A Data Company
DataNetwork of
100K+ weather sensors
Global Lightning Detection Network
Global Radar & Location Data
Largest Collection of
Weather Data
State-of-the-Science Forecasts
TechnologiesIndustry Best
Forecast Modeling
Proprietary Radar
Algorithms
Proprietary Weather Analytics
220+ Fulltime Meteorologists
TWC Content (Video, Images, Articles)
Weather APIs Content APIs
20+ TB Data Daily
800+ Sources of
Ingest
40+ Billion API
Requests Daily
Background: About Data
Weather Data
- Observations
- Forecasts
- Radar
- Alerts
- Notices
- Emergency Bulletins
- Health & Life Style
Content
- Articles
- Images
- Slide Shows
- Videos
- Maps
Domain Specific
- Aviation
- Energy
- Insurance
Background: Big Data
- Push/Pull, every 5 minutes
- Real Time Alerts & Notification
- World’s most volatile atmospheric data
- 15-20 sec. to prepare and serve
- 800+ Partners
- 50+ GB Raw compressed data
- Several Billion Request / day
Big Data
Variety
VolumeVelocity
Textual data, structured, unstructured, binary data, pictures, images, videos
Background: About Distribution
Digital- Weather.com,
Wunderground.com
- Mobile Apps on all Major
Mobile OS Platforms
Partnerships- Major Mobile Phone
Company
- Major Search Engine
- Many Others …
B2B- Major Airlines
- Energy Trading Desks
- Many Others …
40+ Billion API Requests / day
Expect 60 Billion / day by EOY 2015
We power weather for
Apple, Facebook,
Google, Microsoft,
Twitter, Yahoo and
many more
Our B2B Division, WSI,
has 4,600+ B2B clients
in 60 countries.
124M+
Monthly Unique
72% visit 2x or more Daily
170M+ App Downloads
47.2M Unduplicated Monthly
Uniques
The Dark Ages: Before The Cloud
- Run From TWC Data Centers
- Slow Time To Market
- Product
- Content
- Limited Distributed Scaling
- Limits of our existing Data
Centers
- Batch Based Forecast Systems
- Java Based Monolithic
Applications
- Big Web, Mobile Web
- Data Services
- Homegrown CMS
Business
- Build a Low Latency Global On Demand
Forecasting System
- Build a Highly Scalable Global Data
Distribution Platform
- Reboot Digital Properties (weather.com,
Mobile Apps, CMS)
- Reduce time to deploy new data sets
- Data Distribution APIs as Product
- Secured/Metered access to APIs
- Consolidate Data Centers
Reboot & Reimagine: Goals
Technical
- 100% cloud based
- Capable of handling billions of requests a day
- Capable of ingesting & processing Terabytes
of data a day
- Low latency APIs (25-100 ms)
- Highly Scalable
- Highly Available (99.99)
- Generic Data Processing Engine (DPE)
- Developer Friendly APIs
- Authentication, metering, and throttling
Architecture: Component Layers
- Large Undertaking – Divide & Conquer
- Loosely Coupled Layered Architecture
- Focus on your Core Competency
- Best Tool/Technology for the job
- Independent Delivery Timelines
- DATA PLATFORM: Weather Data
Distribution As A Service
- Eat your own dog food!Data Processing Engine
Data Services
StorageSystems of
Record
GatewayCDN
Architecture: Data Processing Engine (DPE)
- Generic DPE
- API Driven
- Data Agnostic
- Extensible
- Always on, Always flowing
- Asynchronous, Non Blocking
- High availability
- Low latency
- Horizontal scalability
Data Processing Engine
Data Services
StorageSystems of
Record
GatewayCDN
Architecture: Data Processing Engine (DPE)
Push/Pull Data
ProvidersIAPI Rabbit MQ
DPE
Redis
Riak
S3Rabbit MQ
System Of Record
(e.g. Forecast On Demand)
DPE Core
Plugin 1 Plugin 2 Plugin 3
- DPE Architecture- DPE Core
- Custom Plugins for Process, Download,
Store, Archive
- Technical Stack- Java 1.7
- Storage (Redis)
- Archive (Riak, S3)
- Distribution – RabbitMQ
- OS: Amazon-Linux (Centos 6 variant)
- Ingestion API
- RestFul Web Service
- Messaging Queue- RabbitMQ Cluster
- Workers- DPE
Architecture: Data Flow (DPE)
Private Subnet
RabbitMQ
ClusterIAPI Endpoint
AZ A
AZ B
Public Subnet
Public Subnet
Private Subnet
Data Processing
Engine
Private
Subnet
Data
Publisher
Private
Subnet
Architecture: Storage
- Polyglot Architecture
- Best Store for the Job
- Most Cost Effective
Storage for the Job
- BYOS: Bring Your Own Store
- Cache Rich!
Data Processing Engine
Data Services
StorageSystems of
Record
GatewayCDN
Architecture: Storage Polyglot
- Archive
- Images
- Videos
Bucket
Key/Value
Master
Slaves
- Real-time Data
and Caching
Key/Value
Node
NodeNode
Node
Key/Value
- Historical Weather
Archive
- Data Migration
- Gateway Data
- Analytics
Node
NodeNode
Node
Columnar
- Analytics
Parquet
Columnar
Storage
Repositories
MySQL
SQL
Server
- Informatica
- Drupal
Architecture: Cache is your friend!
CDN
Master
Slaves
- App Cache
Key/Value
(with data types
for values)
- Origin Cache- Edge Caching
- Edge Compute
- Make Sure All Data Elements are TTL Driven
- Always Respect Cache Control Headers
VarnishEC2 EC2
App Instances
EC2 EC2
- And Keep It Simple!
Architecture: Systems Of Record
- Let the system designers focus on the
problem they are trying to solve
- Let them pick the best technology
- Just Make sure they interface using
standard protocols
- Let DPE handle Ingest
- Let Services Layer handle
Distribution
- Support both Push/Pull model for
publication to distribution engineData Processing Engine
Data Services
StorageSystems of
Record
GatewayCDN
Architecture: Systems of Record
Forecast On Demand CMS
GET Model Post Model
Forecast On Demand
Data Services Data Services
Content Management system
Get: On Cache Miss Post: On Publish
RESTFul End Point
Currents On Demand
GET Model
Currents On Demand
Data Services
Get: On Cache Miss
Architecture: Data Services
Data Processing Engine
Data Services
StorageSystems of
Record
- RestFul API Design
- Stateless
- Decoupled
- Atomic / Aggregation Services
- Support both Push/Pull Model
- API Key driven Auth/Metering
- Horizontally Scalable
- Capable of serving billions of
request / day
- Data lends well to caching
GatewayCDN
Architecture: Distribution – Weather Data
Redis
Riak
OAPI API Gateway CDN API Users
FOD
Dispatcher
COD
Dispatcher
Aggregate
Engine
COD
Cache
FOD
Cache
Outbound API (OAPI)
- Fine grained RESTful API
- Intelligent Cache Management
- Accesses datastores, system of records and
other services
Aggregate Engine
- Aggregates fine grained APIs
- Aggregates at Edge through CDN ESI
Architecture: Request Flow
AZ A
AZ B
Public Subnet
Public Subnet
Private
Subnet
Internet
Private
Subnet
OAPIFOD Cache
COD Cache
FOD
CODOAPI
Distribution
Services
Architecture: Distribution – Content (Articles, Images, Video)
D
R
U
P
A
L
C
M
SMetadata Store
Images
Videos
Asset
Metadata
Image Cut Service
Video Distribution
Services
Generic Asset
Service
mRSS Feeds
Metadata
Metadata
Static Asset Pools
S3
Architecture: Gateway
Data Processing Engine
Data Services
StorageSystems of
Record
GatewayCDN- Authentication
- Routing
- Metering
- Throttling
- CDN Aware, CDN Driven
- Remember 25ms latency target!
- We rolled our own
Architecture: Gateway
API
UsersCDN
Authentication,
metering, Throttling
Quick Response
Caching routingOrigin routing
Source of
Authentication
Truth
- User makes API request
- CDN checks authorization - Look Aside
- If authorized, check cache
- If cache-miss, hit origin caching/routing
- If origin cache-miss, pass through to backend servers
Architecture: The Other Side – Events & Analytics!
Data Lake
Operational
Analytics
Business
Analytics
Executive
Dashboards
Data
Discovery
Data
Science
3rd Party
System
Integration
Stream
Processing
Long Term Raw Storage
Short Term Storage and
Big Data Processing
Consumers
Amazon SQS
Streaming
Custom
Ingestion
Pipeline
Events
3rd Party
Other DBs
S3
Batch
Sources
Streaming
Sources
ETL
Data Access
SQL
Architecture: Putting it all together
Data Processing Engine
Data Services
StorageSystems of
Record
GatewayCDN
Architecture: Implementation
Global Region 2Global
Region 3
Global
Region 4Global Region 1
Global Traffic Management
and CDN
Remote
Ingestion
Remote
Ingestion
FOD FOD FOD
Global Region 2
MonitoringConfiguration Mgmt Automation
Partner Data Sources:
(Weather, Alerts, Traffic, etc)
Distribution Engine Distribution Engine Distribution Engine
FOD
Distribution Engine
A curve ball !
Challenge:
• New deal struck with a
MAJOR mobile phone
company
• Ship new API
• Time to Market = 3 months
• Scale to 25+ billion
requests per day
Some findings
Architecture Already Decoupled
- Focus on Scaling Distribution Layer
Findings in Cycle:
- Load Testing / Tuning
- VPC NAT Saturation
- DNS Servers Sizing
- Instance Types and Characteristics
- OS Kernel Limits
- Destructive Testing / Fixing
- Brought Down instances, AZs,
Regions
- Corrupted caches, databases
Load Test
Tune
Destructive Test
Fix
KEY TAKEAWAY
It takes time to figure all this out … so
please budget time and resources for both
load and destructive testing
Leverage AWS Managed Services
• Amazon Route 53 – DNS
• Amazon RDS – Relational DBs
• Amazon DynamoDB – NoSQL DBs
• Amazon ElastiCache – Redis or Memcached
• Amazon SQS - Queuing
• Amazon Redshift – Data Warehouse
• Amazon Kinesis – Stream Storage
• AWS Lambda – “Code as a Service”
Data Processing Engine
Data Services
StorageSystems
of Record
GatewayCDN
Leverage AWS Managed Services
• Amazon Route 53 – DNS
• Amazon RDS – Relational DBs
• Amazon DynamoDB – NoSQL DBs
• Amazon ElastiCache – Redis or Memcached
• Amazon SQS - Queuing
• Amazon Redshift – Data Warehouse
• Amazon Kinesis – Stream Storage
• Lambda – “Code as a Service”
Data Processing Engine
Data Services
StorageSystems
of Record
GatewayCDN
Why RDS vs. EC2-based RDMS
Independent of RDBMS
• Licensing
• Replication
engine:
• Backups
• Updates
MySQL,
Oracle,
Postgres
MS SQL Amazon
Aurora
Max. IOPS 20,000 10,000 100,000s
Max. TBs 6 4 64
Storage
Which NoSQL?
+ Write performance more critical than durability
+ Native multi-X replication
+ Ecosystem
– Repartitioning
– Operational burden
– Data transfer cost
+ “Zero downtime”
+ Cross-region
replication
– Repartitioning
– Operational burden
– Data transfer cost
+ Managed solution
+ Easy to scale
+ Constantly Evolving
– Item size
– Cross-region replication
Storage
DynamoDB
Stream Storage
Building a DPE – AWS Style
Decouple producers &
consumers
Temporary buffer
Preserve client ordering
Streaming MapReduce
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
Shard 1
Shard 2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
Producer 2
Producer 3
Producer N
Key = Red
Key = Green
Data Processing Engine
Which Stream Store Should I Use?
Amazon Kinesis and Apache Kafka have many similarities
• Multiple consumers
• Ordering of records
• Streaming MapReduce
• Low latency
• Highly durable, available, and scalable
Differences
• Record lifetime: 24 hours in Amazon Kinesis, configurable in Kafka
• Record size: 1MB/record in Amazon Kinesis, configurable in Kafka
• Amazon Kinesis is a fully managed service
• Easier to provision, manage, and scale
Data Processing Engine
Server-less Approach to DPE
Data Input Amazon
Kinesis
Action AWS
Lambda
Data Output
IT application activity
Capture the
stream
Audit
Process the
stream
SNS
Metering records Condense Redshift
Change logs Backup S3
IoT Device Data Store RDS
Transaction orders Process SQS
Server health metrics Monitor EC2
Data Processing Engine
Architectural Evolution: Micro-services Approach
GTM/CDNUser
ForecastAggregationLocation
VarnishVarnish Varnish
Common Services Layer – Router & Controller Auth & Metering
Lifestyle
Varnish
Storage Polyglot
Micro DPE
Architectural Evolution: Technical Stack
Ingest
- Queue:
- Amazon SQS
- Stream
- Kafka
- Micro DPE
- Avro
- Thrift
- Proto-buffs
- Micro-Services Type of Model For Ingest
Distribution
- Micro Services
- Language Polyglot
- Service Discovery
Storage
- Amazon Aurora
- BYOS
Analytics
- Parquet +
Amazon S3
- Spark
- Amazon EMR
Wrapping Up!
- Have an Architectural
Blueprint
- Keep Decoupled or
Loosely Coupled Layers
- Communication via
Standard Protocols
- Keep Architectural Plan
“Technology Agnostic”
- Storage Polyglot
- Language Polyglot
- Be Aware of the
Monoliths!
- Keep Caching
Architecture Simple – TTL
Driven
- Always Budget for
- Load Testing
- Destructive Testing
Related Sessions
ARC309 - From Monolithic to Microservices: Evolving Architecture
Patterns in the Cloud - Thursday
ARC301 - Scaling Up to Your First 10 Million Users - Thursday
BDT310 - Big Data Architectural Patterns and Best Practices on
AWS – Today 2:45 PM
BDT403 - Best Practices for Building Real-time Streaming
Applications with Amazon Kinesis - Thursday