Getting Started with Website Analytics for Economic Developers
Getting Started with Real-time Analytics
-
Upload
amazon-web-services -
Category
Technology
-
view
848 -
download
4
Transcript of Getting Started with Real-time Analytics
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Getting Started with Real-time Analytics
&
Real-time Game Analytics at GREE International
Rahul Bhartia - Solution Architect, Amazon Web Services
Kandarp Shah - Engineering Manager, GREE International
Agenda
• Real-time analytics
– Data ingestion
– Data processing
• GREE International
– Analytics architecture
– Lessons learned
• Takeaway
Real-time analytics
Real-time ingestion
• Highly scalable
• Durable
• Elastic
• Re-playable reads
Continuous processing
• Load-balancing incoming streams
• Fault-tolerance, check-pointing and replay
• Elastic
• Enables multiple apps to process in parallel
Continuous data flow
Low end-to-end latency
Continuous, real-time workloads
+
Global top 10
Data
record
StreamShard
Partition key
Worker
My top 10
Data recordSequence number
14 17 18 21 23
Amazon Kinesis – managed stream
example.com
Amazon
Kinesis
AW
S e
nd
po
int
Amazon
S3
Amazon
DynamoDB
Amazon
Redshift
Data
sources
Availability
Zone
Availability
Zone
Data
sources
Data
sources
Data
sources
Data
sources
Availability
Zone
Shard 1
Shard 2
Shard N
[Data
archive]
[Metric
extraction]
[Sliding-wiindow
analysis]
[Machine
learning]
App. 1
App. 2
App. 3
App. 4
Amazon EMR
Amazon Kinesis – common data broker
Amazon Kinesis – stream and shards
•Stream: A named entity to
capture and store data
•Shards: Unit of capacity
•Put – 1 MB/sec or 1000
TPS
•Get - 2 MB/sec or 5 TPS
•Scale by adding or removing
shards
•Replay in 24-hr. window
How to size your Amazon Kinesis stream
Consider 2 producers, each producing 2 KB records at 500 TPS:
Minimum of 2 shards for ingress of 2 MB/s
2 Applications can read with egress of 4MB/s
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Producers
Application
How to size your Amazon Kinesis stream
Consider 3 consuming applications each processing the data
Simple! Add another shard to the stream to spread the load
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Application
Application
Producers
Shard
Amazon Kinesis – distributed streams
• From batch to continuous processing
• Scale UP or DOWN without losing sequencing
• Workers can replay records for up to 24 hours
• Scale up to GB/sec without losing durability– Records stored across multiple Availability Zones
• Run multiple parallel Amazon Kinesis applications
Batch
Micro
batch
Real
time
Pattern for real-time analytics…
Batch
analysisData Warehouse
Hadoop
Notifications
& alerts
Dashboards/
visualizations
APIsStreaming
analytics
Data
streams
Deep learning
Dashboards/
visualizations
Spark-Streaming
Apache Storm
Amazon KCL
Data
archive
Real-time analytics
• Streaming– Event-based response within seconds; for example, detecting
whether a transaction is a fraud or not
• Micro-batch– Operational insights within minutes; for example, monitor
transactions from different regions
Kinesis
Client
Library
Amazon Kinesis Client Library (Amazon KCL)
• Distributed to handle
multiple shards
• Fault tolerant
• Elastically adjusts to shard
count
• Helps with distributed
processing
Amazon
Kinesis
Stream
Amazon EC2
Amazon EC2
Amazon EC2
Amazon KCL design components
• Worker: The processing unit that maps to each application instance
• Record processor: The processing unit that processes data from a shard of an Amazon Kinesis stream
• Check-pointer: Keeps track of the records that have already been processed in a given shard
Amazon KCL restarts the processing of the shard at the last-known processed record if a worker fails
Amazon Kinesis Connector Library
• Amazon S3– Archival of data
• Amazon Redshift
– Micro-batching loads
• Amazon DynamoDB
– Real-time Counters
• Elasticsearch
– Search and Index
S3 Dynamo DB Amazon
Redshift
Amazon
Kinesis
Read data directly into
Hive, Pig, Streaming,
and Cascading from
Amazon Kinesis
Real-time sources into batch-oriented systems
Multi-application support & check-pointing
EMR integration with Amazon Kinesis
DStream
RDD@T1 RDD@T2
Messages
Receiver
Spark streaming – Basic concepts
• Higher-level abstraction called Discretized Streams
(DStreams)
• Represented as sequences of Resilient Distributed
Datasets (RDDs)
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
Apache Storm: Basic concepts
• Streams: Unbounded sequence of tuples
• Spout: Source of stream
• Bolts: Processes that input streams and output new streams
• Topologies: Network of spouts and bolts
https://github.com/awslabs/kinesis-storm-spout
Batch
Micro
batch
Real
time
Putting it together…
Producer Amazon
Kinesis
App Client
EMRS3
Amazon KCL
DynamoDB
Amazon Redshift BI tools
Amazon KCL
Amazon KCL
• Best Practices for Micro-Batch Loading on Amazon Redshift
• Implement a Real-time, Sliding-Window Application Using Amazon Kinesis and Apache Storm
• Visualizing Real-time, Geotagged Data with Amazon Kinesis
GREE Headquarters
Tokyo, Japan
GREE International,
Inc.
San Francisco, CA
GREE Canada
Vancouver, BC
QUICK FACTS
6Continents playing GREE games
1,882Employees Worldwide
13Games made in North America
2004
2011
2013
MILESTONES GAME STATS - 4 titles in top 100 grossing*
Crime City (Studios)
Reached Top 10 Grossing in 140 countries
Top 100 Grossing in 19 countries, over 3 years
since launch
*As of Sep. 2014 – Source: App Annie
A Global Gaming Powerhouse
Knights & Dragons (Publishing)
Reached Top 10 Grossing in 41 countries
Top 100 Grossing in 22 countries
Ad Clicks
Downloads
Perf Data
Attribution
Campaign Performance
SC Balance
HC Balance
IAP
Player Targeting
Analytics @ GREE
Data collection
Source of data
• Mobile devices
• Game servers
• Ad networks
Data size & growth
• 500 G+/day
• 500 M+ events/day
• Size of event ~ 1 KB
Analytics Data
{"player_id":"323726381807586881","player_level":169,"device":"iPhone 5","version":"iOS 7.1.2”,"platfrom":"ios","client_build":"440”,"db":”mw_dw_ios","table":"player_login","uuid":"1414566719-rsl3hvhu7o","time_created":"2014-10-29 00:11:59”}
Key requirements
• Guaranteed data delivery
• Zero data loss
• Zero data corruption
• Ease of adding consumers
• Near real-time data latency
• Real-time ad-hoc analysis
• Managed service
Game DB
Game
serversAmazon
Kinesis
Amazon
S3
Amazon
S3Amazon
Redshift
S3
Consumer
Amazon
EMR
DSV
JSON
Analytics architecture
DashboardReal-time
stats
consumer
Amazon
ElastiCache
(Redis)
Sender
Amazon
Kinesis
stream
Shard 1
Shard 2
Shard 3
Shard n
Describe Stream
Sync Shards
Analytics
files
Send
PutRecord
Read Buffer
Amazon Kinesis sender
Compress
50 KB
Design choices for sender
• Single-stream vs. stream per game
• Batch vs. single event
• Compressed vs. uncompressed
• PartitionKey vs. ExplicitHashKey
Consumer – Amazon S3 store in DSV format
Amazon
Kinesis
stream
Shard 1
Shard 2
Shard n
S3File metadata DB
Decompress De-Dupe
BufferDSV transformation
Validation Target table
Compress
Size/
Timeout
Record
Consumer
Amazon KCL
Record processor
Record processor
Consumer
Amazon KCL
Record processor
Auto Scaling group
Loading data into Amazon Redshift
S3
File metadata DBAmazon
Redshift
Update status
Transaction
Create manifest Execute COPY
Create manifest Execute COPY
Status
Create manifest Execute COPY
Consumer – Real-time stats
Amazon
Kinesis
Stream
Shard 1
Shard 2
Shard n
Decompress De-Dupe
Target tableRecordConsumer
Amazon Kinesis Client Library
Record processor
Record processor
ConsumerAmazon Kinesis Client Library
Record processor
Auto Scaling group
ConfigurationMetric, segment &
value, timeslot
Filter events
ElastiCache
(Redis)Dashboard
Lessons learned
Sender
• Decouple data generation from sending
• Batch and compress
• PutRecord HTTP:5XX can result in duplicates
• Monitor ProvisionedThroughputExceeded exception
Lessons learned (Cont.)
Consumer
• Use Amazon KCL
• Auto-scale and monitor load
Overall
• Provision enough shards
• Handle shutdown gracefully
• Follow AWS best practices for error retries and
exponential back-off
Takeaway
Amazon Kinesis
• Data available for processing within seconds
• Robust API, Amazon KCL, and Amazon Kinesis
Connector Libraries
AWS
• Managed
• Scalable
• Cost effective
• Quick to get up and running