AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Mark McBride, Senior Software Engineer, Capital Games, Electronic Arts

Bill Weiner, SVP Operations, 47Lining

11/28/16

How EA Leveraged Amazon Redshift and AWS

Partner 47Lining to Gather Meaningful Player

Insights - GAM301

Speakers

Mark McBrideSenior Software

Engineer

Capital Games,

Electronic Arts

Bill WeinerSVP Operations,

47Lining

& Redshift Whisperer

What to Expect from the Session

• Analytics Architecture

• Challenges

• Effective patterns for ingest, de-dup, aggregate, vacuum

into Redshift

• How to balance rapid ingest and query speeds

• Strategies for data partitioning / orchestration

• Best practices for schema optimization, performant data

summaries incremental updates

• And how we built a Redshift solution to ingest 1 billion

rows of data per day

Life Before Redshift

• External solutions

• “One size fits all” for processing all games

• Serves the needs of central teams, but no focus on the

game team, no dedicated resource to us

• Lack of depth in data

• Client driven

Vision

• Discover how players play our game

• Drive better feature development

• Healthier operations through data

• Rapid iteration and evolution of telemetry gathering

• Decoupled from game server

• Frictionless access to data

• Easily query-able data

• Wall displays

Architecture

Architecture – Persisting to S3

Game Servers Amazon

Kinesis

S3

Worker

S3

Bucket

Put Events

Game Clients

Architecture – Game Client

• iOS/Android clients

• Produces client specific events like screen transitions

• Events are batched up and sent to the game server

every minute

• In between flushes to server, events are persisted to disk

• If the client crashes events will be sent on the next

session

Architecture - Game Server

• EC2 Instances - Tomcat/Java

• Produces the majority of events

• Events are sent asynchronously to Kinesis

• ActiveMQ broker is responsible for the durability of the

message

• Persisted to disk until sent

• Retries with exponential backoff

• Dead Letter Queue

Architecture - Kinesis

• One Kinesis stream with 10 shards partitioned by event UUID

• 24 hour retention

• Provides fault tolerance to game server. Redshift can be

offline and the Game Server isn't impacted.

• Game Server batches many events into one Kinesis record

on every client request

• Records are compressed

Architecture – S3 Kinesis Worker

• Elastic Beanstalk

• Decompress records

• Transform hierarchical JSON structure into flat structure

• Patch missing data. PlayerId

• Clean/truncate data. 0/1 -> true/false

• Filter out unrepairable data. Bad timestamps

• Report operational metrics

• Write to S3 when thresholds are met

Architecture – S3

• S3 files organized by

hour: Yyyy/MM/dd/HH/SequenceStart-SequenceEnd.gz

• Compressed JSON

• Long-term “truth” storage

• Cheap

Architecture - S3 to Redshift

S3Ingest Data

Pipeline

Amazon

Redshift

Amazon Elastic

Beanstalk

DeDupe & Analyze

Vacuum

SQL ETL

Data

Pipeline

Architecture – Ingest Data Pipeline

• Every hour data pipeline job bulk inserts all S3 files into

EventsDups table.

• Copy EventsDups from s3://sw-prod-kinesis-

events/#{format(minusHours(@scheduledStartTime,1),'Y

YYY/MM/dd/HH/')}

• Monitor for failures!

• Consider manifest driven ingest next time!

Table Progression

AggregateIngest Deduped

S3

Worker

Asynchronous

Copy of New

Data

Deduplication of

Incoming DataDeduplication

with

Events Table

and Insertion

Aggregation of

Data

Events Table

Architecture – Deduper

• Why deduplication?

• Redshift doesn't provide constraints.

• Distributed systems are complicated. Allow for retries when in

doubt.

• Data pipeline jobs can fail. Allow one to rerun ingest.

Architecture – Deduper Implementation

• Critical that a proper definition of duplicates is created

• Not based on all columns being the same

• Using the unique set of event identifying columns events

can be deduplicated both in the ingest table and against

the events table

Architecture – Deduper Implementation

• Beanstalk webapp that polls EventsDups table for work

• Deduplication is performed using the following columns to establish

uniqueness:

Description

Raw Event

Timestamp

Timestamp for event

User Id Player Identifier

Session Id Unique to each session

Step Each event gets a unique number generated from a memcached

increment operation.

Event Type Integer unique to each event.

Schema : Events

Sort Key Description

Ingest Time Unix time UTC when event is captured on Kinesis

Stat Date Raw Event Timestamp in yyyy-mm-dd format

Player Id Random generated UUID – Distribution Key

Raw Event

Timestamp

Unix time UTC when event is triggered on server

Event Type Integer unique to each event. 2924 = BattleSummaryEvent

Standard Fields Country, Device, Network, Platform...

Event Value 1-10 For each event type a set of 10 fields can be set.

Architecture – Vacuum

• Why Vacuum?

• Reclaim and reuse space that is freed when you delete and

update rows … we only insert …

• Ensure new data is properly sorted with existing table

• This is important in providing quality statistics to the query

optimizer.

• We Vacuum once a day which balances the time to Vacuum

and ability to provide performant statistics.

Architecture – Analyze

• Why Analyze?

• Any time one adds (modifies, or deletes) a significant number

of rows, you should run the analyze command to maintain the

query optimizers statistics.

• This occurs when the table is vacuumed.

• We analyze on every 4th successful deduper process.

• Analyze is resource intensive. Balance time to analyze

to optimizers ability to generate good plans.

Architecture – ETL – User Retention Daily

• Data Pipeline scheduled once an hour – along with many other aggregate tables

• Upsert into table looking back a week into events table

• Executed after users aggregate table is updated

Sort Key Description

PlayerId Random generated UUID – Distribution Key

Platform Apple/Google

Country US, GB...

Stat Date Row for every day player has played

Days In Game Number of days in game

Revenue Summary revenue data

Architecture – Scaling Growth

1 Billion Events!!!!

The Force

Awakens!!!

World Wide Launch!!!

Technical

Challenges & Solutions

Amazon Redshift system architecture

Leader node• SQL endpoint

• Stores metadata

• Coordinates query execution

Compute nodes• Local, columnar storage

• Executes queries in parallel

• Load, backup, restore via Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH

Two hardware platforms• Optimized for data processing

• DS2: HDD; scale from 2 TB to 2 PB

• DC1: SSD; scale from 160 GB to 326 TB

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Architecture – Scaling Challenges

650 minutes

to Vacuum

1,550 Minutes

To Deduplicate

Goals of Sorting

• Physically sort data within blocks and throughout a table

• Enable rrscans (block-rejection) to prune blocks by

leveraging zone maps

• Optimal SORTKEY is dependent on:• Query patterns

• Data profile

• Business requirements

COMPOUND

• Most common

• Well-defined filter criteria

• Time-series data

Choosing a SORTKEY

INTERLEAVED

• Edge cases

• Large tables (>billion rows)

• No common filter criteria

• Non time-series data

• Organizing of time-series data

• Optimally newest data at the "end" of a time-series table,

• Primarily as a query predicate (date, identifier, …)

• Optionally, choose a column frequently used for aggregates

Best Practices for Time-Series Data

http://docs.aws.amazon.com/redshift/latest/d

g/vacuum-load-in-sort-key-order.html

It is important to have

sort keys that ensures

that new data is

“located”, per sort key

order, at the end of the

time-series table

Tim

e

Events Out of Time

Incoming event destination post-vacuum

Events Table

Ingest Table

Altered Timestamp

By creating synthetic timestamp sort key the incoming

rows all vacuum to the end of the main events table

Ingest Table

Tim

e

Events Table

Best Practice for Sort Key Selection

http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html

Compound Sort Key:

A compound key is made up of all of the columns listed

in the sort key definition, in the order they are listed. A

compound sort key is most useful when a query's filter

applies conditions, such as filters and joins, that use a

prefix of the sort keys. The performance benefits of

compound sorting (may) decrease when queries depend

only on secondary sort columns, without referencing the

primary columns.

Optimizing the Effectiveness of Zone Maps

Time Only Query Performance>9 <4

Blo

ck 1

Blo

ck 2

Blo

ck 3

Blo

ck 4

Truncated Synthetic Timestamp>9 <4

Blo

ck 1

Blo

ck 2

Blo

ck 3

Blo

ck 4

Balancing Vacuum and Query Speed

Four ingest batches come in with the

same truncated synthetic timestamp

After vacuum the secondary and tertiary

reorder the order of rows improving

sorting power for these later sort key

Vacuum time grows the number of

overlapping batches increases

Improved grouping of the secondary

and tertiary sort key values improves

query speed where these are used

Pre-Vacuum Post-Vacuum

Architecture – Scaling Challenges

Goals of Distribution

• Distribute data evenly for parallel processing

• Minimize data movement

• Co-located joins

• Localized aggregations

Distribution key All

Node 1

Slice

1

Slice

2

Node 2

Slice

3

Slice

4

Node 1

Slice

1

Slice

2

Node 2

Slice

3

Slice

4

Full table data on first

slice of every nodeSame key to same location

Node 1

Slice

1

Slice

2

Node 2

Slice

3

Slice

4

EvenRound-robin

distribution

Choosing a Distribution Style

Key

• Large FACT tables

• Rapidly changing tables used

in joins

• Localize columns used within

aggregations

All

• Have slowly changing data

• Reasonable size (i.e., few

millions but not 100s of

millions of rows)

• No common distribution key

for frequent joins

• Typical use case: joined

dimension table without a

common distribution key

Even

• Tables not frequently joined or

aggregated

• Large tables without

acceptable candidate keys

Best Practice for Distribution

http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

Data redistribution can account for a substantial portion of the cost of a

query plan, and the network traffic it generates can affect other database

operations and slow overall system performance

1. To distribute the workload uniformly among the nodes in the cluster.

Uneven distribution, or data distribution skew, forces some nodes to

do more work than others, which impairs query performance

2. To minimize data movement during query execution. If the rows that

participate in joins or aggregates are already collocated on the

nodes with their joining rows in other tables, the optimizer does not

need to redistribute as much data during query execution

Unauthenticated (Anonymous) Events

Events Table

Slice 0 Slice 1 Slice 2 … Slice N

A small percentage of unauthenticated

events located on a single slice of a

large cluster leads to significant skew

Node Level Skew Slice Level Skew

Split Events Tables

Events Table - Authenticated


By splitting events into two tables querying speed was improved due to

unauthenticated events no longer unbalancing skew

UNION ALL view can be used to query all event data when needed

Events Table - Unauthenticated


Long Deduplication Time

Incoming events needs to be scrubbed

to prevent duplicate events

Duplicates removed from incoming data

Scanning the full events table for

deduplication slows as the events table

grows

Ingest Table

Tim

e

Events Table

Events Table

Time Restricted Deduplication

Incoming events evaluated for ranges

on specific columns

Scan of main events table

limited to range of incoming

events

Ingest Table

Tim

e

Growing Aggregation Time

Per player statistics

Events Table

Aggregate

Incremental AggregationEvents Table

Aggregate

Temp

Benefits

Benefits Detail

- Churn Prediction

- Cheater Detection

- Adaptive AI

- Changing the definition of success

Next steps

Next Steps

• Data retention

• Machine Learning

• Firehose

• Kinesis Analytics

Thank you!

Remember to complete

your evaluations!

AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather...

Technology

Transcript of AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather...