with Cloud Bigtable Personalization Engine Driving a ... · DBS302: Driving a Realtime...

Post on 27-Oct-2019

5 views 0 download

Transcript of with Cloud Bigtable Personalization Engine Driving a ... · DBS302: Driving a Realtime...

DBS302:Driving a Realtime Personalization Enginewith Cloud Bigtable

Calvin French-Owen, Co-Founder & CTO, Segment

You’re making a hard choice...

Our roadmap

- A bit of background- Personas architecture- BigQuery + Cloud Bigtable- Making hard choices

A bit of background

- 19,000 users- 300B monthly events- 450B outbound API calls- TB of data per day

Segment by the numbers

Under the hood...

API Kafka Consumer

DB

api.google.com

api.salesforce.com

api.intercom.io

api.mixpanel.com

The biggest advantage of this system

The biggest advantage of this system

It’s stateless

API Kafka Consumer

DB

api.google.com

api.salesforce.com

api.intercom.io

api.mixpanel.comAPI Kafka Consumer

In 2018... we started getting a new set of requirements

Personas brought some decidedly stateful use cases

The use cases of personas

1) Profile API

2) Identity resolution

3) Audience computation

- Query profiles in real-time- Match users by identity- Create audiences of users

Personas

Personas architecture

Let’s first talk about lambda architectures...

- Data is sent to the batch and speed layers- Batch layers runs bigger computations- Speed layer serves real-time updates (+ diffs)

Lambda architecture

- Query profiles in real-time (speed)- Match users by identity (speed)- Create audiences of users (batch)

Personas

Different pipelines, different datastores

Kafka Pubsub

BigQuery

Cloud Bigtable

(batch)

(speed)

Worker

Worker

Kafka Pubsub

BigQuery

Cloud Bigtable

(batch)

(speed)

Worker

Worker

Kafka -> Pub/Sub

Segment messages

- Tracking things like pageviews, user events, etc

- Semi-structured JSON- Typically ~1kb

- Hundreds of thousands of 1kb messages- Published from Kafka to Cloud PubSub- Writes data twice, once for realtime, once for batch- Audience computation in BigQuery- Real-time reads in Cloud Bigtable

Personas architecture

BigQuery + Cloud Bigtable

- Use case- Architecture- Data model- Query patterns

BigQuery + Cloud Bigtable

BigQuery: Use case

BigQuery

Cloud Bigtable

computeservice

Kafka Pubsub

Worker

Worker

- Want to find users who meet arbitrary criteria- Terabytes of data within a few minutes- Tables have billions of rows- We rarely care about all of the columns - Real-time reads are not a big deal- Tens of concurrent queries

BigQuery: Use case

BigQuery

Cloud Bigtable

computeservice

PubSub

Worker

Worker

BigQuery: Architecture

2004: MapReduce

2010: Dremel (built in 2006)

BigQuery: architecture

- Designed to interactively query datasets (seconds-minutes)- Nested, structured data- Uses SQL, no programming- Private version: Dremel

BigQuery Architecture: four good ideas

BigQuery idea #1:Column-oriented

Suppose we want to build a database...

A row-oriented database

What if my database has billions of rows...

...and I only need location?

What if my database has billions of rows...

...and I only need location?

Store columns, not rows!

What if we invert the rows?

BigQuery idea #2: Compression

BigQuery idea #2: Compression

Columns on disk

- We have a lot of repeated data- Run-length-encoding (RLE)- Let’s compress it...

Columns on disk

- We have a lot of repeated data- Run-length-encoding (RLE)- Let’s compress it...

BigQuery idea #3: Efficient nested decoding

BigQuery idea #3: Efficient nested decoding

What happens when I select *?

FSM

BigQuery idea #4: More servers, more efficiency

BigQuery idea #4: More servers, more efficiency

Root

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

query

Root

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

MERGE!

query

Root

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

Level 1

Level 1

Level 1

query

Root

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

leaflet

Level 1

Level 1

Level 1

query

MERGE!

MERGE!

MERGE!

MERGE!

More servers == more distributed work

BigQuery’s good ideas

1. Column-oriented2. Compression3. Fast, nested, data encoding4. Distribute the work (separate data + compute)

BigQuery: Data model

We want to take user-supplied criteria…

…and turn it into query parameters

UI JSON

SQLJSON

BigQuery: Data Model

- Dataset per customer- Table per {collection,event}- Additional tables for traits,

identity, merges

BigQuery: Data Model

- Dataset per customer- Table per {collection,event}- Additional tables for traits,

identity, merges- Repeated fields for

external_ids

BigQuery: Data Model

- Dataset per customer- Table per {collection,event}- Additional tables for traits,

identity, merges- Repeated fields for

external_ids- Explode arbitrary nested

properties

BigQuery:Query patterns

BigQuery

Cloud Bigtable

computeservice

Kafka Pubsub

Worker

Worker

Compute service runs queries every minute

Scan gigabytes in seconds

2GB/s scanned(170T/day)

800 slots

- Tens of concurrent queries- Scans terabytes of data independently- Partitioned by customer- Query by arrays of external_ids- Stored AST as JSON and converted to SQL

Batch computations in BigQuery

Cloud Bigtable: Use case

BigQuery

Cloud Bigtable

profileAPI

Kafka Pubsub

Worker

Worker

Cloud Bigtable: Use case

- Small amounts of data (kb to mb)- Able to be indexed for a single user- A high read and write rate (tens of thousands of qps)- Data should be reflected in real-time

Cloud Bigtable: Use case

- Small amounts of data (kb to mb)- Able to be indexed for a single user- A high read and write rate (tens of thousands of qps)- Data should be reflected in real-time

(Not a new idea)

Cloud Bigtable: Architecture

Bigtable (published in 2006)

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

write: <k, v>

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

write: <k, v>

memtable.append(k, v)

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

write: <k, v>

memtable.append(k, v)

append(k, v)

Writes are fast appends

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k)

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k)

memtable[k]

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

<value>

memtable[k]

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k) fetch(offset)

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

<value><data>

Reads first cache,then merge

What about failures?

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k) fetch(offset)

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k) fetch(offset)

Cloud Bigtable: Architecture

Client

BT Node

GFS Tablet

GFS Tablet

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k)

fetch(

offset

)

Cloud Bigtable: Architecture

- Multi-tenant- Row-oriented- Log-structured merge tree- Immutable, with in-memory caching- Bloom filters save on reads- Lock service maps nodes to keyspace

Cloud Bigtable: Data model

- Separate tables for different datatypes- Records- Properties- Events

- Keys are ID and time-ordered- Values are snappy-encoded

Cloud Bigtable: Data Model

Cloud Bigtable: Data Model

- Records provide metadata to stitch together the full record

- User properties power the profile API

- Events are sorted to query the last range of events

Cloud Bigtable + BigQuery:In production

In production

- Cloud Bigtable- 55,000 rows written per second- 175,000 rows read per second- 10 TB of data- 16 nodes

- BigQuery- Hundreds of queries per minute- Scanning hundreds of GB/minute- 500TB worth of data stored

Back to that hard choice...

BigQuery is hard to compare

A few placesCloud Bigtable shines

1. Identification of hot keys

2. Write-heavy workloads

Split compute

- Compute is separated from storage

- Writes can be spread across many nodes

In summary...

Segment Personas

- Powered by Cloud Bigtable and BigQuery- Cloud Bigtable for small, random reads- BigQuery for batch aggregations

- Processes billions of events- Large, multi-tenant architecture- SQL for flexible feature development- Favorable read/write costs- Millions of dollars in revenue- Scales to Google-levels

Fin

Your Feedback is Greatly Appreciated!

Complete the session survey in mobile app

1-5 star rating system

Open field for comments

Rate icon in status bar