with Cloud Bigtable Personalization Engine Driving a ... · DBS302: Driving a Realtime...

DBS302:Driving a Realtime Personalization Enginewith Cloud Bigtable

Calvin French-Owen, Co-Founder & CTO, Segment

You’re making a hard choice...

Our roadmap

- A bit of background- Personas architecture- BigQuery + Cloud Bigtable- Making hard choices

A bit of background

- 19,000 users- 300B monthly events- 450B outbound API calls- TB of data per day

Segment by the numbers

Under the hood...

API Kafka Consumer

api.google.com

api.salesforce.com

api.intercom.io

api.mixpanel.com

The biggest advantage of this system

It’s stateless

API Kafka Consumer

api.google.com

api.salesforce.com

api.intercom.io

api.mixpanel.comAPI Kafka Consumer

In 2018... we started getting a new set of requirements

Personas brought some decidedly stateful use cases

The use cases of personas

1) Profile API

2) Identity resolution

3) Audience computation

- Query profiles in real-time- Match users by identity- Create audiences of users

Personas

Personas architecture

Let’s first talk about lambda architectures...

- Data is sent to the batch and speed layers- Batch layers runs bigger computations- Speed layer serves real-time updates (+ diffs)

Lambda architecture

- Query profiles in real-time (speed)- Match users by identity (speed)- Create audiences of users (batch)

Personas

Different pipelines, different datastores

Kafka Pubsub

BigQuery

Cloud Bigtable

(batch)

(speed)

Worker

Kafka Pubsub

BigQuery

Cloud Bigtable

(batch)

(speed)

Worker

Kafka -> Pub/Sub

Segment messages

- Tracking things like pageviews, user events, etc

- Semi-structured JSON- Typically ~1kb

- Hundreds of thousands of 1kb messages- Published from Kafka to Cloud PubSub- Writes data twice, once for realtime, once for batch- Audience computation in BigQuery- Real-time reads in Cloud Bigtable

Personas architecture

BigQuery + Cloud Bigtable

- Use case- Architecture- Data model- Query patterns

BigQuery + Cloud Bigtable

BigQuery: Use case

BigQuery

Cloud Bigtable

computeservice

Kafka Pubsub

Worker

- Want to find users who meet arbitrary criteria- Terabytes of data within a few minutes- Tables have billions of rows- We rarely care about all of the columns - Real-time reads are not a big deal- Tens of concurrent queries

BigQuery: Use case

BigQuery

Cloud Bigtable

computeservice

PubSub

Worker

BigQuery: Architecture

2004: MapReduce

2010: Dremel (built in 2006)

BigQuery: architecture

- Designed to interactively query datasets (seconds-minutes)- Nested, structured data- Uses SQL, no programming- Private version: Dremel

BigQuery Architecture: four good ideas

BigQuery idea #1:Column-oriented

Suppose we want to build a database...

A row-oriented database

What if my database has billions of rows...

...and I only need location?

What if my database has billions of rows...

...and I only need location?

Store columns, not rows!

What if we invert the rows?

BigQuery idea #2: Compression

Columns on disk

- We have a lot of repeated data- Run-length-encoding (RLE)- Let’s compress it...

Columns on disk

- We have a lot of repeated data- Run-length-encoding (RLE)- Let’s compress it...

BigQuery idea #3: Efficient nested decoding

What happens when I select *?

BigQuery idea #4: More servers, more efficiency

leaflet

MERGE!

leaflet

Level 1

leaflet

Level 1

MERGE!

More servers == more distributed work

BigQuery’s good ideas

1. Column-oriented2. Compression3. Fast, nested, data encoding4. Distribute the work (separate data + compute)

BigQuery: Data model

We want to take user-supplied criteria…

…and turn it into query parameters

UI JSON

SQLJSON

BigQuery: Data Model

- Dataset per customer- Table per {collection,event}- Additional tables for traits,

identity, merges

identity, merges- Repeated fields for

external_ids

identity, merges- Repeated fields for

external_ids- Explode arbitrary nested

properties

BigQuery:Query patterns

BigQuery

Cloud Bigtable

computeservice

Kafka Pubsub

Worker

Compute service runs queries every minute

Scan gigabytes in seconds

2GB/s scanned(170T/day)

800 slots

- Tens of concurrent queries- Scans terabytes of data independently- Partitioned by customer- Query by arrays of external_ids- Stored AST as JSON and converted to SQL

Batch computations in BigQuery

Cloud Bigtable: Use case

BigQuery

Cloud Bigtable

profileAPI

Kafka Pubsub

Worker

- Small amounts of data (kb to mb)- Able to be indexed for a single user- A high read and write rate (tens of thousands of qps)- Data should be reflected in real-time

(Not a new idea)

Cloud Bigtable: Architecture

Bigtable (published in 2006)

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

write: <k, v>

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

write: <k, v>

memtable.append(k, v)

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

write: <k, v>

memtable.append(k, v)

append(k, v)

Writes are fast appends

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k)

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k)

memtable[k]

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

<value>

memtable[k]

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k) fetch(offset)

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

Reads first cache,then merge

What about failures?

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

Client

BT Node

GFS Tablet

memtable

BT Node

memtable

GFS Tablet

read(k)

fetch(

offset

- Multi-tenant- Row-oriented- Log-structured merge tree- Immutable, with in-memory caching- Bloom filters save on reads- Lock service maps nodes to keyspace

Cloud Bigtable: Data model

- Separate tables for different datatypes- Records- Properties- Events

- Keys are ID and time-ordered- Values are snappy-encoded

Cloud Bigtable: Data Model

- Records provide metadata to stitch together the full record

- User properties power the profile API

- Events are sorted to query the last range of events

Cloud Bigtable + BigQuery:In production

In production

- Cloud Bigtable- 55,000 rows written per second- 175,000 rows read per second- 10 TB of data- 16 nodes

- BigQuery- Hundreds of queries per minute- Scanning hundreds of GB/minute- 500TB worth of data stored

Back to that hard choice...

BigQuery is hard to compare

A few placesCloud Bigtable shines

1. Identification of hot keys

2. Write-heavy workloads

Split compute

- Compute is separated from storage

- Writes can be spread across many nodes

In summary...

Segment Personas

- Powered by Cloud Bigtable and BigQuery- Cloud Bigtable for small, random reads- BigQuery for batch aggregations

- Processes billions of events- Large, multi-tenant architecture- SQL for flexible feature development- Favorable read/write costs- Millions of dollars in revenue- Scales to Google-levels

Your Feedback is Greatly Appreciated!

Complete the session survey in mobile app

1-5 star rating system

Open field for comments

Rate icon in status bar

with Cloud Bigtable Personalization Engine Driving a ... · DBS302: Driving a Realtime...

Documents

Transcript of with Cloud Bigtable Personalization Engine Driving a ... · DBS302: Driving a Realtime...

BigTable: A System for Distributed Structured Storagepages.cs.wisc.edu/~remzi/Classes/739/Fall2017/Papers/bigtable-slides-05.pdf · Bigtable master Bigtable tablet server Bigtable

Lecture 8 gooble bigtable

Personalization as driving belt for digital transformation? A reality check.

The Guide to Driving Impact with Personalization · The Guide to Driving Impact with Personalization | 8 Our retail clients often struggle with data in one of the following five ways:

Bigtable: A Distributed Storage System for Structured Data ...cs655/lectures/CS655-Louis_Rabiet_BigTable.pdf · Bigtable. . .. . ... .. . Other NoSQL Thoughts. . . Conclusion Bigtable:

Google Cloud Bigtable · 2016/02/03 · Google Research Publications. Google Research Publications. Managed Cloud Versions Bigtable Flume Dremel. Managed Cloud Versions Bigtable

Great BigTable and my toys

Driving Value At Scale With Human-Centered Personalization

HBaseConEast2016: OpenTSDB+BigTable

Distributed Systems 18. BigTable

Bigtable: A Distributed Storage System for Structured Data · Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several

Google Bigtable paper presentation

Google File System BigTable

#MITXECS - Driving eCommerce Revenue with Real-Time Experience Personalization

Online Bigtable merge compaction - UCRneal/Slides/bigtable_merge_compaction.pdf · 2015-09-22 · BIGTABLE — data storage at Google Maps, Search/Crawl, Gmail ...use BIGTABLE to

Cross-Site BigTable using HBase

Google Cloud Bigtable Integrating time series database with · PDF fileOpenTSDB + Bigtable Integrating time series database with Google Cloud Bigtable Danil Zburivsky, Big Data Practice

How Content Personalization is Driving Conversions and Revenue€¦ · HOW CONTENT PERSONALIZATION IS DRIVING CONVERSIONS AND REVENUE IN 2019 Ninety-eight percent of marketers believe

Bigtable: A Distributed Storage System for Structured Data · 2016-01-09 · Introduction •Bigtable is a distributed storage system for managing structured data. •Goals of Bigtable

[Webinar] Driving The Ultimate Customer Experience With Predictive Marketing & Personalization