API analytics with Bigquery, by Javier Ramirez from teowaki

68
javier ramirez @supercoco9 API Analytics with Redis, BigQuery, and AppsScripts

description

At https://teowaki.com we have a system for API usage analytics, with Redis as a fast intermediate store and google Bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise. In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.

Transcript of API analytics with Bigquery, by Javier Ramirez from teowaki

Page 1: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez@supercoco9

API Analytics withRedis, BigQuery, and AppsScripts

Page 2: API analytics with Bigquery, by Javier Ramirez from teowaki

a two peoplestart-up

Page 3: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 4: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 5: API analytics with Bigquery, by Javier Ramirez from teowaki

a different league...

Page 6: API analytics with Bigquery, by Javier Ramirez from teowaki

.. or maybe not

Page 7: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 8: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 9: API analytics with Bigquery, by Javier Ramirez from teowaki

moral of the story

you can do big, if you know how

Page 10: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 11: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 12: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 13: API analytics with Bigquery, by Javier Ramirez from teowaki

Set a distance.

Set an expiration time.

Bye bye noise.

Page 14: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 15: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 16: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 17: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 18: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 19: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 20: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 21: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

REST API (Ruby on Rails) +

Web on top (AngularJS)

Page 22: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

Page 23: API analytics with Bigquery, by Javier Ramirez from teowaki

data that’s an order of magnitude greater than data you’re accustomed to

javier ramirez @supercoco9 https://teowaki.com

Doug Laney VP Research, Business Analytics and Performance Management at Gartner

Page 24: API analytics with Bigquery, by Javier Ramirez from teowaki

data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.

Ed Dumbill program chair for the O’Reilly Strata Conference

javier ramirez @supercoco9 https://teowaki.com

Page 25: API analytics with Bigquery, by Javier Ramirez from teowaki

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds

javier ramirez @supercoco9 https://teowaki.com

Javier Ramirezimpresionable teowaki founder

Page 26: API analytics with Bigquery, by Javier Ramirez from teowaki

1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time

javier ramirez @supercoco9 https://teowaki.com

Page 27: API analytics with Bigquery, by Javier Ramirez from teowaki

twitterstackoverflowpinterestbooking.comWorld of WarcraftYouPornHipChatSnapchat

javier ramirez @supercoco9 https://teowaki.com

ntopngLogStash

Page 28: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

Non intrusive metrics

Capture data really fast.

Then process the data on the background

Page 29: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

Page 30: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

Page 31: API analytics with Bigquery, by Javier Ramirez from teowaki

Gzip to AWS S3/Glacier

orGoogle Cloud Storage

javier ramirez @supercoco9 https://teowaki.com

Page 32: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

Page 33: API analytics with Bigquery, by Javier Ramirez from teowaki

HadoopCassandraHadoop + Voldemort + KafkaHBase…Amazon Redshift

javier ramirez @supercoco9 https://teowaki.com

tools we considered:

Page 34: API analytics with Bigquery, by Javier Ramirez from teowaki

but...

hard to set up and monitor

not interactive enough

expensive cluster

javier ramirez @supercoco9 https://teowaki.com

Page 35: API analytics with Bigquery, by Javier Ramirez from teowaki

Our choice:

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery

javier ramirez @supercoco9 https://teowaki.com

Page 36: API analytics with Bigquery, by Javier Ramirez from teowaki

Based on “Dremel”

Specifically designed for interactive queries over petabytes of real-time data

javier ramirez @supercoco9 https://teowaki.com

Page 37: API analytics with Bigquery, by Javier Ramirez from teowaki

loading data

You just send the data intext (or JSON) format

javier ramirez @supercoco9 https://teowaki.com

Page 38: API analytics with Bigquery, by Javier Ramirez from teowaki

SQL

javier ramirez @supercoco9 https://teowaki.com

select name from USERS order by date;

select count(*) from users;

select max(date) from USERS;

select sum(total) from ORDERS group by user;

Page 39: API analytics with Bigquery, by Javier Ramirez from teowaki

specific extensions for analytics

javier ramirez @supercoco9 https://teowaki.com

withinflattennest

stddev

topfirstlastnth

variance

var_popvar_samp

covar_popcovar_samp

quantiles

correlations

Page 40: API analytics with Bigquery, by Javier Ramirez from teowaki

Things you always wanted to try but were too scared to

javier ramirez @supercoco9 https://teowaki.com

select count(*) from publicdata:samples.wikipedia

where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

Page 41: API analytics with Bigquery, by Javier Ramirez from teowaki

columnar storage

javier ramirez @supercoco9 https://teowaki.com

Page 42: API analytics with Bigquery, by Javier Ramirez from teowaki

highly distributed execution using a tree

javier ramirez @supercoco9 https://teowaki.com

Page 43: API analytics with Bigquery, by Javier Ramirez from teowaki

web console screenshot

javier ramirez @supercoco9 https://teowaki.com

Page 44: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

country segmented traffic

Page 45: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

window functions

Page 46: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

our most active user

Page 47: API analytics with Bigquery, by Javier Ramirez from teowaki

new users per month

Page 48: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 https://teowaki.com

10 request we should be caching

Page 49: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

select uri, count(*) total from stats where method = 'POST' group by URI;

Page 50: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 http://teowaki.com

...but

/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links

Page 51: API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

Page 52: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 53: API analytics with Bigquery, by Javier Ramirez from teowaki

SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (

SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url

)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25

Page 54: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 55: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 56: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 57: API analytics with Bigquery, by Javier Ramirez from teowaki

NO

Page 58: API analytics with Bigquery, by Javier Ramirez from teowaki

Automation with Apps Script

Read from bigquery

Create a spreadsheet on Drive

E-mail it everyday as a PDF

javier ramirez @supercoco9 https://teowaki.com

Page 59: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 63: API analytics with Bigquery, by Javier Ramirez from teowaki
Page 64: API analytics with Bigquery, by Javier Ramirez from teowaki

bigquery pricing

$26 per stored TB1000000 rows => $0.00416 / month

£0.00243 / month

$5 per processed TB1 full scan = 160 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03

javier ramirez @supercoco9 https://teowaki.com

Page 65: API analytics with Bigquery, by Javier Ramirez from teowaki

£0.054307 / month*

per 1MM rows

*the 1st 1TB every month is free of charge

javier ramirez @supercoco9 https://teowaki.com

Page 66: API analytics with Bigquery, by Javier Ramirez from teowaki

1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time

javier ramirez @supercoco9 https://teowaki.com

Page 67: API analytics with Bigquery, by Javier Ramirez from teowaki

ig

Page 68: API analytics with Bigquery, by Javier Ramirez from teowaki

Find related links at

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Thanks!תודה

Javier Ramírez@supercoco9