API analytics with Bigquery, by Javier Ramirez from teowaki

javier ramirez@supercoco9

API Analytics withRedis, BigQuery, and AppsScripts

a two peoplestart-up

a different league...

.. or maybe not

moral of the story

you can do big, if you know how

Set a distance.

Set an expiration time.

Bye bye noise.

javier ramirez @supercoco9 https://teowaki.com

REST API (Ruby on Rails) +

Web on top (AngularJS)

data that’s an order of magnitude greater than data you’re accustomed to


Doug Laney VP Research, Business Analytics and Performance Management at Gartner

data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.

Ed Dumbill program chair for the O’Reilly Strata Conference


bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds


Javier Ramirezimpresionable teowaki founder

1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time


twitterstackoverflowpinterestbooking.comWorld of WarcraftYouPornHipChatSnapchat


ntopngLogStash


Non intrusive metrics

Capture data really fast.

Then process the data on the background

Gzip to AWS S3/Glacier

orGoogle Cloud Storage


HadoopCassandraHadoop + Voldemort + KafkaHBase…Amazon Redshift


tools we considered:

but...

hard to set up and monitor

not interactive enough

expensive cluster


Our choice:

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery


Based on “Dremel”

Specifically designed for interactive queries over petabytes of real-time data


loading data

You just send the data intext (or JSON) format


SQL


select name from USERS order by date;

select count(*) from users;

select max(date) from USERS;

select sum(total) from ORDERS group by user;

specific extensions for analytics


withinflattennest

stddev

topfirstlastnth

variance

var_popvar_samp

covar_popcovar_samp

quantiles

correlations

Things you always wanted to try but were too scared to


select count(*) from publicdata:samples.wikipedia

where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

columnar storage


highly distributed execution using a tree


web console screenshot



country segmented traffic


window functions


our most active user

new users per month


10 request we should be caching

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

select uri, count(*) total from stats where method = 'POST' group by URI;


...but

/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links


5 most created resources

SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (

SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url

)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25

Automation with Apps Script

Read from bigquery

Create a spreadsheet on Drive

E-mail it everyday as a PDF


http://teowaki.com/

bigquery pricing

$26 per stored TB1000000 rows => $0.00416 / month

£0.00243 / month

$5 per processed TB1 full scan = 160 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03


£0.054307 / month*

per 1MM rows

*the 1st 1TB every month is free of charge


1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time


Find related links at

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Thanks!תודה

Javier Ramírez@supercoco9

API analytics with Bigquery, by Javier Ramirez from teowaki

Technology

Transcript of API analytics with Bigquery, by Javier Ramirez from teowaki