API analytics with Bigquery, by Javier Ramirez from teowaki
-
Upload
javier-ramirez -
Category
Technology
-
view
363 -
download
2
description
Transcript of API analytics with Bigquery, by Javier Ramirez from teowaki
javier ramirez@supercoco9
API Analytics withRedis, BigQuery, and AppsScripts
a two peoplestart-up
a different league...
.. or maybe not
moral of the story
you can do big, if you know how
Set a distance.
Set an expiration time.
Bye bye noise.
javier ramirez @supercoco9 https://teowaki.com
REST API (Ruby on Rails) +
Web on top (AngularJS)
javier ramirez @supercoco9 https://teowaki.com
data that’s an order of magnitude greater than data you’re accustomed to
javier ramirez @supercoco9 https://teowaki.com
Doug Laney VP Research, Business Analytics and Performance Management at Gartner
data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.
Ed Dumbill program chair for the O’Reilly Strata Conference
javier ramirez @supercoco9 https://teowaki.com
bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds
javier ramirez @supercoco9 https://teowaki.com
Javier Ramirezimpresionable teowaki founder
1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time
javier ramirez @supercoco9 https://teowaki.com
twitterstackoverflowpinterestbooking.comWorld of WarcraftYouPornHipChatSnapchat
javier ramirez @supercoco9 https://teowaki.com
ntopngLogStash
javier ramirez @supercoco9 https://teowaki.com
Non intrusive metrics
Capture data really fast.
Then process the data on the background
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
Gzip to AWS S3/Glacier
orGoogle Cloud Storage
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
HadoopCassandraHadoop + Voldemort + KafkaHBase…Amazon Redshift
javier ramirez @supercoco9 https://teowaki.com
tools we considered:
but...
hard to set up and monitor
not interactive enough
expensive cluster
javier ramirez @supercoco9 https://teowaki.com
Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 https://teowaki.com
Based on “Dremel”
Specifically designed for interactive queries over petabytes of real-time data
javier ramirez @supercoco9 https://teowaki.com
loading data
You just send the data intext (or JSON) format
javier ramirez @supercoco9 https://teowaki.com
SQL
javier ramirez @supercoco9 https://teowaki.com
select name from USERS order by date;
select count(*) from users;
select max(date) from USERS;
select sum(total) from ORDERS group by user;
specific extensions for analytics
javier ramirez @supercoco9 https://teowaki.com
withinflattennest
stddev
topfirstlastnth
variance
var_popvar_samp
covar_popcovar_samp
quantiles
correlations
Things you always wanted to try but were too scared to
javier ramirez @supercoco9 https://teowaki.com
select count(*) from publicdata:samples.wikipedia
where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;
223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
columnar storage
javier ramirez @supercoco9 https://teowaki.com
highly distributed execution using a tree
javier ramirez @supercoco9 https://teowaki.com
web console screenshot
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
country segmented traffic
javier ramirez @supercoco9 https://teowaki.com
window functions
javier ramirez @supercoco9 https://teowaki.com
our most active user
new users per month
javier ramirez @supercoco9 https://teowaki.com
10 request we should be caching
javier ramirez @supercoco9 http://teowaki.com
5 most created resources
select uri, count(*) total from stats where method = 'POST' group by URI;
javier ramirez @supercoco9 http://teowaki.com
...but
/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links
javier ramirez @supercoco9 http://teowaki.com
5 most created resources
SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (
SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url
)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25
NO
Automation with Apps Script
Read from bigquery
Create a spreadsheet on Drive
E-mail it everyday as a PDF
javier ramirez @supercoco9 https://teowaki.com
bigquery pricing
$26 per stored TB1000000 rows => $0.00416 / month
£0.00243 / month
$5 per processed TB1 full scan = 160 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03
javier ramirez @supercoco9 https://teowaki.com
£0.054307 / month*
per 1MM rows
*the 1st 1TB every month is free of charge
javier ramirez @supercoco9 https://teowaki.com
1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time
javier ramirez @supercoco9 https://teowaki.com
ig
Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Thanks!תודה
Javier Ramírez@supercoco9