Big Data Analytics with Google BigQuery. GDG Summit Spain 2014

javier ramirez@supercoco9

Big Data Analyticswith Google BigQuery

javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013

REST API +

AngularJS web as an API client

javier ramirez @supercoco9 https://teowaki.com

data that’s an order of magnitude greater than data you’re accustomed to

Doug Laney VP Research, Business Analytics and Performance Management at Gartner

data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.

Ed Dumbill program chair for the O’Reilly Strata Conference

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds

Javier Ramirezimpresionable teowaki founder

bigdata is cool but...

expensive cluster

hard to set up and monitor

not interactive enough

Our choice:

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery

Based on Dremel

Specifically designed for interactive queries over petabytes of real-time data

• Analysis of crawled web documents.• Tracking install data for applications on Android Market.• Crash reporting for Google products.• OCR results from Google Books.• Spam analysis.• Debugging of map tiles on Google Maps.• Tablet migrations in managed Bigtable instances.• Results of tests run on Google’s distributed build system.• Disk I/O statistics for hundreds of thousands of disks.• Resource monitoring for jobs run in Google’s data centers.• Symbols and dependencies in Google’s codebase.

What Dremel is used for in Google

Columnarstorage

highly distributed execution using a tree

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

loading data

You can feed flat CSV-like files or nested JSON objects

bq cli

bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz

web console screenshot

analytical SQL functions.correlations.

window functions.views.

JSON fields.timestamped tables.

Things you always wanted to try but were too scared to

select count(*) from publicdata:samples.wikipedia

where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (

SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url

)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25

Global Database of Events, Language and Tone

quarter billion rows30 yearsupdated daily

http://gdeltproject.org/data.html#googlebigquery

SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)

WHERE rank=1ORDER BY Year

our most active user

10 request we should be caching

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

select uri, count(*) total from stats where method = 'POST' group by URI;

...but

/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links

5 most created resources

Automation with Apps Script

Read from bigquery

Create a spreadsheet on Drive

E-mail it everyday as a PDF

bigquery pricing

$80 per stored TB300000 rows => $0.007629392 / month

$35 per processed TB1 full scan = 84 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB10 GB => $0.35 / month*the 1st TB every month is free of charge

Find related links at

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Grazas

Javier Ramírez@supercoco9

Big Data Analytics with Google BigQuery. GDG Summit Spain 2014

Software

Transcript of Big Data Analytics with Google BigQuery. GDG Summit Spain 2014

Swachandabhiravras rs006 gdg

Exploring BigData with Google BigQuery

Redshift VS BigQuery

How BigQuery broke my heart

Pratishyaya kc036 gdg

Amakaswasa#dg05 gdg

Madhumeha kc007 gdg

Medoharatwa#dg14 gdg

EPA’s GeoData Gateway (GDG) - Amazon S3 Agenda • Background – What is the EPA’s GeoData Gateway (GDG)? • Key GDG Features – What does the GDG provide? – How does the

Impala and BigQuery

Madhumeha kc041 gdg

Google BigQuery

Gdg wednesday

Column Stores and Google BigQuery

Grudhrasi kc034 gdg

Google BigQuery & Tableau: Best Practices · Google BigQuery Tableau Bes Pracices 2 Introduction Tableau and Google BigQuery allows people to analyze massive amounts of data and get

Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Hypretension kc043 gdg

Manyastambha kc028 gdg

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012