Exploring Open Data with BigQuery - mimming.com · Jen Tong Developer Advocate Google Cloud...

66
Exploring Open Data with BigQuery

Transcript of Exploring Open Data with BigQuery - mimming.com · Jen Tong Developer Advocate Google Cloud...

Exploring Open Data with BigQuery

Jen TongDeveloper AdvocateGoogle Cloud Platform

@MimmingCodesmimming.com

Agenda

Data - it's everywhere

Count stuff

How it works

Some cool open data

Do something useful

About you

● Web developers?● Data people?● Students?● Not techies at all?

Photo credit: Matt Chan

Data

photo credit - taniwha on flickr

Photo credit: Matt Chan

Data

photo credit - wemake_cc on flickr

Photo credit: Matt Chanphoto credit - isaiah115 on flickr

Photo credit: Matt Chan

Google Research Publications

Google Research Publications

Open Source Implementations

Bigtable

Flume

Dremel

Managed Cloud Versions

Bigtable

Flume

Dremel

Bigtable

Dataflow

BigQuery

Google BigQueryGoogle BigQuery

02 Count some stuff

SELECT count(word)FROM publicdata:samples.shakespeare

Words in Shakespeare

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_20150212_01]

Wikipedia hits over 1 hour

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201505]

Wikipedia hits over 1 month

Several years of Wikipedia data

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107],

...

SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')

Several years of Wikipedia data

How about a RegExp

SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))

03 How did it do that?o_O

Qualities of a good RDBMS

Qualities of a good RDBMS

● Inserts & locking● Indexing● Cache● Query planning

Qualities of a good RDBMS

● Inserts & locking● Indexing● Cache● Query planning

Storing data

-- -- -- ---- -- -- ---- -- -- --

Table

Columns

Disks

Reading data: Life of a BigQuery

SELECT sum(requests) as sumFROM ( SELECT requests, title FROM [fh-bigquery:wikipedia.pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )

Life of a BigQuery

L L

MMixer

Leaf

Storage

L L L L

M M

M

Life of a BigQuery

Root Mixer

Mixer

Leaf

Storage

Life of a BigQueryQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage

Life of a BigQueryLife of a BigQuery

Root Mixer

Mixer

Leaf

StorageSELECT requests, title

L L L L

M M

M

Life of a BigQueryLife of a BigQuery

Root Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT requests, title

WHERE (REGEXP_MATCH(title, '[Jj]en.+'))L L L L

M M

M

Life of a BigQueryLife of a BigQuery

Root Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT sum(requests)

5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))

SELECT requests, title

L L L L

M M

M

Life of a BigQueryLife of a BigQuery

Root Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT sum(requests)

5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))

SELECT requests, title

SELECT sum(requests)

L L L L

M M

M

04 Open Data

Finding Open Data

opendata.stackexchange.com

Finding Open Data

reddit.com/r/bigquery/wiki/datasets

Finding Open Data

cloud.google.com/bigquery/public-data

Time to explore

GSOD

Find weather dataselect namefrom [fh-bigquery:weather_gsod.stations]where state == 'MO' and usaf in(select stn from (SELECT count(stn) as cnt, stn FROM [fh-bigquery:weather_gsod.gsod2015] where stn <> '999999' group by stn order by cnt desc))group by nameorder by name ASC;

Weather at STLSELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2015] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'ST. LOUIS/LAMBERT')AND max < 200ORDER BY day;

Weather at STLSELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2013] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'ST. LOUIS/LAMBERT')AND max < 200ORDER BY day;

Global high temperatures SELECT year, avg(max) as averageFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc

Global high temperatures SELECT year, avg(max) as averageFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc

GDELT

Finding stories about Missouri

SELECT count(Actor1Geo_FullName) count, Actor1Geo_FullName FROM [gdelt-bq:full.events]WHERE MonthYear > 0 AND Actor1Geo_FullName LIKE '%Missouri%' group by Actor1Geo_FullName order by count desclimit 100;

Finding stories about MissouriSELECT count(Actor1Geo_FullName) count, Actor1Geo_FullName FROM [gdelt-bq:full.events]WHERE MonthYear > 0 AND Actor1Geo_FullName LIKE 'Missouri%' group by Actor1Geo_FullName order by count desclimit 100;

Stories per month

SELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri, United States') MissouriFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

Stories per monthSELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri, United States') MissouriFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

SELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri...') / count(*) newsynessFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

Stories per month, normalized

SELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri...') / count(*) newsynessFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

Stories per month, normalized

05 Something Useful Use Wikipedia data to pick a movie

1. Wikipedia edits2. ???3. Movie recommendation

Follow the edits

Follow the edits

Same editor

select title, id, count(id) as editsfrom [publicdata:samples.wikipedia]where title contains 'Hackers' and title contains '(film)' and wp_namespace = 0group by title, idorder by editslimit 10

Pick a great movie

select title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where

id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)'group each by title, idorder by edits desclimit 100

Find edits in common

Discover the most broadly popular filmsselect id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20)

Edits in common, minus broadly popularselect title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where

id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)' and id not in (

select id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20 ) )group each by title, idorder by edits desclimit 100

Interesting challenges await

Thank you!

Jen TongDeveloper AdvocateGoogle Cloud Platform@MimmingCodes

Try BigQuery: bigquery.google.com

Slides:mimming.com/presos