Exploring Open Data with BigQuery - mimming.com · Jen Tong Developer Advocate Google Cloud...
Transcript of Exploring Open Data with BigQuery - mimming.com · Jen Tong Developer Advocate Google Cloud...
Jen TongDeveloper AdvocateGoogle Cloud Platform
@MimmingCodesmimming.com
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_20150212_01]
Wikipedia hits over 1 hour
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201505]
Wikipedia hits over 1 month
Several years of Wikipedia data
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107],
...
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')
Several years of Wikipedia data
How about a RegExp
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))
Reading data: Life of a BigQuery
SELECT sum(requests) as sumFROM ( SELECT requests, title FROM [fh-bigquery:wikipedia.pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )
Life of a BigQueryLife of a BigQuery
Root Mixer
Mixer
Leaf
StorageSELECT requests, title
L L L L
M M
M
Life of a BigQueryLife of a BigQuery
Root Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT requests, title
WHERE (REGEXP_MATCH(title, '[Jj]en.+'))L L L L
M M
M
Life of a BigQueryLife of a BigQuery
Root Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
L L L L
M M
M
Life of a BigQueryLife of a BigQuery
Root Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
SELECT sum(requests)
L L L L
M M
M
Finding Open Data
opendata.stackexchange.com
Finding Open Data
reddit.com/r/bigquery/wiki/datasets
Finding Open Data
cloud.google.com/bigquery/public-data
Find weather dataselect namefrom [fh-bigquery:weather_gsod.stations]where state == 'MO' and usaf in(select stn from (SELECT count(stn) as cnt, stn FROM [fh-bigquery:weather_gsod.gsod2015] where stn <> '999999' group by stn order by cnt desc))group by nameorder by name ASC;
Weather at STLSELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2015] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'ST. LOUIS/LAMBERT')AND max < 200ORDER BY day;
Weather at STLSELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2013] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'ST. LOUIS/LAMBERT')AND max < 200ORDER BY day;
Global high temperatures SELECT year, avg(max) as averageFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc
Global high temperatures SELECT year, avg(max) as averageFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc
Finding stories about Missouri
SELECT count(Actor1Geo_FullName) count, Actor1Geo_FullName FROM [gdelt-bq:full.events]WHERE MonthYear > 0 AND Actor1Geo_FullName LIKE '%Missouri%' group by Actor1Geo_FullName order by count desclimit 100;
Finding stories about MissouriSELECT count(Actor1Geo_FullName) count, Actor1Geo_FullName FROM [gdelt-bq:full.events]WHERE MonthYear > 0 AND Actor1Geo_FullName LIKE 'Missouri%' group by Actor1Geo_FullName order by count desclimit 100;
Stories per month
SELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri, United States') MissouriFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1
Stories per monthSELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri, United States') MissouriFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1
SELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri...') / count(*) newsynessFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1
Stories per month, normalized
SELECT DATE(STRING(MonthYear) + '01') month, SUM(Actor1Geo_FullName='Missouri...') / count(*) newsynessFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1
Stories per month, normalized
select title, id, count(id) as editsfrom [publicdata:samples.wikipedia]where title contains 'Hackers' and title contains '(film)' and wp_namespace = 0group by title, idorder by editslimit 10
Pick a great movie
select title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where
id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)'group each by title, idorder by edits desclimit 100
Find edits in common
Discover the most broadly popular filmsselect id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20)
Edits in common, minus broadly popularselect title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where
id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)' and id not in (
select id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20 ) )group each by title, idorder by edits desclimit 100
Thank you!
Jen TongDeveloper AdvocateGoogle Cloud Platform@MimmingCodes
Try BigQuery: bigquery.google.com
Slides:mimming.com/presos