A case for teaching SQL to scientists
description
Transcript of A case for teaching SQL to scientists
![Page 1: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/1.jpg)
A case for teaching SQL to scientists
Daniel Halperin#w2tbac @SESYNC 2013-07-09
![Page 2: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/2.jpg)
SQL: think like data
• SQL is a Language for expressing Queries over Structured data.
• vs Python/R, SQL is
• strictly less powerful
• better for concisely, clearly, and efficiently expressing data manipulation
• ... and anecdotally, “many” scripts written by scientists just manipulate data
![Page 3: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/3.jpg)
Claim 1: SQL isConcise & Clear
• English questions often translate directly into SQL
• Scripting languages have a lot of language overhead -- syntactic sugar
• Let’s see some (admittedly biased) examples
![Page 4: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/4.jpg)
with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1print cnt
What does this code do?
![Page 5: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/5.jpg)
with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1print cnt
What does this code do?
SELECT COUNT(*) AS cntFROM file
![Page 6: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/6.jpg)
with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line
What does this code do?
![Page 7: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/7.jpg)
with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line
What does this code do?
SELECT *FROM fileWHERE value > 5
![Page 8: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/8.jpg)
What does this code do?SELECT value, SUM(counts) AS tot_countFROM fileGROUP BY value
![Page 9: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/9.jpg)
What does this code do?
with open(‘file.txt’) as input_file: tot_counts = defaultdict(0) for line in input_file: tot_counts[line.split()[3]] += int(line.split()[4])for value in tot_counts: print value, tot_counts[value]
SELECT value, SUM(counts) AS tot_countFROM fileGROUP BY value
![Page 10: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/10.jpg)
What does this code do?SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, censusWHERE electoral.county = census.county
![Page 11: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/11.jpg)
What does this code do?SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, censusWHERE electoral.county = census.county
<Complicated stuff with dictionaries>
![Page 12: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/12.jpg)
Claim 2: SQL is Efficient
Scaling up your data
• What happens when Python/R data doesn’t fit in memory? Crash, or rewrite much more complicated code
• All databases automatically, transparently spill to disk, and are heavily optimized for performance
![Page 13: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/13.jpg)
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script./highly_optimized_code.py < TB.dataset > GB.result
![Page 14: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/14.jpg)
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
![Page 15: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/15.jpg)
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
1) Dive into the complex code and modify its internals to filter inside2) Suffer the long running time of the first program
![Page 16: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/16.jpg)
Claim 2: SQL is Efficient
CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset
Gives their query a name, but doesn’t
execute it!
![Page 17: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/17.jpg)
Claim 2: SQL is Efficient
CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset
SELECT *FROM their_queryWHERE <... your filter ...>
Gives their query a name, but doesn’t
execute it!
Combine both queries and optimize
together!
![Page 18: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/18.jpg)
Claim 2: SQL is Efficient
CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset
SELECT *FROM their_queryWHERE <... your filter ...>
Gives their query a name, but doesn’t
execute it!
Combine both queries and optimize
together!
Fast!
![Page 19: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/19.jpg)
SQL for Science• UW’s SQLShare - open, view-oriented,
web database service
• Easy data import, public & private sharing, permalinks (DOI support coming)
• Use a series of views instead of scripts for:
• data cleaning, transformation, integration
• simple stats, analytics, format conversion
• provenance and publishing
• mashups: integrated with R, Sage, etc.
![Page 20: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/20.jpg)
escience.washington.edu/sqlshare“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a
10 minute 100 line script in 1 line of SQL.”- Andrew D White, grad student in UW Chem Eng
“I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.”- Robin Kodner, as asst professor at Western Washington U
"That [SQL query that finished in 1 second] took me a week [manually in Excel]!"
- Robin Kodner, as postdoc at UW Oceanography
* yes, we need (and are interested in) more than anecdotes!!
![Page 21: A case for teaching SQL to scientists](https://reader033.fdocuments.us/reader033/viewer/2022051609/547b68afb4795972098b4e4a/html5/thumbnails/21.jpg)
SQL can do more than you think (here vs R)