Data Science: Best Practice & Governance in Analytics Sayara Beg 30 th April 2013.
-
Upload
elfrieda-wilcox -
Category
Documents
-
view
219 -
download
0
Transcript of Data Science: Best Practice & Governance in Analytics Sayara Beg 30 th April 2013.
Operational Research Consultancy 2
Agenda• Data Science– Role of the Data Scientist– What does a data scientist do?
• Best Practice & Governance in Data Science– Unintentional mistakes; or Fraud?– Issue 1: Question of Reproducibility– Issue 2: Applying the Scientific Method
30/04/2014
Operational Research Consultancy 4
The Role of the Data Scientist
• Sexiest Job of the Century – HBR Oct 2010
• A superman or woman??
• A Mathematician?• A Computer Programmer?• A Graphic Designer?• All rolled into one?30/04/2014
Operational Research Consultancy 5
A tedious job?
• HBR 2014 – “A Data Scientist’s job is tedious..”• Majority of the ‘scientific analysis’ time spent:– Data Discovery (extraction)– Data Wrangling (interpretation)– Data Munging (transformation)– Data Cleansing– Data Profiling
• Less time spent modelling & visualising
30/04/2014
Operational Research Consultancy 6
What does a Data Scientist do?
– Scientific Experts: Statistics, Mathematic, O.R. Modelling, Physics• Identify algorithms, gather insights, discover patterns,
clusters
– Tools of the Trade: Data, Hardware, Software, Programming• Access, capture, prepare, cleanse large data sets
– Interpersonal Skills: Communication, Presentation• Visual communication using colour, shape, size,
quantity
30/04/2014
Operational Research Consultancy 7
Analysis – Structured Coding in SQL
• Sql code“Select id, name, reg_date, reg_addressFrom employeeGroup by name;”
30/04/2014
ID Name Reg_date Reg_address
32487493 Bloggs 24-Sep-1992 London
98349435 Doe 07-Aug-1983 Munich
Operational Research Consultancy 8
Exploratory Data Analysis (EDA) in ‘R’• # Goal: Toss a coin N times and compute the running proportion of heads.• N = 500 # Specify the total number of flips, denoted N
• # Generate a random sample of N flips for a fair coin (head=1, tails=0):• set.seed(47405)• flipsequence = sample( x=c(0,1) , prob=c(0.2,0.8) , size=N , replace=TRUE)
• # Compute the running proportion of heads:• r=cumsum( flipsequence )• n=1:N # N is a vector• runprop = r/n
• # Graph the running proportion:• plot (n, runprop, type="o", log="x",• xlim=c(1,N) , ylim=c(0.0,1.0) , cex.axis=1.5 ,• xlab="FlipNumber" , ylab="Proportion Heads" , cex.lab=1.5 ,• main="Running Proportion of Heads" , cex.main=1.5 )
• # Plot a dotted horizontal line at y=0.8, just as a reference line:• lines( c(1,N) , c(0.2, 0.8) , lty=3 )
• # Display the beginning of the flip sequence. These string and character• flipletters = paste( c("T", "H")[flipsequence[1:10]+1], collapse="")• displaystring = paste( "Flip Sequence = ", flipletters, "..." , sep="")• text(5, 0.9 , displaystring , adj=c(0,1) , cex=1.3)
• #display the relative frequency at the end of the sequence.• text( N, 0.3 , paste("End Proportion = ", runprop[N]), adj=c(1,0) , cex=1.3)
30/04/2014
Operational Research Consultancy 9
Technologies-Structured, Modelled
30/04/2014
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Fact Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Fact Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Fact Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Fact Table
Star
Snowflake
Operational Research Consultancy 13
Best Practice & GovernanceIs it a science?• The Science Council's definition of science
Science is the pursuit and application of knowledge and understanding of the natural
and social world following a systematic methodology based on evidence....
30/04/2014
Operational Research Consultancy 14
Worst Practice & Bad Governance• Rienhart & Rogoff Scandal 2013
– NBER 2010 Paper ’90% Debt to GDP threshold excesses slows economic growth’• Criticised by Henden, Ash & Pollen; discovered fundamental coding errors &
missing data– significant change in real average
• Potti Scandal 2010 (Duke University)– “Abundantly clear” that there was “manipulated data” behind the
published research • Investment Ponzi Scandal 2009 (Madoff Collapse), Subprime
Mortgage Scandal 2008 (Lehman Bros Collapse), Accounting Frauds 2001 (Enron & WorldCom Collapse) etc, etc
• Daily Telegraph – Climategate Blog – Climate Change worst ever scientific scandal
There are lies, damned lies and then, there are statistics !30/04/2014
Operational Research Consultancy 15
Reproducibility
• Can the results be reproduced?• What are the challenges?– Data is not static, its meaning and value is fluid– Analytics is often based on a ‘moment-in-time’– Can that ‘moment’ ever truly be reproduced?– Based on assumptions considered valid at the
moment-in-time– Assumption Validations Must Be Robust!
30/04/2014
Operational Research Consultancy 16
The Scientific Method• State your assumptions, articulate your question, establish your
hypothesis?• Document the steps you will take to validate, analyse and test
your assumptions, questions, hypothesis?• Extrapolate your conclusions; what results are you expecting to
discover? What might or might not happen? Why?• Document the actual results and your observations? Did they
differ from your expectations? Why?• Were your results peer reviewed? Was it reproducible?
Auditable?• Where is the actual data you used?
You may be a Data Analyst, but are you a Data Scientist?30/04/2014