Post on 14-Apr-2017
Data Science?!what even...
David Coallier@davidcoallier
Data ScientistEngine Yard
And I cook..A lot.
(n-1) items
Adapting.
Feedback.
Indifference.
Young mathematically inclined minds
Young mathematically inclined minds
We knew everything.
First Bad Assumption.
So we asked “experts”.
Bad Ingredients
Bad Data
Tasted like sh*t
From Our ResultsWe had questions.
Found ExpertiseNot Online.
Data Scientific Method
Find a QuestionYour Hypothesis
Current DataWhat do you have?
Features & TestsTry it.
Analyse ResultsWon’t be pretty.
ConversationFramed. By. Data.
But....
Good DiscussionsImply good data scientists
Hacking Skills
Hacking Skills
Maths & Stats
Hacking Skills
Maths & Stats
Expertise
Hacking Skills
Maths & Stats
Expertise
MachineLearning
Research
DangerZone!!!
Hacking Skills
Maths & Stats
Expertise
DataScience
Hacking Skills
Maths & StatsExpertise
MachineLearning
Research
DangerZone!!!
DataScience
BusinessDon’t need an MBA
In other words.
1. Hacking2. Maths & Stats3. Expertise
Apply MethodData Scientific
1. Question2. Current Data3. Features/Tests4. Analyse5. Converse
Find a QuestionLet’s imagine Github
Upgrade ReposAffect users as little as possible
import csvcontent = csv.read('repo1.csv')
f (k;λ) = λ ke−k
k!for k >= 0
ConversePresent Findings
IterateCommits aren’t key.
KPIs are keyIndicators from experience
QuestionsSuper Important.
Just test it..
We are Human.Emotional Connection
What next?Second Hypothesis.
Focus on DataRelevant to your KPIs.
Data gives you the what
Humans give you the why
Turn Information
Into
Actionable Insight
Create DiscussionsIntrospection Engines
Seeing, Feeling itThe brain sees.
Not regressions
Not p-values
Not slopes
Not F-statistics
Not coefficients
Another ExampleFraud Engine
FeaturesFraud Engine
ClustersUser Types
Machine LearningHistorical Analysis
DecisionReport as Fraudulent
Fact-Based Decision Failing
Fact-Based Decision Making
Measure
AnalysisKnowledge
Action
Failed.Noetic Intelligence
Measure
AnalysisKnowledge
Action
Measure
AnalysisKnowledge
Action
OfferingMissing Feature
ToolboxWhat do we use?
RModeling, Testing, Prototyping
RStudioThe IDE
lubridateand zoo
Dealing with Dates...
yy/mm/dd mm/dd/yyYYYY-mm-dd HH:MM:ss TZyy-mm-dd 1363784094.513425yy/mm different timezone
reshape2Reshape your Data
ggplot2Visualise your Data
RCurl, RJSONIOFind more Data
HMiscMiscellaneous useful functions
forecastCan you guess?
garchGeneralized Autoregressive Conditional Heteroskedasticity
quantmodStatistical Financial Trading
getSymbols('AAPL')barChart(AAPL)addMACD()
xtsExtensible Time Series
igraphStudy Networks
maptoolsRead & View Maps
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
PythonScientific Computing
scipy.stats
scipy.statsDescriptive Statistics
from scipy.stats import describe
s = [1,2,1,3,4,5]
print describe(s)
scipy.statsProbability Distributions
ExamplePoisson Distribution
f (k;λ) = λ ke−k
k!for k >= 0
import scipy.stats.poissonp = poisson.pmf([1,2,3,4,1,2,3], 2)
print p.mean()print p.sum()...
NumPyLinear Algebra
1 00 1
⎛⎝⎜
⎞⎠⎟
import numpy as npx = np.array([ [1, 0], [0, 1] ])vec, val = np.linalg.eig(x)np.linalg.eigvals(x)
>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )
MatplotlibPython Plotting
statsmodelsAdvanced Statistics Modeling
NLTKNatural Language Tool Kit
scikit-learnMachine Learning
from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y)
clf.predict([[2., 2.]])>>> array([1])
PyBrain... Machine Learning
PyMCBayesian Inference
PatternWeb Mining for Python
NetworkXStudy Networks
MILK: Machine Learning
Pandaseasy-to-use data structures
from pandas import *x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18}])
print x[x['age'] > 20].count()print x[x['age'] > 20].mean()
Python vs R?Different Purposes
Storage
Oppose“big” Data
Hadoop
Had - oops
RiakKey-Value Buckets
CouchDBDocument Database
RedisIn-Memory Database
CubeTime-series Database
PgSQLQuite Extensively
Visualisation
Right NowThe rule of 3
EngineerReport One
Mid-Level MgrReport Two
Board LevelReport Three
The FutureDiscoverable Insight
d3.jsData-Driven Documents
The FutureDiscoverable Insight
DashingElegant Dashboards
Edward TufteGo read his books.
DogfoodingData Scientific Method
Original QuestionWhat is Data Science?
Back to youFor questioning