Fairy tale from the land of data
-
date post
18-Oct-2014 -
Category
Data & Analytics
-
view
211 -
download
4
description
Transcript of Fairy tale from the land of data
Fairy tales in the land of dataOr - do I know what I’m doing?
By @przemur from
http://about.me/przemek.maciolek
A story
http://yamao.deviantart.com/art/Cleric-comm-343786321 https://www.flickr.com/photos/jsjgeology/8359854092/
Suspense
<?
“The hammers from the new
provider are no good, sayr.”
What would you do?
New hammers since this month
install.packages('ggplot2') require('ggplot2') setwd("/Users/pmm/Desktop/hammer") all <- read.csv(file="all.csv") !qplot(all$month_sequence, all$dwarfs) + geom_smooth() qplot(all$month_sequence, all$production) + geom_smooth() !all$prod_per_dwarf <- all$production / all$dwarfs qplot(all$month_sequence, all$prod_per_dwarf) + geom_smooth()
Number of dwarfs working in the mine
The hammers from the new provider started being
distributed to the new miners.
Total production of gold
Per-dwarf average production
Who sees any problem?
Lets look at the production of each dwarf, relative to the time one applied…
Dwarfs which are using the OLD hammer design
Dwarfs which are using the NEW hammer design
new <- read.csv(file="new_relative.csv") old <- read.csv(file="old_relative.csv") !qplot(new$relative_month, new$production) ggplot(new, aes(x=relative_month, y=production)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)
# This will look much better!old$type='old' new$type='new' old_and_new = rbind(old,new) ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)
Scatterplot showing relative production done using old and new hammers
What now?
ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.1) + geom_smooth(method=lm)
The new hammers wear much faster!
How much did the dwarfs lost?
old_m = lm(production ~ relative_month, old) new$possible_production <- predict(old_m, new) sum(new$possible_production) - sum(new$production) (sum(new$possible_production) - sum(new$production))/sum(new$production)
0.5%
Now, taking into account the price of hammer, one can select the optimal strategy… but that’s another story…
Lessons learned …?
• Don’t trust the data blindly, ask questions
• Try to understand underlying rules of the system
• Don’t be shy with trying various models
• If using R, go for ggplot2