Fairy tale from the land of data

23
Fairy tales in the land of data Or - do I know what I’m doing? By @przemur from http://about.me/przemek.maciolek

description

A fairy tale about falling into a trap of wrong interpretation of results. Shows the importance of building models and understanding them.

Transcript of Fairy tale from the land of data

Page 1: Fairy tale from the land of data

Fairy tales in the land of dataOr - do I know what I’m doing?

By @przemur from

http://about.me/przemek.maciolek

Page 2: Fairy tale from the land of data

A story

Page 3: Fairy tale from the land of data

http://yamao.deviantart.com/art/Cleric-comm-343786321 https://www.flickr.com/photos/jsjgeology/8359854092/

Page 4: Fairy tale from the land of data
Page 5: Fairy tale from the land of data
Page 6: Fairy tale from the land of data
Page 7: Fairy tale from the land of data

Suspense

Page 8: Fairy tale from the land of data

<?

“The hammers from the new

provider are no good, sayr.”

Page 9: Fairy tale from the land of data

What would you do?

Page 10: Fairy tale from the land of data

New hammers since this month

Page 11: Fairy tale from the land of data

install.packages('ggplot2') require('ggplot2') setwd("/Users/pmm/Desktop/hammer") all <- read.csv(file="all.csv") !qplot(all$month_sequence, all$dwarfs) + geom_smooth() qplot(all$month_sequence, all$production) + geom_smooth() !all$prod_per_dwarf <- all$production / all$dwarfs qplot(all$month_sequence, all$prod_per_dwarf) + geom_smooth()

Page 12: Fairy tale from the land of data

Number of dwarfs working in the mine

The hammers from the new provider started being

distributed to the new miners.

Page 13: Fairy tale from the land of data

Total production of gold

Page 14: Fairy tale from the land of data

Per-dwarf average production

Page 15: Fairy tale from the land of data

Who sees any problem?

Page 16: Fairy tale from the land of data

Lets look at the production of each dwarf, relative to the time one applied…

Dwarfs which are using the OLD hammer design

Dwarfs which are using the NEW hammer design

Page 17: Fairy tale from the land of data

new <- read.csv(file="new_relative.csv") old <- read.csv(file="old_relative.csv") !qplot(new$relative_month, new$production) ggplot(new, aes(x=relative_month, y=production)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)

# This will look much better!old$type='old' new$type='new' old_and_new = rbind(old,new) ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)

Page 18: Fairy tale from the land of data

Scatterplot showing relative production done using old and new hammers

Page 19: Fairy tale from the land of data

What now?

Page 20: Fairy tale from the land of data

ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.1) + geom_smooth(method=lm)

The new hammers wear much faster!

Page 21: Fairy tale from the land of data

How much did the dwarfs lost?

Page 22: Fairy tale from the land of data

old_m = lm(production ~ relative_month, old) new$possible_production <- predict(old_m, new) sum(new$possible_production) - sum(new$production) (sum(new$possible_production) - sum(new$production))/sum(new$production)

0.5%

Now, taking into account the price of hammer, one can select the optimal strategy… but that’s another story…

Page 23: Fairy tale from the land of data

Lessons learned …?

• Don’t trust the data blindly, ask questions

• Try to understand underlying rules of the system

• Don’t be shy with trying various models

• If using R, go for ggplot2