Data Science: The Main Course @ KCDC 2016

46
DATA SCIENCE: THE MAIN COURSE I Can Science Data, and So Can You! rthur Doler @arthurdoler [email protected]

Transcript of Data Science: The Main Course @ KCDC 2016

Page 1: Data Science: The Main Course @ KCDC 2016

DATA SCIENCE: THE MAIN COURSE

I Can Science Data, and So Can You!

Arthur Doler @arthurdoler [email protected]

Page 2: Data Science: The Main Course @ KCDC 2016

TITANIUM SPONSORS

Platinum Sponsors

Gold Sponsors

Page 3: Data Science: The Main Course @ KCDC 2016

HOW MANY APPETIZERS HAVE YOU EATEN?

Page 4: Data Science: The Main Course @ KCDC 2016

Sources: Mediawiki, Publicdomainpictures.net

Page 5: Data Science: The Main Course @ KCDC 2016

SO WE’RE SKIPPING RIGHT TO THE MAIN COURSEYOU HAVE THE DATA

YOU HAVE THE POWER

Page 6: Data Science: The Main Course @ KCDC 2016

Sources: Mattel, he-manreviewed.net

Page 7: Data Science: The Main Course @ KCDC 2016

WHAT’S FOR DINNER

Page 8: Data Science: The Main Course @ KCDC 2016

Picking your problem

Using Knitr/R Markdown

Building a linear predictor

Making a predictive, repeatable document

Page 9: Data Science: The Main Course @ KCDC 2016

WHAT’S NOT FOR DINNER

Page 10: Data Science: The Main Course @ KCDC 2016

Learning R

Exhaustive discussion of statistics

Exhaustive discussion of regression modeling

Ways to run R in production

Page 11: Data Science: The Main Course @ KCDC 2016

STEP 0: KNOW YOUR RECIPE FOR REPEATABILITY

Learn to Knit you some R

Page 12: Data Science: The Main Course @ KCDC 2016

knitr ≈ Sweave + cacheSweave + pgfSweave + weaver + animation::saveLatex +

R2HTML::RweaveHTML + highlight::HighlightWeaveLatex + 0.2 * brew + 0.1 *

SweaveListingUtils + more

Page 13: Data Science: The Main Course @ KCDC 2016

Source: Reddit

Page 14: Data Science: The Main Course @ KCDC 2016

R Code

Markup

R Code

Markup

Markup

Page 15: Data Science: The Main Course @ KCDC 2016

WHAT?! WHY IS THIS A GOOD IDEA?

Page 16: Data Science: The Main Course @ KCDC 2016

Do you love me?

YN

Page 17: Data Science: The Main Course @ KCDC 2016

LET’S GO FIND THAT RECIPE!

Source: Reddit

Page 18: Data Science: The Main Course @ KCDC 2016

STEP 1: SHOP FOR YOUR INGREDIENTS

Finding the question to ask

Page 19: Data Science: The Main Course @ KCDC 2016

WHAT ARE YOU TRYING TO DO?

Page 20: Data Science: The Main Course @ KCDC 2016

Finding or proving a correlation

Looking for outliers

Building a predictive model

Page 21: Data Science: The Main Course @ KCDC 2016

LET’S BUILD A LINEAR PREDICTIVE MODEL

Page 22: Data Science: The Main Course @ KCDC 2016
Page 23: Data Science: The Main Course @ KCDC 2016

Source: Wikipedia

Page 24: Data Science: The Main Course @ KCDC 2016

WHAT ARE YOUR VARIABLES?

Page 25: Data Science: The Main Course @ KCDC 2016

• Material Category• Material ID• Time-to-Incapacitation• 1000 / Time-To-Incapacitation• Carbon Monoxide• Hydrogen Cyanide• Hydrogen Sulfide• Hydrochloric Acid• Hydrobromic Acid• Nitrogen Dioxide• Sulfur Dioxide

Page 26: Data Science: The Main Course @ KCDC 2016

WHAT DO YOU CARE ABOUT?

Page 27: Data Science: The Main Course @ KCDC 2016
Page 28: Data Science: The Main Course @ KCDC 2016

FORMULATE YOUR QUESTION

Page 29: Data Science: The Main Course @ KCDC 2016

LET’S HEAD TO THE STORE!

Source: Reddit

Page 30: Data Science: The Main Course @ KCDC 2016

STEP 2: GET YOUR MISE EN PLACE

Dividing your data

Page 31: Data Science: The Main Course @ KCDC 2016

WHERE IS THE VALUE IN A PREDICTIVE MODEL?

Page 32: Data Science: The Main Course @ KCDC 2016

WE BUILD OUR MODEL WITH A TRAINING SET

Page 33: Data Science: The Main Course @ KCDC 2016

PARTITIONING YOUR DATA PREVENTS OVERTRAINING

Page 34: Data Science: The Main Course @ KCDC 2016

²⁄³ Training¹⁄³ Test

½ Training¼ Test¼ Validation

Page 35: Data Science: The Main Course @ KCDC 2016

LET’S MEASURE EVERYTHING OUT!

Source: Reddit

Page 36: Data Science: The Main Course @ KCDC 2016

STEP 3: COOK UP YOUR PREDICTOR

Training your model

Page 37: Data Science: The Main Course @ KCDC 2016

ONE WARNING FIRST

DO YOU NEED TO UNDERSTAND YOUR PREDICTOR?

Page 38: Data Science: The Main Course @ KCDC 2016

LET’S GO COOK UP THE MODEL!

Source: Reddit

Page 39: Data Science: The Main Course @ KCDC 2016

WHY DID 1000/TIME_TO_INCAPACITATION WORK BETTER?

Page 40: Data Science: The Main Course @ KCDC 2016

STEP 3A: TRIM THE FATEliminating Outliers

Page 41: Data Science: The Main Course @ KCDC 2016

LET’S GO CUT!

Source: Reddit

Page 42: Data Science: The Main Course @ KCDC 2016

STEP 4: GARNISH WITH GRAPHICS

Adding visualizations to your report

Page 43: Data Science: The Main Course @ KCDC 2016

plot ggplot2

Source: Wikimedia

Page 44: Data Science: The Main Course @ KCDC 2016

LET’S FINISH UP THAT REPORT!

Source: Reddit

Page 45: Data Science: The Main Course @ KCDC 2016

1. Know Your Recipe for Repeatability2. Shop for Your Column Ingredients3. Get your Data Divided4. Cook Up Your Predictor

1. Trim the Outlier Fat5. Garnish with Graphics

Page 46: Data Science: The Main Course @ KCDC 2016

QUESTIONS?Source: Reddit