Feature Engineering Studio

46
Feature Engineering Studio September 23, 2013

description

Feature Engineering Studio. September 23, 2013. Welcome to Mucking Around Day. Sort into pairs. Partner with the person next to you One group of 3 is allowed. Sort into pairs. Do we have a group of 3? One of the 3 will work with me. Sort into pairs. Go over your reports together - PowerPoint PPT Presentation

Transcript of Feature Engineering Studio

Page 1: Feature Engineering Studio

Feature Engineering Studio

September 23, 2013

Page 2: Feature Engineering Studio

Welcome to Mucking Around Day

Page 3: Feature Engineering Studio

Sort into pairs

• Partner with the person next to you

• One group of 3 is allowed

Page 4: Feature Engineering Studio

Sort into pairs

• Do we have a group of 3?• One of the 3 will work with me

Page 5: Feature Engineering Studio

Sort into pairs

• Go over your reports together– A maximum of 5 minutes apiece

Page 6: Feature Engineering Studio

5 minutes for first person

Page 7: Feature Engineering Studio

5 minutes for second person

Page 8: Feature Engineering Studio

Re-assemble into one big group

Page 9: Feature Engineering Studio

Who here found something really cool while mucking around?

• Show us, tell us

Page 10: Feature Engineering Studio

Who here found a histogram with a normal distribution?

• Show us, tell us

Page 11: Feature Engineering Studio

Who here found a histogram with a hypermode?

• Show us, tell us

Page 12: Feature Engineering Studio

Who here found a histogram with a flat distribution?

• Show us, tell us

Page 13: Feature Engineering Studio

Who here found a histogram with a skewed distribution?

• Show us, tell us

Page 14: Feature Engineering Studio

Who here found a histogram with a bimodal distribution?

• Show us, tell us

Page 15: Feature Engineering Studio

Who here found a histogram with something else interesting?

• Show us, tell us

Page 16: Feature Engineering Studio

Who here found something surprising with their min, max, average, stdev?

Page 17: Feature Engineering Studio

Categorical variables

• Who here found something curious, weird, or interesting in the distribution of their categorical variables?

Page 18: Feature Engineering Studio

Who here hasn’t spoken yet?(and analyzed data)

• Tell us something interesting you found in your data

Page 19: Feature Engineering Studio

Who here played with pivot tables?

• What did you learn?

Page 20: Feature Engineering Studio

My turn to play with pivot tables

• Who wants to volunteer their data?• (I might request a 2nd or 3rd data set,

depending on how the 1st one goes)

Page 21: Feature Engineering Studio

Who here played with vlookup?

• What did you learn?

Page 22: Feature Engineering Studio

My turn to play with vlookup

• Using the same volunteered data set(s)

Page 23: Feature Engineering Studio

Other cool things you can create with a few simple formulas (plus demos!)

Page 24: Feature Engineering Studio

Identifying specific cases of interest

Page 25: Feature Engineering Studio

Did event of interest ever occur for student?

Page 26: Feature Engineering Studio

Counts-so-far(and total value for student)

Page 27: Feature Engineering Studio

Counts-last-N-actions

Page 28: Feature Engineering Studio

First attempts

Page 29: Feature Engineering Studio

Ratios between events of interest

Page 30: Feature Engineering Studio

How many students had 3 (or 4, 5, 2,…) of an event

Page 31: Feature Engineering Studio

Times-so-far

Page 32: Feature Engineering Studio

Cutoff-based features

Page 33: Feature Engineering Studio

Unitized actions (such as unitized time)

Page 34: Feature Engineering Studio

Last 3 or 5 unitized

Page 35: Feature Engineering Studio

Comparing earlier behaviors to later behaviors through caching

Page 36: Feature Engineering Studio

Counts-if

Page 37: Feature Engineering Studio

Percentages of action type

Page 38: Feature Engineering Studio

Percentages of time spent per action/location/KC/etc.

Page 39: Feature Engineering Studio

Questions? Comments?

Page 40: Feature Engineering Studio

Other cool ideas?

Page 41: Feature Engineering Studio

Assignment 3• Feature Engineering 1

“Bring Me a Rock”

• Get your data set• Open it in Excel• Create as many features as you feel inspired to create

– Features should be created with the goal of predicting your ground truth variable– At least 12 separate features that are not just variations on a theme (e.g. “time for

last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features)

• For each feature, write a 1-3 sentence “just so story” for why it might work• Test how good each features is

Page 42: Feature Engineering Studio

Testing Feature Goodness• For this assignment, there are a bunch of ways to test feature

goodness

• Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday)

• Compute correlation in Excel (want to see?)– You can do this with binaries variables too, although it’s not really

optimal• Compute t-test in Excel (want to see?)• Compute kappa in Excel (if you don’t know how, easier to do in

RapidMiner)

Page 43: Feature Engineering Studio

Were you right?

• Which of your “just so stories” seem to be correct?

• Did any of your feature correlate in the opposite direction from what you expected?

Page 44: Feature Engineering Studio

Assignment 3

• Write a brief report for me• Email me an excel sheet with your features• You don’t need to prepare a presentation• But be ready to discuss your features in class

Page 45: Feature Engineering Studio

Next Classes

• 9/25 Special Session– Using RapidMiner to Produce Prediction Models– Come to this if you’ve never built a classifier or regressor

in RapidMiner (or a similar tool)– Statistical significance tests using linear regression don’t

count…

• 9/30 Advanced Feature Distillation in Excel– Assignment 3 due– Online Equation Solver Tutorials should be in your INBOX

Page 46: Feature Engineering Studio

Upcoming Classes

• 10/2 Special session on prediction models– Come to this if you don’t know why student-level

cross-validation is important, or if you don’t know what J48 is

• 10/7 Advanced Feature Distillation in Google Refine

• 10/9 Special session? TBD.