Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

46
Feature Engineering Studio September 23, 2013

Transcript of Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Page 1: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Feature Engineering Studio

September 23, 2013

Page 2: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Welcome to Mucking Around Day

Page 3: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Sort into pairs

• Partner with the person next to you

• One group of 3 is allowed

Page 4: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Sort into pairs

• Do we have a group of 3?• One of the 3 will work with me

Page 5: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Sort into pairs

• Go over your reports together– A maximum of 5 minutes apiece

Page 6: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

5 minutes for first person

Page 7: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

5 minutes for second person

Page 8: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Re-assemble into one big group

Page 9: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found something really cool while mucking around?

• Show us, tell us

Page 10: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found a histogram with a normal distribution?

• Show us, tell us

Page 11: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found a histogram with a hypermode?

• Show us, tell us

Page 12: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found a histogram with a flat distribution?

• Show us, tell us

Page 13: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found a histogram with a skewed distribution?

• Show us, tell us

Page 14: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found a histogram with a bimodal distribution?

• Show us, tell us

Page 15: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found a histogram with something else interesting?

• Show us, tell us

Page 16: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here found something surprising with their min, max, average, stdev?

Page 17: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Categorical variables

• Who here found something curious, weird, or interesting in the distribution of their categorical variables?

Page 18: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here hasn’t spoken yet?(and analyzed data)

• Tell us something interesting you found in your data

Page 19: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here played with pivot tables?

• What did you learn?

Page 20: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

My turn to play with pivot tables

• Who wants to volunteer their data?• (I might request a 2nd or 3rd data set,

depending on how the 1st one goes)

Page 21: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Who here played with vlookup?

• What did you learn?

Page 22: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

My turn to play with vlookup

• Using the same volunteered data set(s)

Page 23: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Other cool things you can create with a few simple formulas (plus demos!)

Page 24: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Identifying specific cases of interest

Page 25: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Did event of interest ever occur for student?

Page 26: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Counts-so-far(and total value for student)

Page 27: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Counts-last-N-actions

Page 28: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

First attempts

Page 29: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Ratios between events of interest

Page 30: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

How many students had 3 (or 4, 5, 2,…) of an event

Page 31: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Times-so-far

Page 32: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Cutoff-based features

Page 33: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Unitized actions (such as unitized time)

Page 34: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Last 3 or 5 unitized

Page 35: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Comparing earlier behaviors to later behaviors through caching

Page 36: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Counts-if

Page 37: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Percentages of action type

Page 38: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Percentages of time spent per action/location/KC/etc.

Page 39: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Questions? Comments?

Page 40: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Other cool ideas?

Page 41: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Assignment 3• Feature Engineering 1

“Bring Me a Rock”

• Get your data set• Open it in Excel• Create as many features as you feel inspired to create

– Features should be created with the goal of predicting your ground truth variable– At least 12 separate features that are not just variations on a theme (e.g. “time for

last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features)

• For each feature, write a 1-3 sentence “just so story” for why it might work• Test how good each features is

Page 42: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Testing Feature Goodness

• For this assignment, there are a bunch of ways to test feature goodness

• Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday)

• Compute correlation in Excel (want to see?)– You can do this with binaries variables too, although it’s not really

optimal• Compute t-test in Excel (want to see?)• Compute kappa in Excel (if you don’t know how, easier to do in

RapidMiner)

Page 43: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Were you right?

• Which of your “just so stories” seem to be correct?

• Did any of your feature correlate in the opposite direction from what you expected?

Page 44: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Assignment 3

• Write a brief report for me• Email me an excel sheet with your features• You don’t need to prepare a presentation• But be ready to discuss your features in class

Page 45: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Next Classes

• 9/25 Special Session– Using RapidMiner to Produce Prediction Models– Come to this if you’ve never built a classifier or

regressor in RapidMiner (or a similar tool)– Statistical significance tests using linear regression

don’t count…

• 9/30 Advanced Feature Distillation in Excel– Assignment 3 due– Online Equation Solver Tutorials should be in your

INBOX

Page 46: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Upcoming Classes

• 10/2 Special session on prediction models– Come to this if you don’t know why student-level

cross-validation is important, or if you don’t know what J48 is

• 10/7 Advanced Feature Distillation in Google Refine

• 10/9 Special session? TBD.