Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Post on 12-Jan-2016

215 views 0 download

Transcript of Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day.

Feature Engineering Studio

September 23, 2013

Welcome to Mucking Around Day

Sort into pairs

• Partner with the person next to you

• One group of 3 is allowed

Sort into pairs

• Do we have a group of 3?• One of the 3 will work with me

Sort into pairs

• Go over your reports together– A maximum of 5 minutes apiece

5 minutes for first person

5 minutes for second person

Re-assemble into one big group

Who here found something really cool while mucking around?

• Show us, tell us

Who here found a histogram with a normal distribution?

• Show us, tell us

Who here found a histogram with a hypermode?

• Show us, tell us

Who here found a histogram with a flat distribution?

• Show us, tell us

Who here found a histogram with a skewed distribution?

• Show us, tell us

Who here found a histogram with a bimodal distribution?

• Show us, tell us

Who here found a histogram with something else interesting?

• Show us, tell us

Who here found something surprising with their min, max, average, stdev?

Categorical variables

• Who here found something curious, weird, or interesting in the distribution of their categorical variables?

Who here hasn’t spoken yet?(and analyzed data)

• Tell us something interesting you found in your data

Who here played with pivot tables?

• What did you learn?

My turn to play with pivot tables

• Who wants to volunteer their data?• (I might request a 2nd or 3rd data set,

depending on how the 1st one goes)

Who here played with vlookup?

• What did you learn?

My turn to play with vlookup

• Using the same volunteered data set(s)

Other cool things you can create with a few simple formulas (plus demos!)

Identifying specific cases of interest

Did event of interest ever occur for student?

Counts-so-far(and total value for student)

Counts-last-N-actions

First attempts

Ratios between events of interest

How many students had 3 (or 4, 5, 2,…) of an event

Times-so-far

Cutoff-based features

Unitized actions (such as unitized time)

Last 3 or 5 unitized

Comparing earlier behaviors to later behaviors through caching

Counts-if

Percentages of action type

Percentages of time spent per action/location/KC/etc.

Questions? Comments?

Other cool ideas?

Assignment 3• Feature Engineering 1

“Bring Me a Rock”

• Get your data set• Open it in Excel• Create as many features as you feel inspired to create

– Features should be created with the goal of predicting your ground truth variable– At least 12 separate features that are not just variations on a theme (e.g. “time for

last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features)

• For each feature, write a 1-3 sentence “just so story” for why it might work• Test how good each features is

Testing Feature Goodness

• For this assignment, there are a bunch of ways to test feature goodness

• Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday)

• Compute correlation in Excel (want to see?)– You can do this with binaries variables too, although it’s not really

optimal• Compute t-test in Excel (want to see?)• Compute kappa in Excel (if you don’t know how, easier to do in

RapidMiner)

Were you right?

• Which of your “just so stories” seem to be correct?

• Did any of your feature correlate in the opposite direction from what you expected?

Assignment 3

• Write a brief report for me• Email me an excel sheet with your features• You don’t need to prepare a presentation• But be ready to discuss your features in class

Next Classes

• 9/25 Special Session– Using RapidMiner to Produce Prediction Models– Come to this if you’ve never built a classifier or

regressor in RapidMiner (or a similar tool)– Statistical significance tests using linear regression

don’t count…

• 9/30 Advanced Feature Distillation in Excel– Assignment 3 due– Online Equation Solver Tutorials should be in your

INBOX

Upcoming Classes

• 10/2 Special session on prediction models– Come to this if you don’t know why student-level

cross-validation is important, or if you don’t know what J48 is

• 10/7 Advanced Feature Distillation in Google Refine

• 10/9 Special session? TBD.