ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data...

ORIE 4741: Learning with Big Messy Data

Feature Engineering

Professor Udell

Operations Research and Information EngineeringCornell

September 30, 2019

1 / 38

Announcements

I Guest lecturer Thursday: Milinda Lakkam from LinkedIn

I Project proposals due this Thursday (midnight) on github

I New project idea: course evaluations

I Homework 2 due Thursday next week

I Section this week: Github basics (for project)

I Want to be a grader?

2 / 38

Outline

Feature engineering

Polynomial transformations

Boolean, nominal, ordinal, text, . . .

Time series

3 / 38

Linear models

To fit a linear model (= linear in parameters w)

I pick a transformation φ : X → Rd

I predict y using a linear function of φ(x)

h(x) = wTφ(x) =d∑

wi (φ(x))i

I we want h(xi ) ≈ yi for every i = 1, . . . , n

Q: why do we want a model linear in the parameters w?A: because the optimization problems are easy to solve!e.g., just use least squares.

4 / 38

Linear models

wi (φ(x))i

Q: why do we want a model linear in the parameters w?

A: because the optimization problems are easy to solve!e.g., just use least squares.

4 / 38

Linear models

wi (φ(x))i

Q: why do we want a model linear in the parameters w?A: because the optimization problems are easy to solve!e.g., just use least squares.

4 / 38

Feature engineering

How to pick φ : X → Rd?

I so response y will depend linearly on φ(x)

I so d is not too big

5 / 38

Feature engineering

How to pick φ : X → Rd?

I so response y will depend linearly on φ(x)

I so d is not too big

if you think this looks like a hack: you’re right

6 / 38

Feature engineering

examples:

I adding offsetI standardizing featuresI polynomial fitsI products of featuresI autoregressive modelsI local linear regressionI transforming BooleansI transforming ordinalsI transforming nominalsI transforming imagesI transforming textI concatenating dataI all of the above

https://xkcd.com/2048/7 / 38

Outline

Feature engineering

Time series

8 / 38

Adding offset

I X = Rd−1

I let φ(x) = (x , 1)

I now h(x) = wTφ(x) = wT1:d−1x + wd

9 / 38

Fitting a polynomial

I X = R

I letφ(x) = (1, x , x2, x3, . . . , xd−1)

be the vector of all monomials in x of degree < d

I now h(x) = wTφ(x) = w1 + w2x + w3x2 + · · ·+ wdx

10 / 38

Demo: Linear models

https://github.com/ORIE4741/demos

11 / 38

Fitting a multivariate polynomial

I X = R2

I pick a maximum degree k

φ(x) = (1, x1, x2, x21 , x1x2, x

22 , x

31 , x

21x2, x1x

22 , x

32 . . . , x

be the vector of all monomials in x1 and x2 of degree < d

I now h(x) = wTφ(x) can fit any polynomial of degree ≤ kin X

and similarly for X = Rd . . .

12 / 38

Fitting a multivariate polynomial

I X = R2

I pick a maximum degree k

φ(x) = (1, x1, x2, x21 , x1x2, x

22 , x

31 , x

21x2, x1x

22 , x

32 . . . , x

be the vector of all monomials in x1 and x2 of degree < d

I now h(x) = wTφ(x) can fit any polynomial of degree ≤ kin X

and similarly for X = Rd . . .

12 / 38

Demo: Linear models

polynomial classification

https://github.com/ORIE4741/demos

13 / 38

Linear classification

-4 -2 0 2 4

positivenegativeclassification boundary

14 / 38

Polynomial classification

-4 -2 0 2 40

20positivenegativeclassification boundary

15 / 38

Example: fitting a multivariate polynomial

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

be the vector of all monomials of degree ≤ 2

I now let h(x) = sign(wTφ(x))

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 + 2x22 ), what is

{x : h(x) = 1}?

A: An ellipse!

16 / 38

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 + 2x22 ), what is

{x : h(x) = 1}?

A: An ellipse!

16 / 38

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 − 2x22 ), what is

{x : h(x) = −1}?

A: A hyperbola!

17 / 38

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 − 2x22 ), what is

{x : h(x) = −1}?

A: A hyperbola!

17 / 38

Outline

Feature engineering

Time series

18 / 38

Notation: boolean indicator function

define

1(statement) =

{1 statement is true

0 statement is false

examples:

I 1(1 < 0) = 0

I 1(17 = 17) = 1

19 / 38

Boolean variables

I X = {true, false}I let φ(x) = 1(x)

20 / 38

Boolean expressions

I X = {true, false}2 ={(true, true), (true, false), (false, true), (false, false)}.

I let φ(x) = [1(x1),1(x2),1(x1 and x2),1(x1 or x2)]

I equivalent: polynomials in [1(x1),1(x2)] span the samespace

I encodes logical expressions!

21 / 38

Nominal values: one-hot encoding

I nominal data: e.g., X = {apple, orange, banana}I let

φ(x) = [1(x = apple),1(x = orange),1(x = banana)]

I called one-hot encoding: only one element is non-zero

extension: sets

22 / 38

Nominal values: one-hot encoding

I nominal data: e.g., X = {apple, orange, banana}I let

φ(x) = [1(x = apple),1(x = orange),1(x = banana)]

I called one-hot encoding: only one element is non-zero

extension: sets

22 / 38

Nominal values: look up features!

why not use other information known about each item?

I X = {apple, orange, banana}I price, calories, weight, . . .

I X = zip codeI average income, temperature in July, walk score, %

residential, . . .

I . . .

database lingo: join tables on nominal value

23 / 38

Ordinal values: real encoding

I ordinal data: e.g.,X = {Stage I,Stage II,Stage III, Stage IV}

φ(x) =

1, x = Stage I

2, x = Stage II

3, x = Stage III

4, x = Stage IV

I default encoding

24 / 38

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I use real encoding φ to transform ordinal data

I fit linear model with offset to predict y as wφ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.

Q: What is w? b?A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?

A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 6, w = −2

Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?

A: −2 years

25 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Ordinal values: boolean encoding

I ordinal data: e.g.,X = {Stage I,Stage II,Stage III, Stage IV}

φ(x) = [1(x ≥ Stage II),1(x ≥ Stage III),1(x ≥ Stage IV)]

26 / 38

I define transformation φ : X → R as

I fit linear model with offset to predict y as w>φ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.

Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?

A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determined

Q: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?

A: can’t say without more information

27 / 38

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

X = sentences, documents, tweets, . . .

I bag of words model (one-hot encoding):I pick set of words {w1, . . . ,wd}I φ(x) = [1(x contains w1), . . . ,1(x contains wd)]I ignores order of words in sentence

I pre-trained neural networks:I sentiment analysis: https://medium.com/@b.terryjack/

nlp-pre-trained-sentiment-analysis-1eb52a9d742cI Universal Sentence Encoder (USE) embedding:

https:

//colab.research.google.com/github/tensorflow/

hub/blob/master/examples/colab/semantic_

similarity_with_tf_hub_universal_encoder.ipynbI lots of others: https://modelzoo.co/

28 / 38

X = sentences, documents, tweets, . . .

I bag of words model (one-hot encoding):I pick set of words {w1, . . . ,wd}I φ(x) = [1(x contains w1), . . . ,1(x contains wd)]I ignores order of words in sentence

I pre-trained neural networks:I sentiment analysis: https://medium.com/@b.terryjack/

nlp-pre-trained-sentiment-analysis-1eb52a9d742cI Universal Sentence Encoder (USE) embedding:

https:

//colab.research.google.com/github/tensorflow/

hub/blob/master/examples/colab/semantic_

similarity_with_tf_hub_universal_encoder.ipynbI lots of others: https://modelzoo.co/

28 / 38

Neural networks: whirlwind primer

NN(x) = σ(W1σ(W2 . . . σ(W`x))))

I σ is a nonlinearity applied elementwise to a vector, e.g.I ReLU: σ(x) = max(x , 0)I sigmoid: σ(x) = log(1 + exp (x))

I each W is a matrixI trained on very large datasets, e.g., Wikipedia, YouTube

29 / 38

Why not use deep learning?

towards a solution: https://arxiv.org/abs/1907.10597

30 / 38

Outline

Feature engineering

Time series

31 / 38

Fitting a time series

I given a time series xt ∈ R, t = 1, . . . ,T

I want to predict the value at the next time T + 1

I input: time index t and time series x1:t up to time t

I letφ(t, x) = (xt−1, xt−2, . . . , xt−d)

(called the “lagged outcomes”)

I now h(x) = wTφ(x) = w1xt−1 + w2xt−2 + · · ·+ wdxt−d

also called an auto-regressive (AR) model

32 / 38

Fitting a local model: local neighbors

I X = Rm

I pick a set of points µi ∈ Rm, i = 1, . . . , d , radius δ ∈ R

I let φ(x) = [1(‖x − µ1‖2 ≤ δ), . . . ,1(‖x − µd‖2 ≤ δ)]

often, d = n and µi = xi

33 / 38

Fitting a local model: 1 nearest neighbor

I X = Rm

I pick a set of points µi ∈ Rm, i = 1, . . . , d

I let δ = min(‖x − µ1‖2, . . . , ‖x − µd‖2)

I let φ(x) = [1(‖x − µ1‖2 ≤ δ), . . . ,1(‖x − µd‖2 ≤ δ)]

34 / 38

Fitting a local model: smoothing

I X = Rm

I pick a set of points µi ∈ Rm, i = 1, . . . , d , parameter α ∈ R

φ(x) = (exp(−α‖x − µ1‖2

), . . . , exp

(−α‖x − µd‖2

35 / 38

Demo: Crime

preprocessing:https://juliabox.com/notebooks/demos/Crime.ipynb

predicting: https://juliabox.com/notebooks/demos/

Predicting%20crime.ipynb

36 / 38

Review

I linear models are linear in the parameters w

I can fit many different models by picking feature mappingφ : X → Rd

37 / 38

What makes a good project?

I Clear outcome to predict

I Linear regression should do something interesting

I A data science project; not an NLP or Vision project

I New, interesting model; not a Kaggle competition

38 / 38

ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data...

Documents

Transcript of ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data...

Oil: A Messy LEVELED READER â€¢ L Resource Oil: A Messy

Messy Martin

ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations

Messy room

ORIE 4741: Introduction to AutoML

The messy lecture

Messy Bessy

Huseyin Topaloglu Cornell University School of ORIE

Messy Vintage - Messy Church · Introduction Messy Vintage began in a Methodist Church in Jersey and is a missional offshoot of Messy Church. Like Messy Church, it is Christ-centred,

London Life and Theater It was messy, messy, messy.

Messy chair

Advertising. Messy Marvin.

Larry Orie FINAL - SAIL

ORIE 4741: Learning with Big Messy Data [2ex] The Perceptron · 2019-09-12 · ORIE 4741: Learning with Big Messy Data The Perceptron Professor Udell Operations Research and Information

Messy Drawer Conversations

tugas pak orie

Messy Easter at homebtckstorage.blob.core.windows.net/site13467/Messy Easter at Home.… · Messy celebration live on Easter Day The Messy Church Easter celebration story, told by

History of the Orie 00 Whit Rich

ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon

Learning, seriously messy