The How and Why of Feature Engineering

Alice Zheng, DatoMarch 29, 2016Strata + Hadoop World, San Jose

My journey so far

Shortage of expertise andgood tools in the market.

Applied machine learning/data science

Build ML tools

Write a book

Machine learning is great!

Model data.Make predictions.Build intelligent

applications.Play chess and go!

The machine learning pipeline

I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, …

Raw data

FeaturesModels

Predictions

Deploy inproduction

If machine learning were hairstyles

Images courtesy of “A visual history of ancient hairdos” and “An animated history of 20th century hairstyles.”

ModelsMagnificent, ornate, high-maintenance

Feature engineeringStreet smart, ad-hoc, hacky

Making sense of feature engineering• Feature generation• Feature cleaning and transformation• How well do they work?• Why?

Feature GenerationFeature: An individual measurable property of a phenomenon being observed.

⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”

Representing natural text

It is a puppy and it is extremely cute.

What’s important? Phrases? Specific words? Ordering?

Subject, object, verb?

Classify: puppy or not?

Raw Text

{“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 }

Bag of Words

Representing natural text

It is a puppy and it is extremely cute.

Classify: puppy or not?

Raw Text Bag of Wordsit 2

they 0

puppy 1

aardvark 0

cute 1

extremely 1

… …

Sparse vector representation

Representing images

Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.

Raw image: millions of RGB triplets,one for each pixel

Classify: person or animal?Raw Image Bag of Visual Words

Representing imagesClassify: person or animal?Raw Image Deep learning features

3.29-15

-5.2448.31.3647.1

-1.9236.52.8395.4-19-89

5.0937.8

Dense vector representation

Representing audioRaw Audio Spectrogram

features

Classify: Music or voice?Type of instrument

t=0 t=1 t=2

6.1917 -0.3411 1.2418

0.2205 0.0214 0.4503

1.0423 0.2214 -1.0017

-0.2340 -0.0392 -0.2617

0.2750 0.0226 0.1229

0.0653 0.0428 -0.4721

0.3169 0.0541 -0.1033

-0.2970 -0.0627 0.1960

Time series of dense vectors

Feature generation for audio, image, text

I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, …

“Human native”Conceptually abstract

Low Semantic content in dataHigh

Higher Difficulty of feature generationLower

Feature Cleaning and Transformation

Auto-generated features are noisyRank Word Doc Count Rank Word Doc Count

1 the 1,416,058 11 was 929,703

2 and 1,381,324 12 this 844,824

3 a 1,263,126 13 but 822,313

4 i 1,230,214 14 my 786,595

5 to 1,196,238 15 that 777,045

6 it 1,027,835 16 with 775,044

7 of 1,025,638 17 on 735,419

8 for 993,430 18 they 720,994

9 is 988,547 19 you 701,015

10 in 961,518 20 have 692,749

Most popular words in Yelp reviews dataset (~ 6M reviews).

Auto-generated features are noisyRank Word Doc Count Rank Word Doc Count

357,480 cmtk8xyqg 1 357,470 attractif 1

357,479 tangified 1 357,469 chappagetti 1

357,478 laaaaaaasts 1 357,468 herdy 1

357,477 bailouts 1 357,467 csmpus 1

357,476 feautred 1 357,466 costoso 1

357,475 résine 1 357,465 freebased 1

357,474 chilyl 1 357,464 tikme 1

357,473 cariottis 1 357,463 traditionresort 1

357,472 enfeebled 1 357,462 jallisco 1

357,471 sparklely 1 357,461 zoawan 1

Least popular words in Yelp reviews dataset (~ 6M reviews).

Feature cleaning• Popular words and rare words are not helpful• Manually defined blacklist – stopwords

a b c d e f g h iable be came definitely each far get had ieabout became can described edu few gets happens ifabove because cannot despite eg fifth getting hardly ignoredaccording become cant did eight first given has immediatelyaccordingly becomes cause different either five gives have inacross becoming causes do else followed go having inasmuch… … … … … … … … …

Feature cleaning• Frequency-based pruning

Stopwords vs. frequency filters

No training required

Stopwords Frequency filters

Can be exhaustive

Inflexible

Adapts to data

Also deals with rare words

Needs tuning, hard to control

Both require manual attention

Tf-Idf: Automatic “soft” filter• Tf-idf = term frequency x inverse document

frequency• Tf = Number of times a terms appears in a

document• Idf = log(# total docs / # docs containing word w)• Large for uncommon words, small for popular words• Discounts popular words, highlights rare words

Visualizing bag-of-words

I have a puppy

I have a catI have a kitten

I have a dogand I have a pen

Visualizing tf-idf

I have a puppy

I have a catI have a kitten

idf(puppy) = log 4idf(cat) = log 4idf(have) = log 1 = 0

I have a dogand I have a pen

Visualizing tf-idf

tfidf(puppy) = log 4tfidf(cat) = log 4tfidf(have) = 0

I have a dogand I have a pen,I have a kitten

I have a cat

I have a puppy

Algebraically, tf-idf = column scalingw1 w2 w3 w4 … wM

idf = log (N/L0 norm of word column)

w1 w2 w3 w4 … wM

Algebraically, tf-idf = column scaling

Multiply word column with scalar (idf of word)

w1 w2 w3 w4 … wM

Algebraically, tf-idf = column scaling

Algebraically, tf-idf = column scalingw1 w2 w3 w4 … wM

Other types of column scaling• L2 scaling = divide column by L2 norm

How well do they work?

Classify reviews using logistic regression• Classify business category of Yelp reviews• Bag-of-words vs. L2 normalization vs. tf-idf• Model: logistic regression

Observations• l2 regularization made no difference (with proper

tuning)• L2 normalization made no difference on accuracy• Tf-idf did better, but barely• But they are both column scaling methods! Why

the difference?

A Peek Under the Hood

Linear classification

Feature 2

Feature 1

Find the best line to separate two classes

Algebraically–solve linear systems

Data matrix

Weight vector

Labels

How a matrix works

Any matrixLeft

singular vectors

Singular values Rightsingularvectors

How a matrix works

Any matrix

Project

ScaleProject

How a matrix works

Null space

Singular value = 0

Null space = part of the input space that is squashed by the matrix

Column space

Singular value ≠ 0

Column space = the non-zero part of the output space

Effect of column scaling

Scaled columns

Effect of column scaling

Scaled columns Singular values change(but zeros stay zero)

Singular vectors may also change

Effect of column scaling• Changes the singular values and vectors, but not

the rank of the null space or column space• … unless the scaling factor is zero

- Could only happen with tf-idf• L2 scaling improves the condition number

(therefore the solver converges faster)

Mystery resolved• Tf-idf can emphasize some columns while zeroing

out others—the uninformative features• L2 normalization makes all features equal in “size”

- Improves the condition number of the matrix- Solver converges faster

Take-away points• Many tricks for feature generation and

transformation• Features interact with models, making their effects

difficult to predict• But so much fun to play with!• New book coming out: Mastering feature

engineering- More tricks, intuition, analysis

@RainyData

The How and Why of Feature Engineering

Science

Transcript of The How and Why of Feature Engineering

Flexibel omgaan met data / Feature Engineering

Adventures in Feature Engineering - UNC Charlotte · Adventures in Feature Engineering. Feature engi-whaaat?: “Feature engineering is the act of extracting features from raw data

Feature report engineering Practice Accelerating Process ...

VSSML16 L6. Feature Engineering

Requirements Engineering(Why Requirements Engineering in System Engineering?)

Reverse Engineering Architectural Feature Models

Why choose engineering?

Network Engineering Services Venues Feature

Why feature teams are not enough!

Automating Feature Engineering

Engineering Success: Bioenergy feature

Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

Why Engineering Economy

Automated Feature Engineering for Deep Neural Networks ... · Feature engineering is a process that augments the feature vector of a machine learning model with calculated values

Automated Feature Engineering for Deep Neural Networks ... · Automated Feature Engineering for Deep Neural Networks with Genetic Programming by Jeff Heaton 2016 Feature engineering

Web UI, Algorithms, and Feature Engineering

2011 icse-reverse engineering feature models

feature engineering - ift6758.github.io

L5. Data Transformation and Feature Engineering

Feature engineering pipelines