Applied Machine Learning Lecture 3: Pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...

Post on 24-Jul-2020

2 views 0 download

Transcript of Applied Machine Learning Lecture 3: Pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...

Applied Machine LearningLecture 3: Pre-processing steps and

hyperparameters tuning

Selpi (selpi@chalmers.se)

The slides are further development of Richard Johansson's slides

January 28, 2020

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

the �rst step: mapping features to numerical vectors

I scikit-learn's learning methods works with features asnumbers, not strings

I they can't directly use the feature dicts we have stored in X

I converting from string to numbers is the purpose of these lines:vec = DictVectorizer()

Xe = vec.fit_transform(X)

types of vectorizers

I a DictVectorizer converts from attribute�value dicts:

I a CountVectorizer converts from texts (after splitting intowords) or lists:

I a TfidfVectorizer is like a CountVectorizer, but also usesTF*IDF (downweighting common words)

what goes on in a DictVectorizer?

I each feature corresponds to one or more columns in the outputmatrix

I easy case: boolean and numerical features:

I for string features, we reserve one column for each possiblevalue: one-hot encodingI that is, we convert to booleans

what goes on in a DictVectorizer?

I each feature corresponds to one or more columns in the outputmatrix

I easy case: boolean and numerical features:

I for string features, we reserve one column for each possiblevalue: one-hot encodingI that is, we convert to booleans

code example (DictVectorizer)

from sklearn.feature_extraction import DictVectorizer

X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},

{'f1':'B', 'f2':'M', 'f3':True, 'f4':2},

{'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]

vec = DictVectorizer()

Xe = vec.fit_transform(X)

print(Xe.toarray())

print(vec.get_feature_names())

the result:

[[ 1. 0. 1. 0. 0. 7.]

[ 1. 0. 0. 1. 1. 2.]

[ 0. 1. 1. 0. 0. 9.]]

['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']

code example (DictVectorizer)

from sklearn.feature_extraction import DictVectorizer

X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},

{'f1':'B', 'f2':'M', 'f3':True, 'f4':2},

{'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]

vec = DictVectorizer()

Xe = vec.fit_transform(X)

print(Xe.toarray())

print(vec.get_feature_names())

the result:

[[ 1. 0. 1. 0. 0. 7.]

[ 1. 0. 0. 1. 1. 2.]

[ 0. 1. 1. 0. 0. 9.]]

['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

Dealing with missing values

I Remove rows/columns that contain missing values (theeasiest, but ...)

I Replace the missing values with some values; the process iscalled imputation (more strategic)

Imputation of missing values

I Derived data from the same feature (univariate featureimputation)I (e.g., mean, median, most-frequent)

I Derived data from several features

I See scikit-learn Imputation of missing values

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

Challenges with large number of features

I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.

I What are the potential challenges with more features?

I Finding key information/feature(s) becomes more di�cult

I Higher complexity ⇒ bad for generalization

I Training takes longer time

I To solve these issues: try feature selection method

Challenges with large number of features

I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.

I What are the potential challenges with more features?

I Finding key information/feature(s) becomes more di�cult

I Higher complexity ⇒ bad for generalization

I Training takes longer time

I To solve these issues: try feature selection method

Challenges with large number of features

I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.

I What are the potential challenges with more features?

I Finding key information/feature(s) becomes more di�cult

I Higher complexity ⇒ bad for generalization

I Training takes longer time

I To solve these issues: try feature selection method

Selecting Features: Brute Force

for every possible set of features S :train and evaluate the model based on S

return the set Smax that gave the best result

Feature Selection: Overall Ideas

I �lter methods:

I wrapper methods:

I embedded methods:

[Source: Wikipedia]

Feature Selection: Overall Ideas

I �lter methods:

I wrapper methods:

I embedded methods:

[Source: Wikipedia]

Feature Selection: Overall Ideas

I �lter methods:

I wrapper methods:

I embedded methods:

[Source: Wikipedia]

Association between a feature and the output (idea)

Feature Ranking Example (document categories)

[Source: Manning & Schütze, Introduction to Information Retrieval]

Example of a wrapper method: greedy forward selection

S = empty setrepeat:�nd the feature F that gives the largest

improvement when added to Sif there was an improvement add F to S

until there was no improvement

I the MLXtend library has an implementationI see SequentialFeatureSelector

Example of a wrapper method: backward selection

I See notebook

I Recursive Feature Elimination and Cross-Validation

(RFECV)

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

Hyperparameters vs parameters

I Hyperparameters

I often used to help estimatethe model parameters

I speci�ed by the users

I tuned for the problem athand

I In Random forests: numberof trees, number of featuresconsidered

I In Neural Networks: learningrate, number of hidden layers

I Parameters

I are needed by the model formaking predictions

I the values are estimatedfrom data

I usually are not set by theusers

I In Random forests: randomseed used

I In Neural Networks: weights

Hyperparameters vs parameters

I Hyperparameters

I often used to help estimatethe model parameters

I speci�ed by the users

I tuned for the problem athand

I In Random forests: numberof trees, number of featuresconsidered

I In Neural Networks: learningrate, number of hidden layers

I Parameters

I are needed by the model formaking predictions

I the values are estimatedfrom data

I usually are not set by theusers

I In Random forests: randomseed used

I In Neural Networks: weights

How to tune the hyperparameters?

I Trial and error

In scikit-learn:http://scikit-learn.org/stable/modules/grid_search.html

I GridSearchCV

I RandomizedSearchCV

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

�Finding needle in haystack� problems

I Finding rare diseases

I Finding anomalies in travelpattern

I Di�erentiating crashes vsnon-crashes situations

I . . .[source]

Evaluation of imbalance cases

def has_disease(patient):

return False

I a DummyClassifier will have a high accuracy

I better to use precision and recall (more later)I or sensitivity/speci�city, or FPR/FNR, . . .

Evaluation of imbalance cases

def has_disease(patient):

return False

I a DummyClassifier will have a high accuracy

I better to use precision and recall (more later)I or sensitivity/speci�city, or FPR/FNR, . . .

Dealing with class imbalance

I Change parameter(s) of the learning algorithm

I Change the data

I Use appropriate performance metric(s)

Dealing with class imbalance

I Change parameter(s) of the learning algorithm

I Change the data

I Use appropriate performance metric(s)

Dealing with class imbalance

I Change parameter(s) of the learning algorithm

I Change the data

I Use appropriate performance metric(s)

Change the parameter(s) of the learning algorithm

I Example in scikit-learn: LinearSVC

Changing the data

[source]

I Or generate synthetic data using the existing data

Changing the data

[source]

I Or generate synthetic data using the existing data

Library for imbalanced-learn

I https://imbalanced-learn.readthedocs.io/

I See for example: BalancedRandomForestClassi�er

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

Review of pre-processing and hyperparameters tuning

I Explain di�erent ways of selecting features

I Explain pros/cons of GridSearchCV and RandomizedSearchCV

I Explain di�erent ways of dealing with imbalance classes

Next lecture

I Linear classi�ers and regression models

I (Methods for) data collection and bias

I Annotating data