Applied Machine Learning Lecture 3: Pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...

Applied Machine LearningLecture 3: Pre-processing steps and

hyperparameters tuning

Selpi (selpi@chalmers.se)

The slides are further development of Richard Johansson's slides

January 28, 2020

Overview

Converting features to numerical vectors

Dealing with missing values

Feature Selection

Hyperparameters Tuning

Imbalanced Classes

Review and Closing

the �rst step: mapping features to numerical vectors

I scikit-learn's learning methods works with features asnumbers, not strings

I they can't directly use the feature dicts we have stored in X

I converting from string to numbers is the purpose of these lines:vec = DictVectorizer()

Xe = vec.fit_transform(X)

types of vectorizers

I a DictVectorizer converts from attribute�value dicts:

I a CountVectorizer converts from texts (after splitting intowords) or lists:

I a TfidfVectorizer is like a CountVectorizer, but also usesTF*IDF (downweighting common words)

what goes on in a DictVectorizer?

I each feature corresponds to one or more columns in the outputmatrix

I easy case: boolean and numerical features:

I for string features, we reserve one column for each possiblevalue: one-hot encodingI that is, we convert to booleans

what goes on in a DictVectorizer?

I each feature corresponds to one or more columns in the outputmatrix

I easy case: boolean and numerical features:

I for string features, we reserve one column for each possiblevalue: one-hot encodingI that is, we convert to booleans

code example (DictVectorizer)

from sklearn.feature_extraction import DictVectorizer

X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},

{'f1':'B', 'f2':'M', 'f3':True, 'f4':2},

{'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]

vec = DictVectorizer()

print(Xe.toarray())

print(vec.get_feature_names())

the result:

[[ 1. 0. 1. 0. 0. 7.]

[ 1. 0. 0. 1. 1. 2.]

[ 0. 1. 1. 0. 0. 9.]]

['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']

code example (DictVectorizer)

from sklearn.feature_extraction import DictVectorizer

X = [{'f1':'B', 'f2':'F', 'f3':False, 'f4':7},

{'f1':'B', 'f2':'M', 'f3':True, 'f4':2},

{'f1':'O', 'f2':'F', 'f3':False, 'f4':9}]

vec = DictVectorizer()

print(Xe.toarray())

print(vec.get_feature_names())

the result:

[[ 1. 0. 1. 0. 0. 7.]

[ 1. 0. 0. 1. 1. 2.]

[ 0. 1. 1. 0. 0. 9.]]

['f1=B', 'f1=O', 'f2=F', 'f2=M', 'f3', 'f4']

Overview

Feature Selection

Imbalanced Classes

Review and Closing

I Remove rows/columns that contain missing values (theeasiest, but ...)

I Replace the missing values with some values; the process iscalled imputation (more strategic)

Imputation of missing values

I Derived data from the same feature (univariate featureimputation)I (e.g., mean, median, most-frequent)

I Derived data from several features

I See scikit-learn Imputation of missing values

Overview

Feature Selection

Imbalanced Classes

Review and Closing

Challenges with large number of features

I Imagine working with two data sets, one with 500 features, theother 10 features. Assume number of samples are moderateand the same in both data sets.

I What are the potential challenges with more features?

I Finding key information/feature(s) becomes more di�cult

I Higher complexity ⇒ bad for generalization

I Training takes longer time

I To solve these issues: try feature selection method

Selecting Features: Brute Force

for every possible set of features S :train and evaluate the model based on S

return the set Smax that gave the best result

Feature Selection: Overall Ideas

I �lter methods:

I wrapper methods:

I embedded methods:

[Source: Wikipedia]

I �lter methods:

I wrapper methods:

I embedded methods:

[Source: Wikipedia]

I �lter methods:

I wrapper methods:

I embedded methods:

[Source: Wikipedia]

Association between a feature and the output (idea)

Feature Ranking Example (document categories)

[Source: Manning & Schütze, Introduction to Information Retrieval]

Filter-based Feature Selection in scikit-learn

http://scikit-learn.org/stable/modules/feature_selection.html

I selectors:I SelectKBestI SelectPercentile

I feature scoring functions:I f_classifI mutual_info_classifI chi2

Example of a wrapper method: greedy forward selection

S = empty setrepeat:�nd the feature F that gives the largest

improvement when added to Sif there was an improvement add F to S

until there was no improvement

I the MLXtend library has an implementationI see SequentialFeatureSelector

Example of a wrapper method: backward selection

I See notebook

I Recursive Feature Elimination and Cross-Validation

(RFECV)

Overview

Feature Selection

Imbalanced Classes

Review and Closing

Hyperparameters vs parameters

I Hyperparameters

I often used to help estimatethe model parameters

I speci�ed by the users

I tuned for the problem athand

I In Random forests: numberof trees, number of featuresconsidered

I In Neural Networks: learningrate, number of hidden layers

I Parameters

I are needed by the model formaking predictions

I the values are estimatedfrom data

I usually are not set by theusers

I In Random forests: randomseed used

I In Neural Networks: weights

Hyperparameters vs parameters

I Hyperparameters

I often used to help estimatethe model parameters

I speci�ed by the users

I tuned for the problem athand

I In Random forests: numberof trees, number of featuresconsidered

I In Neural Networks: learningrate, number of hidden layers

I Parameters

I are needed by the model formaking predictions

I the values are estimatedfrom data

I usually are not set by theusers

I In Random forests: randomseed used

I In Neural Networks: weights

How to tune the hyperparameters?

I Trial and error

In scikit-learn:http://scikit-learn.org/stable/modules/grid_search.html

I GridSearchCV

I RandomizedSearchCV

Overview

Feature Selection

Imbalanced Classes

Review and Closing

�Finding needle in haystack� problems

I Finding rare diseases

I Finding anomalies in travelpattern

I Di�erentiating crashes vsnon-crashes situations

I . . .[source]

Evaluation of imbalance cases

def has_disease(patient):

return False

I a DummyClassifier will have a high accuracy

I better to use precision and recall (more later)I or sensitivity/speci�city, or FPR/FNR, . . .

Evaluation of imbalance cases

def has_disease(patient):

return False

I a DummyClassifier will have a high accuracy

I better to use precision and recall (more later)I or sensitivity/speci�city, or FPR/FNR, . . .

Dealing with class imbalance

I Change parameter(s) of the learning algorithm

I Change the data

I Use appropriate performance metric(s)

I Change the data

Change the parameter(s) of the learning algorithm

I Example in scikit-learn: LinearSVC

Changing the data

[source]

I Or generate synthetic data using the existing data

Changing the data

[source]

I Or generate synthetic data using the existing data

Library for imbalanced-learn

I https://imbalanced-learn.readthedocs.io/

I See for example: BalancedRandomForestClassi�er

Overview

Feature Selection

Imbalanced Classes

Review and Closing

Review of pre-processing and hyperparameters tuning

I Explain di�erent ways of selecting features

I Explain pros/cons of GridSearchCV and RandomizedSearchCV

I Explain di�erent ways of dealing with imbalance classes

Next lecture

I Linear classi�ers and regression models

I (Methods for) data collection and bias

I Annotating data

Applied Machine Learning Lecture 3: Pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...

Documents

Transcript of Applied Machine Learning Lecture 3: Pre-processing steps ...richajo/dit866/lectures/l3/l3.pdf ·...

Applied Machine Learning Lecture 4-2: Data collection ...richajo/dit866/lectures/l4/l4_2.pdf · Applied Machine Learning Lecture 4-2: Data collection, bias, and ... Manual annotation

Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Licence L3 Mention Mathématiquescouture.perso.math.cnrs.fr/Documents L3 Mathématiques... · 2020. 9. 1. · L3 – Maths – informations générales • U.E. PréproMEF. inscriptions

Pathways 路徑與項目對照表 PI PM SR TC VC · 8310 準備面談 l3 選 l3 選 l3 選 l3 選 l3 選 l3 選 l3 選 l3 選 l3 選 l3 選 l3 選 1 5~7 分鐘角色扮演進行面談→在面談當中適當展現自己具備的技能、能力和經驗，處理面談問題。

Geographic l3

140707 Telstra Plaza map changes - adelaideovalapp.com · Leigh Whicker Room L3 Neil Kerley Bar L3 One L3 Peter Carey Bar L3 Premiership Suite L3 SANFL Chairman’s Room L3 William

Chapter9 L3

L3 control

Aku3201 l3

DAT340/DIT866 Applied Machine Learning: Machine learning ...richajo/dit866/lectures/l11/l11.pdfDAT340/DIT866 Applied Machine Learning: Machine learning and ethics Vilhelm Verendel

Testing L3

² L3$7(

l3 Materials

ECP 11-0511h Efacec Fluofix GCT Test Formlibrary.ukpowernetworks.co.uk/library/en/g81/Testing_and... · Web viewL3 - L3 L3 - L3 - L3 Closed Open Closed L1 - L2 (RSW1) Closed Open

L3’Physical’layer’(part2)’ · EITF25’Internet,,Techniques’and’Applicaons’ Stefan’Höst L3’Physical’layer’(part2)’

Architectural L3

L3 futures

L3 Brings Future C4ISR Solutions to the Indo-Asia-Pacific ... · RAAF Base Edinburgh L3 TRL Gallipoli Barracks L3 Australia Melbourne L3 Oceania Fremantle L3 WESCAM Service Centre

L3 amendments

Spelligwords l3