Missing Values

10
Missing values Data numeric Mean Median Categor ical Mode Less outliers Large dataset

description

How to handle missing values

Transcript of Missing Values

Page 1: Missing Values

Missing values

Data

numeric

Mean Median

Categorical

Mode

Less outliersLarge dataset

Page 2: Missing Values

Missing value through prediction Missing variable Associated

variable Type of technique Remarks Assumptions

Categorical Categorical • Decision tree • Naïve Bayesian

• Decesion tree need no assumption

• Naïve bayes assume independent variables

Categorical Numeric • Logistic regression

• K-NN classifier

• K-NN CLASSIFIER need no assumption

• Regression assumption of normality, homoscedasticity etc

Numeric Numeric • Regression model

• Clustering

• Clustering need no assumption

Regression assumption of normality, homoscedasticity etc

Numeric Categorical • Clustering • No assumption

Categorical Both • Decision tree • Multinomial

regression

• No assumption for decision tree

• Regression assumtpions

Numeric Both • Clustering • No assumption

Page 3: Missing Values

K-NN Classifier

3-NN classifier

Page 4: Missing Values

K-NN Classifier

Page 5: Missing Values

K-NN Classifier

Page 6: Missing Values

K-NN Classifier

If k is too small, sensitive to noise points

If k is too large, neighborhood may include points from other classes

Page 7: Missing Values

K-NN Classifier

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

Knn classifier is a lazy learner because It does not build models explicitly

Page 8: Missing Values

Testing with different k

Page 9: Missing Values

Naïve Bayesian Classifier

P(A|B) = P(B|A) *P(A) / P(B) (Bayes theorem )

P(Spam|free)=P(free|spam)* P(Spam) / P(free)

Since P(Spam|free) > P(ham|free) , hence with this word, the message is classified as spam

Page 10: Missing Values

Step 4 : Applying the classifier

If output eqn 1 is greater then eqn 2 , its classified as spam o/w ham

1

2

How it works

sms_classifier <- naiveBayes(sms_train, sms_raw_train$type)library(e1071)