Summary

15
Summary

description

Summary of all analytical models........

Transcript of Summary

Page 1: Summary

Summary

Page 2: Summary

Technique Ordinary linear regression

Application Predicting the numeric quantity. The dependent variable can be numeric as well as categorical.

Alternative Support vector regression and neural network

Assumption Normality, iid and homoscedasticity for the residuals .If assumption gets violated then box cox method can be used and variable can be transformed

Limitations Data should be linearly dependent otherwise svm and neural network can be used .Multicollinearity within dependent variable need to be checked

Accuracy measure R-Sqaure and adjusted r-square can be used

Key – terms Ordinary linear regression, residuals , transformation of variable , multicollinearity , outliers and influential points , box-cox, normal distribution, PCA

Sources to study NPTEL notes by Prof Shulabh from iit k ( theory )Regression model course at course era by john Hopkins ( implementation on R )

Page 3: Summary

Technique Logistic Regression

Application Predicting the categorical variables. The dependent variable can be numeric as well as categorical.

Alternative Support vector regression and neural network, decision tree , naïve Bayesian

Assumption Normality, iid and homoscedasticity for the residuals .If assumption gets violated then box cox method can be used and variable can be transformed

Limitations Data should be linearly dependent otherwise svm and neural network can be used .Multicollinearity within dependent variable need to be checked.

Accuracy measure Confusion matric can be used when used as a classifier

Key – terms Logistic regression as a classifier, residuals , transformation of variable , multicollinearity , outliers and influential points , box-cox, normal distribution, PCA

Sources to study NPTEL notes by Prof Shulabh from iit k ( theory )Regression model course at course era by john Hopkins ( implementation on R )

Page 4: Summary

Technique Poisson Regression

Application Predicting the number of events in a given time period ( event can be a failure of a machine )

Alternative -

Assumption The probability of occurring of an event should be very less

Limitations Cannot predict when the event will occur

Accuracy measure AIC, BIC

Key – terms Generalized linear models , poisson distribution , link function

Sources to study NPTEL notes by Prof Sulabh from iit k ( theory )Regression model course at course era by john Hopkins ( implementation on R )

Page 5: Summary

Technique Analysis of variance ( ANOVA )

Application To figure out the significance of independent categorical variable in the numeric output.

Alternative Neural network

Assumption Normality, iid and homoscedasticity for the groups .KRUSKAL WALLIS method can be used if assumptions get violated

Limitations Independent variable should only be categorical.

Accuracy measure F statistics

Key – terms Design of experiment, levels in an experiment, TUKEY HSD ,

Sources to study kutner applied linear statistical models ( book ) R code from http://mgmt.iisc.ernet.in/CM/MG221/Handouts.html

Page 6: Summary

Technique Association rules & Association Sequence

Application Used in retail industry to find out an association between products.

Alternative Decision tree , naïve Bayesian

Assumption No assumptions

Limitations Everything should be categorical.If not then it has to be divided into categories

Accuracy measure Support , confidence and lift

Key – terms Association sequence, support, confidence , lift, market basket analysis ,

Sources to study http://www-users.cs.umn.edu/~kumar/dmbook/index.php ( theory )R code is provided during training

Page 7: Summary

Technique Clustering

Application Used to cluster similar data points together. .

Alternative -

Assumption No assumptions

Limitations If dataset is large ( hierarchical cannot be used and finding number of cluster is difficult )Due to redundant variables , clear cluster are not visible in large data

Accuracy measure Tuning parameter and within variation within the cluster

Key – terms K-means , k-medoids, hierarchical , Sparseclustering , PCA , Feature selection , knee point , sumofsquare

Sources to study Machine learning course by Stanford university Prof Andrewhttp://www-users.cs.umn.edu/~kumar/dmbook/index.php

Page 8: Summary

Technique Decision Tree ( classifier )

Application Classifier

Alternative Naïve bayes , logistic , Svm ,Neural network

Assumption -

Limitations The data should be linear separable.Interpretation is difficult when tree is big .Only categorical variable otherwise numeric data is categorized by decision tree with loss in information

Accuracy measure Confusion matrix , accuracy measure , KS statistics, Area under ROC Curve

Key – terms Entropy , splitinfo, Gain ratio

Sources to study http://www-users.cs.umn.edu/~kumar/dmbook/index.php ( theory )

Page 9: Summary

Technique Naïve Bayes ( classifier )

Application Classifier

Alternative Decision tree , logistic , Svm ,Neural network

Assumption All variables must be categorical

Limitations The data should be linear separable.All variables must be independent.Only categorical variable.

Accuracy measure Confusion matrix , accuracy measure , KS statistics, Area under ROC Curve

Key – terms Bayes theorem

Sources to study http://www-users.cs.umn.edu/~kumar/dmbook/index.php ( theory )

Page 10: Summary

Technique SVM and Neural network ( classifier )

Application Classifier ( NON_LINEAR )

Alternative -

Assumption -

Limitations Interpretation is very difficult.Very expensive and time consuming

Accuracy measure Confusion matrix , accuracy measure , KS statistics, Area under ROC Curve

Key – terms Kernel trick , hidden layers, non linear classifier , pattern recognition

Sources to study Machine learning course by Stanford university Prof Andrew http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Page 11: Summary

Technique Principle Component

Application Dimensionality reduction for clustering as well as regression

Alternative Feature selection

Assumption -

Limitations Interpretation is difficult.

Accuracy measure

Key – terms PCA, Multicollnearity , dimentionality reduction .

Sources to study Machine learning course by Stanford university Prof Andrew http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Page 12: Summary

Technique Time series

Application Auto regression. Is used for only time series data

Alternative Markov model if sudden jumps are present

Assumption Stationary and regression assumption for the residuals

Limitations Cyclic component and sudden jumps are taken into account

Accuracy measure AIC and BIC

Key – terms ARIMA, FOERCASTING ,

Sources to study NPTEL VIDEOS BY IISC PROF FROM CIVIL ENGINEERING DEPTTWebsite otext.com

Page 13: Summary

Markov models : for predicting sudden jumps in stock market data or time series data related to other domain

Survival Analysis : data when the machine is going to fail. Sometimes you get the data which is censored. Like a data of failure of machine where some machines did not fail when the data was collected.

Data envelop Analysis : to measure the performance difference between various units or teams based on multiple factors .

Further readings suggested

Page 14: Summary

Further readings suggested • Transformation of variables in linear modeling ( box cox method )• What measure you do when assumption violates ( boc – cox )• Adding non linear terms to your model • Association sequence rule • Sparse clustering for feature selection in clustering ( special method for clustering )• Naïve bayes classifier ( used in text analytics to classify tweets , mails , document etc • K-NN classifier that is usually used when clustering and classification both are required.• How to include the interaction term to improve the model performance • Poisson regression for predicting the failure of machines • Generalized linear regression modelling• Linear discriminant analysis• Boosting bagging and other methods for improvement of classifiers • Random forests method for classification • Topic modelling • Sentimental analysis• Support vector regression for non-linear regression

Page 15: Summary

Thank You