The Titanic:
Machine Learning
from Disaster
Data Mining and Machine Learning. Winter 2014. Final Project
Jean Callao | Michelle Darling | Paul Marxhausen
AGENDA
In depth analysis: by Jean Callao
• Logistic Regression: glm
• Tree-based methods: rpart, ctree
In depth analysis: by Paul Marxhausen
• Ensemble Methods: randomForest, cForest
Summary: by Michelle Darling
• Data Visualization
• Machine Learning Kaggle Results
Titanic: Machine Learning from Disaster
Why we picked this project:
● Historical context to understand "What does the data mean?"
● Learn one data set well, and then apply different algorithms and modelling tools.
● Practice the steps of data analysis:
○ Data exploration and visualization.
○ Model selection, building and testing.
● Prize: $0 + "knowledge & confidence" to go on to more challenging data science problems.
kaggle.com provides:
Online data science competitions.
Structured problems, tutorials, help forums and discussion groups.
Easy, consistent way to test models and track results.
>>> Focus <<<
April 1912
The Titanic Disaster
RMS Titanic, April 1912
A priori knowledge from problem domain
What factors contributed to survival?
Gender, Age, Passenger Class, Fare, Family
More likely to survive
• Females
• Children, Adults<50
• 1st Class
• Paid higher fares
• Travelling with family
More likely to perish
• Males
• Adults >50
• 2nd, 3rd class
• Paid lower fares
• Travelling alone
• Immigrants
Titanic DatasetPredictor & Target Variables
ResponseVARIABLE
Survived(1 = Yes; 0 = No)
PredictorVariables DESCRIPTION
Pclass Passenger Class (1=1st; 2=2nd; 3=3rd)Name Passenger NameSex Sex ("male", "female")Age Age (Numeric fraction e.g., 1.5)Fare Passenger FareSibsp Number of Siblings/Spouses AboardParch Number of Parents/Children AboardTicket Ticket NumberCabin Cabin Embarked Port of Embarkation
(C=Cherbourg; Q=Queenstown; S=Southampton)
QUANTITATIVE Variables; the rest are QUALITATIVE.
Feature Engineering
Data relating to one's location on the ship
data$cabin.last.digit <- str_sub(data$Cabin, -1)
data$Side <- "Unknown”
data$Side[which(isEven(data$cabin.last.digit))] <-"port”
data$Side[which(isOdd(data$cabin.last.digit))] <-"starboard”
Classifying Fares
combi$Fare2 <- '30+'
combi$Fare2[combi$Fare < 30 & combi$Fare >= 20] <-'20-30'
combi$Fare2[combi$Fare < 20 & combi$Fare >= 10] <-'10-20’
combi$Fare2[combi$Fare < 10] <- '<10'
Title - Extract from name to find wealthy passengers:
combi$Title[combi$Title %in% c('Mme', 'Mlle')] <-'Mlle‘
combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess')] <- 'Lady'
combi$Title[combi$Title %in% c('Capt', 'Col', 'Don', 'Dr','Jonkheer', 'Major', 'Rev', 'Sir')]<-'Noble’
FamilySize - Combining spouse, siblings and parents
combi$FamilySize <- combi$SibSp + combi$Parch + 1
Decision Trees and
Logistic Regression
Presented by Jean Callao
Decision Trees• A decision tree is a simple, but
powerful form of multiple variable
analysis. It displays a tree-like
graph of decisions and their
possible consequences.
• Recursive Partitioning-> at each
step, we identify a question that
we use to partition the data.
Advantages:
• Data-driven: Makes no prior
assumptions; selects significant predictors
based on the greatest information gain.
• Flexible: No data pre-processing needed!
Handles numeric and categorical data.
• Easy to interpret and explain to others.
Decision Tree with New Variables
tree <- rpart(Survived~ Class + Sex + Age + SibSp + Parch + Fare + Title + Side,
data=train, method="class", control = rpart.control(minsplit = 0, minbucket = 0, maxdepth = 10))
fancyRpartPlot(tree)
Prediction <- predict(tree, test, type = "class")
table(Prediction)
Perished Survived
262 156
Decision Tree with New Variables
Root node-> 62% perished, 38% didn’t perished
Mr or Noble-> 84% perished, 16% didn’t perished
Not a Mr or Noble-> 28% didn't survive, 72% survived
3rd class-> 52% died, 48% didn’t died
Not a 3rd class-> 5% didn't survive, 95% survived
Pay >=$23-> 91% perished, 9% didn’t perished
Pay <=$23-> 38% didn't survive, 62% survived
If >=36 yrs-> 86% died, 14% didn't died
If <=36 yrs-> 36% didn't survive, 64% survived
Overfitted rpart Decision Tree
Disadvantages of rpart:
• Can suffer from:
o High Variance
o High Bias
• Decision tree algorithms can result in
overly complex or overfitted trees.
Function ctree() in package party
addresses these weaknesses by providing:
• Unbiased variable selection
• Statistical stopping rules to
optimize tree growth.
Conditional Tree: ctree
train.ctree <- ctree(Survived ~ Class + Sex + Age + Fare +Title + Side,data=train)
plot(train.ctree)
Prediction2 <- predict(train.ctree , newdata=test, type="response")
table(Prediction2)
Perished Survived
256 162
Mr or Noble-> Side-> Port or Starboard:
40% of surviving, 60% of dying
Mr or Noble-> Side-> Unknown:
16% of surviving, 84% of dying
Not a Mr or Noble-> 1st or 2nd Class:
98% of surviving, 2% of dying
Not a Mr or Noble-> 3rd Class-> Pay $23.25
61% of surviving, 39% of dying
Not a Mr or Noble-> 3rd Class-> Pay > $23.25
14% of surviving, 86% of dying
Conditional Tree: ctree
Logistic Regression
Least squares linear
regression
Predicted probabilities can
be greater than 1 or less
than 0 if used for
classification!
LOGISTIC REGRESSION
• Used for binary
qualitative response.
• Using logit ensures all
probabilities are between
1 and 0 only.
Why use Logistic
Regression?
Allows us to establish a
relationship between a binary
outcome variable and a group
of predictor variables. Can be
used as:
• CLASSIFICATION METHOD:
Classifies binary response (E.g.
Yes/No, Pass/Fail,
Survived/Perished)
• REGRESSION METHOD:
Calculates probability (0.0 to
1.0) of the response.
The “logit” model solves the problem:
Where:
• “p” is the probability that Y
for cases equals 1, p (Y=1).
• “1- p” is the probability that
Y for cases equals 0.
Transformed, the “log odds” are linear.
0 1
0 1
Linear CombiantionLog Odds(logit)
0 1
/ 1
or
log / 1 e
B B X
ln p p B B
p
y
X
p B B X
0 1
0 1
Solving....
/ 1
/ 1
B B X
B B X
Odds
e p p
p p e
Probability (Logistic function): that
Produces an S-shape curve.
Confirming “women &
children first” policy
Titanic.glm <- glm(Survived~ I(Sex=="female") + Class + I(Age<=10) + Embarked + Fare2,
data = train, family=binomial("logit"))
table(test$Survived)
Perished Survived
252 166
summary(Titanic.glm)
The logistic regression coefficients give the
change in the log odds of the outcome for a
one unit increase in the predictor variable.
Making Predictions
Sex==female who is 10 yrs old has an estimated
survival probability of:
2nd class men who paid 20 dollars for a ticket has an
estimated survival probability of:
12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)
12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)0.70
1
ep
e
12.3958 2.6816(1) 1.6133(10)
12.3958 2.6816(1) 1.6133(10)0.99
1
ep
e
Interpreting Coefficients…
summary(Titanic.glm)
Estimate > 0 higher probability of
surviving
Estimate < 0 lower probability of
surviving
Passengers travelling with relatives
have higher chances of survival.
Titanic.glm2<- glm(Survived ~ Class+I(FamilySize>=2) + Parch+I(SibSp>=2),data = train, family=binomial("logit"))
table(test$Survived)
Perished Survived
276 142
summary(Titanic.glm2)
We see that PClass is a strong
predictor supporting the
hypotheses about:
• location on the ship
• lifeboat access.
First class adult males
have lower chances of survival
Titanic.glm3<- glm(Survived ~ Class + I(Title=="Mr")+ I(Title=="Noble") + I(Age>=30 & Age<=50)+I(Fare>=27),data = train, family=binomial("logit"))
table(test$Survived)
Perished Survived
239 179
summary(Titanic.glm3)
"Any data relating to one's location on the ship could
prove helpful to survival predictions…"
First class adult males had
lower chances of survival
summary(Titanic.glm3)Those in upper decks (1st class) had more
timely, accurate information and shorter
journey to the lifeboats… Yet why did 1st
Class Males have lower survival rates?
Possible explanation:
• 1st Class Males were expected to be
"gentlemen" and perish with the ship.
"No woman shall be left aboard this ship
because Ben Guggenheim was a coward."
• 1st Class Male Survivors were
condemned by society:
> Bruce Ismay – had to resign as
Chairman of White Star Line.
> William Carter – divorced by wife.
Third class adult males had
lower chance of survival
summary(Titanic.glm4)
Those located in the bow or
lower decks (3rd Class) had less
chance of survival.
Titanic.gml4 <- glm(Survived ~ Class+I(Age>=30 & Age<=65) +I(Title=="Mr"& Class=="Third")+I(Fare<=10), data = train, family=binomial("logit"))
table(test$Survived)
Perished Survived
258 160
Ensemble Methods:
randomForest and cforest
Presented by Paul Marxhausen
Random ForestsAdvantages:
• Easy to use: can be used quite efficiently
with default parameters.
• Ideal for people without a deep
background in statistics.
• Produces fairly strong predictions with
only a small amount of coding.
• A group of actors who perform
together.
• An example of an ENSEMBLE
METHOD -- combines multiple
models to produce one result.
• Unlike single decision trees which
can suffer from high variance or
high bias, Random Forests use
random sampling and
averaging to find a natural
balance between the two
extremes.
Random Forests: Randomness logic
built-in and Data pre-processing
Randomness logic used
• Built-in; random rows (bagging)
and columns (mtry) as part of
fitting with training data.
Restriction Disadvantages
• Data has to be pre-processed to
remove NAs, NULLs, blanks
• We have to fix Age, Embarked,
Fare and FamilyID to meet
these requirements.
• Factor levels must be <=32 for
FamilyId (start with ~double)
DATA PRE-PROCESSING TASKS
USING COMBINED DATA
• Age(263 NA’s)=>rpart/predict
• Embarked (2 blanks) => assign
• Fare (1 NA) => median
• FamilyID (exceeded levels) => re-group (now 22 levels)
# Replace Fare NAs (see example)
which(is.na(combi$Fare))
combi$Fare[1044] <-median(combi$Fare, na.rm=TRUE)
Model: randomForest(…) using
‘randomForest’ package
# Build Random Forest Ensemble
set.seed(415) # two sources of randomness
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2, data=train, importance=TRUE, ntree=2000)
# generate importance graphs
varImpPlot(fit)
# Now let's make a prediction
Prediction <- predict(fit, test)
Kaggle.com score
0.81818 Surprised
cForest(…): type of random forest; implementation using party package
# Build condition inference tree Forest
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train, controls=cforest_unbiased(ntree=4000, mtry=2))
# Now let's make a prediction and write a submission file
Prediction <- predict(fit, test, OOB=TRUE, type = "response")
kaggle score after parameter adjustments
0.81818 Surprised Again !!!
randomForest package vs. party package
randomForest package• randomForest(…) function
• mtry is floor(sqrt(p)), which is
the number of features to
randomly select at each split.
• randomForest is
computationally faster.
• Popular in applied research
party package• cForest(…) function
• mtry set to the number 5 by
default for technical reasons
• Resulting forests are unbiased if
the predictor variables are of
different types.
• Ensembles of conditional inference
trees have not yet been extensively
tested, so this routine is meant for
the expert user only and its current
state is rather experimental.
Model Description Result
fit <-randomForest(…)
Traditional Random Forest (randomForest package)
0.81818
Leader Board
03/20/14
fit <- cForest(…) Conditional Inference tree (party package)
0.81340
Leader Board
03/20/14
fit <-cForest Changed ntree from 2000 to
4000, and mtry from 3 to 2.
0.81818
Leader Board
03/22/14
Ensemble Methods: kaggle results
Summary:
Data Visualization
Algorithms & kaggle results
by Michelle Darlingwww.datastudentblog.wordpress.com
Data Visualization
Summary
1. Created Conceptual Data Model• to understand denormalized data file.
2. Tried lots of visualizations:
• Categorical vs. Continuous
• Uni-, Bi- and Multivariate
3. Compared datasets:
• Titanic vs. train vs. test ARE similar
4. Created rule-based models using the
most significant predictors:• Sex == "female"
• Sex=="female OR Age <10
• Sex:Child:Fare:FamilySize
Data Visualization prototyping tools:
• MS Excel
• wordle.net
• Google Fusion
• R {rattle} package
PORTEmbarkedS=SouthhamptonC=CherbourgQ=Queenstown
TICKETTicketPclassCabin
PASSENGERPassengerIDNameAgeSibSpParchFareSurvived
Text Analysis of
Passenger Name
SURVIVORS PERISHED
Word Clouds created in www.wordle.net for Survivors$Name and Perished$NameSurvivors <-train[train$Survived==1,]; Perished <-train[train$Survived==0,]
Sex ("male" vs. "female") is
an important predictor of survival.
Do family members affect survival?
> table(Survived, Parch)Parch
Survived 0 1 2 3 4 5 60 445 53 40 2 4 4 11 233 65 40 3 0 1 0
> table(Survived, SibSp)SibSp
Survived 0 1 2 3 4 5 80 398 97 15 12 15 5 71 210 112 13 4 3 0 0
Survival is higher for passengers with Parch==3 (60%), or SibSp==1 (54%)
What is the relationship between:Embarked, Pclass, Ticket, Fare?
Cherbourg,
FranceSouthhampton, EnglandQueenstown,
Ireland
All three Embarked Ports (C,Q,S) boarded passengers from all classes (1st, 2nd, 3rd).
But 50% of Cherbourg Passengers were 1st Class; they paid much higher fares (blue spikes).
Based on this, Fare is likely a stronger predictor of survival than Embarked.
Graph created in MSExcel using data from table(Embarked, Pclass, Fare, Ticket)
Google Fusion TablesGeospatial Heatmap, Network Diagrams
Google Fusion Heatmap
GEOCODED by Embarkation Port:
• Southampton, UK -- 644 pasengers
• Cherbourg, France -- 168 passengers
• Queenstown, Ireland – 77 passengers
No Lifeboat
SURVIVORSPERISHED
Network Diagrams showing
Lifeboats (orange) vs. Embarkation Port (blue)
Based on external data (Encyclopedia Titanica)
imported into Google Fusion Tables.
Rule-Based ModelsEveryone Survived vs. Everyone Perished
# Model: Everyone survivedtest$Survived <- 1submit <- data.frame(PassengerId = test$PassengerId, Survived =test$Survived)
write.csv(submit, file = "mdarling_model_0.csv", row.names = FALSE)
Result: 0.37321☹
# Model: Everyone perishedtest$Survived <- 0
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "mdarling_model_1.csv", row.names = FALSE)
Result: Your Best Entry: 0.62679 ☺
You improved on your best score by 0.25359.
You just moved up 12 positions on the leaderboard
Survival rate for test is similar to RMS Titanic
Rule-Based ModelsRandom vs. Informed Guess
# Model: Random Guess
test$Survived <- sample(c(0,1), 418, replace = TRUE)
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "mdarling_model_1random.csv", row.names = FALSE)
Your submission scored 0.50718, ☹which is not an improvement of your best score.
Model: Informed Guess● Used problem domain info, data
visualizations and intuition to make an
“informed guess” about each passenger.
● Manually typed in 1,0 into test.csv file
with 418 rows…
Your Best Entry: 0.70335! ☺You improved on your best score by 0.07656!
Process is similar to
everyday human
decision-making
(no machine learning).
Score is much better
than random chance!
Data Visualization in R
R Visualization Packages:
• Base R: plot, barplot, boxplot, hist, dotchart, heatmap, pairs
• ggplot2: qplot, ggplot
• lattice: xyplot, dotplot, parallelplot
• vcd: "Visualizing Categorical Data" mosaic, assoc
• rcmdr: "Rcommander" scatter3d
• rattle: Explore Tab.
latticist, ggobi
Continuous vs. Discrete (Categorical) Variables
CORRELOGRAM: {base R} pairs()t <- as.data.frame(Survived,Pclass,Sex,
Age,Fare,Embarked,SibSp,Parch)
pairs(t, col=t$Pclass+2) # Shift base R color palette by 2# 1st class – green (1+2=3)# 2nd class – blue (2+2=4)# 3rd class – cyan (3+2=5)# base R Color Wheel is not very subtle!
• Correlogram is meant to show pair-wise relationships.
• Continuous variables appear as "clouds"
• Discrete variables appear as "bands"
Which variables are correlated?(Models perform better when variables are independent!)
Correlation plots created using {rattle} R package
FamilySize
SibSp
Parch
Fare3
Fare
Age
Continuous, Multivariate
Marginal Plots:
{rattle} latticist
• {rattle} is an R package
• latticist is an interactive GUI
for Data Visualization
Continuous, Multivariate
Intensity Map{base R} heatmap()
• Useful for visualizing and
comparing data sets.
• Requires a data matrix.
• Values must be numeric (recode qualitative variables e.g.,
Pclass, Gender).
• Can use custom color palette
(e.g., RColorBrewer)
test does not have a
Survived attribute.
PassengerID 1:891 (train) 892:1309 (test)891 obs. 418 obs.
train is representative of test.
"Soup Analogy": values look like
they are randomly distributed and
"well-stirred" – no big chunks of
dark or light bands.
Models based on train can be used
to predict test fairly accurately.
Continuous, Univariate
Histogram: {base R} hist()
Show range, density
and distribution of a
single, continuous
variable.
# Use 2X2 gridpar(mfrow=c(2,2))hist(test$Age)hist(test$Fare)hist(train$Age)hist(train$Fare)
"Small Multiples"
concept by Tukey:
Displaying multiple small
plots side-by-side is
effective for analysis.
test and train have
similar distributions for
continuous variables.
"Small Multiples" of Bar Plots for categorical variables. E.g., barplot(table(test$Child))
Categorical, Univariate
Bar Plots: {base R} barplot()
test and train have similar
distributions for
categorical variables.
Continuous, Univariate
Dot Plot: {lattice} dotplot()
library(lattice)attach(train)# Each dot is# a passenger.# Survived==1 Red# Survived==0 Black
dotplot(Age,pch=1,col=Survived, main="train$Age")
dotplot(Fare,pch=1,col=Survived,main="train$Fare")
cluster of survivors
(young children)outliers
cluster of perished passengers
(who paid lowest fares).
Continuous, Univariate
Box Plot: {Base R} boxplot()
Shows interquartile range (IQR),
Median, outliers.
# Plot Age grouped by Pclasspar(mfrow=c(1,2))Survivors <-train[train$Survived==1,]Perished <-train[train$Survived==0,]
boxplot(Age ~ Pclass, data = Survivors, col = "light blue", main="Survived", xlab="Passenger Class", ylab="Age")
boxplot(Age ~ Pclass, data = Perished, col = "gray", main="Perished", xlab="Passenger Class", ylab="Age")
Survivors had younger age
range compared to perished across
all three passenger classes.
Median
33.50 Median
28.00Median
27.00
Median
28.00
Median
30.00
Median
38.50
Categorical, Multivariate
Spine Plot = 3 Bar Plots
35% 65% 68% 32% 15% 85%
314
577
233
109
81
468
32%68%FEMALES:
greater than expected
survival rate
85%MALES: greater than
expected mortality rate
15
%
Class: mutually exclusive, rectilinear partition. E.g., Female Survivors
Probability: frequency count/whole set. E.g, 233/891 = 68%
Spine Plot is a visualization of a
rules-based model; it exhaustively
describes the feature space = Titanic
Passengers (female vs male)
Categorical, Multivariate Spine Plot: {base R} spineplot()
Indicates a higher
than expected survival rate.
Visualization of a contingency table.
vcd = "Visualizing Categorical Data"Blue – High Probability
Gray – Neutral
Red – Low Probability
Example:
3rd Class MaleSex==male & Pclass==3• High Probability: Survived ==0• Low Probability: Survived==1
# Mosaic Plot library(vcd)attach(train)t <-table(Sex,Survived,Child)mosaic(t, shade=TRUE,
main="train dataset")
Categorical, Multivariate
Mosaic Plot: {vcd} mosaic()
female adults
female children
male adults
male children
female
children
female
adults
male adults
male children
60%
Perished
40%
Survived
females (survived)
36% of all passengers
77% of all survivors
male
adults
male children
female
children
female
adults
male adults (perished) 61% of all passengers
83% of all who perished
male children
Similar
Mosaic PlotDecision Tree
60% Perished
40% Survivedmale adults
(perished)male children
(survived)
females
(survived)
males
(perished)
Rule-Based Models"Females" / "Women or Children"
# Model: Females Survive
test$Survived <-0
test$Survived[test$Sex=='female']<-1
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_female.csv", row.names = FALSE)
Your Best Entry: 0.76555☺You improved on your best score by 0.06220.
# Model: Women OR Children Survivetest$Survived <-0
test$Survived[test$Sex=='female'] <-1test$Survived[test$Age<10] <-1# Tried different age cutoffs until score improved.
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_wc.csv", row.names = FALSE)
Your Best Entry: 0.77033☺You improved on your best score by 0.00478
Rule-based model (70 rules)Sex : Child : Fare2: FamilySize
Principal Components Analysis
• Inspired by
PCA
• Performed
better than
naiveBayes,
qda, glm,
svm(radial,
sigmoid,
polynomial)!
aggregate(Survived~Sex+Child+Fare2+FamilySize, data=train, FUN=function(x) {sum(x)/length(x)})
Model Description Result
70-Rule Model aggregate(Survived~Sex+Child+Fare2+FamilySize,
data=train, FUN=function(x) {sum(x)/length(x)})0.77512
Female OR Child [test$Sex =='female'| test$Age < 10] 0.77033
Female [test$Sex =='female'] 0.76555
Informed Guess Data Visualization + Problem Domain info+ manual
typing 1,0 into .csv file.0.70335
Random Guess sample(c(1,0), 418, replace=TRUE) 0.50718
Everyone Perished test$Survived <- 0 0.62679
Everyone Survived test$Survived <- 1 0.37321
Summary: kaggle.com results so far…
START: Is training data available?
No UNSUPER-
VISED LEARNING
Yes -- train.csv SUPERVISED LEARNING
Continuous Target
REGRESSION
Categorical Target: Survived
CLASSIFICATION
Multivariate Classification
BINARY Classification == 1,0
SINGLE CLASSIFIERS
glm, knn, qda naiveBayes,
rpart, ctree, svm
ENSEMBLE METHODS
randomForest, cforest
Machine
Learning:
Titanic Dataset
Overview of Machine Learning Algorithms
QDA (0.75598) vs Logistic Regression (.76077)
• Linear model = straight line boundaries.
• Better fit for Titanic data set.
• Eager Learners. 2 step process: 1) Fit model using global info. 2) Predict test using reusable model.
• Polynomial model = curved boundaries.
Naïve Bayes (0.76555) vs. KNN (0.77990)
ptm <- proc.time()partimat(Survived~.,data=train_bc,method="sknn")end <- (proc.time() - ptm)# 769.72 milliseconds – MORE TIME CONSUMING butMORE CUSTOMIZED BOUNDARIES –> greater accuracy.
ptm <- proc.time()partimat(Survived~.,data=train_bc,method="naiveBayes")end <- (proc.time() - ptm)# 39.99 milliseconds – only 5% of the knn time.
AdaBoost (0.77990 – same as KNN)
# rattle Model outputSummary of the Ada Boost model:Call:ada(Survived ~ ., data = crs$dataset[crs$train, c(crs$input,
crs$target)], control = rpart.control(maxdepth = 30, cp = 0.01, minsplit = 20, xval = 10), iter = 50)Loss: exponential Method: discrete Iteration: 50 Final Confusion Matrix for Data:
Final PredictionTrue value 0 1
0 350 231 45 205
Train Error: 0.109 Out-Of-Bag Error: 0.136 iteration= 50 Additional Estimates of number of iterations:train.err1 train.kap1
50 50 Variables actually used in tree construction:[1] "Age" "FamilyID2" "Fare" "Sex" "Title" Frequency of variables actually used:FamilyID2 Fare Title Age Sex
49 49 48 46 8
Time taken: 3.42 secs
Only 50 trees compared
to 4000 trees in
cforest, hence lower
performance.
Examples of AdaBoost "weak learner" trees:1,3,10,20,35,47. Total: 50 trees
linear, cost=1, 68% correct radial, cost=100, 73.4% correct
polynomial, cost=10, 68% correct sigmoid, cost=0.1, 66% correct
Support VectorMachines (2D)SVM Kernels
& Decision Boundary Shapes
• Linear Line
• Radial Circle
• Polynomial C Curve
• Sigmoid S Curve
"Goodness of Fit" – svm:
radial performed best with two
dimensions (.77033).
Scatterplots for visualizing SVM 2D {ggplot2} qplot vs. 3-D {Rcmdr} scatter3d
# Interactive 3D hyperplane with splinelibrary(Rcmdr); attach(train)scatter3d(Age,Survived,Fare)
# Point and Line ScatterPlotlibrary(ggplot2); attach(train)qplot(Age, Fare, data=train,
geom=c("point","line"),colour=Survived,main = "Titanic Passengers")
SVM
using 11 inputs
Advantages of SVM:
• Minimal pre-processing needed.
• Tuning improves accuracy.
• Helps reveal best fit
(linear/poly/radial/sigmoid).
• Immune to "Curse of
Dimensionality".
• Instead of worsening, accuracy
improved when dimensions
increased from 2 to 11
attributes.
0.79904good, but still not
better than cforest
or randomForest
0.81818
cforest (.81818) + Lifeboat Data Fusion = .83732
# Added 12 male survivors based on merged # lifeboat data from Encyclopedia Titanica.
ciforest2 <- read.csv("ciforest2.csv")testlb <- read.csv("test_lifeboats.csv")
ensembles <- merge(ciforest2, testlb, by.x="PassengerId", by.y="PassengerId")
ensembles$Survived[ensembles$Lifeboat==1] <-1table(ensembles$Survived)#0 1 #272 146
submit <- data.frame(PassengerId = ensembles$PassengerId, Survived = ensembles$Survived)write.csv(submit, file = "ensembles_5.csv", row.names = FALSE)
"Ensemble of ensembles":randomForest + cForest + random tiebreaker
# Code for 95/05 tiebreaker (score 0.81818)
# Merge randomForest and cForest and average# the results. Reuse unanimous votes.ensembles <- merge(rforest, ciforest2, by.x="PassengerId", by.y="PassengerId")ensembles$Vote <-(as.numeric(ensembles$Survived.x)+ as.numeric(ensembles$Survived.y))/2ensembles$Survived[ensembles$Vote==1.0] <-1ensembles$Survived[ensembles$Vote==0.0] <-0
# Create vector of 418 random 0s and 1sset.seed(pi)probs<-c(.95,.05)ensembles$rvote <-sample(c(0,1), 418,replace = TRUE,prob=probs)
#For each tie, use a random voteensembles$Survived[ensembles$Vote==0.5] <-ensembles$rvote[ensembles$Vote==0.5]table(ensembles$Survived)
0 1 281 137
What if we combine results from randomForest and
cForest? Use random tiebreaker for non-unanimous votes.
Results: Combinations did not outperform individuals,
even when lifeboat data was added.
Data mining using lifeboat info = competitive edge. 12
additional male survivors is highly significant because they
countered social norms and survived "against the odds".
Ensemble methods (randomForest, cforest) outperform
single classifiers. "Many models work better than one."
Embedded feature selection models (svm, ctree, rpart) outperform models that need "manual" feature
selection. Decision trees are great communication tools.
knn has same accuracy as glm and AdaBoost, but takes a lot
of processing time.
Simple rule-based models can outperform naiveBayes if
features chosen by Principal Components Analysis (PCA).
Social norms ("Women and Children First", "Male
survivors are cowards" ) greatly influenced survival.
Human decision-making outperforms random chance,
and can outperform machine learning (depending on the
human's expertise).
Math-based models like glm sensitive to feature selection.
"Goodness of fit" determines performance. Linear and
radial (glm, svm:linear/radial) outperformed others
(qda,svm:polynomial/sigmoid).
Machine Learning Summary
Top Related