The Titanic:

Machine Learning

from Disaster

Data Mining and Machine Learning. Winter 2014. Final Project

Jean Callao | Michelle Darling | Paul Marxhausen

In depth analysis: by Jean Callao

• Logistic Regression: glm

• Tree-based methods: rpart, ctree

In depth analysis: by Paul Marxhausen

• Ensemble Methods: randomForest, cForest

Summary: by Michelle Darling

• Data Visualization

• Machine Learning Kaggle Results

Titanic: Machine Learning from Disaster

Why we picked this project:

● Historical context to understand "What does the data mean?"

● Learn one data set well, and then apply different algorithms and modelling tools.

● Practice the steps of data analysis:

○ Data exploration and visualization.

○ Model selection, building and testing.

● Prize: $0 + "knowledge & confidence" to go on to more challenging data science problems. provides:

Online data science competitions.

Structured problems, tutorials, help forums and discussion groups.

Easy, consistent way to test models and track results.

>>> Focus <<<

April 1912

The Titanic Disaster

RMS Titanic, April 1912

A priori knowledge from problem domain

What factors contributed to survival?

Gender, Age, Passenger Class, Fare, Family

More likely to survive

• Females

• Children, Adults<50

• 1st Class

• Paid higher fares

• Travelling with family

More likely to perish

• Males

• Adults >50

• 2nd, 3rd class

• Paid lower fares

• Travelling alone

• Immigrants

Titanic DatasetPredictor & Target Variables


Survived(1 = Yes; 0 = No)

PredictorVariables DESCRIPTION

Pclass Passenger Class (1=1st; 2=2nd; 3=3rd)Name Passenger NameSex Sex ("male", "female")Age Age (Numeric fraction e.g., 1.5)Fare Passenger FareSibsp Number of Siblings/Spouses AboardParch Number of Parents/Children AboardTicket Ticket NumberCabin Cabin Embarked Port of Embarkation

(C=Cherbourg; Q=Queenstown; S=Southampton)

QUANTITATIVE Variables; the rest are QUALITATIVE.

Feature Engineering

Data relating to one's location on the ship

data$cabin.last.digit <- str_sub(data$Cabin, -1)

data$Side <- "Unknown”

data$Side[which(isEven(data$cabin.last.digit))] <-"port”

data$Side[which(isOdd(data$cabin.last.digit))] <-"starboard”

Classifying Fares

combi$Fare2 <- '30+'

combi$Fare2[combi$Fare < 30 & combi$Fare >= 20] <-'20-30'

combi$Fare2[combi$Fare < 20 & combi$Fare >= 10] <-'10-20’

combi$Fare2[combi$Fare < 10] <- '<10'

Title - Extract from name to find wealthy passengers:

combi$Title[combi$Title %in% c('Mme', 'Mlle')] <-'Mlle‘

combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess')] <- 'Lady'

combi$Title[combi$Title %in% c('Capt', 'Col', 'Don', 'Dr','Jonkheer', 'Major', 'Rev', 'Sir')]<-'Noble’

FamilySize - Combining spouse, siblings and parents

combi$FamilySize <- combi$SibSp + combi$Parch + 1

Decision Trees and

Logistic Regression

Presented by Jean Callao

Decision Trees• A decision tree is a simple, but

powerful form of multiple variable

analysis. It displays a tree-like

graph of decisions and their

possible consequences.

• Recursive Partitioning-> at each

step, we identify a question that

we use to partition the data.


• Data-driven: Makes no prior

assumptions; selects significant predictors

based on the greatest information gain.

• Flexible: No data pre-processing needed!

Handles numeric and categorical data.

• Easy to interpret and explain to others.

Decision Tree with New Variables

tree <- rpart(Survived~ Class + Sex + Age + SibSp + Parch + Fare + Title + Side,

data=train, method="class", control = rpart.control(minsplit = 0, minbucket = 0, maxdepth = 10))


Prediction <- predict(tree, test, type = "class")


Perished Survived

262 156

Decision Tree with New Variables

Root node-> 62% perished, 38% didn’t perished

Mr or Noble-> 84% perished, 16% didn’t perished

Not a Mr or Noble-> 28% didn't survive, 72% survived

3rd class-> 52% died, 48% didn’t died

Not a 3rd class-> 5% didn't survive, 95% survived

Pay >=$23-> 91% perished, 9% didn’t perished

Pay <=$23-> 38% didn't survive, 62% survived

If >=36 yrs-> 86% died, 14% didn't died

If <=36 yrs-> 36% didn't survive, 64% survived

Overfitted rpart Decision Tree

Disadvantages of rpart:

• Can suffer from:

o High Variance

o High Bias

• Decision tree algorithms can result in

overly complex or overfitted trees.

Function ctree() in package party

addresses these weaknesses by providing:

• Unbiased variable selection

• Statistical stopping rules to

optimize tree growth.

Conditional Tree: ctree

train.ctree <- ctree(Survived ~ Class + Sex + Age + Fare +Title + Side,data=train)


Prediction2 <- predict(train.ctree , newdata=test, type="response")


Perished Survived

256 162

Mr or Noble-> Side-> Port or Starboard:

40% of surviving, 60% of dying

Mr or Noble-> Side-> Unknown:

16% of surviving, 84% of dying

Not a Mr or Noble-> 1st or 2nd Class:

98% of surviving, 2% of dying

Not a Mr or Noble-> 3rd Class-> Pay $23.25

61% of surviving, 39% of dying

Not a Mr or Noble-> 3rd Class-> Pay > $23.25

14% of surviving, 86% of dying

Conditional Tree: ctree

Logistic Regression

Least squares linear


Predicted probabilities can

be greater than 1 or less

than 0 if used for



• Used for binary

qualitative response.

• Using logit ensures all

probabilities are between

1 and 0 only.

Why use Logistic


Allows us to establish a

relationship between a binary

outcome variable and a group

of predictor variables. Can be

used as:


Classifies binary response (E.g.

Yes/No, Pass/Fail,



Calculates probability (0.0 to

1.0) of the response.

The “logit” model solves the problem:


• “p” is the probability that Y

for cases equals 1, p (Y=1).

• “1- p” is the probability that

Y for cases equals 0.

Transformed, the “log odds” are linear.

0 1

0 1

Linear CombiantionLog Odds(logit)

0 1

/ 1


log / 1 e


ln p p B B




p B B X

0 1

0 1


/ 1

/ 1




e p p

p p e

Probability (Logistic function): that

Produces an S-shape curve.

Confirming “women &

children first” policy

Titanic.glm <- glm(Survived~ I(Sex=="female") + Class + I(Age<=10) + Embarked + Fare2,

data = train, family=binomial("logit"))


Perished Survived

252 166


The logistic regression coefficients give the

change in the log odds of the outcome for a

one unit increase in the predictor variable.

Making Predictions

Sex==female who is 10 yrs old has an estimated

survival probability of:

2nd class men who paid 20 dollars for a ticket has an

estimated survival probability of:

12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)

12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)0.70




12.3958 2.6816(1) 1.6133(10)

12.3958 2.6816(1) 1.6133(10)0.99




Page 19: Final pink panthers_03_30

Interpreting Coefficients…


Estimate > 0 higher probability of


Estimate < 0 lower probability of


Passengers travelling with relatives

have higher chances of survival.

Titanic.glm2<- glm(Survived ~ Class+I(FamilySize>=2) + Parch+I(SibSp>=2),data = train, family=binomial("logit"))


Perished Survived

276 142


We see that PClass is a strong

predictor supporting the

hypotheses about:

• location on the ship

• lifeboat access.

First class adult males

have lower chances of survival

Titanic.glm3<- glm(Survived ~ Class + I(Title=="Mr")+ I(Title=="Noble") + I(Age>=30 & Age<=50)+I(Fare>=27),data = train, family=binomial("logit"))


Perished Survived

239 179


"Any data relating to one's location on the ship could

prove helpful to survival predictions…"

First class adult males had

lower chances of survival

summary(Titanic.glm3)Those in upper decks (1st class) had more

timely, accurate information and shorter

journey to the lifeboats… Yet why did 1st

Class Males have lower survival rates?

Possible explanation:

• 1st Class Males were expected to be

"gentlemen" and perish with the ship.

"No woman shall be left aboard this ship

because Ben Guggenheim was a coward."

• 1st Class Male Survivors were

condemned by society:

> Bruce Ismay – had to resign as

Chairman of White Star Line.

> William Carter – divorced by wife.

Third class adult males had

lower chance of survival


Those located in the bow or

lower decks (3rd Class) had less

chance of survival.

Titanic.gml4 <- glm(Survived ~ Class+I(Age>=30 & Age<=65) +I(Title=="Mr"& Class=="Third")+I(Fare<=10), data = train, family=binomial("logit"))


Perished Survived

258 160

Ensemble Methods:

randomForest and cforest

Presented by Paul Marxhausen

Random ForestsAdvantages:

• Easy to use: can be used quite efficiently

with default parameters.

• Ideal for people without a deep

background in statistics.

• Produces fairly strong predictions with

only a small amount of coding.

• An example of an ENSEMBLE

METHOD -- combines multiple

models to produce one result.

• Unlike single decision trees which

can suffer from high variance or

high bias, Random Forests use

random sampling and

averaging to find a natural

balance between the two


Random Forests: Data pre-processing


• Data has to be pre-processed to

remove NAs, NULLs, blanks.

• Factor levels must be <22.

• We have to fix Age, Fare,

Embarked and FamilyID to

meet these requirements.


• Age

• Fare

• Embarked

• FamilyID

# Fill in Fare NAs



combi$Fare[1044] <-median(combi$Fare, na.rm=TRUE)

Model: RF using ‘randomForest’ package

# Build Random Forest Ensemble


fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2, data=train, importance=TRUE, ntree=2000)

# Now let's make a prediction

Prediction <- predict(fit, test) score


Models: RFs using party package

# Build condition inference tree forest

fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train, controls=cforest_unbiased(ntree=4000, mtry=2))

# Now let's make a prediction and write a submission file

Prediction <- predict(fit, test, OOB=TRUE, type = "response") score


randomForest vs. party

randomForest package• randomForest(…) function

• mtry is floor(sqrt(p)), which is

the number of features to

randomly select at each split.

• randomForest is

computationally faster.

• Popular in applied research

party package• cForest(…) function

• mtry set to the number 5 by

default for technical reasons

• Resulting forests are unbiased if

the predictor variables are of

different types.

• Importance manager: helps

evaluate the importance of

correlated predictor variables.

Model Description Result

fit <-cForest Changed ntree from 2000 to

4000, and mtry from 3 to 2.0.81818

fit <-randomForest(…)

Traditional Random Forest (randomForest package)


fit <- cForest(…) Conditional Inference tree (party package)


Ensemble Methods: kaggle results

Data Visualization

Algorithms & kaggle results

by Michelle

Data Visualization


1. Created Conceptual Data Model• to understand denormalized data file.

2. Tried lots of visualizations:

• Categorical vs. Continuous

• Uni-, Bi- and Multivariate

3. Compared datasets:

• Titanic vs. train vs. test ARE similar

4. Created rule-based models using the

most significant predictors:• Sex == "female"

• Sex=="female OR Age <10

• Sex:Child:Fare:FamilySize

Data Visualization prototyping tools:

• MS Excel


• Google Fusion

• R {rattle} package




Do family members affect survival?

> table(Survived, Parch)Parch

Survived 0 1 2 3 4 5 60 445 53 40 2 4 4 11 233 65 40 3 0 1 0

> table(Survived, SibSp)SibSp

Survived 0 1 2 3 4 5 80 398 97 15 12 15 5 71 210 112 13 4 3 0 0

Survival is higher for passengers with Parch==3 (60%), or SibSp==1 (54%)

What is the relationship between:Embarked, Pclass, Ticket, Fare?


FranceSouthhampton, EnglandQueenstown,


All three Embarked Ports (C,Q,S) boarded passengers from all classes (1st, 2nd, 3rd).

But 50% of Cherbourg Passengers were 1st Class; they paid much higher fares (blue spikes).

Based on this, Fare is likely a stronger predictor of survival than Embarked.

Graph created in MSExcel using data from table(Embarked, Pclass, Fare, Ticket)

Text Analysis of

Passenger Name


Word Clouds created in for Survivors$Name and Perished$NameSurvivors <-train[train$Survived==1,]; Perished <-train[train$Survived==0,]

Sex ("male" vs. "female") is

an important predictor of survival.

Google Fusion TablesGeospatial Heatmap, Network Diagrams

Google Fusion Heatmap

GEOCODED by Embarkation Port:

• Southampton, UK -- 644 pasengers

• Cherbourg, France -- 168 passengers

• Queenstown, Ireland – 77 passengers

No Lifeboat


Network Diagrams showing

Lifeboats (orange) vs. Embarkation Port (blue)

Based on external data (Encyclopedia Titanica)

imported into Google Fusion Tables.

Data Visualization in R

R Visualization Packages:

• Base R: plot, barplot, boxplot, hist, dotchart, heatmap, pairs

• ggplot2: qplot, ggplot

• lattice: xyplot, dotplot, parallelplot

• vcd: "Visualizing Categorical Data" mosaic, assoc

• rcmdr: "Rcommander" scatter3d

• rattle: Explore Tab.

latticist, ggobi

Continuous vs. Discrete (Categorical) Variables

CORRELOGRAM: {base R} pairs()t <-,Pclass,Sex,


pairs(t, col=t$Pclass+2) # Shift base R color palette by 2# 1st class – green (1+2=3)# 2nd class – blue (2+2=4)# 3rd class – cyan (3+2=5)# base R Color Wheel is not very subtle!

• Correlogram is meant to show pair-wise relationships.

• Continuous variables appear as "clouds"

• Discrete variables appear as "bands"

Continuous, Multivariate

Intensity Map{base R} heatmap()

• Useful for visualizing and

comparing data sets.

• Requires a data matrix.

• Values must be numeric (recode qualitative variables e.g.,

Pclass, Gender).

• Can use custom color palette

(e.g., RColorBrewer)

test does not have a

Survived attribute.

PassengerID 1:891 (train) 892:1309 (test)891 obs. 418 obs.

train is representative of test.

"Soup Analogy": values look like

they are randomly distributed and

"well-stirred" – no big chunks of

dark or light bands.

Models based on train can be used

to predict test fairly accurately.

Continuous, Univariate

Histogram: {base R} hist()

Show range, density

and distribution of a

single, continuous


# Use 2X2 gridpar(mfrow=c(2,2))hist(test$Age)hist(test$Fare)hist(train$Age)hist(train$Fare)

"Small Multiples"

concept by Tukey:

Displaying multiple small

plots side-by-side is

effective for analysis.

test and train have

similar distributions for

continuous variables.

"Small Multiples" of Bar Plots for categorical variables. E.g., barplot(table(test$Child))

Categorical, Univariate

Bar Plots: {base R} barplot()

test and train have similar

distributions for

categorical variables.

Continuous, Univariate

Dot Plot: {lattice} dotplot()

library(lattice)attach(train)# Each dot is# a passenger.# Survived==1 Red# Survived==0 Black

dotplot(Age,pch=1,col=Survived, main="train$Age")


cluster of survivors

(young children)outliers

cluster of perished passengers

(who paid lowest fares).

Continuous, Univariate

Box Plot: {Base R} boxplot()

Shows interquartile range (IQR),

Median, outliers.

# Plot Age grouped by Pclasspar(mfrow=c(1,2))Survivors <-train[train$Survived==1,]Perished <-train[train$Survived==0,]

boxplot(Age ~ Pclass, data = Survivors, col = "light blue", main="Survived", xlab="Passenger Class", ylab="Age")

boxplot(Age ~ Pclass, data = Perished, col = "gray", main="Perished", xlab="Passenger Class", ylab="Age")

Survivors had younger age

range compared to perished across

all three passenger classes.


33.50 Median









Categorical, Multivariate

Spine Plot = 3 Bar Plots

35% 65% 68% 32% 15% 85%








greater than expected

survival rate

85%MALES: greater than

expected mortality rate



Class: mutually exclusive, rectilinear partition. E.g., Female Survivors

Probability: frequency count/whole set. E.g, 233/891 = 68%

Spine Plot is a visualization of a

rules-based model; it exhaustively

describes the feature space = Titanic

Passengers (female vs male)

Page 46: Final pink panthers_03_30

Categorical, Multivariate Spine Plot: {base R} spineplot()

Indicates a higher

than expected survival rate.

Visualization of a contingency table.

vcd = "Visualizing Categorical Data"Blue – High Frequency

Gray – Neutral

Red – Low Frequency ount.


3rd Class MaleSex==male & Pclass==3• High Frequency: Survived ==0• Low Frequency: Survived==1

# Mosaic Plot library(vcd)attach(train)t <-table(Sex,Survived,Child)mosaic(t, shade=TRUE,

main="train dataset")

Categorical, Multivariate

Mosaic Plot: {vcd} mosaic()

female adults

female children

male adults

male children





male adults

male children





females (survived)

36% of all passengers

77% of all survivors



male children





male adults (perished) 61% of all passengers

83% of all who perished

male children


Mosaic PlotDecision Tree

60% Perished

40% Survivedmale adults

(perished)male children






Continuous, Multivariate

Marginal Plots:

{rattle} latticist

• {rattle} is an R package

• latticist is an interactive GUI

for Data Visualization

Which variables are correlated?(Models perform better when variables are independent!)

Correlation plots created using {rattle} R package







Rule-Based ModelsEveryone Survived vs. Everyone Perished

# Model: Everyone survivedtest$Survived <- 1submit <- data.frame(PassengerId = test$PassengerId, Survived =test$Survived)

write.csv(submit, file = "mdarling_model_0.csv", row.names = FALSE)

Result: 0.37321☹

# Model: Everyone perishedtest$Survived <- 0

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

write.csv(submit, file = "mdarling_model_1.csv", row.names = FALSE)

Result: Your Best Entry: 0.62679 ☺

You improved on your best score by 0.25359.

You just moved up 12 positions on the leaderboard

Survival rate for test is similar to RMS Titanic

Rule-Based ModelsRandom vs. Informed Guess

# Model: Random Guess

test$Survived <- sample(c(0,1), 418, replace = TRUE)

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

write.csv(submit, file = "mdarling_model_1random.csv", row.names = FALSE)

Your submission scored 0.50718, ☹which is not an improvement of your best score.

Model: Informed Guess● Used problem domain info, data

visualizations and intuition to make an

“informed guess” about each passenger.

● Manually typed in 1,0 into test.csv file

with 418 rows…

Your Best Entry: 0.70335! ☺You improved on your best score by 0.07656!

Process is similar to

everyday human


(no machine learning).

Score is much better

than random chance!

Rule-Based Models"Females" / "Women or Children"

# Model: Females Survive

test$Survived <-0


submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_female.csv", row.names = FALSE)

Your Best Entry: 0.76555☺You improved on your best score by 0.06220.

# Model: Women OR Children Survivetest$Survived <-0

test$Survived[test$Sex=='female'] <-1test$Survived[test$Age<10] <-1# Tried different age cutoffs until score improved.

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_wc.csv", row.names = FALSE)

Your Best Entry: 0.77033☺You improved on your best score by 0.00478

Rule-based model (70 rules)Sex : Child : Fare2: FamilySize

Principal Components Analysis

• Inspired by


• Performed

better than


qda, glm,




aggregate(Survived~Sex+Child+Fare2+FamilySize, data=train, FUN=function(x) {sum(x)/length(x)})

Model Description Result

70-Rule Model aggregate(Survived~Sex+Child+Fare2+FamilySize,

data=train, FUN=function(x) {sum(x)/length(x)})0.77512

Female OR Child [test$Sex =='female'| test$Age < 10] 0.77033

Female [test$Sex =='female'] 0.76555

Informed Guess Data Visualization + Problem Domain info+ manual

typing 1,0 into .csv file.0.70335

Random Guess sample(c(1,0), 418, replace=TRUE) 0.50718

Everyone Perished test$Survived <- 0 0.62679

Everyone Survived test$Survived <- 1 0.37321

Summary: results so far…

START: Is training data available?




Continuous Target


Categorical Target: Survived


Multivariate Classification

BINARY Classification == 1,0


glm, knn, qda naiveBayes,

rpart, ctree, svm


randomForest, cforest



Titanic Dataset

Page 57: Final pink panthers_03_30

Overview of Machine Learning Algorithms

QDA (0.75598) vs Logistic Regression (.76077)

• Linear model = straight line boundaries.

• Better fit for Titanic data set.

• Eager Learners. 2 step process: 1) Fit model using global info. 2) Predict test using reusable model.

• Polynomial model = curved boundaries.

Naïve Bayes (0.76555) vs. KNN (0.77990)

ptm <- proc.time()partimat(Survived~.,data=train_bc,method="sknn")end <- (proc.time() - ptm)# 769.72 milliseconds – MORE TIME CONSUMING butMORE CUSTOMIZED BOUNDARIES –> greater accuracy.

ptm <- proc.time()partimat(Survived~.,data=train_bc,method="naiveBayes")end <- (proc.time() - ptm)# 39.99 milliseconds – only 5% of the knn time.

AdaBoost (0.77990 – same as KNN)

# rattle Model outputSummary of the Ada Boost model:Call:ada(Survived ~ ., data = crs$dataset[crs$train, c(crs$input,

crs$target)], control = rpart.control(maxdepth = 30, cp = 0.01, minsplit = 20, xval = 10), iter = 50)Loss: exponential Method: discrete Iteration: 50 Final Confusion Matrix for Data:

Final PredictionTrue value 0 1

0 350 231 45 205

Train Error: 0.109 Out-Of-Bag Error: 0.136 iteration= 50 Additional Estimates of number of iterations:train.err1 train.kap1

50 50 Variables actually used in tree construction:[1] "Age" "FamilyID2" "Fare" "Sex" "Title" Frequency of variables actually used:FamilyID2 Fare Title Age Sex

49 49 48 46 8

Time taken: 3.42 secs

Only 50 trees compared

to 4000 trees in

cforest, hence lower


linear, cost=1, 68% correct radial, cost=100, 73.4% correct

polynomial, cost=10, 68% correct sigmoid, cost=0.1, 66% correct

Support VectorMachines (2D)SVM Kernels

& Decision Boundary Shapes

• Linear Line

• Radial Circle

• Polynomial C Curve

• Sigmoid S Curve

"Goodness of Fit" – svm:

radial performed best with two

dimensions (.77033).

Scatterplots for visualizing SVM 2D {ggplot2} qplot vs. 3-D {Rcmdr} scatter3d

# Interactive 3D hyperplane with splinelibrary(Rcmdr); attach(train)scatter3d(Age,Survived,Fare)

# Point and Line ScatterPlotlibrary(ggplot2); attach(train)qplot(Age, Fare, data=train,

geom=c("point","line"),colour=Survived,main = "Titanic Passengers")

using 11 inputs

Advantages of SVM:

• Minimal pre-processing needed.

• Tuning improves accuracy.

• Helps reveal best fit


• Immune to "Curse of


• Instead of worsening, accuracy

improved when dimensions

increased from 2 to 11


0.79904good, but still not

better than cforest

or randomForest


cforest (.81818) + Lifeboat Data Fusion = .83732

# Added 12 male survivors based on merged # lifeboat data from Encyclopedia Titanica.

ciforest2 <- read.csv("ciforest2.csv")testlb <- read.csv("test_lifeboats.csv")

ensembles <- merge(ciforest2, testlb, by.x="PassengerId", by.y="PassengerId")

ensembles$Survived[ensembles$Lifeboat==1] <-1table(ensembles$Survived)#0 1 #272 146

submit <- data.frame(PassengerId = ensembles$PassengerId, Survived = ensembles$Survived)write.csv(submit, file = "ensembles_5.csv", row.names = FALSE)

"Ensemble of ensembles":randomForest + cForest + random tiebreaker

# Code for 95/05 tiebreaker (score 0.81818)

# Merge randomForest and cForest and average# the results. Reuse unanimous votes.ensembles <- merge(rforest, ciforest2, by.x="PassengerId", by.y="PassengerId")ensembles$Vote <-(as.numeric(ensembles$Survived.x)+ as.numeric(ensembles$Survived.y))/2ensembles$Survived[ensembles$Vote==1.0] <-1ensembles$Survived[ensembles$Vote==0.0] <-0

# Create vector of 418 random 0s and 1sset.seed(pi)probs<-c(.95,.05)ensembles$rvote <-sample(c(0,1), 418,replace = TRUE,prob=probs)

#For each tie, use a random voteensembles$Survived[ensembles$Vote==0.5] <-ensembles$rvote[ensembles$Vote==0.5]table(ensembles$Survived)

0 1 281 137

What if we combine results from randomForest and

cForest? Use random tiebreaker for non-unanimous votes.

Results: Combinations did not outperform individuals,

even when lifeboat data was added.

Data mining using lifeboat info = competitive edge. 12

additional male survivors is highly significant because they

countered social norms and survived "against the odds".

Ensemble methods (randomForest, cforest) outperform

single classifiers. "Many models work better than one."

Embedded feature selection models (svm, ctree, rpart) outperform models that need "manual" feature

selection. Decision trees are great communication tools.

knn has same accuracy as glm and AdaBoost, but takes a lot

of processing time.

Simple rule-based models can outperform naiveBayes if

features chosen by Principal Components Analysis (PCA).

Social norms ("Women and Children First", "Male

survivors are cowards" ) greatly influenced survival.

Human decision-making outperforms random chance,

and can outperform machine learning (depending on the

human's expertise).

Math-based models like glm sensitive to feature selection.

"Goodness of fit" determines performance. Linear and

radial (glm, svm:linear/radial) outperformed others


