Download - Final pink panthers_03_31

The Titanic:

Machine Learning

from Disaster

Data Mining and Machine Learning. Winter 2014. Final Project

Jean Callao | Michelle Darling | Paul Marxhausen

AGENDA

In depth analysis: by Jean Callao

• Logistic Regression: glm

• Tree-based methods: rpart, ctree

In depth analysis: by Paul Marxhausen

• Ensemble Methods: randomForest, cForest

Summary: by Michelle Darling

• Data Visualization

• Machine Learning Kaggle Results

Titanic: Machine Learning from Disaster

Why we picked this project:

● Historical context to understand "What does the data mean?"

● Learn one data set well, and then apply different algorithms and modelling tools.

● Practice the steps of data analysis:

○ Data exploration and visualization.

○ Model selection, building and testing.

● Prize: $0 + "knowledge & confidence" to go on to more challenging data science problems.

kaggle.com provides:

Online data science competitions.

Structured problems, tutorials, help forums and discussion groups.

Easy, consistent way to test models and track results.

>>> Focus <<<

April 1912

The Titanic Disaster

RMS Titanic, April 1912

A priori knowledge from problem domain

What factors contributed to survival?

Gender, Age, Passenger Class, Fare, Family

More likely to survive

• Females

• Children, Adults<50

• 1st Class

• Paid higher fares

• Travelling with family

More likely to perish

• Males

• Adults >50

• 2nd, 3rd class

• Paid lower fares

• Travelling alone

• Immigrants

Titanic DatasetPredictor & Target Variables

ResponseVARIABLE

Survived(1 = Yes; 0 = No)

PredictorVariables DESCRIPTION

Pclass Passenger Class (1=1st; 2=2nd; 3=3rd)Name Passenger NameSex Sex ("male", "female")Age Age (Numeric fraction e.g., 1.5)Fare Passenger FareSibsp Number of Siblings/Spouses AboardParch Number of Parents/Children AboardTicket Ticket NumberCabin Cabin Embarked Port of Embarkation

(C=Cherbourg; Q=Queenstown; S=Southampton)

QUANTITATIVE Variables; the rest are QUALITATIVE.

Feature Engineering

Data relating to one's location on the ship

data$cabin.last.digit <- str_sub(data$Cabin, -1)

data$Side <- "Unknown”

data$Side[which(isEven(data$cabin.last.digit))] <-"port”

data$Side[which(isOdd(data$cabin.last.digit))] <-"starboard”

Classifying Fares

combi$Fare2 <- '30+'

combi$Fare2[combi$Fare < 30 & combi$Fare >= 20] <-'20-30'

combi$Fare2[combi$Fare < 20 & combi$Fare >= 10] <-'10-20’

combi$Fare2[combi$Fare < 10] <- '<10'

Title - Extract from name to find wealthy passengers:

combi$Title[combi$Title %in% c('Mme', 'Mlle')] <-'Mlle‘

combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess')] <- 'Lady'

combi$Title[combi$Title %in% c('Capt', 'Col', 'Don', 'Dr','Jonkheer', 'Major', 'Rev', 'Sir')]<-'Noble’

FamilySize - Combining spouse, siblings and parents

combi$FamilySize <- combi$SibSp + combi$Parch + 1

Decision Trees and

Logistic Regression

Presented by Jean Callao

Decision Trees• A decision tree is a simple, but

powerful form of multiple variable

analysis. It displays a tree-like

graph of decisions and their

possible consequences.

• Recursive Partitioning-> at each

step, we identify a question that

we use to partition the data.

Advantages:

• Data-driven: Makes no prior

assumptions; selects significant predictors

based on the greatest information gain.

• Flexible: No data pre-processing needed!

Handles numeric and categorical data.

• Easy to interpret and explain to others.

Decision Tree with New Variables

tree <- rpart(Survived~ Class + Sex + Age + SibSp + Parch + Fare + Title + Side,

data=train, method="class", control = rpart.control(minsplit = 0, minbucket = 0, maxdepth = 10))

fancyRpartPlot(tree)

Prediction <- predict(tree, test, type = "class")

table(Prediction)

Perished Survived

262 156

Decision Tree with New Variables

Root node-> 62% perished, 38% didn’t perished

Mr or Noble-> 84% perished, 16% didn’t perished

Not a Mr or Noble-> 28% didn't survive, 72% survived

3rd class-> 52% died, 48% didn’t died

Not a 3rd class-> 5% didn't survive, 95% survived

Pay >=$23-> 91% perished, 9% didn’t perished

Pay <=$23-> 38% didn't survive, 62% survived

If >=36 yrs-> 86% died, 14% didn't died

If <=36 yrs-> 36% didn't survive, 64% survived

Overfitted rpart Decision Tree

Disadvantages of rpart:

• Can suffer from:

o High Variance

o High Bias

• Decision tree algorithms can result in

overly complex or overfitted trees.

Function ctree() in package party

addresses these weaknesses by providing:

• Unbiased variable selection

• Statistical stopping rules to

optimize tree growth.

Conditional Tree: ctree

train.ctree <- ctree(Survived ~ Class + Sex + Age + Fare +Title + Side,data=train)

plot(train.ctree)

Prediction2 <- predict(train.ctree , newdata=test, type="response")

table(Prediction2)

Perished Survived

256 162

Mr or Noble-> Side-> Port or Starboard:

40% of surviving, 60% of dying

Mr or Noble-> Side-> Unknown:


Not a Mr or Noble-> 1st or 2nd Class:


Not a Mr or Noble-> 3rd Class-> Pay $23.25


Not a Mr or Noble-> 3rd Class-> Pay > $23.25


Conditional Tree: ctree

Logistic Regression

Least squares linear

regression

Predicted probabilities can

be greater than 1 or less

than 0 if used for

classification!

LOGISTIC REGRESSION

• Used for binary

qualitative response.

• Using logit ensures all

probabilities are between

1 and 0 only.

Why use Logistic

Regression?

Allows us to establish a

relationship between a binary

outcome variable and a group

of predictor variables. Can be

used as:

• CLASSIFICATION METHOD:

Classifies binary response (E.g.

Yes/No, Pass/Fail,

Survived/Perished)

• REGRESSION METHOD:

Calculates probability (0.0 to

1.0) of the response.

The “logit” model solves the problem:

Where:

• “p” is the probability that Y

for cases equals 1, p (Y=1).

• “1- p” is the probability that

Y for cases equals 0.

Transformed, the “log odds” are linear.

0 1

0 1

Linear CombiantionLog Odds(logit)

0 1

/ 1

or

log / 1 e

B B X

ln p p B B

p

y

X

p B B X

0 1

0 1

Solving....

/ 1

/ 1

B B X

B B X

Odds

e p p

p p e

Probability (Logistic function): that

Produces an S-shape curve.

Confirming “women &

children first” policy

Titanic.glm <- glm(Survived~ I(Sex=="female") + Class + I(Age<=10) + Embarked + Fare2,

data = train, family=binomial("logit"))

table(test$Survived)

Perished Survived

252 166

summary(Titanic.glm)

The logistic regression coefficients give the

change in the log odds of the outcome for a

one unit increase in the predictor variable.

Making Predictions

Sex==female who is 10 yrs old has an estimated

survival probability of:

2nd class men who paid 20 dollars for a ticket has an

estimated survival probability of:

12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)

12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)0.70

1

ep

e

12.3958 2.6816(1) 1.6133(10)

12.3958 2.6816(1) 1.6133(10)0.99

1

ep

e

Interpreting Coefficients…

summary(Titanic.glm)

Estimate > 0 higher probability of

surviving

Estimate < 0 lower probability of

surviving

Passengers travelling with relatives

have higher chances of survival.

Titanic.glm2<- glm(Survived ~ Class+I(FamilySize>=2) + Parch+I(SibSp>=2),data = train, family=binomial("logit"))


Perished Survived

276 142

summary(Titanic.glm2)

We see that PClass is a strong

predictor supporting the

hypotheses about:

• location on the ship

• lifeboat access.

First class adult males

have lower chances of survival

Titanic.glm3<- glm(Survived ~ Class + I(Title=="Mr")+ I(Title=="Noble") + I(Age>=30 & Age<=50)+I(Fare>=27),data = train, family=binomial("logit"))


Perished Survived

239 179


"Any data relating to one's location on the ship could

prove helpful to survival predictions…"

First class adult males had

lower chances of survival

summary(Titanic.glm3)Those in upper decks (1st class) had more

timely, accurate information and shorter

journey to the lifeboats… Yet why did 1st

Class Males have lower survival rates?

Possible explanation:

• 1st Class Males were expected to be

"gentlemen" and perish with the ship.

"No woman shall be left aboard this ship

because Ben Guggenheim was a coward."

• 1st Class Male Survivors were

condemned by society:

> Bruce Ismay – had to resign as

Chairman of White Star Line.

> William Carter – divorced by wife.

Third class adult males had

lower chance of survival


Those located in the bow or

lower decks (3rd Class) had less

chance of survival.

Titanic.gml4 <- glm(Survived ~ Class+I(Age>=30 & Age<=65) +I(Title=="Mr"& Class=="Third")+I(Fare<=10), data = train, family=binomial("logit"))


Perished Survived

258 160

Ensemble Methods:

randomForest and cforest

Presented by Paul Marxhausen

Random ForestsAdvantages:

• Easy to use: can be used quite efficiently

with default parameters.

• Ideal for people without a deep

background in statistics.

• Produces fairly strong predictions with

only a small amount of coding.

• A group of actors who perform

together.

• An example of an ENSEMBLE

METHOD -- combines multiple

models to produce one result.

• Unlike single decision trees which

can suffer from high variance or

high bias, Random Forests use

random sampling and

averaging to find a natural

balance between the two

extremes.

Random Forests: Randomness logic

built-in and Data pre-processing

Randomness logic used

• Built-in; random rows (bagging)

and columns (mtry) as part of

fitting with training data.

Restriction Disadvantages

• Data has to be pre-processed to

remove NAs, NULLs, blanks

• We have to fix Age, Embarked,

Fare and FamilyID to meet

these requirements.

• Factor levels must be <=32 for

FamilyId (start with ~double)

DATA PRE-PROCESSING TASKS

USING COMBINED DATA

• Age(263 NA’s)=>rpart/predict

• Embarked (2 blanks) => assign

• Fare (1 NA) => median

• FamilyID (exceeded levels) => re-group (now 22 levels)

# Replace Fare NAs (see example)

which(is.na(combi$Fare))

combi$Fare[1044] <-median(combi$Fare, na.rm=TRUE)

Model: randomForest(…) using

‘randomForest’ package

# Build Random Forest Ensemble

set.seed(415) # two sources of randomness

fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2, data=train, importance=TRUE, ntree=2000)

# generate importance graphs

varImpPlot(fit)

# Now let's make a prediction

Prediction <- predict(fit, test)

Kaggle.com score

0.81818 Surprised

cForest(…): type of random forest; implementation using party package

# Build condition inference tree Forest

fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train, controls=cforest_unbiased(ntree=4000, mtry=2))

# Now let's make a prediction and write a submission file

Prediction <- predict(fit, test, OOB=TRUE, type = "response")

kaggle score after parameter adjustments

0.81818 Surprised Again !!!

randomForest package vs. party package

randomForest package• randomForest(…) function

• mtry is floor(sqrt(p)), which is

the number of features to

randomly select at each split.

• randomForest is

computationally faster.

• Popular in applied research

party package• cForest(…) function

• mtry set to the number 5 by

default for technical reasons

• Resulting forests are unbiased if

the predictor variables are of

different types.

• Ensembles of conditional inference

trees have not yet been extensively

tested, so this routine is meant for

the expert user only and its current

state is rather experimental.

Model Description Result

fit <-randomForest(…)

Traditional Random Forest (randomForest package)

0.81818

Leader Board

03/20/14

fit <- cForest(…) Conditional Inference tree (party package)

0.81340

Leader Board

03/20/14

fit <-cForest Changed ntree from 2000 to

4000, and mtry from 3 to 2.

0.81818

Leader Board

03/22/14

Ensemble Methods: kaggle results

Summary:

Data Visualization

Algorithms & kaggle results

by Michelle Darlingwww.datastudentblog.wordpress.com

Data Visualization

Summary

1. Created Conceptual Data Model• to understand denormalized data file.

2. Tried lots of visualizations:

• Categorical vs. Continuous

• Uni-, Bi- and Multivariate

3. Compared datasets:

• Titanic vs. train vs. test ARE similar

4. Created rule-based models using the

most significant predictors:• Sex == "female"

• Sex=="female OR Age <10

• Sex:Child:Fare:FamilySize

Data Visualization prototyping tools:

• MS Excel

• wordle.net

• Google Fusion

• R {rattle} package

PORTEmbarkedS=SouthhamptonC=CherbourgQ=Queenstown

TICKETTicketPclassCabin

PASSENGERPassengerIDNameAgeSibSpParchFareSurvived

Text Analysis of

Passenger Name

SURVIVORS PERISHED

Word Clouds created in www.wordle.net for Survivors$Name and Perished$NameSurvivors <-train[train$Survived==1,]; Perished <-train[train$Survived==0,]

Sex ("male" vs. "female") is

an important predictor of survival.

http://www.wordle.net/

Do family members affect survival?

> table(Survived, Parch)Parch

Survived 0 1 2 3 4 5 60 445 53 40 2 4 4 11 233 65 40 3 0 1 0

> table(Survived, SibSp)SibSp

Survived 0 1 2 3 4 5 80 398 97 15 12 15 5 71 210 112 13 4 3 0 0

Survival is higher for passengers with Parch==3 (60%), or SibSp==1 (54%)

What is the relationship between:Embarked, Pclass, Ticket, Fare?

Cherbourg,

FranceSouthhampton, EnglandQueenstown,

Ireland

All three Embarked Ports (C,Q,S) boarded passengers from all classes (1st, 2nd, 3rd).

But 50% of Cherbourg Passengers were 1st Class; they paid much higher fares (blue spikes).

Based on this, Fare is likely a stronger predictor of survival than Embarked.

Graph created in MSExcel using data from table(Embarked, Pclass, Fare, Ticket)

Google Fusion TablesGeospatial Heatmap, Network Diagrams

Google Fusion Heatmap

GEOCODED by Embarkation Port:

• Southampton, UK -- 644 pasengers

• Cherbourg, France -- 168 passengers

• Queenstown, Ireland – 77 passengers

No Lifeboat

SURVIVORSPERISHED

Network Diagrams showing

Lifeboats (orange) vs. Embarkation Port (blue)

Based on external data (Encyclopedia Titanica)

imported into Google Fusion Tables.

Rule-Based ModelsEveryone Survived vs. Everyone Perished

# Model: Everyone survivedtest$Survived <- 1submit <- data.frame(PassengerId = test$PassengerId, Survived =test$Survived)

write.csv(submit, file = "mdarling_model_0.csv", row.names = FALSE)

Result: 0.37321☹

# Model: Everyone perishedtest$Survived <- 0

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

write.csv(submit, file = "mdarling_model_1.csv", row.names = FALSE)

Result: Your Best Entry: 0.62679 ☺

You improved on your best score by 0.25359.

You just moved up 12 positions on the leaderboard

Survival rate for test is similar to RMS Titanic

Rule-Based ModelsRandom vs. Informed Guess

# Model: Random Guess

test$Survived <- sample(c(0,1), 418, replace = TRUE)

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)

write.csv(submit, file = "mdarling_model_1random.csv", row.names = FALSE)

Your submission scored 0.50718, ☹which is not an improvement of your best score.

Model: Informed Guess● Used problem domain info, data

visualizations and intuition to make an

“informed guess” about each passenger.

● Manually typed in 1,0 into test.csv file

with 418 rows…

Your Best Entry: 0.70335! ☺You improved on your best score by 0.07656!

Process is similar to

everyday human

decision-making

(no machine learning).

Score is much better

than random chance!

Data Visualization in R

R Visualization Packages:

• Base R: plot, barplot, boxplot, hist, dotchart, heatmap, pairs

• ggplot2: qplot, ggplot

• lattice: xyplot, dotplot, parallelplot

• vcd: "Visualizing Categorical Data" mosaic, assoc

• rcmdr: "Rcommander" scatter3d

• rattle: Explore Tab.

latticist, ggobi

Continuous vs. Discrete (Categorical) Variables

CORRELOGRAM: {base R} pairs()t <- as.data.frame(Survived,Pclass,Sex,

Age,Fare,Embarked,SibSp,Parch)

pairs(t, col=t$Pclass+2) # Shift base R color palette by 2# 1st class – green (1+2=3)# 2nd class – blue (2+2=4)# 3rd class – cyan (3+2=5)# base R Color Wheel is not very subtle!

• Correlogram is meant to show pair-wise relationships.

• Continuous variables appear as "clouds"

• Discrete variables appear as "bands"

Which variables are correlated?(Models perform better when variables are independent!)

Correlation plots created using {rattle} R package

FamilySize

SibSp

Parch

Fare3

Fare

Age

Continuous, Multivariate

Marginal Plots:

{rattle} latticist

• {rattle} is an R package

• latticist is an interactive GUI

for Data Visualization

Continuous, Multivariate

Intensity Map{base R} heatmap()

• Useful for visualizing and

comparing data sets.

• Requires a data matrix.

• Values must be numeric (recode qualitative variables e.g.,

Pclass, Gender).

• Can use custom color palette

(e.g., RColorBrewer)

test does not have a

Survived attribute.

PassengerID 1:891 (train) 892:1309 (test)891 obs. 418 obs.

train is representative of test.

"Soup Analogy": values look like

they are randomly distributed and

"well-stirred" – no big chunks of

dark or light bands.

Models based on train can be used

to predict test fairly accurately.

Continuous, Univariate

Histogram: {base R} hist()

Show range, density

and distribution of a

single, continuous

variable.

# Use 2X2 gridpar(mfrow=c(2,2))hist(test$Age)hist(test$Fare)hist(train$Age)hist(train$Fare)

"Small Multiples"

concept by Tukey:

Displaying multiple small

plots side-by-side is

effective for analysis.

test and train have

similar distributions for

continuous variables.

"Small Multiples" of Bar Plots for categorical variables. E.g., barplot(table(test$Child))

Categorical, Univariate

Bar Plots: {base R} barplot()

test and train have similar

distributions for

categorical variables.


Dot Plot: {lattice} dotplot()

library(lattice)attach(train)# Each dot is# a passenger.# Survived==1 Red# Survived==0 Black

dotplot(Age,pch=1,col=Survived, main="train$Age")

dotplot(Fare,pch=1,col=Survived,main="train$Fare")

cluster of survivors

(young children)outliers

cluster of perished passengers

(who paid lowest fares).


Box Plot: {Base R} boxplot()

Shows interquartile range (IQR),

Median, outliers.

# Plot Age grouped by Pclasspar(mfrow=c(1,2))Survivors <-train[train$Survived==1,]Perished <-train[train$Survived==0,]

boxplot(Age ~ Pclass, data = Survivors, col = "light blue", main="Survived", xlab="Passenger Class", ylab="Age")

boxplot(Age ~ Pclass, data = Perished, col = "gray", main="Perished", xlab="Passenger Class", ylab="Age")

Survivors had younger age

range compared to perished across

all three passenger classes.

Median

33.50 Median

28.00Median

27.00

Median

28.00

Median

30.00

Median

38.50

Categorical, Multivariate

Spine Plot = 3 Bar Plots

35% 65% 68% 32% 15% 85%

314

577

233

109

81

468

32%68%FEMALES:

greater than expected

survival rate

85%MALES: greater than

expected mortality rate

15

%

Class: mutually exclusive, rectilinear partition. E.g., Female Survivors

Probability: frequency count/whole set. E.g, 233/891 = 68%

Spine Plot is a visualization of a

rules-based model; it exhaustively

describes the feature space = Titanic

Passengers (female vs male)

Categorical, Multivariate Spine Plot: {base R} spineplot()

Indicates a higher

than expected survival rate.

Visualization of a contingency table.

vcd = "Visualizing Categorical Data"Blue – High Probability

Gray – Neutral

Red – Low Probability

Example:

3rd Class MaleSex==male & Pclass==3• High Probability: Survived ==0• Low Probability: Survived==1

# Mosaic Plot library(vcd)attach(train)t <-table(Sex,Survived,Child)mosaic(t, shade=TRUE,

main="train dataset")

Categorical, Multivariate

Mosaic Plot: {vcd} mosaic()

female adults

female children

male adults

male children

female

children

female

adults

male adults

male children

60%

Perished

40%

Survived

females (survived)

36% of all passengers

77% of all survivors

male

adults

male children

female

children

female

adults

male adults (perished) 61% of all passengers

83% of all who perished

male children

Similar

Mosaic PlotDecision Tree

60% Perished

40% Survivedmale adults

(perished)male children

(survived)

females

(survived)

males

(perished)

Rule-Based Models"Females" / "Women or Children"

# Model: Females Survive

test$Survived <-0

test$Survived[test$Sex=='female']<-1

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_female.csv", row.names = FALSE)

Your Best Entry: 0.76555☺You improved on your best score by 0.06220.

# Model: Women OR Children Survivetest$Survived <-0

test$Survived[test$Sex=='female'] <-1test$Survived[test$Age<10] <-1# Tried different age cutoffs until score improved.

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)write.csv(submit, file = "mdarling_model_wc.csv", row.names = FALSE)

Your Best Entry: 0.77033☺You improved on your best score by 0.00478

Rule-based model (70 rules)Sex : Child : Fare2: FamilySize

Principal Components Analysis

• Inspired by

PCA

• Performed

better than

naiveBayes,

qda, glm,

svm(radial,

sigmoid,

polynomial)!

aggregate(Survived~Sex+Child+Fare2+FamilySize, data=train, FUN=function(x) {sum(x)/length(x)})

Model Description Result

70-Rule Model aggregate(Survived~Sex+Child+Fare2+FamilySize,

data=train, FUN=function(x) {sum(x)/length(x)})0.77512

Female OR Child [test$Sex =='female'| test$Age < 10] 0.77033

Female [test$Sex =='female'] 0.76555

Informed Guess Data Visualization + Problem Domain info+ manual

typing 1,0 into .csv file.0.70335

Random Guess sample(c(1,0), 418, replace=TRUE) 0.50718

Everyone Perished test$Survived <- 0 0.62679

Everyone Survived test$Survived <- 1 0.37321

Summary: kaggle.com results so far…

START: Is training data available?

No UNSUPER-

VISED LEARNING

Yes -- train.csv SUPERVISED LEARNING

Continuous Target

REGRESSION

Categorical Target: Survived

CLASSIFICATION

Multivariate Classification

BINARY Classification == 1,0

SINGLE CLASSIFIERS

glm, knn, qda naiveBayes,

rpart, ctree, svm

ENSEMBLE METHODS

randomForest, cforest

Machine

Learning:

Titanic Dataset

Overview of Machine Learning Algorithms

QDA (0.75598) vs Logistic Regression (.76077)

• Linear model = straight line boundaries.

• Better fit for Titanic data set.

• Eager Learners. 2 step process: 1) Fit model using global info. 2) Predict test using reusable model.

• Polynomial model = curved boundaries.

Naïve Bayes (0.76555) vs. KNN (0.77990)

ptm <- proc.time()partimat(Survived~.,data=train_bc,method="sknn")end <- (proc.time() - ptm)# 769.72 milliseconds – MORE TIME CONSUMING butMORE CUSTOMIZED BOUNDARIES –> greater accuracy.

ptm <- proc.time()partimat(Survived~.,data=train_bc,method="naiveBayes")end <- (proc.time() - ptm)# 39.99 milliseconds – only 5% of the knn time.

AdaBoost (0.77990 – same as KNN)

# rattle Model outputSummary of the Ada Boost model:Call:ada(Survived ~ ., data = crs$dataset[crs$train, c(crs$input,

crs$target)], control = rpart.control(maxdepth = 30, cp = 0.01, minsplit = 20, xval = 10), iter = 50)Loss: exponential Method: discrete Iteration: 50 Final Confusion Matrix for Data:

Final PredictionTrue value 0 1

0 350 231 45 205

Train Error: 0.109 Out-Of-Bag Error: 0.136 iteration= 50 Additional Estimates of number of iterations:train.err1 train.kap1

50 50 Variables actually used in tree construction:[1] "Age" "FamilyID2" "Fare" "Sex" "Title" Frequency of variables actually used:FamilyID2 Fare Title Age Sex

49 49 48 46 8

Time taken: 3.42 secs

Only 50 trees compared

to 4000 trees in

cforest, hence lower

performance.

Examples of AdaBoost "weak learner" trees:1,3,10,20,35,47. Total: 50 trees

linear, cost=1, 68% correct radial, cost=100, 73.4% correct

polynomial, cost=10, 68% correct sigmoid, cost=0.1, 66% correct

Support VectorMachines (2D)SVM Kernels

& Decision Boundary Shapes

• Linear Line

• Radial Circle

• Polynomial C Curve

• Sigmoid S Curve

"Goodness of Fit" – svm:

radial performed best with two

dimensions (.77033).

Scatterplots for visualizing SVM 2D {ggplot2} qplot vs. 3-D {Rcmdr} scatter3d

# Interactive 3D hyperplane with splinelibrary(Rcmdr); attach(train)scatter3d(Age,Survived,Fare)

# Point and Line ScatterPlotlibrary(ggplot2); attach(train)qplot(Age, Fare, data=train,

geom=c("point","line"),colour=Survived,main = "Titanic Passengers")

SVM

using 11 inputs

Advantages of SVM:

• Minimal pre-processing needed.

• Tuning improves accuracy.

• Helps reveal best fit

(linear/poly/radial/sigmoid).

• Immune to "Curse of

Dimensionality".

• Instead of worsening, accuracy

improved when dimensions

increased from 2 to 11

attributes.

0.79904good, but still not

better than cforest

or randomForest

0.81818

cforest (.81818) + Lifeboat Data Fusion = .83732

# Added 12 male survivors based on merged # lifeboat data from Encyclopedia Titanica.

ciforest2 <- read.csv("ciforest2.csv")testlb <- read.csv("test_lifeboats.csv")

ensembles <- merge(ciforest2, testlb, by.x="PassengerId", by.y="PassengerId")

ensembles$Survived[ensembles$Lifeboat==1] <-1table(ensembles$Survived)#0 1 #272 146

submit <- data.frame(PassengerId = ensembles$PassengerId, Survived = ensembles$Survived)write.csv(submit, file = "ensembles_5.csv", row.names = FALSE)

"Ensemble of ensembles":randomForest + cForest + random tiebreaker

# Code for 95/05 tiebreaker (score 0.81818)

# Merge randomForest and cForest and average# the results. Reuse unanimous votes.ensembles <- merge(rforest, ciforest2, by.x="PassengerId", by.y="PassengerId")ensembles$Vote <-(as.numeric(ensembles$Survived.x)+ as.numeric(ensembles$Survived.y))/2ensembles$Survived[ensembles$Vote==1.0] <-1ensembles$Survived[ensembles$Vote==0.0] <-0

# Create vector of 418 random 0s and 1sset.seed(pi)probs<-c(.95,.05)ensembles$rvote <-sample(c(0,1), 418,replace = TRUE,prob=probs)

#For each tie, use a random voteensembles$Survived[ensembles$Vote==0.5] <-ensembles$rvote[ensembles$Vote==0.5]table(ensembles$Survived)

0 1 281 137

What if we combine results from randomForest and

cForest? Use random tiebreaker for non-unanimous votes.

Results: Combinations did not outperform individuals,

even when lifeboat data was added.

Data mining using lifeboat info = competitive edge. 12

additional male survivors is highly significant because they

countered social norms and survived "against the odds".

Ensemble methods (randomForest, cforest) outperform

single classifiers. "Many models work better than one."

Embedded feature selection models (svm, ctree, rpart) outperform models that need "manual" feature

selection. Decision trees are great communication tools.

knn has same accuracy as glm and AdaBoost, but takes a lot

of processing time.

Simple rule-based models can outperform naiveBayes if

features chosen by Principal Components Analysis (PCA).

Social norms ("Women and Children First", "Male

survivors are cowards" ) greatly influenced survival.

Human decision-making outperforms random chance,

and can outperform machine learning (depending on the

human's expertise).

Math-based models like glm sensitive to feature selection.

"Goodness of fit" determines performance. Linear and

radial (glm, svm:linear/radial) outperformed others

(qda,svm:polynomial/sigmoid).

Machine Learning Summary