Machine learning from disaster - GL.Net 2015

23
MACHINE LEARNING FROM DISASTER Gloucestershire .NET User Group @glnetgroup Phil Trelford 2015 @ptrelford

Transcript of Machine learning from disaster - GL.Net 2015

MACHINE LEARNING FROM DISASTER

Gloucestershire .NET User Group @glnetgroupPhil Trelford 2015 @ptrelford

RMS TitanicOn April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

…there were not enough lifeboats for the passengers and crew.

…some groups of people were more likely to survive

than others, such as women, children, and the upper-class.

Kagglecompetition

KaggleTitanic datasettrain.csv

test.csv

PassengerIdSurvived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

1 0 3 Braund, Mr. Owen Harrismale 22 1 0 A/5 21171 7.25 S

2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)female 38 1 0 PC 17599 71.2833 C85 C

3 1 3 Heikkinen, Miss. Lainafemale 26 0 0 STON/O2. 3101282 7.925 S

4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)female 35 1 0 113803 53.1 C123 S

5 0 3 Allen, Mr. William Henrymale 35 0 0 373450 8.05 S

6 0 3 Moran, Mr. Jamesmale 0 0 330877 8.4583 Q

7 0 1 McCarthy, Mr. Timothy Jmale 54 0 0 17463 51.8625 E46 S

8 0 3 Palsson, Master. Gosta Leonardmale 2 3 1 349909 21.075 S

9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female 27 0 2 347742 11.1333 S

10 1 2 Nasser, Mrs. Nicholas (Adele Achem)female 14 1 0 237736 30.0708 C

11 1 3 Sandstrom, Miss. Marguerite Rutfemale 4 1 1 PP 9549 16.7 G6 S

12 1 1 Bonnell, Miss. Elizabethfemale 58 0 0 113783 26.55 C103 S

13 0 3 Saundercock, Mr. William Henrymale 20 0 0 A/5. 2151 8.05 S

14 0 3 Andersson, Mr. Anders Johanmale 39 1 5 347082 31.275 S

15 0 3 Vestrom, Miss. Hulda Amanda Adolfinafemale 14 0 0 350406 7.8542 S

16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16 S

17 0 3 Rice, Master. Eugenemale 2 4 1 382652 29.125 Q

18 1 2 Williams, Mr. Charles Eugenemale 0 0 244373 13 S

19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)female 31 1 0 345763 18 S

20 1 3 Masselmani, Mrs. Fatimafemale 0 0 2649 7.225 C

21 0 2 Fynney, Mr. Joseph Jmale 35 0 0 239865 26 S

22 1 2 Beesley, Mr. Lawrencemale 34 0 0 248698 13 D56 S

23 1 3 McGowan, Miss. Anna "Annie"female 15 0 0 330923 8.0292 Q

24 1 1 Sloper, Mr. William Thompsonmale 28 0 0 113788 35.5 A6 S

25 0 3 Palsson, Miss. Torborg Danirafemale 8 3 1 349909 21.075 S

26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)female 38 1 5 347077 31.3875 S

27 0 3 Emir, Mr. Farred Chehabmale 0 0 2631 7.225 C

28 0 1 Fortune, Mr. Charles Alexandermale 19 3 2 19950 263 C23 C25 C27 S

Titanic Data

Variable Description

survival Survival (0 = No; 1 = Yes)

pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

name Name

sex Sex

age Age

sibsp Number of Siblings/Spouses Aboard

parch Number of Parents/Children Aboard

ticket Ticket Number

fare Passenger Fare

cabin Cabin

embarked Port of Embarkation

(C = Cherbourg; Q = Queenstown; S =

Southampton)

Tips:

* Empty floats -

Double.Nan

DATA ANALYSIS

Titanic: Machine Learning from Disaster

FSharp.Data: CSV Provider

Counting

let female (passenger:Passenger) = passenger.Sex = “female”

let survived (passenger:Passenger) = passenger.Survived = 1

let females = passengers |> where female

let femaleSurvivors = females |> tally survived

let femaleSurvivorsPc = females |> percentage survived

Tally Ho!

/// Tally up items that match specified criteria

let tally criteria items =

items |> Array.filter criteria |> Array.length

/// Percentage of items that match specified criteria

let percentage criteria items =

let total = items |> Array.length

let count = items |> tally criteria

float count * 100.0 / float total

Survival rate

/// Survival rate of a criteria’s group

let survivalRate criteria =

passengers |> Array.groupBy criteria

|> Array.map (fun (key,matching) ->

key, matching |> Array.percentage survived

)

let embarked = survivalRate (fun p -> p.Embarked)

Score

let score f = passengers |> Array.percentage (fun p -> f p = p.Survived)

let rate = score (fun p -> (child p || female p) && not (p.Class = 3))

MACHINE LEARNING

Titanic: Machine Learning from Disaster

20 QuestionsThe game suggests that the

information (as measured

by Shannon's entropy statisti

c) required to identify an

arbitrary object is at most

20 bits. The game is often

used as an example when

teaching people

about information theory.

Mathematically, if each

question is structured to

eliminate half the objects,

20 questions will allow the

questioner to distinguish

between 220 or 1,048,576

objects.

Decision TreesA tree can be "learned"

by splitting the

source set into subsets

based on an attribute

value test. This process is

repeated on each

derived subset in a

recursive manner

called recursive

partitioning.

Split data set (from ML in Action)

Python

def splitDataSet(dataSet, axis, value):

retDataSet = []

for featVec in dataSet:

if featVec[axis] == value:

reducedFeatVec = featVec[:axis]

reducedFeatVec.extend(featVec[axis+1:])

retDataSet.append(reducedFeatVec)

return retDataSet

F#

let splitDataSet(dataSet, axis, value) =

[|for featVec in dataSet do

if featVec.[axis] = value then

yield featVec |> Array.removeAt axis|]

Decision Tree

let labels =

[|"sex"; "class"|]

let features (p:Passenger) : obj[] =

[|p.Sex; p.Pclass|]

let dataSet : obj[][] =

[|for passenger in passengers ->

[|yield! features passenger;

yield box (p.Survived = 1)|] |]

let tree = createTree(dataSet, labels)

Overfitting

CLASSIFY

Titanic: Machine Learning from Disaster

Decision Tree: Create -> Classify

let rec classify(inputTree, featLabels:string[], testVec:obj[]) =

match inputTree with

| Leaf(x) -> x

| Branch(s,xs) ->

let featIndex = featLabels |> Array.findIndex ((=) s)

xs |> Array.pick (fun (value,tree) ->

if testVec.[featIndex] = value

then classify(tree, featLabels,testVec) |> Some

else None

)

RESOURCES

Titanic: Machine Learning from Disaster

Special thanks!

◦ Matthias Brandewinder for the Machine Learning samples

◦ http://www.clear-lines.com/blog/

◦ Tomas Petricek & Gustavo Guerra for the FSharp.Data library

◦ http://fsharp.github.io/FSharp.Data/

◦ F# Team for Type Providers

◦ http://blogs.msdn.com/b/dsyme/archive/2013/01/30/twelve-type-providers-in-pictures.aspx

◦ Peter Harrington for the Machine Learning in Action code samples

◦ http://www.manning.com/pharrington/

◦ Kaggle for the Titanic data set

◦ http://www.kaggle.com/c/titanic-gettingStarted

Machine Learning Job TrendsSource indeed.co.uk

What next?

F# Machine Learning information

◦ http://fsharp.org/machine-learning/

Random Forests

◦ http://tinyurl.com/randomforests

Progressive F# Tutorials

◦ http://skillsmatter.com/event/scala/progressive-f-tutorials-2014