92 Applied Predictive Modeling Techniques in R: With step by step instructions on how to build them...

Ego autem et domus

mea serviemus Domino

92APPLIED PREDICTIVE MODELING

Techniques in R

Over ninety of the most important models used bysuccessful Data Scientists With step by stepinstructions on how to build them FAST

Dr ND Lewis

Copyright copy 2015 by ND Lewis

All rights reserved No part of this publication may be reproduced distributed or trans-mitted in any form or by any means including photocopying recording or other electronicor mechanical methods without the prior written permission of the author except in thecase of brief quotations embodied in critical reviews and certain other noncommercial usespermitted by copyright law For permission requests contact the authorat wwwAusCovcom

Disclaimer Although the author and publisher have made every effort to ensure that theinformation in this book was correct at press time the author and publisher do not assumeand hereby disclaim any liability to any party for any loss damage or disruption causedby errors or omissions whether such errors or omissions result from negligence accidentor any other cause

Ordering Information Quantity sales Special discounts are available on quantity pur-chases by corporations associations and others For details email infoNigelDLewiscom

Image photography by Deanna Lewis

ISBN-13 978-1517516796ISBN-10 151751679X

Dedicated to Angela wife friend and mother extraordinaire

Acknowledgments

A special thank you to

My wife Angela for her patience and constant encouragement

My daughter Deanna for taking hundreds of photographs for this book andmy website

And the readers of my earlier books who contacted me with questions andsuggestions

iii

About This BookThis jam-packed book takes you under the hood with step by step instruc-tions using the popular and free R predictive analytics package It providesnumerous examples illustrations and exclusive use of real data to help youleverage the power of predictive analytics A book for every data analyststudent and applied researcher Here is what it can do for you

bull BOOST PRODUCTIVITY Bestselling author and data scientist DrND Lewis will show you how to build predictive analytic models in lesstime than you ever imagined possible Even if yoursquore a busy professionalor a student with little time By spending as little as 10 minutes aday working through the dozens of real world examples illustrationspractitioner tips and notes yoursquoll be able to make giant leaps forwardin your knowledge strengthen your business performance broaden yourskill-set and improve your understanding

bull SIMPLIFY ANALYSIS You will discover over 90 easy to follow appliedpredictive analytic techniques that can instantly expand your modelingcapability Plus yoursquoll discover simple routines that serve as a checklist you repeat next time you need a specific model Even better yoursquolldiscover practitioner tips work with real data and receive suggestionsthat will speed up your progress So even if yoursquore completely stressedout by data yoursquoll still find in this book tips suggestions and helpfuladvice that will ease your journey through the data science maze

bull SAVE TIME Imagine having at your fingertips easy access to the verybest of predictive analytics In this book yoursquoll learn fast effectiveways to build powerful models using R It contains over 90 of the mostsuccessful models used for learning from data With step by step in-structions on how to build them easily and quickly

bull LEARN FASTER 92 Applied Predictive Modeling Techniquesin R offers a practical results orientated approach that will boost yourproductivity expand your knowledge and create new and exciting op-portunities for you to get the very best from your data The book worksbecause you eliminate the anxiety of trying to master every single math-ematical detail Instead your goal at each step is to simply focus ona single routine using real data that only takes about 5 to 15 minutesto complete Within this routine is a series of actions by which thepredictive analytic model is constructed All you have to do is followthe steps They are your checklist for use and reuse

bull IMPROVE RESULTS Want to improve your predictive analytic re-sults but donrsquot have enough time Right now there are a dozen waysto instantly improve your predictive models performance Odds arethese techniques will only take a few minutes apiece to complete Theproblem You might feel like therersquos not enough time to learn how todo them all The solution is in your hands It uses R which is freeopen-source and extremely powerful software

In this rich fascinatingmdashsurprisingly accessiblemdashguide data scientist DrND Lewis reveals how predictive analytics works and how to deploy itspower using the free and widely available R predictive analytics packageThe book serves practitioners and experts alike by covering real life casestudies and the latest state-of-the-art techniques Everything you need to getstarted is contained within this book Here is some of what is included

bull Support Vector Machines

bull Relevance Vector Machines

bull Neural networks

bull Random forests

bull Random ferns

bull Classical Boosting

bull Model based boosting

bull Decision trees

bull Cluster Analysis

For people interested in statistics machine learning data analysis data min-ing and future hands-on practitioners seeking a career in the field it sets astrong foundation delivers the prerequisite knowledge and whets your ap-petite for more Buy the book today Your next big breakthrough usingpredictive analytics is only a page away

vi

OTHER BOOKS YOU WILL ALSO ENJOY

Over 100 Statistical Tests at Your Fingertips100 Statistical Tests in R is designedto give you rapid access to one hun-dred of the most popular statisticaltests

It shows you step by step howto carry out these tests in the freeand popular R statistical package

The book was created for the ap-plied researcher whose primary fo-cus is on their subject matter ratherthan mathematical lemmas or statis-tical theory

Step by step examples of eachtest are clearly described and canbe typed directly into R as printedon the page

To accelerate your research ideas over three hundred applications of sta-tistical tests across engineering science and the social sciences are discussed

100 Statistical Tests in R - ORDER YOUR COPY TODAY

vii

They laughed as they gave me the data toanalyzeBut then they saw my charts

Wish you had fresh ways to presentdata explore relationships visualizeyour data and break free from mun-dane charts and diagrams

Visualizing complex relation-ships with ease using R beginshere

In this book you will find inno-vative ideas to unlock the relation-ships in your own data and createkiller visuals to help you transformyour next presentation from good togreat

Visualizing ComplexData Using R - ORDERYOUR COPY TODAY

viii

PrefaceIn writing this text my intention was to collect together in a single placepractical predictive modeling techniques ideas and strategies that have beenproven to work but which are rarely taught in business schools data sciencecourses or contained in any other single text

On numerous occasions researchers in a wide variety of subject areashave asked ldquohow can I quickly understand and build a particular predictivemodelrdquo The answer used to involve reading complex mathematical textsand then programming complicated formulas in languages such as C C++and Java With the rise of R predictive analytics is now easier than ever 92Applied Predictive Modeling Techniques in R is designed to give yourapid access to over ninety of the most popular predictive analytic techniquesIt shows you step by step how to build each model in the free and popularR statistical package

The material you are about to read is based on my personal experiencearticles Irsquove written hundreds of scholarly articles Irsquove read over the yearsexperimentation some successful some failed conversations Irsquove had with datascientists in various fields and feedback Irsquove received from numerous presen-tations to people just like you

This book came out of the desire to put predictive analytic tools in thehands of the practitioner The material is therefore designed to be used bythe applied data scientist whose primary focus is on delivering results ratherthan mathematical lemmas or statistical theory Examples of each techniqueare clearly described and can be typed directly into R as printed on the page

This book in your hands is an enlarged revised and updated collection ofmy previous works on the subject Irsquove condensed into this volume the bestpractical ideas available

Data science is all about extracting meaningful structure from data Itis always a good idea for the data scientist to study how other users andresearchers have used a technique in actual practice This is primarily be-cause practice often differs substantially from the classroom or theoreticaltext books To this end and to accelerate your progress actual real worldapplications of the techniques are given at the start of each section

These illustrative applications cover a vast range of disciplines incorpo-rating numerous diverse topics such as intelligent shoes forecasting the stockmarket signature authentication oil sand pump prognostics detecting de-ception in speech electric fish localization tropical forest carbon mappingvehicle logo recognition understanding rat talk and many more I have alsoprovided detailed references to these application for further study at the endof each section

In keeping with the zeitgeist of R copies of the vast majority of appliedarticles referenced in this text are available for are free

New users to R can use this book easily and without any prior knowledgeThis is best achieved by typing in the examples as they are given and readingthe comments which follow Copies of R and free tutorial guides for beginnerscan be downloaded at httpswwwr-projectorg

I have found over and over that a data scientist who has exposure to abroad range of modeling tools and applications will run circles around thenarrowly focused genius who has only been exposed to the tools of theirparticular discipline

Greek philosopher Epicurus once said ldquoI write this not for the many butfor you each of us is enough of an audience for the otherrdquo Although theideas in this book reach out to thousands of individuals Irsquove tried to keepEpicurusrsquos principle in mindndashto have each page you read give meaning to justone person - YOU

I invite you to put what you read in these pages into action To help youdo that Irsquove created ldquo12 Resources to Supercharge Your Productivityin Rrdquo it is yours for free Simply go to http www auscov com toolshtml and download it now Itrsquos my gift to you It shares with you 12 of thevery best resources you can use to boost your productivity in R

Irsquove spoken to thousands of people over the past few years Irsquod love tohear your experiences using the ideas in this book Contact me with yourstories questions and suggestions at InfoNigelDLewiscom

Now itrsquos your turn

PS Donrsquot forget to sign-up for your free copy of 12 Resources to Super-charge Your Productivity in R at http www auscov com tools html

x

How to Get the Most from thisBook

There are at least three ways to use this book First you can dip into it as anefficient reference tool Flip to the technique you need and quickly see howto calculate it in R For best results type in the example given in the textexamine the results and then adjust the example to your own data Secondbrowse through the real world examples illustrations practitioner tips andnotes to stimulate your own research ideas Third by working through thenumerous examples you will strengthen you knowledge and understandingof both applied predictive modeling and R

Each section begins with a brief description of the underlying modelingmethodology followed by a diverse array of real world applications This isfollowed by a step by step guide using real data for each predictive analytictechnique

13 PRACTITIONER TIP

If you are using Windows you can easily upgrade to the latestversion of R using the installr package Enter the followinggt installpackages(installr)gt installr updateR ()

If a package mentioned in the text is not installed on your machine you candownload it by typing installpackages(ldquopackage_namerdquo) For exampleto download the ada package you would type in the R consolegt installpackages(ada)

Once a package is installed you must call it You do this by typing in theR consolegt require(ada)

1

92 Applied Predictive Modeling Techniques in R

The ada package is now ready for use You only need to type this onceat the start of your R session

13 PRACTITIONER TIP

You should only download packages from CRAN using en-crypted HTTPS connections This provides much higher as-surance that the code you are downloading is from a legitimateCRAN mirror rather than from another server posing as oneWhilst downloading a package from a HTTPS connection youmay run into a error message something likeunable to access index for repositoryhttpscranrstudiocomThis is particularly common on Windows The internet2 dllhas to be activated on versions before R-322 If you are usingan older version of R before downloading a new package enterthe followinggt setInternet2(TRUE)

Functions in R often have multiple parameters In the examples in thistext I focus primarily on the key parameters required for rapid model develop-ment For information on additional parameters available in a function typein the R console function_name For example to find out about additionalparameters in the ada function you would typeada

Details of the function and additional parameters will appear in yourdefault web browser After fitting your model of interest you are stronglyencouraged to experiment with additional parameters I have also includedthe setseed method in the R code samples throughout this text to assistyou in reproducing the results exactly as they appear on the page R is avail-able for all the major operating systems Due to the popularity of windowsexamples in this book use the windows version of R

2

13 PRACTITIONER TIP

Canrsquot remember what you typed two hours ago Donrsquot worryneither can I Provided you are logged into the same R sessionyou simply need to typegt history(Inf)

It will return your entire history of entered commands for yourcurrent session

You donrsquot have to wait until you have read the entire book to incorporatethe ideas into your own analysis You can experience their marvelous potencyfor yourself almost immediately You can go straight to the technique ofinterest and immediately test create and exploit it in your own research andanalysis

13 PRACTITIONER TIP

On 32-bit Windows machines R can only use up to 3Gb ofRAM regardless of how much you have installed Use thefollowing to check memory availabilitygt memorylimit ()

To remove all objects from memoryrm(list=ls())

Applying the ideas in this book will transform your data science practiceIf you utilize even one tip or idea from each chapter you will be far betterprepared not just to survive but to excel when faced by the challenges andopportunities of the ever expanding deluge of exploitable data

Now letrsquos get started

3

Part I

Decision Trees

4

The Basic Idea

We begin with decision trees because they are one of the most popular tech-niques in data mining1 They can be applied to both regression and classifi-cation problems Part of the reason for their popularity lies in the ability topresent results a simple easy to understand tree format True to its name thedecision tree selects an outcome by descending a tree of possible decisions

NOTE

Decision trees are in general a non-parametric inductivelearning technique able to produce classifiers for a given prob-lem which can assess new unseen situations andor reveal themechanisms driving a problem

An Illustrative ExampleIt will be helpful to build intuition by first looking at a simple example ofwhat this technique does with data Imagine you are required to build anautomatic rules based system to buy cars The goal is to make decisions asnew vehicles are presented to you Let us say you have access to data onthe attributes (also known as features) - Road tested miles driven Price ofvehicle Likability of the current owner (measured on a continuous scale from0 to 100 100 = ldquolove themrdquo) Odometer miles Age of the vehicle in years

A total of 100 measurements are obtained on each variable and also onthe decision (yes or no to purchase the vehicle) You run this data through adecision tree algorithm and it produces the tree shown in Figure 1 Severalthings are worth pointing out about this decision tree First the number ofobservations falling in ldquoyesrdquo and ldquonordquo is reported For example in ldquoRoadtested miles lt100rdquo we see there are 71 observations Second the tree isimmediately able to model new data using the rules developed Third it didnot use all of the variables to develop a decision rule (Likability of the current

6

owner and Price were excluded)Let us suppose that this tree classified 80 of the observations correctly

It is still worth investigating whether a more parsimonious and more accuratetree can be obtained One way to achieve this is to transform the variablesLet us suppose we create the additional variable Odometer

Age and then rebuild

the tree The result is shown in Figure 2It turns out that this decision tree which chooses only the transformed

data ignores all the other attributes Let us assume this tree has a predic-tion accuracy of 90 Two important points become evident First a moreparsimonious and accurate tree was possible and second to obtain this treeit was necessary to include the variable transformations in the second run ofthe decision tree algorithm

The example illustrates that the decision tree must have the variablessupplied in the appropriate form to obtain the most parsimonious tree Inpractice domain experts will often advise on the appropriate transformationof attributes

Figure 1 Car buying decision tree

7


Figure 2 Decision tree obtained by transforming variables

13 PRACTITIONER TIP

I once developed what I thought was a great statistical modelfor an area I knew little about Guess what happened It wasa total flop Why Because I did not include domain expertsin my design and analysis phase If you are building a deci-sion trees for knowledge discovery it is important to includedomain experts alongside you and throughout the entire anal-ysis process Their input will be required to assess the finaldecision tree and opine on ldquoreasonabilityrdquoAnother advantage of using domain experts is that the com-plexity of the final decision tree (in terms of the number ofnodes or the number of rules that can be extracted from atree) may be reduced Inclusion of domain experts almostalways helps the data scientist create a more efficient set ofrules

Decision Trees in PracticeThe basic idea behind a decision tree is to construct a tree whose leaves arelabeled with a particular value for the class attribute and whose inner nodes

8

represent descriptive attributes At each internal node in the tree a singleattribute value is compared with a single threshold value In other wordseach node corresponds to a ldquomeasurementrdquo of a particular attributemdashthatis a question often of the ldquoyesrdquo or ldquonordquo variety which can be asked aboutthat attributersquos value (eg ldquois age less than 48 yearsrdquo) One of the twochild nodes is then selected based on the result of the comparison and henceanother measurementmdashor to a leaf

When a leaf node is reached the single class associated with that node isthe final prediction In other words the terminal nodes carry the informationrequired to classify the data

A real world example of decision trees is shown in Figure 3 they weredeveloped by Koch et al2 for understanding biological cellular signaling net-works Notice that four trees of increasing complexity are developed by theresearchers

Figure 3 Four decision trees developed by Koch et al Note that decisiontree (A) (B) (C) and (D) have a misclassification error of 3085 24682117 and 1813 respectively However tree (A) is the easiest to interpretSource Koch et al

The Six Advantages of Decision Trees1 Decision trees can be easy-to-understand with intuitively clear rules

understandable to domain experts

2 Decision trees offer the ability to track and evaluate every step in thedecision-making process This is because each path through a tree con-sists of a combination of attributes which work together to distinguish

9


between classes This simplicity gives useful insights into the innerworkings of the method

3 Decision trees can handle both nominal and numeric input attributesand are capable of handling data sets that contain misclassified values

4 Decision trees can easily be programmed for use in real time systemsA great illustration of this is the research of Hailemariam et al3 whouse a decision tree to determine real time building occupancy

5 They are relatively inexpensive computationally and work well on bothlarge and small data sets Figure 4 illustrates an example of a verylarge decision tree used in Bioinformatics4 the smaller tree representsthe tuned version for greater readability

6 Decision trees are considered to be a non-parametric method Thismeans that decision trees have no assumptions about the space distri-bution and on the classifier structure

10

Figure 4 An example of a large Bioinformatics decision tree and the visuallytuned version from the research of Stiglic et al

NOTE

One weakness of decision trees is the risk of over fitting Thisoccurs when statistically insignificant patterns end up influ-encing classification results An over fitted tree will performpoorly on new data Bohanec and Bratko5 studied the roleof pruning a decision tree for better decision making Theyfound that pruning can reduce the risk of over fitting becauseit results in smaller decision trees that exploit fewer attributes

How Decision Trees WorkExact implementation details differ somewhat depending on the algorithmused but the general principles are very similar across methods and contain

11


the following steps

1 Provide a set (S) of examples with known classified states This iscalled the learning set

2 Select a set of test attributes attributes a1 a2 aN These can beviewed as the parameters of the system and are selected because theycontain essential information about the problem of concern In thecar example on page 6 the attributes were - a1 = Road tested milesdriven a2 = Price of vehicle a3 = Likability of the current owner a4= Odometer miles a5 = Age of the vehicle in years

3 Starting at the top node of the tree (often called the root node) with theentire set of examples S split S using a test on one or more attributesThe goal is to split S into subsets of increasing classification purity

4 Check the results of the split If every partition is pure in the sensethat all examples in the partition belong to the same class then stopLabel each leaf node with the name of the class

5 Recursively split any partitions that are not ldquopurerdquo

6 The procedure is stopped when all the newly created nodes are rsquotermi-nalrsquo ones containing ldquopure enoughrdquo learning subsets

Decision tree algorithms vary primarily in how they choose to ldquosplitrdquo thedata when to stop splitting and how they prune the trees they produce

NOTE

Many of the decision tree algorithms you will encounter arebased on a greedy top-down recursive partitioning strategyfor tree growth They use different variants of impurity mea-sures such as - information gain6 gain ratio7 gini-index8 anddistance-based measures9

12

Practical Applications

Intelligent Shoes for Stroke VictimsZhang et al10 develop a wearable shoe (SmartShoe) to monitor physical ac-tivity in stroke patients The data set consisted of 12 patients who hadexperienced a stroke

Supervised by a physical therapist SmartShoe collected data from thepatients using eight posture and activity groups sitting standing walkingascending stairs descending stairs cycling on a stationary bike being pushedin a wheelchair and propelling a wheelchair

Patients performed each activity from between 1- 3 minutes Data wascollected from the SmartShoe every 2 seconds from which feature vectors werecomputed

Half the feature features vectors were selected at random for the trainingset The remainder were used for validation The C50 algorithm was usedto build the decision tree

The researchers constructed both subject specific and group decision treesThe group models were developed using Leave-One-Out cross-validation

The performance results for five of the patients are shown in Table 1 Asmight be expected the individual models fit better than the group modelsFor example for patient 5 the accuracy of the patient specific tree was 984however using the group tree the accuracy declined to 755 This differencein performance might be in part due to over fitting of the individual specifictree The group models were trained using data from multiple subjects andtherefore can be expected to have lower overall performance scores

Patient 1 2 3 4 5 AverageIndividual 965 974 998 972 984 979Group 875 911 647 822 755 802

Table 1 Zhang et alrsquos decision tree performance metrics

13


13 PRACTITIONER TIP

Decision trees are often validated by calculating sensitivityand specificity Sensitivity is the ability of the classifier toidentify positive results while specificity is the ability to dis-tinguish negative results

Sensitivity = NTP

NTP +NTNtimes 100 (1)

Specificity = NTN


NTP is the number of true positives and NTN is the numberof true negatives

Micro Ribonucleic AcidMicroRNAs (miRNAs) are non-protein coding Ribonucleic acids (RNAs)that attenuate protein production in P bodies11 Williams et al12 developa MicroRNA decision tree For the training set the researchers used knownmiRNAs associated from various plant species for positive controls and non-miRNA sequences as negative controls

The typical size of their training set consisted of 5294 cases using 29attributes The model was validated by calculating sensitivity and specificitybased on leave-one-out cross-validation

After training the researchers focus on attribute usage informationTable 2 shows the top ten attribute usage for a typical training run Theresearchers report that other training runs show similar usage The valuesrepresent the percentage of sequences that required that attribute for classi-fication Several attributes such as DuplexEnergy minMatchPercent and Ccontent are required for all sequences to be classified Note that G and Care directly related to the stability of the duplex Sensitivity and specificitywas as high as 8408 and 9853 respectively

An interesting question is if all miRNAs in each taxonomic category stud-ied by the researchers are systematically excluded from training while includ-ing all others how well does the predictor do when tested on the excludedcategory Table 2 provides the answer The ability to correctly identifyknown miRNAs ranged from 78 for the Salicaceae to 100 for seven ofthe groups shown in Table 2 The researchers conclude by stating ldquoWe have

14

Usage Attribute100 G100 C100 T100 DuplexEnergy100 minMatchPercent100 DeltaGnorm100 G + T100 G + C98 duplexEnergyNorm86 NormEnergyRatio

Table 2 Top ten attribute usage for one training runs of the classifierreported by Williams et al

shown that a highly accurate universal plant miRNA predictor can be producedby machine learning using C50rdquo

Taxonomic Correctly of full setGroup classified excludedEmbryophyta 94 916Lycopodiophyta 100 265Brassicaceae 100 2022Caricaceae 100 005Euphorbiaceae 100 034Fabaceae 100 2700Salicaceae 78 352Solanaceae 93 068Vitaceae 94 429Rutaceae 100 043Panicoideae 95 800Poaceae 100 1948Pooideae 80 318

Table 3 Results from exclusion of each of the 13 taxonomic groups byWilliams et al

15


NOTE

A decision tree produced by an algorithm is usually not opti-mal in the sense of statistical performance measures such asthe log-likelihood squared errors and so on It turns out thatfinding the ldquooptimal treerdquo if one exists is computationallyintractable (or NP-hard technically speaking)

Acute Liver FailureNakayama et al13 use decision trees for the prognosis of acute liver failure(ALF) patients The data set consisted of a of 1022 ALF patients seenbetween 1998 and 2007 (698 patients seen between 1998 and 2003 and 324patients seen between 2004 and 2007)

Measurements on 73 medical attributes at the onset of hepatic en-cephalopathy14 and 5 days later were collected from 371 of the 698 patientsseen between 1998 and 2003

Two decision trees were built The first was used to predict (using 5attributes) the outcome of the patient at the onset of hepatic encephalopathyThe second decision tree was used to predict (using 7 attributes) the outcomeat 5 days after the onset of grade II or more severe hepatic encephalopathyThe decision trees were validated using data from 160 of the 324 patients seenbetween 2004 and 2007 Decision tree performance is shown in Table 4

Decision Tree 1 Decision Tree IIOutcome at Outcome atthe onset 5 days

Accuracy 790 836(patients 1998-2003)


Table 4 Nakayama et alrsquos decision tree performance metrics

16

NOTE

The performance of a decision tree is often measured in termsof three characteristics

bull Accuracy - The percentage of cases correctly classified

bull Sensitivity - The percentage of cases correctly classifiedas belonging to class A among all observations knownto belong to class A

bull Specificity - The percentage of cases correctly classifiedas belonging to class B among all observations knownto belong to class B

Traffic Accidentsde Ontildea et al15 investigate the use of decision tress for analyzing road acci-dents on rural highways in the province of Granada Spain Regression-typegeneralized linear models Logit models and Probit models have been thetechniques most commonly used to conduct such analyses16

Three decision tree models are developed (CART ID3 and C45) usingdata collected from 2003 to 2009 Nineteen independent variables reportedin Table 5 are used to build the decision trees

Accident type Age Atmospheric factorsSafety barriers Cause Day of weekLane width Lighting Month

Number of injuries Number of Occupants Paved shoulderPavement width Pavement markings GenderShoulder type Sight distance TimeVehicle type

Table 5 Variables used from police accident reports by de Ontildea et al

The accuracy results of their analysis are shown in Table 6 Overall thedecision trees showed modest improvement over chance

17


CART C45 ID35587 5416 5272

Table 6 Accuracy results (percentage) reported by de Ontildea et al

13 PRACTITIONER TIP

Even though de Ontildea et al tested three different decision treealgorithms (CART ID3 and C45) their results led to a verymodest improvement over chance This will also happen toyou on very many occasions Rather than clinging on to atechnique (because it is the latest technique or the one youhappen to be most familiar with) the professional data scien-tist seeks out and tests alternative methods In this text youhave over ninety of the best applied modeling techniques atyour fingertips If decision trees donrsquot ldquocut itrdquo try somethingelse

18

Electrical Power LossesA non-technical loss (NTL) is defined by electrical power companies as anyconsumed electricity which is not billed This could be because of measure-ment equipment failure or fraud Traditionally power utilities have moni-tored NTL by making in situ inspections of equipment especially for thosecustomers that have very high or close to zero levels of energy consumptionMonedero et al17 develop a CART decision tree to enhance the detection rateof NTL

A sample on 38575 customer accounts was collected over two years inCatalonia Spain For each customer various indicators were measured18

The best decision tree had a depth of 5 with the terminal node identifyingcustomers with the highest likelihood of NTL A total of 176 customers wereidentified by the model This number was greater than the expected total of85 and too many for the utility company to inspect in situ The researcherstherefore merge their results with a Bayesian network model and therebyreduce the estimated number to 64

This example illustrates the important point that the predictive model isjust one aspect that goes into real world decisions Oftentimes a model willbe developed but deemed impracticable by the end user

NOTE

Cross-validation refers to a technique used to allow for thetraining and testing of inductive models Williams et al used aleave-one-out cross-validation Leave-one-out cross-validationinvolves taking out one observation from your sample andtraining the model with the rest The predictor just trained isapplied to the excluded observation One of two possibilitieswill occur the predictor is correct on the previously unseencontrol or not The removed observation is then returnedand the next observation is removed and again training andtesting are done This process is repeated for all observationsWhen this is completed the results are used to calculate theaccuracy of the model often measured in terms of sensitivityand specificity

19


Tracking Tetrahymena Pyriformis CellsTracking and matching individual biological cells in real time is a challengingtask Since decision trees provide an excellent tool for real time and rapiddecision making they have potential in this area Wang et al19 consider theissue of real time tracking and classification of Tetrahymena Pyriformis CellsThe issue is whether a cell in the current video frame is the same cell in theprevious video frame

A 23-dimensional feature vector is developed and whether two regions indifferent frames represent the same cell is manually determined The trainingset consisted of 1000 frames from 2 videos The videos were captured at 8frames per second with each frame an 8-bit gray level image The researchersdevelop two decision trees the first (T1) trained using the feature vector andmanual classification and the second trained (T2) using a truncated set offeatures The error rates for each tree by tree depth are reported in Table 7Notice that in this case T1 substantially outperforms T2 indicating the im-portance of using the full feature set

Tree Depth T1 T25 148 14028 137 124910 155 1361

Table 7 Error rates by tree depth for T1 amp T2 reported by Wang et al

13 PRACTITIONER TIP

Cross-validation is often used to prevent over-fitting a modelto the data In n-fold cross-validation we first divide thetraining set into n subsets of equal size Sequentially onesubset is tested using the classifier trained on the remaining(n-1) subsets Thus each instance of the whole training setis predicted once The advantage of this method over a ran-dom selection of training samples is that all observations areused for either training (n times) or evaluation (once) Cross-validation accuracy is measured as the percentage of data thatare correctly classified

20

Classification Trees

21

Technique 1

Classification Tree

A classification tree can be built using the package tree with the tree func-tion

tree(z ~ data split)

Key parameters include split which controls whether deviance or giniare used as the splitting criteria z the data-frame of classes data the dataset of attributes with which you wish to build the tree

13 PRACTITIONER TIP

To obtain information on which version or R you are runningloaded packages and other information usegt sessionInfo ()

Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library(tree)gt library(mlbench)gt data(Vehicle)

22

TECHNIQUE 1 CLASSIFICATION TREE

NOTE

The Vehicle data set20 was collected to classify a Corgievehicle silhouette as one of four types (double decker busCheverolet van Saab 9000 and Opel Manta 400) The dataframe contains 846 observations on 18 numerical features ex-tracted from the silhouettes and one nominal variable definingthe class of the objects (see Table 8)

You can access the features directly using Table 8 as a reference Forexample a summary of the Comp and Circ features is obtained by typinggt summary(Vehicle [1])

CompMin 73001st Qu 8700Median 9300Mean 93683rd Qu 10000Max 11900

gt summary(Vehicle [2])Circ

Min 33001st Qu 4000Median 4400Mean 44863rd Qu 4900Max 5900

Step 2 Prepare Data amp Tweak ParametersWe use 500 of the 846 observations to create a randomly selected trainingsample We use the observations associated with train to build the classifi-cation treegt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)

23


Data-frame R name DescriptionIndex1 Comp Compactness2 Circ Circularity3 DCirc Distance Circularity4 RadRa Radius ratio5 PrAxisRa praxis aspect ratio6 MaxLRa maxlength aspect ratio7 ScatRa scatter ratio8 Elong elongatedness9 PrAxisRect praxis rectangularity10 MaxLRect maxlength rectangularity11 ScVarMaxis scaled variance along major axis12 ScVarmaxis scaled variance along minor axis13 RaGyr scaled radius of gyration14 SkewMaxis skewness about major axis15 Skewmaxis skewness about minor axis16 Kurtmaxis kurtosis about minor axis17 KurtMaxis kurtosis about major axis18 HollRa hollows ratio19 Class type bus opel saab van

Table 8 Attributes and Class labels for vehicle silhouettes data set

24


Step 3 Estimate the Decision TreeNow we are ready to build the decision tree using the training sample Thisis achieved by entering

gt fitlt- tree(Class ~ data = Vehicle[train ] split=deviance)

We use deviance as the splitting criteria a common alternative is to usesplit=gini You may be surprised by just how quickly R builds the tree

13 PRACTITIONER TIP

It is important to remember that the response variable is afactor This actually trips up quite a few users who have nu-meric categories and forget to convert their response variableusing factor() To see the levels of the response variabletypegt attributes(Vehicle$Class)$levels[1] bus opel saab van

$class[1] factor

For Vehicle each level is associated with a different vehicletype (bus opel saab van )

To see details of the fitted tree type

gt fit

node) split n deviance yval (yprob) denotes terminal node

1) root 500 1386000 saab ( 0248000 02540000258000 0240000 )2) Elong lt 415 229 489100 opel ( 0222707

0410480 0366812 0000000 )

25


14) SkewMaxis lt 645 7 9561 van (0000000 0000000 0428571 0571429 )

15) SkewMaxis gt 645 64 10300 van (0015625 0000000 0000000 0984375 )

At each branch of the tree (after root) we see in order

1 The branch number (eg in this case 1214 and 15)

2 the split (eg Elong lt 415)

3 the number of samples going along that split (eg 229)

4 the deviance associated with that split (eg 4891)

5 the predicted class (eg opel)

6 the associated probabilities (eg ( 0222707 0410480 0366812 0000000)

7 and for a terminal node (or leaf) the symbol

13 PRACTITIONER TIP

If the minimum deviance occurs with a tree with 1 node thenyour model is at best no better than random It is even pos-sible that it may be worse

A summary of the tree can also be obtainedgt summary(fit)

Classification treetree(formula = Class ~ data = Vehicle[train ]

split = deviance)Variables actually used in tree construction[1] Elong MaxLRa Comp

PrAxisRa ScVarmaxis[6] MaxLRect DCirc Skewmaxis

Circ KurtMaxis

26


[11] SkewMaxisNumber of terminal nodes 15Residual mean deviance 09381 = 455 485Misclassification error rate 0232 = 116 500

Notice that summary(fit) shows

1 The type of tree in this case a Classification tree

2 the formula used to fit the tree

3 the variables used to fit the tree

4 the number of terminal nodes in this case 15

5 the residual mean deviance - 09381

6 the misclassification error rate 0232 or 232

We plot the tree see Figure 11gt plot(fit) text(fit)

27


Figure 11 Fitted Decision Tree

13 PRACTITIONER TIP

The height of the vertical lines in Figure 11 is proportional tothe reduction in deviance The longer the line the larger thereduction This allows you to identify the important sectionsimmediately If you wish to plot the model using uniformlengths use plot(fittype=uniform)

28


Step 4 Assess ModelUnfortunately classification trees have a tendency to over-fit the data Oneapproach to reduce this risk is to use cross-validation For each hold outsample we fit the model and note at what level the tree gives the best results(using deviance or the misclassification rate) Then we hold out a differentsample and repeat This can be carried out using the cvtree()functionWe use a leave-one-out cross-validation using the misclassification rate anddeviance (FUN=prunemisclass followed by FUN=prunetree)

NOTE

Textbooks and academics used to spend a inordinate amountof time on the subject of when to stop splitting a tree andalso pruning techniques This is indeed an important aspectto consider when building a single tree because if the tree istoo large it will tend to over fit the data If the tree is toosmall it might misclassification important characteristics ofthe relationship between the covariates and the outcome Inactual practice I do not spend a great deal of time on decidingwhen to stop splitting a tree or even pruning This is partlybecuase

1 A single tree is generally only of interest to gain in-sight about the data if it can be easily interpreted Thedefault settings in R decision tree functions often aresufficient to create such trees

2 For use in ldquopurerdquo prediction activites random forests(see page272) have largely replaced individual decisiontrees becuase they often produce more acurate predic-tive models

The results are plotted out side by side in Figure 12 The jagged linesshows where the minimum deviance misclassification occurred with thecross-validated tree Since the cross validated misclassification and devianceboth reach their minimum close to the number of branches in the originalfitted tree there is little to be gained from pruning this treegt fitMcv lt- cvtree(fit K=346FUN=prunemisclass)

gt fitPcv lt- cvtree(fit K=346FUN=prunetree)

29


gt par(mfrow = c(1 2))gt plot(fitMcv)gt plot(fitPcv)

Figure 12 Cross Validation results on Vehicle using misclassification andDeviance

Step 5 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses then we display the confusion matrix and calculate the error rate ofthe fitted tree Overall the model has an error rate of 32

30


gt predlt-predict(fit newdata=Vehicle[-train ])

gt predclass lt- colnames(pred)[maxcol(pred tiesmethod = c(random))]

gt table(Vehicle$Class[-train]predclass dnn=c( Observed ClassPredicted Class ))

Predicted ClassObserved Class bus opel saab van

bus 86 1 3 4opel 1 55 20 9saab 4 55 23 6van 2 2 5 70

gt error_rate = (1-sum(predclass== Vehicle$Class[-train])346)

gt round(error_rate 3)[1] 0324

31

Technique 2

C50 Classification Tree

The C50 algorithm21 is based on the concepts of entropy the measure ofdisorder in a sample and the information gain of each attribute Informationgain is a measure of the effectiveness of an attribute in reducing the amountof entropy in the sample

It begins by calculating the entropy of a data sample The next stepis to calculate the information gain for each attribute This is the expectedreduction in entropy by partitioning the data set on the given attribute Fromthe set of information gain values the best attributes for partitioning the dataset are chosen and the decision tree is built

A C50 Classification Tree can be built using the package C50 with theC50 function

C50(z ~ data)

Key parameters include z the data-frame of classes data the data set ofattributes with which you wish to build the tree

Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library(C50)gt library(mlbench)gt data(Vehicle)

32

TECHNIQUE 2 C50 CLASSIFICATION TREE

Step 2 Prepare Data amp Tweak ParametersWe use 500 of the 846 observations to create a randomly selected trainingsample We will use the observations associated with train to build theclassification treegt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)

Step 3 Estimate and Assess the Decision TreeNow we are ready to build the decision tree using the training sample Thisis achieved by entering

gt fitlt- C50( Class ~ data = Vehicle[train ] )

Next we assess variable importance using the C5imp function

gt C5imp(fit)

OverallMaxLRa 1000Elong 1000Comp 502Circ 462Skewmaxis 452ScatRa 400MaxLRect 292RaGyr 234DCirc 232SkewMaxis 202PrAxisRect 170Kurtmaxis 122PrAxisRa 90RadRa 30HollRa 30ScVarmaxis 00KurtMaxis 00

We observe that MaxLRa and Elong are the two most influential at-tributes The attributes ScVarmaxis and KurtMaxis are the least influ-ential variables with an influence score of zero

33


13 PRACTITIONER TIP

To assess the importance of attributes by split usegt C5imp(fit metric = splits)

Step 4 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses then we display the confusion matrix and calculate the error rate ofthe fitted tree Overall the model has an error rate of 275gt predlt-predict(fit newdata=Vehicle[-train ]type =

class)

gt table(Vehicle$Class[-train]pred dnn=c( ObservedClassPredicted Class ))


bus 84 1 5 4opel 0 48 31 6saab 3 38 47 0van 2 1 4 72

gt error_rate = (1-sum(pred== Vehicle$Class[-train ])346)


34


13 PRACTITIONER TIP

To view the estimated probabilities of each class usegt predlt-predict(fit newdata=Vehicle[-train

]type = prob)

gt head(round(pred 3))bus opel saab van

5 0018 0004 0004 09747 0050 0051 0852 004810 0011 0228 0750 001012 0031 0032 0782 015515 0916 0028 0029 002719 0014 0070 0903 0013

35

Technique 3

Conditional InferenceClassification Tree

A conditional inference classification tree is a non-parametric regression treeembedding tree-structured regression models It is essentially a decision treebut with extra information about the distribution of classes in the terminalnodes22 It can be built using the package party with the ctree function

ctree(z ~data )


Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library (party)gt library(mlbench)gt data(Vehicle)

Step 2 is outlined on page 23


gt fitlt-ctree(Class ~ data = Vehicle[train ]controls =ctree_control(maxdepth = 2))

36

TECHNIQUE 3 CONDITIONAL INFERENCE

Notice we use controls with the maxdepth parameter to limit the depthof the tree to at most 2 Note if maxdepth = 0 no tree is fitted

Next we plot the tree

gt plot(fit)

The resultant tree is shown in Figure 31 At each internal node a p-valueis reported for the split In this case they are all highly significant (less than1) The primary split takes place at Elong le 41 Elong gt 41 The fourroot nodes are labeled Node 3 Node 4 Node 6 and Node 7 with 62 167 88and 183 observations respectively Each of these leaf nodes also has a barchart illustrating the proportion of the four vehicle types that fall into eachclass at that node

37


Figure 31 Conditional Inference Classification Tree for Vehicle withmaxdepth = 2

Step 4 Make PredictionsWe use the validation data set and the fitted decision tree to predict vehicleclasses we show the confusion matrix and calculate the error rate of fitThe misclassification rate is approximately 53 for this treegt predlt-predict(fit newdata=Vehicle[-train ]type =

response)

38




bus 36 0 11 47opel 3 50 24 8saab 6 58 19 5van 0 0 21 58



13 PRACTITIONER TIP

Set the parameter type = node to see which nodes the ob-servations end up in and type = prob to view the proba-bilities For example to see the distribution for the validationsample typegt predlt-predict(fit newdata=Vehicle[-train

]type = node)gt table(pred)pred

3 4 6 745 108 75 118

We see that 45 observations ended up in node 3 and 118 innode 7

To assess the predictive power of fit we compare it against two otherconditional inference classification trees - fit3 which limits the maximumtree depth to 3 and fitu which estimates an unrestricted treegt fit3lt-ctree(Class ~ data = Vehicle[train ]controls =ctree_control(maxdepth = 3))

gt fitult-ctree(Class ~ data = Vehicle[train ])

We use the validation data set and the fitted decision tree to predictvehicle classes

39


gt pred3 lt-predict(fit3 newdata=Vehicle[-train ]type = response)

gt predu lt-predict(fitu newdata=Vehicle[-train ]type = response)

Next we calculate the error rate of each fitted treegt error_rate3 = (1-sum(pred3 ==Vehicle$Class[-train ])346)

gt error_rateu = (1-sum(predu ==Vehicle$Class[-train ])346)

gt tree_1lt-round(error_rate 3)gt tree_2lt-round(error_rate3 3)gt tree_3lt-round(error_rateu 3)

Finally we calculate the misclassification error rate for each fitted treegt errlt-cbind(tree_1tree_2tree_3)100gt rownames(err) lt- error ()

gt errtree_1 tree_2 tree_3

error () 529 416 37

The unrestricted tree has the lowest misclassification error rate of 37

40

Technique 4

Evolutionary Classification Tree

NOTE

The recursive partitioning methods discussed in previous sec-tions build the decision tree using a forward stepwise searchwhere splits are chosen to maximize homogeneity at the nextstep only Although this approach is known to be an efficientheuristic the results are only locally optimal Evolutionary al-gorithm based trees search over the parameter space of treesusing a global optimization method

A evolutionary classification tree model can be estimated using the pack-age evtree with the evtree function

evtree(z ~ data )

Key parameters include the response variable Z which contains the classesand data the data set of attributes with which you wish to build the tree

Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library (evtree)gt library(mlbench)gt data(Vehicle)

41



Step 3 Estimate and Assess the Decision TreeNow we are ready to build the decision tree using the training sample Weuse evtree to build the tree and plot to display the tree We restrict thetree depth using maxdepth=2

gt fitlt- evtree(Class ~ data = Vehicle[train ]control =evtreecontrol (maxdepth =2) )

gt plot(fit)

42

TECHNIQUE 4 EVOLUTIONARY CLASSIFICATION TREE

Figure 41 Fitted Evolutionary Classification Tree using Vehicle

The tree shown in Figure 41 visualizes the decision rules It also showsfor the terminal nodes the number of observations and their distributionamongst the classes Let us take node 3 as an illustration This node isreached by the rule maxLRa lt8 and ScVarmaxislt290 The node con-tains 64 observations Similar details can be obtained by typing fit

gt fit

Model formulaClass ~ Comp + Circ + DCirc + RadRa + PrAxisRa +

MaxLRa +

43


ScatRa + Elong + PrAxisRect + MaxLRect + ScVarMaxis +

ScVarmaxis + RaGyr + SkewMaxis + Skewmaxis+ Kurtmaxis +

KurtMaxis + HollRa

Fitted party[1] root| [2] MaxLRa lt 8| | [3] ScVarmaxis lt 290 van (n = 64 err =

453)| | [4] ScVarmaxis gt= 290 bus (n = 146 err =

274)| [5] MaxLRa gt= 8| | [6] ScVarmaxis lt 389 van (n = 123 err =

309)| | [7] ScVarmaxis gt= 389 opel (n = 167 err

= 473)

Number of inner nodes 3Number of terminal nodes 4


response)



bus 85 0 0 9opel 14 48 0 23saab 18 57 0 13van 0 1 0 78

44




13 PRACTITIONER TIP

When using predict you can specify any of thefollowing via type = ldquoresponserdquo ldquoprobrdquo ldquoquantilerdquoldquodensityrdquo or ldquonoderdquo For example to view the estimatedprobabilities of each class usegt predlt-predict(fit newdata=Vehicle[-train

]type = prob)

To see the distribution across nodes you would entergt predlt-predict(fit newdata=Vehicle[-train

]type = node)

Finally we estimate a unrestricted model use the validation data set topredict vehicle classes display the confusion matrix and calculate the errorrate of the unrestricted fitted tree In this case the misclassification error rateis lower at 347gt fitlt- evtree(Class ~ data = Vehicle[train ] )

gt predlt-predict(fit newdata=Vehicle[-train ]type =response)



bus 72 3 15 4opel 4 30 41 10saab 6 23 55 4van 3 1 6 69


45



46

Technique 5

Oblique Classification Tree

NOTE

A common question is what is different about oblique treesHere is the answer in a nutshell For a set of attributesX1Xk the standard classification tree produces binarypartitioned trees by considering axis-parallel splits over con-tinuous attributes ( ie grown using tests of the form Xi ltC versus Xi ge C) This is the most widely used approachto tree-growth Oblique trees are grown using oblique splits(ie grown using tests of the form sumk

i=1 αiXi lt C versussumki=1 αiXi ge C ) So for axis-parallel splits a single attribute

is used for oblique splits a weighted combination of attributesis used

An oblique classification tree model can be estimated using the packageobliquetree with the obliquetree function

obliquetree(z ~ data obliquesplits = onlycontrol = treecontrol () splitimpurity )

Key parameters include the response variable Z which contains the classesand data the data set of attributes with which you wish to build the tree con-trol with takes arguments from using treecontrol from the tree packageand splitimpurity which controls the splitting criterion used and takesvalues deviance or gini

47


Step 1 Load Required PackagesWe build our classification tree using the Vehicle data frame contained inthe mlbench packagegt library (obliquetree)gt library(mlbench)gt data(Vehicle)

Step 2 Prepare Data amp Tweak ParametersWe use 500 of the 846 observations to create a randomly selected training sam-ple Three attributes are used to build the tree (MaxLRa ScVarmaxisand Elong)

gt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)gt flt-Class ~MaxLRa +ScVarmaxis +Elong

Step 3 Estimate and Assess the Decision TreeBefore estimating the tree we tweak the following

1 Only allow oblique splits (obliquesplits = only)

2 Use treecontrol to indicate the number of observations (nobs=500)and set the minimum number of observations in a node to 60(mincut=60)

The tree is estimated as follows

gtfitlt-obliquetree(fdata=Vehicle[train ] obliquesplits = only control = treecontrol(nobs =500mincut =60) splitimpurity=deviance)

48

TECHNIQUE 5 OBLIQUE CLASSIFICATION TREE

13 PRACTITIONER TIP

The type of split is indicated by the obliquesplits argu-ment For our analysis we use obliquesplits = only togrow trees that only use oblique splits Use obliquesplits= on to grow trees that use both oblique and axis paral-lel splits and obliquesplits = off to grow traditionalclassification trees (only use axis-parallel splits)

Details of the tree can be visualized using a combination of plot andtext see Figure 51gt plot(fit)text(fit)

49


Figure 51 Fitted Oblique tree using a subset of Vehicle attributes

The tree visualizes the decision rules The first split occurs where64minus 142MaxLRa+ 003ScV armaxisminus 012Elong lt 0

If the above holds it leads directly to the leaf node indicating class type =vanFull details of the tree can be obtained by typing fit whilst summary givesan overview of the fitted tree

gt summary(fit)

Classification tree

50


obliquetree(formula = f data = Vehicle[train ]control = treecontrol(nobs = 500mincut = 60) splitimpurity = deviance

obliquesplits = only)Variables actually used in tree construction[1] Number of terminal nodes 5Residual mean deviance 1766 = 8742 495Misclassification error rate 0 = 0 500


class)gtgt table(Vehicle$Class[-train]pred dnn=c( Observed

ClassPredicted Class ))Predicted Class

Observed Class bus opel saab vanbus 77 0 13 4opel 17 37 20 11saab 23 34 27 4van 8 0 4 67


gtgt round(error_rate 3)[1] 0399

51

Technique 6

Logistic Model Based RecursivePartitioning

A model based recursive partitioning logistic regression tree can be estimatedusing the package party with the mob function

ctree(z ~x+y+z|a+b+cdata model = glinearModel family = binomial())

Key parameters include the binary response variable Z the conditioningcovariates xy and z and the tree partitioning covariates a b and c

Step 1 Load Required PackagesWe build the decision tree using the data frame PimaIndiansDiabetes2 con-tained in the mlbench packagegt library (party)gt data(PimaIndiansDiabetes2package=mlbench)

NOTE

The PimaIndiansDiabetes2 data set was collected by theNational Institute of Diabetes and Digestive and Kidney Dis-eases23 It contains 768 observations on 9 variables measuredon females at least 21 years old of Pima Indian heritageTable 9 contains a description of each of the variables

52

TECHNIQUE 6 LOGISTIC MODEL BASED RECURSIVE

Name Descriptionpregnant Number of times pregnantglucose Plasma glucose concentration (glucose tolerance test)pressure Diastolic blood pressure (mm Hg)triceps Triceps skin fold thickness (mm)insulin 2-Hour serum insulin (mu Uml)mass Body mass indexpedigree Diabetes pedigree functionage Age (years)diabetes test for diabetes - Class variable (neg pos)

Table 9 Response and independent variables in PimaIndiansDiabetes2data frame

Step 2 Prepare Data amp Tweak ParametersFor our analysis we use 600 of the 768 observations to train the model Theresponse variable is diabetes and we use mass and pedigree as logistic re-gression conditioning variables with the remaining six variables (glucosepregnant pressure triceps insulin and age) being used as the par-titioning variables The model is stored in fgt setseed (898)gt n=nrow(PimaIndiansDiabetes2)gt train lt- sample (1n 600 FALSE)gt flt-diabetes ~ mass+ pedigree| glucose +pregnant +

pressure + triceps + insulin + age

53


NOTE

The PimaIndiansDiabetes2 data-set has a large number ofmisclassified values (recorded as NA) particularly for the at-tributes of insulin and triceps In traditional statisticalanalysis these values would have to be removed or their val-ues interpolated However ignoring misclassified values ortreating them as another category is often inefficient A moreefficient of the available information used by many decisiontree algorithms is to ignore the misclassified data point in theevaluation of a split but distribute them to child nodes usinga given rule For example the rule might

1 Distribute misclassified values to the node which has thelargest number of instances

2 Distribute to all child nodes with diminished weightsproportional with the number of instances from eachchild node

3 Randomly distribute to one single child node

4 Create surrogates attributes which closely resemble thetest attributes and use them to send misclassified valuesto child nodes

Step 3 Estimate amp Interpret Decision TreeWe estimate the model using the function mob and then use the plot functionto visualize the tree as shown in Figure 61 Since the response variablediabetes is binary and mass and pedigree are numeric a spinogram is usedfor visualization The plots in the leaves give spinograms for diabetes versusmass (upper panel) and pedigree (lower panel)

gt fit lt- mob(fdata = PimaIndiansDiabetes2[train ]model = glinearModel family = binomial())

gt plot(fit)

54


13 PRACTITIONER TIP

As an alternative to using spinograms you can also plot thecumulative density function using the argument tp_args =list(cdplot = TRUE) in the plot function In the examplein this section you would typegt plot(fit tp_args = list(cdplot = TRUE))

You can also specify the smoothing bandwidth using the bwargument For examplegt plot(fmPID tp_args = list(cdplot = TRUE

bw = 15))

55


Figure 61 Logistic-regression-based tree for the Pima Indians diabetes data

The fitted lines are the mean predicted probabilities in each group Thedecision tree distinguishes four different groups of women

bull Node 3 Women with low glucose and 26 years or younger have onaverage a low risk of diabetes however this increases with mass butdecreases slightly with pedigree

bull Node 4 Women with low glucose and older than 26 years have onaverage a moderate risk of diabetes which increases with mass andpedigree

bull Node 5 Women with glucose in the range 127 to 165 have on average

56


a moderate to high risk of diabetes which increases with mass andpedigree

bull Node 7 Women with glucose greater than 165 have on average a highrisk of diabetes which increases with mass and decreases with pedigree

The same interpretation can also be drawn from the coefficient estimatesobtained using the coef functiongt round(coef(fit) 3)

(Intercept) mass pedigree3 -7263 0143 -06804 -4887 0076 24956 -5711 0149 13657 -3216 0225 -2193

When comparing models it can be useful to have the value of the log like-lihood function or the Akaike information criterion These can be obtainedbygt logLik(fit)rsquolog Likrsquo -1154345 (df=15)

gt AIC(fit)[1] 2608691

Step 5 Make PredictionsWe use the function predict to fit calculated the fitted values using thevalidation sample and show the confusion table Then we calculate the mis-classification error which returns a value of 262gt predlt-predict(fit newdata=PimaIndiansDiabetes2[-

train ])

gt thresh lt- 05gt predFac lt- cut(pred breaks=c(-Inf thresh Inf)

labels=c(neg pos))

gt tblt-table(PimaIndiansDiabetes2$diabetes[-train]predFac dnn=c(actual predicted))

gt tb

57


predictedactual neg pos

neg 92 17pos 26 29

gt error lt- 1-(sum(diag(tb))sum(tb))gt round(error 3)100[1] 262

13 PRACTITIONER TIP

Re-run the analysis in this section omitting the attributes ofinsulin and triceps and using the naomit method to re-move the any remaining misclassified values Here is somesample code to get you startedtemplt-(PimaIndiansDiabetes2)temp$insulin lt- NULLtemp$triceps lt- NULLtemplt-naomit(temp)

What do you notice about the resultant decision tree

58

Technique 7

Probit Model Based RecursivePartitioning

A model based recursive partitioning probit regression tree can be estimatedusing the package party with the mob function

ctree(z ~x+y+z|a+b+cdata model = glinearModel family = binomial(link = probit) )


Step 1 and step 2 are discussed beginning on page 52

Step 3 Estimate amp Interpret Decision TreeWe estimate the model using the function mob and display the coefficients atthe leaf nodes using coef

gt fit lt- mob(fdata = PimaIndiansDiabetes2[train ]model = glinearModel family = binomial(link = probit))

gt round(coef(fit) 3)(Intercept) mass pedigree

3 -4070 0078 -02224 -2932 0046 14746 -3416 0089 08147 -1174 0100 -1003

59


The estimated decision tree is similar to that shown in Figure 61 and theinterpretation of the coefficients is given on page 56

Step 5 Make PredictionsFor comparison with the logistic regression discussed on page 52 we reportthe value of the log likelihood function or the Akaike information criterionSince these values are very close to those obtained by the logistic regressionwe should expect similar predictive performancegt logLik(fit)rsquolog Likrsquo -1154277 (df=15)

gt AIC(fit)[1] 2608555

We use the function predict with the validation sample and show theconfusion table Then we calculate the misclassification error which returnsa value of 262 This is exactly the same error rate observed by the logisticregression modelgt predlt-predict(fit newdata=PimaIndiansDiabetes2[-

train ])


labels=c(neg pos))


gt tbpredicted

actual neg posneg 92 17pos 26 29


60

Regression Trees forContinuous Response Variables

61

Technique 8

Regression Tree

A regression tree can be built using the package tree with the tree function


Key parameters include split which controls whether deviance or giniare used as the splitting criteria z the data-frame containing the continuousresponse variable data the data set of attributes with which you wish tobuild the tree

Step 1 Load Required PackagesWe build our regression tree using the bodyfat data frame contained in theTHdata packagegt library(tree)gt data(bodyfatpackage=THdata)

NOTE

The bodyfat data set was collected by Garcia et al24to de-velop improved predictive regression equations for body fatcontent derived from common anthropometric measurementsThe original study collected data from 117 healthy Germansubjects 46 men and 71 women The bodyfat data framecontains the data collected on 10 variables for the 71 womensee Table 10

62

TECHNIQUE 8 REGRESSION TREE

Name DescriptionDEXfat body fat measured by DXA(response variable)age age in yearswaistcirc waist circumferencehipcirc hip circumferenceelbowbreadth breadth of the elbowkneebreadth breadth of the kneeanthro3a sum of logarithm of three anthropometric measurementsanthro3b sum of logarithm of three anthropometric measurementsanthro3c sum of logarithm of three anthropometric measurementsanthro4 sum of logarithm of three anthropometric measurements

Table 10 Response and independent variables in bodyfat data frame

Step 2 Prepare Data amp Tweak ParametersFollowing the approach taken by Garcia et al we use 45 of the 71 observationsto build the regression tree The remainder will be used for prediction The45 training observations were selected at random without replacementgt setseed (465)gt train lt- sample (171 45 FALSE)

Step 3 Estimate the Decision TreeNow we are ready to fit the decision tree using the training sample We takethe log of DEXfat as the response variable

gt fitlt- tree(log(DEXfat) ~ data = bodyfat[train] split=deviance)

To see details of the fitted tree enter

gt fit

node) split n deviance yval denotes terminal node

1) root 45 672400 33642) anthro4 lt 533 19 133200 3004

63


4) anthro4 lt 4545 5 025490 2667

14) waistcirc lt 10425 10 007201 3707 15) waistcirc gt 10425 5 008854 3874

The terminal value at each node is the estimated percent cover oflog(DEXfat) For example the root node indicates the overall mean oflog(DEXfat) is 3364 percent This is approximately the same value we wouldget fromgt mean(log(bodyfat$DEXfat))[1] 3359635

Following the splits the first terminal node is 4 where the estimatedabundance of log(DEXfat) that has anthro4 lt 533 and anthro4 lt 4545 is2667 percent


Regression treetree(formula = log(DEXfat) ~ data = bodyfat[train

] split = deviance)Variables actually used in tree construction[1] anthro4 hipcirc waistcircNumber of terminal nodes 7Residual mean deviance 001687 = 06411 38Distribution of residuals

Min 1st Qu Median Mean 3rd QuMax

-0268300 -0080280 0009712 0000000 00734000332900


1 The type of tree in this case a Regression tree



64



5 the residual mean deviance of 001687

6 the distribution of the residuals in this case they have a mean of 0


Figure 81 Fitted Regression Tree for bodyfat

65


Step 4 Assess ModelWe use a leave-one-out cross-validation with the results plotted in Figure 82Since the jagged line reaches a minimum close to the number of branches inthe original fitted tree there is little to be gained from pruning this treegt fitcv lt- cvtree(fit K=45)gt plot(fitcv)

Figure 82 Regression tree cross validation results using bodyfat

66


Step 5 Make PredictionsWe use the test observations and the fitted decision tree to predictlog(DEXfat) The scatter plot between predicted and observed values isshown in Figure 83 The squared correlation coefficient between predictedand observed values is 0795gt predlt-predict(fit newdata=bodyfat[-train ])

gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Values main=Training SampleModel Fit)

gt round(cor(pred bodyfat$DEXfat[-train])^23)[1] 0795

67


Figure 83 Scatterplot of predicted versus observed observations for the re-gression tree of bodyfat

68

Technique 9

Conditional InferenceRegression Tree

A conditional inference regression tree is a non-parametric regression treeembedding tree-structured regression models It is similar to the regressiontree of page 62 but with extra information about the distribution of subjectsin the leaf nodes It can be estimated using the package party with thectree function

ctree(z ~data )

Key parameters include the continuous response variable Z the data-frameof classes data the data set of attributes with which you wish to build thetree

Step 1 Load Required PackagesWe build the decision tree using the bodyfat (see page 62) data frame con-tained in the THdata packagegt library (party)gt data(bodyfatpackage=THdata)


Step 3 Estimate and Assess the Decision TreeWe estimate the model using the training data followed by a plot of the fittedtree shown in Figure 91

69


gt fitlt- ctree(log(DEXfat) ~ data = bodyfat[train])

gt plot(fit)

Figure 91 Fitted Conditional Inference Regression Tree using bodyfat

Further details of the fitted tree can be obtained using the print functiongt print(fit)

Conditional inference tree with 4 terminalnodes

70

TECHNIQUE 9 CONDITIONAL INFERENCE REGRESSION TREE

Response log(DEXfat)Inputs age waistcirc hipcirc elbowbreadth

kneebreadth anthro3a anthro3b anthro3c anthro4Number of observations 45

1) anthro3c lt= 385 criterion = 1 statistic =352152) anthro3c lt= 339 criterion = 0998 statistic

= 140613) weights = 9

2) anthro3c gt 3394) weights = 12

1) anthro3c gt 3855) hipcirc lt= 1085 criterion = 0999 statistic

= 158626) weights = 10

5) hipcirc gt 10857) weights = 14

At each branch of the tree (after root) we see in order the branch numberthe split rule (eganthro3c lt= 385) the criterion reflects the reported p-value and is derived from statistic Terminal nodes (or leaf) are indicatedwith and weights are the number of subjects observations at that node

Step 5 Make PredictionsWe use the validation observations and the fitted decision tree to predictlog(DEXfat) The scatter plot between predicted and observed values isshown in Figure 92 The squared correlation coefficient between predictedand observed values is 068gt predlt-predict(fit newdata=bodyfat[-train ])


gt round(cor(pred bodyfat$DEXfat[-train])^23)[1]

log(DEXfat) 068

71


Figure 92 Conditional Inference Regression Tree scatter plot for bodyfat

72

Technique 10

Linear Model Based RecursivePartitioning

A linear model based recursive partitioning regression tree can be estimatedusing the package party with the mob function

ctree(z ~x+y+z|a+b+cdata model = linearModel )

Key parameters include the continuous response variable Z the linearregression covariates xy and z and the covariates a b and c with whichyou wish to partition the tree


Step 2 Prepare Data amp Tweak ParametersFor our analysis we will use the entire bodyfat sample We begin by takingthe log of the response variable (DEXfat) and the two conditioning variables(waistcirchipcirc) the remaining covariates form the partitioning setgt bodyfat$DEXfat lt-log(bodyfat$DEXfat)

gt bodyfat$waistcirc lt-log(bodyfat$waistcirc)

73


gt bodyfat$hipcirc lt-log(bodyfat$hipcirc)

gt flt- DEXfat~waistcirc + hipcirc|age+ elbowbreadth +kneebreadth+ anthro3a + anthro3b + anthro3c +anthro4

Step 3 Estimate amp Evaluate Decision TreeWe estimate the model using the function mob Since looking at the printedoutput can be rather tedious a visualization is shown in Figure 101 By de-fault this produces partial scatter plots of the response variable against eachof the regressors (waistcirchipcirc) in the terminal nodes Each scatterplot also shows the fitted values From this visualization it can be seen thatin the nodes 3 4 and 5 bodyfat increases with waist and hip circumferenceThe increase of value appears steepest in node 3 and flattens out somewhatin node 5

gt fit lt- mob(f data = bodyfat model =linearModel control = mob_control(objfun = logLik))

gt plot(fit)

13 PRACTITIONER TIP

Model based recursive partitioning searches for the locallyoptimal split in the response variable by minimizing theobjective function of the model Typically this will besomething like deviance or the negative log likelihood func-tion It can be specified using the mob_control control func-tion For example to use deviance you would set control =mob_control(objfun = deviance) In our analysis we usethe log likelihood with control = mob_control(objfun =logLik)

74

TECHNIQUE 10 LINEAR MODEL BASED RECURSIVE

Figure 101 Linear model based recursive partitioning tree using bodyfat

Further details of the fitted tree can be obtained by typing the fittedmodelrsquos namegt fit1) anthro3b lt= 464 criterion = 0998 statistic =

245492) anthro3b lt= 429 criterion = 0966 statistic

= 169623) weights = 31

Terminal node modelLinear model with coefficients(Intercept) waistcirc hipcirc

75


-14309 1196 2660

2) anthro3b gt 4294) weights = 20


-53867 09887 09466



-35815 06027 09546

The output informs us that the tree consists of five nodes At each branchof the tree (after root) we see in order the branch number the split rule(eganthro3b lt= 464) Note criterion reflects the reported p-value25

and is derived from statistic Terminal nodes are indicated with andweights are the number of subjects observations at that node The outputalso presents the estimated regression coefficients at the terminal nodes Wecan also use coef function to obtain a summary of the estimated coefficientsand their associated nodegt round(coef(fit) 3)

(Intercept) waistcirc hipcirc3 -14309 1196 26604 -5387 0989 09475 -3582 0603 0955

The summary function also provides detailed statistical information on thefitted coefficients by node For example summary(fit) produces the following(we only show details of node 3 )gt summary(fit)

$lsquo3lsquo

CallNULL

Weighted Residuals

76


Min 1Q Median 3Q Max-03272 00000 00000 00000 04376

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) -143093 24745 -5783 329e-06 waistcirc 11958 04033 2965 0006119 hipcirc 26597 06969 3817 0000685 ---Signif codes 0 0001 001 005 01 1

Residual standard error 01694 on 28 degrees offreedom

Multiple R-squared 06941 Adjusted R-squared06723

F-statistic 3177 on 2 and 28 DF p-value 6278e-08

When comparing models it can be useful to have the value of the log like-lihood function or the Akaike information criterion These can be obtainedbygt logLik(fit)rsquolog Likrsquo 5442474 (df=14)

gt AIC(fit)[1] -8084949

13 PRACTITIONER TIP

The test statistics and p-values computed in each node canbe extracted using the function sctest() For example to seethe statistics for node 2 you would type sctest(fit node= 2)

Step 5 Make PredictionsWe use the function predict and then display the scatter plot between pre-dicted and observed values in Figure 102 The squared correlation coefficientbetween predicted and observed values is 089gt predlt-predict(fit newdata=bodyfat)

77


gt plot(bodyfat$DEXfat pred xlab=DEXfat ylab=PredictedValues main=FullSampleModelFit)

gt round(cor(pred bodyfat$DEXfat)^23)[1] 089

Figure 102 Linear model based recursive partitioning tree predicted andobserved values using bodyfat

78

Technique 11

Evolutionary Regression Tree

A evolutionary regression tree model can be estimated using the packageevtree with the evtree function

evtree(z ~ data )

Key parameters include the continuous response variable Z and the co-variates contained in data

Step 1 Load Required PackagesWe build the decision tree using the bodyfat (see page 62) data frame con-tained in the THdata packagegt library (evtree)gt data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersFor our analysis we will use 45 observations for the training sample Wetake the log of the response variable (DEXfat) and two of the covariates(waistcirchipcirc) The remaining covariates are used in their originalformgt setseed (465)gt train lt- sample (171 45 FALSE)

gt bodyfat$DEXfat lt-log(bodyfat$DEXfat)gt bodyfat$waistcirc lt-log(bodyfat$waistcirc)gt bodyfat$hipcirc lt-log(bodyfat$hipcirc)

79


gt flt- DEXfat~waistcirc + hipcirc+age+ elbowbreadth +kneebreadth+ anthro3a + anthro3b + anthro3c +anthro4

Step 3 Estimate amp Evaluate Decision TreeWe estimate the model using the function evtree A visualization is obtainedusing plot and shown in Figure 111 This produces box and whisper plotsof the response variable in each leaf From this visualization it can be seenthat body fat increases as we move from node 3 to node 5

gt fit lt- evtree(f data = bodyfat[train ])gt plot(fit)

13 PRACTITIONER TIP

Notice that evtreecontrol is used to control important as-pects of a tree You can change the number of evolutionaryiterations using niterations this is useful if your tree doesnot converge by the default number of iterations You can alsospecify the number of trees in the population using ntreesand the tree depth with maxdepth For example to limit themaximum tree depth to three and the number of iterations toten thousand you would enter something along the lines offit lt- evtree(f data = bodyfat[train ]control =evtreecontrol (maxdepth =2)niterations =10000)

80

TECHNIQUE 11 EVOLUTIONARY REGRESSION TREE

Figure 111 Fitted Evolutionary Regression Tree for bodyfat

Further details of the fitted tree can be obtained by typing the fittedmodelrsquos namegt fit

Model formulaDEXfat ~ waistcirc + hipcirc + age + elbowbreadth +

kneebreadth +anthro3a + anthro3b + anthro3c + anthro4

Fitted party[1] root

81


| [2] hipcirc lt 109| | [3] anthro3c lt 377 20271 (n = 18 err =

3859)| | [4] anthro3c gt= 377 31496 (n = 12 err =

1868)| [5] hipcirc gt= 109 43432 (n = 15 err = 5548)


13 PRACTITIONER TIP

Decision tree models can outperform other techniques inthe situation when relationships are irregular (eg non-monotonic) but they are also known to be inefficient whenrelationships can be well approximated by simpler models

Step 5 Make PredictionsWe use the function predict and then display the scatter plot between pre-dicted and observed values in Figure 112 The squared correlation coefficientbetween predicted and observed values is 073gt predlt-predict(fit newdata=bodyfat[-train ])

gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=PredictedValues main=ModelFit)


82


Figure 112 Scatter plot of fitted and observed values for the EvolutionaryRegression Tree using bodyfat

83

Decision Trees for Count ampOrdinal Response Data

13 PRACTITIONER TIP

An oft believed maxim is the more data the better whilst thismay sometimes be true it is a good idea to try to rationallyreduce the number of attributes you include in your decisiontree to the minimum set of highest value attributes Oatesand Jensen26 studied the influence of database size on decisiontree complexity They found tree size and complexity stronglydepends on the size of the training setIt is always worth thinking about and removing uninformativeattributes prior to decision tree construction For practicalideas and additional tips on how to do this see the excellentpapers of John27 Brodley and Friedl28 and Cano and Herrera29

84

Technique 12

Poisson Decision Tree

A decision tree for a count response variable (yi) following a Poisson distri-bution with a mean that depends on the covariates x1xk can be builtusing the package rpart with the rpart function

rpart(z ~data method = poisson)

Key parameters include method = ldquopoissonrdquo which is used to indicatethe type of tree to be built z the data-frame containing the Poisson dis-tributed response variable data the data set of attributes with which youwish to build the tree

Step 1 Load Required PackagesWe build a Poisson decision tree using the DebTrivedi data frame containedin the MixAll packagegt library (rpart)gt library(MixAll)gt data(DebTrivedi)

85


NOTE

Deb and Trivedi30 model counts of medical care utilization bythe elderly in the United States using data from the NationalMedical Expenditure Survey They analyze data on 4406 in-dividuals aged 66 and over who are covered by Medicarea public insurance program The objective is to model thedemand for medical care using as the response variable thenumber of physiciannon-physician office and hospital outpa-tient visits The data is contained in the DebTrivedi dataframe available in the MixAll package

Step 2 Prepare Data amp Tweak ParametersThe number of physician office visits (ofp) is the response variable The co-variates are - hosp (number of hospital stays) health (self-perceived healthstatus) numchron (number of chronic conditions) as well as the socioeco-nomic variables gender school (number of years of education) and privins(private insurance indicator)gt flt-ofp~hosp+health+numchron+gender+school+privins

Step 3 Estimate the Decision Tree amp Assess fitNow we are ready to fit the decision tree

gt fitlt-rpart(fdata=DebTrivedi method = poisson)

To see a plot of the tree use the plot and text methods

gt plot(fit) text(fit usen=TRUE cex =8))

Figure 121 shows a visualization of the fitted tree Each of the five ter-minal nodes reports the event rate the total number of events and number ofobservations for that node For example for the rule chain numchron lt15-gt hosplt05 -gt numchron lt05 the event rate is 3121 with 923 events atthat node

86

TECHNIQUE 12 POISSON DECISION TREE

13 PRACTITIONER TIP

To see the number of events and observations at every nodein a decision tree plot add all = TRUE to the text functionplot(fit) text(fit usen=TRUE all=TRUE

cex =6)

Figure 121 Poisson Decision Tree using the DebTrivedi data frame

To help validate the decision tree we use the printcp function The lsquocprsquo

87


part of the function stands for the ldquocomplexity parameterrdquo of the tree Thefunction indicates the optimal tree size based on the cp valuegt printcp(fit digits =3)

Rates regression treerpart(formula = f data = DebTrivedi method =

poisson)

Variables actually used in tree construction[1] hosp numchron

Root node error 269434406 = 612

n= 4406

CP nsplit rel error xerror xstd1 00667 0 1000 1000 003322 00221 1 0933 0942 003253 00220 2 0911 0918 003264 00122 3 0889 0896 003155 00100 4 0877 0887 00315

The printcp function returns the formula used to fit the tree the variablesused to build the tree (in this case hosp numchron) the root node error(612) the number of events at the root node (4406) and the relative errorthe cross-validation error(xerror) xstd and CP at each node split Eachrow represents a different height of the tree In general more levels in thetree often imply a lower classification error However you run the risk of overfitting

Figure 122 plots the rel_error and cp parametergt plotcp(fit)

88


Figure 122 Complexity parameter for the Poisson decision tree using theDebTrivedi data frame

89


13 PRACTITIONER TIP

A simple rule of thumb is to choose the lowest level where therel_error + xstd lt xerror Another rule of thumb is toprune the tree so that it has the minimum xerror You cando this automatically using the followingpfitlt- prune(fit cp= fit$cptable[which

min(fit$cptable[xerror])CP])

pfit contains the pruned tree

Since the tree is relatively parsimonious we retain the original tree Asummary of the tree can also be obtained by typinggt summary(fit)

The first part of the output displays similar data to that obtained by usingthe printcp function This is followed by variable importance detailsVariable importancenumchron hosp health

55 37 8

We see that numchron is the most important variable followed by hospand then health

The second part of the summary function gives details of the tree withthe last few lines giving details of the terminal nodes For example for node5 we observeNode number 5 317 observations

events =2329 estimated rate =7346145 meandeviance =5810627

90

Technique 13

Poisson Model Based RecursivePartitioning

A poisson model based recursive partitioning regression tree can be estimatedusing the package party with the mob function

ctree(z ~x+y+z|a+b+cdata model = linearModel family=poisson(link = log) )

Key parameters include the poisson distributed response variable of countsZ the regression covariates xy and z and the covariates a b and c withwhich you wish to partition the tree

Step 1 Load Required PackagesWe build a Poisson decision tree using the DebTrivedi data frame containedin the MixAll package Details of this data frame are given on page 86gt library (party)gt library(MixAll)gt data(DebTrivedi)

Step 2 Prepare Data amp Tweak ParametersFor our analysis we use all the observations in DebTrivedi to estimate amodel with the response variable ofp (number of physician office visits) andnumchron (number of chronic conditions) as poisson regression condition-ing variables The remaining variables (hosp health gender schoolprivins) are used as the partitioning set The model is stored in f

91


gt flt-ofp~numchron|hosp+health+gender+school+privins

Step 3 Estimate the Decision Tree amp Assess fitNow we are ready to fit and plot the decision tree see Figure 131

gt fit lt- mob(f data=DebTrivedi model =linearModel family=poisson(link = log))

gt plot(fit)

Figure 131 Poisson Model Based Recursive Partitioning Tree forDebTrivedi

92

TECHNIQUE 13 POISSON MODEL BASED RECURSIVE

Coefficient estimates at the leaf nodes are given bygt round(coef(fit) 3)

(Intercept) numchron3 7042 02316 2555 08927 2672 14009 4119 094210 2788 104312 3769 121214 6982 112315 14579 0032


gt AIC(fit)[1] 2828175

The results of the parameter stability tests for any given node can beretrieved using sctest For example to retrieve the statistics for node 2entergt round(sctest(fit node = 2) 2)

hosp health gender school privinsstatistic 1394 3625 793 2422 1687pvalue 011 000 009 000 000

93

Technique 14

Conditional Inference OrdinalResponse Tree

The Conditional Inference Ordinal Response Tree is used when the responsevariable is measured on an ordinal scale In marketing for instance we oftensee consumer satisfaction measured on an ordinal scale - ldquovery satisfiedrdquo ldquosat-isfiedrdquo ldquodissatisfiedrdquo and ldquovery dissatisfiedrdquo In medical research constructssuch as self-perceived health are often measured on an ordinal scale - ldquoveryunhealthyrdquo ldquounhealthyrdquo ldquohealthyrdquo ldquovery healthyrdquo A conditional inferenceordinal response tree can be built using the package party with the ctreefunction

ctree(z ~data )

Key parameters include the response variable Z an ordered factor thedata-frame of classes data the data set of attributes with which you wish tobuild the tree

Step 1 Load Required PackagesWe build our tree using the wine data frame contained in the ordinal pack-agegt library (party)gt library(ordinal)gt data(wine)

94

TECHNIQUE 14 CONDITIONAL INFERENCE ORDINAL

NOTE

The wine data frame was analyzed by Randall31 in an ex-periment on factors determining the bitterness of wine Thebitterness (rating) was measured as 1 = ldquoleast bitterrdquo and 5= ldquomost bitterrdquo Temperature and contact between juice andskins can be controlled during wine production Two treat-ment factors were collected - temperature (temp) and contact(contact) with each having two levels Nine judges assessedwine from two bottles from each of the four treatment condi-tions resulting in a total of 72 observations in all

Step 2 Estimate and Assess the Decision TreeWe estimate the model using all of the data with rating as the responsevariable This is followed by a plot shown in Figure 141 of the fitted tree

gt fitlt-ctree(rating ~ temp + contact data= wine)gt plot(fit)

95


Figure 141 Fitted Conditional Inference Ordinal Response Tree using wine

The decision tree has three leafs - Node 3 with 18 observations Node 4with 18 observations and Node 5 with 36 observations The covariate temp ishighly significant (plt 0001) and contact is significant at the 5 level Noticethe terminal nodes contain the distribution of bitterness scores

We compare the fitted values with the observed values and print out theconfusion matrix and error rate The overall misclassification rate is 569for the fitted treegt tblt-table(wine$rating pred dnn=c(actual

predicted))gt tb

96


predictedactual 1 2 3 4 5

1 0 5 0 0 02 0 16 5 1 03 0 13 8 5 04 0 2 3 7 05 0 0 2 5 0

gt error lt- 1-(sum(diag(tb))sum(tb))gt round (error 3)[1] 0569

97

Decision Trees for SurvivalAnalysis

Studies involving time to event data are numerous and arise in all areas ofresearch For example survival times or other time to failure related mea-surements such as relapse time are major concerns of modeling medicaldata The Cox proportional hazard regression model and its extensions havebeen the traditional tool of the data scientist for modeling survival variableswith censoring These parametric (and semindashparametric) models remain use-ful staples as they allow simple interpretations of the covariate effects andcan readily be used for statistical inference However such models force aspecific link between the covariates and the response Even though interac-tions between covariates can be incorporated they must be specified by theanalyst

Survival decision trees allow the data scientist to carry out their analysiswithout imposing a specific link function or know aprori the nature of variableinteraction Survival trees offer great flexibility because they can automat-ically detect certain types of interactions without the need to specify thembeforehand prognostic groupings are a natural output from survival treesthis is because the basic idea of a decision tree is to partition the covariatespace recursively to form groups (nodes in the tree) of subjects which aresimilar according to the outcome of interest

98

Technique 15

Exponential Algorithm

A decision tree for survival data where the time values are assumed to fitan exponential model32 can be built using the package rpart with the rpartfunction

rpart(z ~data method = exp)

Key parameters include z the survival response variable (note we setz=Surv(time status)where Surv is a survival object constructed using thesurvival package) data the data set of explanatory variable and method= exp is used to indicate a survival decision tree

Step 1 Load Required PackagesWe build a survival decision tree using the rhDNase data frame containedin the simexaft package The required packages and data are loaded asfollowsgt library (rpart)gt library(simexaft)gt library(survival)gt data(rhDNase)

99


NOTE

Respiratory disease in patients with cystic fibrosis is char-acterized by airway obstruction caused by the accumulationof thick purulent secretions The viscoelasticity of these se-cretions can be reduced in-vitro by recombinant human de-oxyribonuclease I (rhDNase) a bioengineered copy of the hu-man enzymeThe rhDNase data set contained in the simexaftpackage contains a subset of the original data collected byFuchs et al33 who performed a randomized double-blindplacebo-controlled study on 968 adults and children with cys-tic fibrosis to determine the effects of once-daily and twice-daily administration of rhDNase The patients were treatedfor 24 weeks as outpatients The rhDNase data frame containsdata on the occurrence and resolution of all exacerbations for641 patients

Step 2 Prepare Data amp Tweak ParametersThe forced expiratory volume (FEV) was considered a risk factor and wasmeasured twice at randomization (rhDNase$fev and rhDNase$fev2) Wetake the average of the two measurements as an explanatory variable Theresponse is defined as the logarithm of the time from randomization to thefirst pulmonary exacerbation measured in the object survreg(Surv(time2status))gt rhDNase$fevave lt- (rhDNase$fev + rhDNase$fev2)2gt zlt-Surv(rhDNase$time2 rhDNase$status)


gt fitlt-rpart(z ~ trt + fevave data= rhDNase method= exp)

To see a plot of the tree simply type its name

gt plot(fit) text(fit usen=TRUE cex=8 all=TRUE)

100

TECHNIQUE 15 EXPONENTIAL ALGORITHM

Figure 151 shows a plot of the fitted tree Notice that the terminal nodesreport the estimated rate as well as the number of events and observationsavailable For example for the rule fevave lt8003 the estimated event rateis 035 with 27 events out of a total of 177

13 PRACTITIONER TIP

To see the rules that lead to a specific node use the functionpathrpart(fitted_Model node = x) For example to seethe rules associated with node 7 entergt pathrpart(fit node = 7)

node number 7rootfevave lt 8003fevave lt 4642

101


Figure 151 SurvivalDecision Tree using the rhDNase data frame

To help assess the decision tree we use the plottcp function shown inFigure 152 Since the tree is relatively parsimonious there is little need toprunegt plotcp(fit)

102


Figure 152 Complexity parameter and error for the Survival Decision Treeusing the rhDNase data frame

Survival times can vary greatly between subjects Decision tree analysis isa useful tool to homogenize the data by separating it into different subgroupsbased on treatments and other relevant characteristics In other words asingle tree will group subjects according to their survival behavior based ontheir covariates For a final summary of the model it can be helpful to plotthe probability of survival based on the final nodes in which the individualpatients landed as shown in Figure 153

We see that the node 2 appears to have the most favorable survival char-acteristicsgt km lt- survfit(z ~ fit$where data= rhDNase)

103


gt plot(km lty = 13 marktime = FALSE xlab = Time ylab = Status)

gt legend(150 02 paste(rsquonodersquo c(245)) lty =13)

Figure 153 Survival plot by terminal node for rhDNase

104

Technique 16

Conditional Inference SurvivalTree

A conditional inference survival tree is a non-parametric regression tree em-bedding tree-structured regression models This is essentially a decision treebut with extra information about survival in the terminal nodes34 It can bebuilt using the package party with the ctree function

ctree(z ~data )

Key parameters include z the survival response variable (note we setz=Surv(time status)where Surv is a survival object constructed using thesurvival package) and data the sample of explanatory variables

Step 1 Load Required PackagesWe build a conditional inference survival decision tree using the rhDNase dataframe contained in the simexaft package The required packages and dataare loaded as followsgt library (party)gt library(simexaft)gt library(survival)gt data(rhDNase)

Details of step 2 are given on page 100

Step 3 Estimate the Decision Tree amp Assess fitNext we fit and plot the decision tree Figure 161 shows the resultant tree

105


gt fitlt-ctree(z ~ trt + fevave data= rhDNase)gt plot(fit)

Figure 161 Conditional Inference Survival Tree for rhDNase

Notice that the internal nodes report the p-value for split whilst the leafnodes give the number of subjects and a plot of the estimated survival curveWe can obtain a summary of the tree by typinggt print(fit)


106

TECHNIQUE 16 CONDITIONAL INFERENCE SURVIVAL TREE

Response zInputs trt fevaveNumber of observations 641

1) fevave lt= 7995 criterion = 1 statistic =581522) trt lt= 0 criterion = 0996 statistic = 934

3) weights = 2322) trt gt 0

4) fevave lt= 453 criterion = 0957 statistic= 5254

5) weights = 1054) fevave gt 453


7) weights = 177

Node 7 is a terminal or leaf node (the symbol ldquordquo signifies this ) with thedecision rule fevave gt 7995 At this node there are 177 observations

We grab the fitted responses using the function treeresponse and storethem in stree Notice that every stree component is a survival object ofclass survfitgt stree lt- treeresponse(fit)

gt class(stree [[2]])[1] survfit


For this particular tree we have four terminal nodes so there are only fourunique survival objects You can use the where method to see which nodesthe observations are ingt subjects lt- where(fit)

gt table(subjects)subjects

3 5 6 7232 105 127 177

107


So we have 232 subjects in node 3 and 127 subjects in node 6 which agreewith the numbers reported in Figure 161

We end our initial analysis by plotting in Figure 162 the survival curvefor node 3 with a 95 confidence interval and also mark on the plot the timeof individual subject eventsgt plot(stree [[3]] confint = TRUE marktime = TRUE

ylab=Cumulative Survival ()xlab=Days Elapsed)

Figure 162 Survival curve for node 3

108

NOTES

Notes1See for example the top ten list of Wu Xindong et al Top 10 algorithms in data

mining Knowledge and Information Systems 141 (2008) 1-372Koch Y Wolf T Sorger PK Eils R Brors B (2013) Decision-Tree Based Model Analysis

for Efficient Identification of Parameter Relations Leading to Different Signaling StatesPLoS ONE 8(12) e82593 doi101371journalpone0082593

3Ebenezer Hailemariam Rhys Goldstein Ramtin Attar amp Azam Khan (2011) Real-Time Occupancy Detection using Decision Trees with Multiple Sensor Types SimAUD 2011Conference Proceedings Symposium on Simulation for Architecture and Urban Design

4Taken from Stiglic G Kocbek S Pernek I Kokol P (2012) Comprehensive Decision TreeModels in Bioinformatics PLoS ONE 7(3) e33812 doi101371journalpone0033812

5Bohanec M Bratko I (1994) Trading accuracy for simplicity in decision trees MachineLearning 15 223ndash250

6See for example J R Quinlan ldquoInduction of decision treesrdquo Mach Learn vol 1 no1 pp 81ndash106 1986

7See J R Quinlan C45 programs for machine learning San Francisco CA USAMorgan Kaufmann Publishers Inc 1993

8See Breiman Leo et al Classification and regression trees CRC press 19849See R L De Macuteantaras ldquoA distance-based attribute selection measure for decision

tree inductionrdquo Mach Learn vol 6 no 1 pp 81ndash92 199110Zhang Ting et al Using decision trees to measure activities in people with stroke

Engineering in Medicine and Biology Society (EMBC) 2013 35th Annual InternationalConference of the IEEE IEEE 2013

11See for example J Liu M A Valencia-Sanchez G J Hannon and R ParkerldquoMicroRNA-dependent localization of targeted mRNAs to mammalian P-bodiesrdquo NatureCell Biology vol 7 no 7 pp 719ndash723 2005

12Williams Philip H Rod Eyles and George Weiller Plant MicroRNA predictionby supervised machine learning using C50 decision trees Journal of nucleic acids 2012(2012)

13Nakayama Nobuaki et al Algorithm to determine the outcome of patients with acuteliver failure a data-mining analysis using decision trees Journal of gastroenterology 476(2012) 664-677

14Hepatic encephalopathy also known as portosystemic encephalopathy is the loss ofbrain function (evident in confusion altered level of consciousness and coma) as a resultof liver failure

15de Ontildea Juan Griselda Loacutepez and Joaquiacuten Abellaacuten Extracting decision rules frompolice accident reports through decision trees Accident Analysis amp Prevention 50 (2013)1151-1160

16See for example Kashani A Mohaymany A 2011 Analysis of the traffic injuryseverity on two-lane two-way rural roads based on classification tree models Safety Science49 1314-1320

17Monedero Intildeigo et al Detection of frauds and other non-technical losses in a powerutility using Pearson coefficient Bayesian networks and decision trees International Jour-nal of Electrical Power amp Energy Systems 341 (2012) 90-98

18Maximum and minimum value monthly consumption Number of meter readings Num-ber of hours of maximum power consumption and three variables to measure abnormalconsumption

109


19Wang Quan et al Tracking tetrahymena pyriformis cells using decision trees Pat-tern Recognition (ICPR) 2012 21st International Conference on IEEE 2012

20This data set comes from the Turing Institute Glasgow Scotland21For further details see httpwwwrulequestcomsee5-comparisonhtml22For further details see

1 Hothorn Torsten Kurt Hornik and Achim Zeileis ctree Conditional InferenceTrees

2 Hothorn Torsten Kurt Hornik and Achim Zeileis Unbiased recursive partition-ing A conditional inference framework Journal of Computational and Graphicalstatistics 153 (2006) 651-674

3 Hothorn Torsten et al Party A laboratory for recursive partytioning (2010)23httpwwwniddknihgov24Garcia Ada L et al Improved prediction of body fat by measuring skinfold thickness

circumferences and bone breadths Obesity Research 133 (2005) 626-63425The reported p-value at a node is equal to 1-criterion26Oates T Jensen D (1997) The effects of training set size on decision tree complexity

In Proceedings of the Fourteenth International Conference on Machine Learning pp254ndash262

27John GH (1995) Robust decision trees removing outliers from databases In Pro-ceedings of the First Conference on Knowledge Discovery and Data Mining pp 174ndash179

28 Brodley CE Friedl MA (1999) Identifying mislabeled training data J Artif Intell Res11 131ndash167 19

29Cano JR Herrera F Lozano M (2007) Evolutionary stratified training set selection forextracting classification rules with trade off precision-interpretability Data amp KnowledgeEngineering 60(1) 90ndash108

30Deb Partha and Pravin K Trivedi Demand for medical care by the elderly a finitemixture approach Journal of applied Econometrics 123 (1997) 313-336

31See Randall J (1989) The analysis of sensory data by generalised linear modelBiometrical journal 7 pp 781ndash793

32See Atkinson Elizabeth J and Terry M Therneau An introduction to recursivepartitioning using the RPART routines Rochester Mayo Foundation (2000)

33See Henry J Fuchs Drucy S Borowitz David H Christiansen Edward M MorrisMartha L Nash Bonnie W Ramsey Beryl J Rosenstein Arnold L Smith and MaryEllen Wohl for the Pulmozyme Study Group N Engl J Med 1994 331637-642September8 1994

34For further details see



3 Hothorn Torsten et al Party A laboratory for recursive partytioning (2010)

110

Part II

Support Vector Machines

111

The Basic Idea

The support vector machine (SVM) is a supervised machine learning algo-rithm35 that can be used for both regression and classification Letrsquos take aquick look at how SVM performs classification The core idea of SVMs isthat they construct hyperplanes in a multidimensional space that separatesobjects which belong to different classes A decision plane is then used todefine the boundaries between different classes Figure 163 visualizes thisidea

The decision plane separates a set of observations into their respectiveclasses using a straight line In this example the observations belong eitherto class ldquosolid circlerdquo or class ldquoedged circlerdquo The separating line defines aboundary on the right side of which all objects are ldquosolid circlerdquo and to theleft of which all objects are ldquoedged circlerdquo

Figure 163 Schematic illustration of a decision plane determined by a linearclassifier

113


In practice as illustrated in Figure 164 the majority of classificationproblems require nonlinear boundaries in order to determine the optimal sep-aration 36 Figure 165 illustrates how SVMs solve this problem The left sideof the figure represents the original sample (known as the input space) whichis mapped using a set of mathematical functions known as kernels to thefeature space The process of rearranging the objects for optimal separationis known as transformation Notice that in mapping from the input space tothe feature space the mapped objects are linearly separable Thus althoughSVM uses linear learning methods due to its nonlinear kernel function it isin effect a nonlinear classifier

NOTE

Since intuition is better built from examples that are easy toimagine lines and points are drawn in the Cartesian plane in-stead of hyperplanes and vectors in a high dimensional spaceRemember that the same concepts apply where the examplesto be classified lie in a space whose dimension is higher thantwo

Figure 164 Nonlinear boundary required for correct classification

114

Figure 165 Mapping from input to feature space in an SVM

Overview of SVM ImplementationThe SVM finds the decision hyperplane leaving the largest possible fractionof points of the same class on the same side while maximizing the distanceof either class from the hyperplane This minimizes the risk of misclassifyingnot only the examples in the training data set but also the yet-to-be seenexamples of the test set

To construct an optimal hyperplane SVM employs an iterative trainingalgorithm which is used to minimize an error function

Letrsquos take a look at how this is achieved Given a set of feature vectorsxi(i = 1 2 N) a target yi isin minus1+1 with corresponding binary labels isassociated with each feature vector xi The decision function for classificationof unseen examples is given as

y = f(xα) = sign

(Nssumi=1

αiyiK(si x) + b

)(161)

Wheresi are the Ns support vectors and K(si x) is the kernel functionThe parameters are determined by maximizing the margin hyperplane (seeFigure 166)

Nsumi=1

αi minus12

Nsumi=1

Nsumj=1

αiαjyiyjK(xi middot xj) (162)

115


subject to the constraints

Nsumi=1

αiyi = 0 and 0 le αi le C (163)

To build a SVM classifier the user needs to tune the cost parameter Cand choose a kernel function and optimize its parameters

13 PRACTITIONER TIP

I once worked for an Economist who was trained (essentially)in one statistical technique (linear regression and its variants)Whenever there was an empirical issue this individual alwaystried to frame it in terms of his understanding of linear mod-els and economics Needless to say this archaic approach tomodeling lead to all sorts of difficulties The data scientistis pragmatic in their modeling approach linear non-linearBayesian boosting they are guided by statistical theory ma-chine learning insights and unshackled from the vagueness ofeconomic theory

Note on the Slack ParameterThe variable C known as the slack parameter serves as the cost parameterthat controls the trade-off between the margin and classification error If noslack is allowed (often known as a hard margin) and the data are linearlyseparable the support vectors are the points which lie along the supportinghyperplanes as shown in Figure 166 In this case all of the support vectorslie exactly on the margin

116

Figure 166 Support vectors for linearly separable data and a hard margin

In many situations this will not yield useful results and a soft margin willbe required In this circumstance some proportion of data points are allowedto remain inside the margin The slack parameter C is used to control thisproportion37 A soft margin results in a wider margin and greater error on thetraining data set however it improves the generalization of data and reducesthe likelihood of over fitting

NOTE

The total number of support vectors depends on the amountof allowed slack and the distribution of the data If a largeamount of slack is permitted there will be a larger number ofsupport vectors than the case where very little slack is per-mitted Fewer support vectors means faster classification oftest points This is because the computational complexity ofthe SVM is linear in the number of support vectors

117


NOTE

The kappa coefficient38 is a measure of agreement betweencategorical variables It is similar to the correlation coeffi-cient in that higher values indicate greater agreement It iscalculated as

k = Po minus Pe

1minus Pe

(164)

Po is the observed proportion correctly classified Pe is theproportion correctly classified by chance

Classification of Dengue Fever PatientsThe Dengue virus is a mosquito-borne pathogen that infects millions of peopleevery year Gomes et al39 use the support vector machine algorithm to classify28 dengue patients from the Recife metropolitan area Brazil (13 with denguefever (DF) and 15 with dengue haemorrhagic fever (DHF)) based on mRNAexpression data of 11 genes involved in the innate immune response pathway(MYD88 MDA5 TLR3 TLR7 TLR9 IRF3 IRF7 IFN-alpha IFN-betaIFN-gamma and RIGI)

A radial basis function is used and the model built using leave-one-outcross-validation repeated fifteen times under different conditions to analyzethe individual and collective contributions of each gene expression data toDFDHF classification

A different gene was removed during the first twelve cross-validationsDuring the last three cross-validations multiple genes were removedFigure 167 shows the overall accuracy for the support vector machine fordiffering values of its parameter C

118

Figure 167 SVM optimization Optimization of the parame-ters C and c of the SVM kernel RBF Source Gomes et aldoi101371journalpone0011267g003

119


13 PRACTITIONER TIP

To transform the gene expression data to a suitable for-mat for support vector machine training and testing Gomeset al designate each gene as either ldquo10rdquo (for observation ofup-regulation) or ldquo01rdquo (for observation of downregulation)Therefore the collective gene expressions observed in eachpatient was represented by a 24-dimension vector (12 genestimes2 gene states up- or down-regulated) Each of the 24-dimension vectors was labeled as either ldquo1rdquo for DF patients orldquo-1rdquo for DHF patients Notice this is a different classificationstructure than that used in traditional statistical modeling ofbinary variables Typically binary observations are measuredby Statisticians using 0 and 1 Be sure you have the correctclassification structure when moving between traditional sta-tistical models and those developed out of machine learning

Forecasting Stock Market DirectionHuang et al40 use a support vector machine to predict the direction of weeklychanges in the NIKKEI 225 Japanese stock market index The index is com-posed of 225 stocks of the largest Japanese publicly traded companies

Two independent variables are selected as inputs to the model weeklychanges in the SampP500 index and weekly changes in the US dollar - JapaneseYen exchange rate

Data was collected from January 1990 to December 2002 yielding a totalsample size of 676 observations The researchers use 640 observations to traintheir support vector machine and perform an out of sample evaluation onthe remaining 36 observations

As a benchmark the researchers compare the performance of their modelto four other models a random walk linear discriminant analysis quadraticdiscriminant analysis and a neural network

The random walk correctly predicts the direction of the stock market 50of the time linear discriminant analysis 55 quadratic discriminant analysisand the neural network 69 and the support vector machine 73

The researchers also observe that an information weighted combinationof the models correctly predicts the direction of the stock market 75 of thetime

120

Bankruptcy PredictionMin et al41 develop a support vector machine to predict bankruptcy andcompare its performance to a neural network logistic regression and multiplediscriminant analysis

Data on 1888 firms is collected from Korearsquos largest credit guarantee or-ganization The data set contains 944 bankruptcy and 944 surviving firmsThe attribute set consisted of 38 popular financial ratios and was reduced byprincipal component analysis to two ldquofundamentalrdquo factors

The training set consists of 80 of the observations with the remaining20 of observations used for the hold out test sample A radial basisfunction is used for the kernel Its parameters are optimized using a gridsearch procedure and 5- fold cross-validation

In rank order (for the hold out data) the support vector machine had aprediction accuracy of 83 the neural network 825 multiple discriminantanalysis 791 and the logistic regression 783

Early Onset Breast CancerBreast cancer is often classified according to the number of estrogen receptorspresent on the tumor Tumors with a large numbers of receptors are termedestrogen receptor positive (ER+) and estrogen receptor negative (ER-) forfew or no receptors ER status is important because ER+ cancers growunder the influence of estrogen and may respond well to hormone suppressiontreatments This is not the case for ER- cancers as they do not respond tohormone suppression treatments

Upstill-Goddard et al42 investigate whether patients who develop ER+and ER- tumors show distinct constitutional genetic profiles using geneticsingle nucleotide polymorphisms data At the core of their analysis was a sup-port vector machines with linear normalized quadratic polynomial quadraticpolynomial cubic and radial basis kernels The researchers opt for a 10- foldcross-validation

All five kernel models had an accuracy rate in excess of 93 see Table 11

121


Kernal Type Correctly ClassifiedLinear 9328 plusmn307Normalized quadratic polynomial 9369 plusmn269Quadratic polynomial 9389plusmn306Cubic polynomial 9464plusmn294Radial basis function 9595plusmn261

Table 11 Upstill-Goddard et alrsquos kernels and classification results

Flood SusceptibilityTehrany et al43 evaluate support vector machines with different kernel func-tions for spatial prediction of flood occurrence in the Kuala Terengganu basinMalaysia Model attributes were constructed using ten geographic factorsaltitude slope curvature stream power index (SPI) topographic wetnessindex (TWI) distance from the river geology land usecover (LULC) soiland surface runoff

Four kernels linear (LN) polynomial (PL) radial basis function (RBF)and sigmoid (SIG) were used to assess factor importance This was achievedby eliminating the factor and then measuring the Cohenrsquos kappa index ofthe model The overall rank of each factor44 is shown in Table 12 Overallslope was the most important factor followed by distance from river and thenaltitude

Factor Average RankAltitude 0265 3Slope 0288 1Curvature 0225 6SPI 0215 8TWI 0235 4Distance 0268 2Geology 0223 7LULC 0215 8Soil 0228 5Runoff 0140 10

Table 12 Variable importance calculated from data reported in Tehrany etal

122

13 PRACTITIONER TIP

Notice how both Upstill-Goddard et al and Tehrany et al usemultiple kernels in developing their models This is always agood strategy because it is not always obvious which kernel isoptimal at the onset of a research project

Signature AuthenticationRadhikaet al45 consider the problem of automatic signature authenticationusing a variety of algorithms including a support vector machine (SVM) Theother algorithms considered included a Bayes classifier (BC) fast Fouriertransform (FT) linear discriminant analysis (LD) and principal componentanalysis (PCA)

Their experiment used a signature database containing 75 subjects with15 genuine samples and 15 forged samples for each subject Features wereextracted from images drawn from the database and used as inputs to trainand test the various methods

The researchers report a false rejection rate of 8 for SVM and 13 forFT 10 for BC 11 for PCA and 12 for LD

Prediction of Vitamin D StatusThe accepted bio marker of vitamin D status is serum 25-hydroxyvitaminD (25(OH)D) concentration Unfortunately in large epidemiological studiesdirect measurement is often not feasible However useful proxies for 25(OH)Dare available by using questionnaire data

Guo et al46 develop a support vector machine to predict serum 25(OH)Dconcentration in large epidemiological studies using questionnaire data Atotal of 494 participants were recruited onto the study and asked to completea questionnaire which included sun exposure and sun protection behaviorsphysical activity smoking history diet and the use of supplements Skintypes were defined by spectrophotometric measurements of skin reflectanceto calculate melanin density for exposed skin sites (dorsum of hand shoulder)and non-exposed skin sites (upper inner arm buttock)

A multiple regression model (MLR) estimated using 12 explanatory vari-ables47 was used to benchmark the support vector machine The researchersselected a radial basis function for the kernel with identical explanatory fac-

123


tors used in the MLR The data were randomly assigned to a training sample(n= 294) and validation sample (n= 174)

The researchers report a correlation of 074 between predicted scores andmeasured 25(OH)D concentration for the support vector machine They alsonote that it performed better than MLR in correctly identifying individualswith vitamin D deficiency Overall they conclude ldquoRBF SVR [radial basisfunction support vector machine] method has considerable promise for theprediction of vitamin D status for use in chronic disease epidemiology andpotentially other situationsrdquo

13 PRACTITIONER TIP

The performance of the SVM is very closely tied to the choiceof the kernel function There exist many popular kernel func-tions that have been widely used for classification eg linearGaussian radial basis function polynomial and so on DataScientists can spend a considerable amount of time tweakingthe parameters of a specified kernel function via trial-and-error Here are four general approaches that can speed up theprocess

1 Cross-validation48

2 Multiple kernel learning49 attempts to construct a gen-eralized kernel function so as to solve all classificationproblems through combing different types of standardkernel functions

3 Evolution amp Particle swarm optimization Thadani etal50 use gene expression programming algorithms toevolve the kernel function of SVM An analogous ap-proach has been proposed using particle swarm opti-mization51

4 Automatic kernel selection using the C50 algorithmwhich attempts to select the optimal kernel functionbased on the statistical data characteristics and distri-bution information

124

Support Vector Classification

125

Technique 17

Binary Response Classificationwith C-SVM

A C-SVM for binary response classification can be estimated using the pack-age svmpath with the svmpath function

svmpath(x y kernelfunction)

Key parameters include the response variable y coded as (-1+1) thecovariates x and the specified kernel using kernelfunction

13 PRACTITIONER TIP

Although there are an ever growing number of kernels fourworkhorses of applied research are

bull LinearK(xi xj) = xTi xj

bull PolynomialK(xi xj) =(γxT

i xj + r)d γ gt 0

bull Radial basis functionK(xi xj) =exp (minusγxi minus xj2) γ gt 0

bull Sigmoidtanh(γxT

i xj + r)

Here γ r and d are kernel parameter

126

TECHNIQUE 17 BINARY RESPONSE CLASSIFICATION WITH

Step 1 Load Required PackagesWe build the C-SVM for binary response classification using the data framePimaIndiansDiabetes2 contained in the mlbench package For additionaldetails on this data set see page 52gt data(PimaIndiansDiabetes2package =mlbench)gt require(svmpath)

Step 2 Prepare Data amp Tweak ParametersThe PimaIndiansDiabetes2 has a large number of misclassified values(recorded as NA) particularly for the attributes of insulin and tricepsWe remove these two attributes from the sample and use the naomit methodto remove any remaining misclassified values The cleaned data is stored intempgt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)

The response diabetes is a factor containing the labels ldquoposrdquo and ldquonegrdquoHowever the svmpath method requires the response to be numeric takingthe values -1 or +1 The following converts diabetes into a format usableby svmpathgt ylt-(temp$diabetes)gt levels(y) lt- c(-11)gt ylt-asnumeric(ascharacter(y))gt y lt-asmatrix(y)

Support vector machine kernels generally depend on the inner product ofattribute vectors Therefore very large values might cause numerical prob-lems We scale the attributes to lie in the range [01] using the scale methodthe results are stored in the matrix xgt xlt-tempgt x$diabetes lt- NULL variablegt xlt-scale(x)

We use nrow to measure the remaining observations (as a check it shouldequal 724) We then set the training sample to select at random withoutreplacement 600 observations The remaining 124 observations form the testsample

127


gt setseed (103)gt n=nrow(x)gt train lt- sample (1n 600 FALSE)

Step 3 Estimate amp Evaluate ModelThe svmpath function can use two popular kernels the polynomial and ra-dial basis function We will assess both beginning with the polynomial ker-nel It is selected by setting kernelfunction = polykernel We also settrace=FALSE to prevent svmpath from printing results to the screen at eachiteration

gt fitlt-svmpath(x[train ] y[train ] kernelfunction= polykernel trace=FALSE)

A nice feature of svmpath is that it computes the entire regularization pathfor the SVM cost parameter along with the associated classification errorWe use the with method to identify the minimum errorgt with(fit Error[Error == min(Error)])[1] 140 140 140 140 140 140 140 140 140 140 140 140

140 140

Two things are worth noting here first the number of misclassified ob-servations is 140 out of the 600 observations which is approximately 23Second each observation of 140 is associated with a unique regularizationvalue Since the regularization value is the penalty parameter of the errorterm (and svmpath reports the inverse of this parameter) we will se-lect for our model the minimum value and store it in lambda This is achievedvia the following steps

1 Store the minimum error values in error

2 Grab the row numbers of these minimum errors using the whichmethod

3 Obtain the regularization parameter values associated with the mini-mum errors and store in temp_lamdba

4 Identify which value in temp_lamdba has the minimum value store itin lambda and print it to the screen

128


gt error lt-with(fit Error[Error == min(Error)])gt min_err_row lt-which(fit$Error == min(fit$Error))gt temp_lamdba lt-fit$lambda[min_err_row]

gt loclt-which(fit$lambda[min_err_row] == min(fit$lambda[min_err_row]))

gt lambda lt-temp_lamdba[loc]gt lambda[1] 7383352

The method svmpath actually reports the inverse of the kernel regular-ization parameter (often and somewhat confusingly called gamma in the lit-erature) We obtain a value of 7383352 which corresponds to a gamma of

17383352 = 00135

Next we follow the same procedure for radial basis function kernel Inthis case the estimated regularization parameter is stored in lambdaRgt fitRlt-svmpath(x[train ] y[train ] kernel

function = radialkernel trace=FALSE)gt error lt-with(fitR Error[Error == min(Error)])

gt min_err_row lt-which(fitR$Error == min(fitR$Error))gt temp_lamdba lt-fitR$lambda[min_err_row]

gt loclt-which(fitR$lambda[min_err_row] == min(fitR$lambda[min_err_row]))

gt lambdaR lt-temp_lamdba[loc]

gt lambdaR[1] 009738556gt error [1]600[1] 0015

Two things are noteworthy about this result First the regularizationparameter is estimated as 1

009738556 = 10268 Second the error is estimatedat only 15 This is very likely an indication that the model has been overfit

129


13 PRACTITIONER TIP

It often helps intuition to visualize data Letrsquos use a few linesof code to estimate a simple model and visualize it We willfit a sub-set of the model we are already working on using thefirst twelve patient observations on the attributes of glucoseand age storing the result in xx We standardize xx by usingthe scale method Then we grab the first twelve observationsof the response variable diabetesgt xxlt-cbind(temp$glucose [112] temp$age

[112])gt scale(xx)gt yylt-y[112]

Next we use the method svmpath to estimate the model anduse plot to show the resultant data points We use step=1 to show the results at the first value of the regularizationparameter and step =8 to show the results at the last stepThe dotted line in Figure 171 represents the margin notice itgets narrower from step 1 to 8 as the cost parameter increasesThe support vectors are represented by the open dots Noticethat the number of support vectors increase as we move fromstep 1 to step 8 Also notice at step 8 only one point ismisclassified (point 7)gt example lt- svmpath(xxyytrace=TRUE plot

=FALSE)gt par(mfrow = c(1 2))gt plot(example xlab=glucose ylab=age

step=1)gt plot(example xlab=glucose ylab=age

step=8)

130


Figure 171 Plot of support vectors using svmpath with the response vari-able diabetes and attributes glucose amp age

131


Step 4 Make Predictions

NOTE

Automated choice of kernel regularization parameters is chal-lenging This is because it is extreemly easy to over fit a SVMmodel on the validation sample if you only consider the mis-classification rate The consequence is that you end up witha model that is not generalizable to the test data andor amodel that performs considerably worse than the disguardedmodels with higher test sample error rates52

Although we suspect the radial basis kernel results in an over fit we willcompare itrsquos predictions to that of the optimal polynomial kernel First weuse the test data and the radial basis kernel via the predict method Theconfusion matrix is printed using the table method The error rate is thencalculated It is 35 and considerably higher than the 15 indicated duringvalidationgt predlt-predict(fitR newx=x[-train ] lambda=lambdaR

type=class)

gt table( pred y[-train ] dnn =c(Predicted Observed))

ObservedPredicted -1 1

-1 65 271 16 16

gt error_rate = (1- sum( pred == y[-train ] ) 124)gt round( error_rate 2)[1] 035

132


13 PRACTITIONER TIP

Degradation in performance due to over-fitting can be sur-prisingly large The key is to remember the primary goal ofthe validation sample is to provide a reliable indication of theexpected error on the test sample and future as yet unseensamples Thoughout this book we have used the setseed()method to help ensure replicability of the results Howevergiven the stochastic nature of the validation-test sample splitwe should always expect variation in performance for differ-ent realisations This suggests that evaluation should alwaysinvolve multiple partitions of the data to form training val-idation and test sets as the sampling of data for a singlepartition might arbitrarily favour one classifier over another

Letrsquos now look at the polynomial kernel modelgt predlt-predict(fit newx=x[-train ] lambda=lambda

type=class)gtgt table( pred y[-train ] dnn =c(Predicted

Observed))Observed

Predicted -1 1-1 76 131 5 30

gt error_rate = (1- sum( pred == y[-train ] ) 124)

gt round( error_rate 2)[1] 015

It seems the error rate for this choice of kernel is around 15 This is lessthan half the error of the radial basis kernel SVM

133

Technique 18

Multicategory Classificationwith C-SVM

A C-SVM for multicategory response classification can be estimated usingthe package e1071 with the svm function

svm(y ~ data kernel cost )

Key parameters include kernel - the kernel function cost - the costparameter multicategory response variable y and the covariates data

Step 1 Load Required PackagesWe build the C-SVM for multicategory response classification using the dataframe Vehicle contained in the mlbench package For additional details onthis sample see page 23gt require(e1071)gt library (mlbench)gt data(Vehicle)

Step 2 Prepare Data amp Tweak ParametersWe use 500 out of the 846 observations in the training sample with the re-maining saved for testing The variable train selects the random samplewithout replacement from the 500 observationsgt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)

134

TECHNIQUE 18 MULTICATEGORY CLASSIFICATION WITH

Step 3 Estimate amp Evaluate ModelWe begin by estimating the support vector machine using the default settingsThis will use a radial basis function as the kernel with a cost parameter valueset equal to 1

gt fitlt- svm(Class ~ data = Vehicle[train ])

The summary function provides details of the estimated modelgt summary(fit)

Callsvm(formula = Class ~ data = Vehicle[train ])

ParametersSVM -Type C-classification

SVM -Kernel radialcost 1

gamma 005555556

Number of Support Vectors 376

( 122 64 118 72 )

Number of Classes 4

Levelsbus opel saab van

The function provides details of the type of support vector machine (C-classification) kernel cost parameter and gamma model parameter Noticethe model estimates 376 support vectors with 122 in the first class (bus)and 72 in the fourth class (van)A nice feature of the e1071 package is that contains tunesvm a supportvector machine tuning function We use the method to identify the bestmodel for gamma ranging between 025 and 4 and cost between 4 and 16We store the results in objgt obj lt- tunesvm(Class~ data = Vehicle[train ]

gamma = 2^( -22) cost = 2^(24))

Once again we can use the summary function to see the resultgt summary(obj)

135


Parameter tuning of svm

- sampling method 10-fold cross validation

- best parametersgamma cost025 16

- best performance 0254

The method automatically performs a 10-fold cross validation It appearsthe best model has a gamma of 025 and a cost parameter equal to 16 with254 misclassification rate

13 PRACTITIONER TIP

Greater control over model tuning can be achieved using thetunecontrol method inside of tunesvm For example toperform a 20-fold cross validation you would usetunecontrol = tunecontrol(sampling =

crosscross =20))

Your call using tunesvm would look something like thisobj lt- tunesvm(Class~ data = Vehicle[

train ] gamma = 2^( -22) cost =2^(24) tunecontrol = tunecontrol(sampling = crosscross =20))

Visualization of the output of tunesvm using plot is often useful for finetuninggt plot(obj)

Figure 181 illustrates the resultant plot To interpret the image notethat the darker the shading the better the fit of the model Two thingsare noteworthy about this image First a larger cost parameter seems toindicate a better fit Second a gamma of 05 or less also indicates a betterfitting model

Using this information we re-tune the model with gamma ranging from 001to 05 and cost ranging from 16 to 256 Now the best performance occurs

136


with gamma set to 003 with cost equal to 32gt obj lt- tunesvm(Class~ data = Vehicle[train ]

gamma = seq(001 05 by=001) cost = 2^(48))gt summary(obj)





137


Figure 181 Tuning Multicategory Classification with C-SVM using Vehicledata set

We store the results for the optimal model in the objects bestC andbestGamma and refit the modelgt bestC lt-obj$bestparameters [[2]]gt bestGamma lt-obj$bestparameters [[1]]gt fitlt- svm(Class ~ data = Vehicle[train ]cost=

bestC gamma=bestGamma cross =10)

Details of the fit can be viewed using the print methodgt print(fit)

138


Callsvm(formula = Class ~ data = Vehicle[train ]

cost = bestC gamma = bestGamma cross = 10)



gamma 003


The fitted model now has 262 support vectors down from 376 for theoriginal model

The summary method provides additional details Reported are the sup-port vectors by classification and the results from the 10-fold cross validationOverall the model has a total accuracy of 81gt summary(fit)





gamma 003


( 100 32 94 36 )

Number of Classes 4

Levels

139


bus opel saab van

10-fold cross -validation on training data

Total Accuracy 81Single Accuracies84 86 88 76 82 78 76 84 82 74

It can be fun to visualize a two dimensional projection of the fitted dataTo do this we use the plot function with variables Elong and MaxLRa forthe x and y axis whilst holding all the other variables in Vehicle at theirmedian value Figure 182 shows the resultant plotgt plot(fit Vehicle[train ]Elong ~ MaxLRa

svSymbol = v slice = list(Comp = median(Vehicle$Comp)Circ =median(Vehicle$Circ)DCirc =median(Vehicle$DCirc)

RadRa = median(Vehicle$RadRa)PrAxisRa =median (Vehicle$PrAxisRa)

ScatRa=median(Vehicle$ScatRa) PrAxisRect =median(Vehicle$PrAxisRect)

MaxLRect =median(Vehicle$MaxLRect ) ScVarMaxis =median(Vehicle$ScVarMaxis) ScVarmaxis =median(Vehicle$ScVarmaxis) RaGyr=median(Vehicle$RaGyr) SkewMaxis =median(Vehicle$SkewMaxis) Skewmaxis =median(Vehicle$Skewmaxis ) Kurtmaxis =median(Vehicle$Kurtmaxis)KurtMaxis =median(Vehicle$KurtMaxis ) HollRa =median(Vehicle$HollRa )))

140


Figure 182 Multicategory Classification with C-SVM two dimensional pro-jection of Vehicle

Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit Vehicle[-train ])

Classification results alongside the actual observed values can be obtainedusing the table methodgt table(Vehicle$Class[-train]pred dnn=c( Observed

ClassPredicted Class ))

141



bus 87 2 0 5opel 0 55 28 2saab 0 20 68 0van 1 1 1 76

The error misclassification rate can then be calculated Overall themodel achieves an error rate of 17 on the test samplegt error_rate = (1-sum(pred== Vehicle$Class[-train ])

346)gt round(error_rate 3)[1] 0173

142

Technique 19

Multicategory Classificationwith nu-SVM

A nu-SVM for multicategory response classification can be estimated usingthe package e1071 with the svm function

svm(y ~ data kernel type=nu -classification)

Key parameters include kernel - the kernel function type set tonu-classification multicategory response variable y and the covariatesdata

Step 3 Estimate amp Evaluate ModelStep 1 and 2 are outlined beginning on page 134

We estimate the support vector machine using the default settings Thiswill use a radial basis function as the kernel with a nu parameter value setequal to 05

gt fitlt- svm(Class ~ data = Vehicle[train ]type=nu -classification)



type = nu-classification)

143


ParametersSVM -Type nu -classification

SVM -Kernel radialgamma 005555556

nu 05


( 110 88 107 98 )

Number of Classes 4


The function provides details of the type of support vector machine (nu-classification) kernel nu parameter and gamma parameter Notice the modelestimates 403 support vectors with 110 in the first class (bus) and 98 in thefourth class (van) The total number of support vectors is slighter higherthan estimated in the C-SVM discussed on page 134We use the tunesvm method to identify the best model for gamma and nuranging between 005 and 045 We store the results in objgt obj lt- tunesvm(Class~ data = Vehicle[train ]type=nu-classificationgamma = seq(005 045 by=01)nu=seq(005 045 by=01))

We can use the summary function to see the resultgt summary(obj)



- best parametersgamma nu005 005


144


The method automatically performs a 10-fold cross validation We seethat the best model has a gamma and nu equal to 005 with a 178 misclassi-fication rate Visualization of the output of tunesvm can be achieved usingthe plot functiongt plot(obj)

Figure 191 illustrates the resultant plot To interpret the image note thatthe darker the shading the better the fit of the model It seems that a smallernu and gamma lead to a better fit of the training data

Using this information we re-tune the model with gamma and nu rangingfrom 001 to 005 The best performance occurs with both parameters set to002 and an overall misclassification error rate of 164gt obj lt- tunesvm(Class~ data = Vehicle[train ]type=nu-classificationgamma = seq(001 005 by=001) nu=seq(001 005 by=001))

gt summary(obj)





145


Figure 191 Tuning Multicategory Classification with NU-SVM usingVehicle data set

We store the results for the optimal model in the objects bestC andbestGamma and then refit the modelgt bestNU lt-obj$bestparameters [[2]]gt bestGamma lt-obj$bestparameters [[1]]gtfitlt- svm(Class ~ data = Vehicle[train ]type=nu-classificationnu=bestNU gamma=bestGamma cross =10)

Details of the fit can be viewed using the print method

146


print(fit)


type = nu-classificationnu = bestNU gamma = bestGamma cross = 10)



nu 002


The fitted model now has 204 support vectors less than half required forthe model we initially build

The summary method provides additional details Reported are the sup-port vectors by classification and the results from the 10-fold validationOverall the model has a total accuracy of 756gt summary(fit)





nu 002


( 71 29 72 32 )

Number of Classes 4

147





Next we visualize a two dimensional projection of the fitted data To dowe use the plot function with variables Elong and MaxLRa for the x and yaxis whilst holding all the other variables in Vehicle at their median valueFigure 192 shows the resultant plotgt plot(fit Vehicle[train ]Elong ~ MaxLRa





148


Figure 192 Multicategory Classification with NU-SVM two dimensional pro-jection of Vehicle


Classification results alongside the actual observed values can be obtainedusing the table methodgt table(Vehicle$Class[-train]pred dnn=c( Observed ClassPredicted Class ))

149



bus 87 1 1 5opel 0 41 42 2saab 0 30 58 0van 1 1 2 75



150

Technique 20

Bound-constraint C-SVMclassification

A bound-constraint C-SVM for multicategory response classification can beestimated using the package kernlab with the ksvm function

ksvm(y ~ data kernel type=C-bsvckpar )

Key parameters include kernel - the kernel function type set toC-bsvckpar which contains parameter values multicategory responsevariable y and the covariates data

Step 1 Load Required PackagesWe build the bound-constraint C-SVM for multicategory response classifica-tion using the data frame Vehicle contained in the mlbench package Foradditional details on this sample see page 23 The scatterplot3d packagedwill be used to create a three dimensional scatter plotgt library(kernlab)gt library (mlbench)gt data(Vehicle)gt library(scatterplot3d)

Step 2 Prepare Data amp Tweak ParametersWe use 500 out of the 846 observations in the training sample with the re-maining saved for testing The variable train selects the random samplewithout replacement from the 500 observations

151


gt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)

Step 3 Estimate amp Evaluate ModelWe estimate the bound-constraint support vector machine using a radial basiskernel (type = rbfdot) and parameter sigma equal to 005 We also set thecross validation parameter cross =10

gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=C-bsvckernel=rbfdot kpar=list(sigma =005) cross =10)

The print function provides details of the estimated modelgt print(fit)Support Vector Machine object of class ksvm

SV type C-bsvc (classification)parameter cost C = 1

Gaussian Radial Basis kernel functionHyperparameter sigma = 005


Objective Function Value -491905 -533552-480924 -1932145 -489507 -597368

Training error 0164Cross validation error 0248

The function provides details of the type of support vector machine (C-bsvc) model cost parameter and kernel parameter (C=1 sigma =005) No-tice the model estimates 375 support vectors with a training error of 164and cross validation error of 248 In this case we observe a relatively largedifference between the training error and cross validation error In practicethe cross validation error is often a better indicator of the expected perfor-mance on the test sample

Since the output of ksvm (in our example values stored in fit) is anS4 object both errors can be accessed directly using ldquordquo Although theldquopreferredrdquo approach is to use an accessor function such as cross(fit) and

152

TECHNIQUE 20 BOUND-CONSTRAINT C-SVM CLASSIFICATION

error(fit) If you donrsquot know the accessor functions names you can alwaysuse attributes(fit) and access the required parameter using (for S4objects ) or $ (for S3 objects) gt fiterror[1] 0164

gt fitcross[1] 0248

13 PRACTITIONER TIP

The package Kernlab supports a wide range of kernels Hereare nine popular choices

bull rbfdot - Radial Basis kernel function

bull polydot - Polynomial kernel function

bull vanilladot - Linear kernel function

bull tanhdot - Hyperbolic tangent kernel function

bull laplacedot - Laplacian kernel function

bull besseldot - Bessel kernel function

bull anovadot - ANOVA RBF kernel function

bull splinedot - Spline kernel

bull stringdot - String kernel

We need to tune the model to obtain the optimum parameters Letrsquos createa few lines of R code to do this for us First we set up the ranges for thecost and sigma parametersgt cost lt-2^(28)gt sigma lt-seq(0105 by=01)gt n_costlt-length(cost)gt n_sigma lt-length(sigma)

The total number of models to be estimated is stored in runs Check tosee if it contains the value 35 (it should)gt runslt-n_sigman_cost

153


gt countcost lt- 0gt countsigma lt-0gt runs[1] 35

The result in terms of cross validation error cost and sigma are storedin resultsgt results lt-1(3runsgt dim(results) lt- c(runs 3)gt colnames(results) lt- c(costsigmaerror)

The objects i j and count are loop variablesgt i=1gt j=1gt count=1

The main loop for tuning is as followsgt for (val in cost)

for (val in sigma) cat(iteration = count out of runs n)

fitlt- ksvm(Class ~ data = Vehicle[train ]type=C-bsvcC=cost[j]kernel=rbfdot kpar=list(sigma=sigma[i])cross =45)

results[count 1]= fitparam$Cresults[count 2]= sigma[i]results[count 3]= fitcrosscountsigma = countsigma +1count=count+1

i=i+1 end sigma loop

i=1j=j+1

end cost loop

154


Notice we set cross = 45 to perform leave one out validationWhen you execute the above code as it is running you should see output

along the lines ofiteration = 1 out of 35iteration = 2 out of 35iteration = 3 out of 35iteration = 4 out of 35iteration = 5 out of 35

We turn results into a data frame using the asdataframe methodgt results lt-asdataframe(results)

Take a peek and you should see something like thisgt results

cost sigma error1 4 01 023585862 4 02 026414143 4 03 027441084 4 04 02973064

Now letrsquos find the best cross validation performance and its associatedrow numbergt with(results error[error == min(error)])[1] 02074074

gt which(results$error== min(results$error))[1] 11

The optimal values may need to be stored for later use We save them inbest_result using a few lines of codegt best_per_row lt-which(results$error== min(results$

error))

gt fit_costlt-results[best_per_row 1]gt fit_sigma lt-results[best_per_row 2]

gt fit_xerror lt-results[best_per_row 3]gt best_result lt-cbind(fit_cost fit_sigma fit_xerror)

gt colnames(best_result)lt- c(costsigmaerror)gt best_result

155


cost sigma error[1] 16 01 02074074

So we see the optimal results occur for cost =16 sigma = 01 with across validation error of 207

Figure 201 presents a three dimensional visualization of the tuning num-bers It was created using scatterplot3d as followsgt scatterplot3d(results$cost results$sigma results$error xlab=costylab=sigmazlab=Error)

156


Figure 201 Bound-constraint C-SVM tuning 3d scatterplot using Vehicle

After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 245gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=C-bsvccost=fit_cost kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)

gt fitcross[1] 02457912

157





bus 88 1 2 3opel 1 31 50 3saab 1 24 61 2van 1 0 3 75



158

Technique 21

Weston - Watkins Multi-ClassSVM

A bound-constraint Weston - Watkins SVM for multicategory response classi-fication can be estimated using the package kernlab with the ksvm function

ksvm(y ~ data kernel type=kbb -svckpar )

Key parameters include kernel - the kernel function type set tokbb-svc kpar which contains parameter values multicategory responsevariable y and the covariates data

Step 3 Estimate amp Evaluate ModelStep 1 and step 2 are outlined beginning on page 151

We estimate the Weston - Watkins support vector machine using a radialbasis kernel (type = rbfdot) and parameter sigma equal to 005 We alsoset the cross validation parameter cross = 10

fitlt- ksvm(Class ~ data = Vehicle[train ]type=kbb -svckernel=rbfdot kpar=list(sigma =005) cross =10)


159


SV type kbb -svc (classification)parameter cost C = 1



Objective Function Value 0Training error 0148Cross validation error 0278

The function provides details of the type of support vector machine (kbb-svc) model cost parameter kernel type and parameter (sigma =005) num-ber of support vectors (356) training error of 148 and cross validation errorof 278 Both types of error can be individually accessed as followsgt fiterror[1] 0148

gt fitcross[1] 0278

Letrsquos see if we can tune the model using our own grid search First we set upthe ranges for the cost and sigma parametersgt cost lt-2^(28)gt sigma lt-seq(0105 by=01)gt n_costlt-length(cost)gt n_sigma lt-length(sigma)

The total number of models to be estimated is stored in runsgt runslt-n_sigman_costgt countcost lt- 0gt countsigma lt-0

The results in terms of cross validation error cost and sigma are storedin resultsgt results lt-1(3runs)gt dim(results) lt- c(runs 3)gt colnames(results) lt- c(costsigmaerror)

The objects i j and count are loop variablesgt i=1

160

TECHNIQUE 21 WESTON - WATKINS MULTI-CLASS SVM

gt j=1gt count=1



fitlt- ksvm(Class ~ data = Vehicle[train ]type=kbb -svcC=cost[j]kernel=rbfdot kpar=list(sigma=sigma[i])cross =45)

results[count 1]= fitparam$Cresults[count 2]= sigma[i]results[count 3]= fitcross

countsigma = countsigma +1count=count+1


i=1j=j+1

end cost loop

When you execute the above code as it is running you should see outputalong the lines ofiteration = 1 out of 35iteration = 2 out of 35iteration = 3 out of 35iteration = 4 out of 35iteration = 5 out of 35


Take a peek you should see something like this

161


gt resultscost sigma error

1 4 01 023569022 4 02 026397313 4 03 032289564 4 04 029730645 4 05 02885522



So the optimal cross validation error is 21 and located in row 21 ofresults


error))

gt fit_costlt-results[best_per_row 1]gt fit_sigma lt-results[best_per_row 2]gt fit_xerror lt-results[best_per_row 3]

gt best_result lt-cbind(fit_cost fit_sigma fit_xerror)gt colnames(best_result)lt- c(costsigmaerror)

gt best_resultcost sigma error

[1] 64 01 02122896


162


Figure 211 Weston - Watkins Multi-Class SVM tuning 3d scatterplot usingVehicle

After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 27gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=kbb -svccost=fit_cost kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)

gt fitcross

163


[1] 02762626




bus 88 1 0 5opel 1 37 43 4saab 1 33 53 1van 1 1 2 75

The error misclassification rate can be calculatedgt error_rate = (1-sum(pred== Vehicle$Class[-train ])


Overall the model achieves an error rate of 269 on the test sample

164

Technique 22

Crammer - Singer Multi-ClassSVM

A Crammer - Singer SVM for multicategory response classification can beestimated using the package kernlab with the ksvm function

ksvm(y ~ data kernel type=spoc -svckpar )

Key parameters include kernel - the kernel function type set tospoc-svc kpar which contains parameter values multicategory responsevariable y and the covariates data


We estimate the Crammer- Singer Multi-Class support vector machineusing a radial basis kernel (type = rbfdot) and parameter sigma equal to005 We also set the cross validation parameter cross = 10

fitlt- ksvm(Class ~ data = Vehicle[train ]type=spoc -svckernel=rbfdot kpar=list(sigma =005) cross =10)


165


SV type spoc -svc (classification)parameter cost C = 1




The function provides details of the type of support vector machine (spoc-svc) model cost parameter and kernel parameter (sigma =005) number ofsupport vectors (340) training error of 152 and cross validation error of244 Both types of error can be individually accessed as followsgt fiterror[1] 0152

gt fitcross[1] 0244

We need to tune the model First we set up the ranges for the cost andsigma parametersgt cost lt-2^(28)gt sigma lt-seq(0105 by=01)gt n_costlt-length(cost)gt n_sigma lt-length(sigma)




166

TECHNIQUE 22 CRAMMER - SINGER MULTI-CLASS SVM

gt j=1gt count=1







i=1j=j+1

end cost loop



Take a peek at the results and you should see something like this

167



1 4 01 022996632 4 02 023787883 4 03 025824924 4 04 02730640

Now letrsquos find the best cross validation performance and its associatedrow number So the optimal cross validation error is 207 and located inrow 11 of resultsgt with(results error[error == min(error)]) [1]

02075758


We save in best_result using a few lines of codegt best_per_row lt-which(results$error== min(results$

error))




[1] 16 01 02075758


168


Figure 221 Crammer- Singer Multi-Class SVM tuning 3d scatterplot usingVehicle

After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 25gt fitlt- ksvm(Class ~ data = Vehicle[train ]type=spoc -svccost=fit_cost kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)

gt fitcross

169


[1] 02478114




bus 88 0 0 6opel 6 43 30 6saab 3 29 52 4van 1 0 1 77



170

Support Vector Regression

171

Technique 23

SVM eps-Regression

A eps-regression can be estimated using the package e1071 with the svmfunction

svm(y ~ data kernel cost type=eps -regressionepsilon gamma cost )

Key parameters include kernel - the kernel function the parameters costgamma and epsilon type set to eps-regression continuous responsevariable y and the covariates data

Step 1 Load Required PackagesWe build the support vector machine eps-regression using the data framebodyfat contained in the THdata package For additional details on thissample see page 62gt library(e1071)gt data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observationsgt setseed (465)gt train lt- sample (171 45 FALSE)

172

TECHNIQUE 23 SVM EPS-REGRESSION


gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=eps -regression)


Callsvm(formula = DEXfat ~ data = bodyfat[train ]

type = eps -regression)

ParametersSVM -Type eps -regression


gamma 01111111epsilon 01


The function provides details of the type of support vector machine (eps-regression) cost parameter kernel type and associated gamma and epsilonparameters and the number of support vectors in this case 33A nice feature of the e1071 package is that contains tunesvm a supportvector machine tuning function We use the method to identify the bestmodel for gamma ranging between 025 and 4 and cost between 4 and 16 andepsilon between 001 to 009 The results are stored in objgt obj lt- tunesvm(DEXfat ~ data = bodyfat[train

]type=eps -regressiongamma = 2^( -22)cost = 2^(24) epsilon=seq(001 09 by=001))

Once again we can use the summary function to see the result

173


gt summary(obj)



- best parametersgamma cost epsilon025 8 001


The method automatically performs a 10-fold cross validation We seethat the best model has a gamma of 025 cost parameter equal to 8 epsilonof 001 with a 266 misclassification rate These metrics can also be accessedcalling $bestperformance and $bestparametersgt obj$bestperformance[1] 2658699

gt obj$bestparametersgamma cost epsilon

6 025 8 001

We use these optimum parameters to refit the model using leave one outcross validation as followsgt bestEpi lt-obj$bestparameters [[3]]gt bestGamma lt-obj$bestparameters [[1]]gt bestCost lt-obj$bestparameters [[2]]

gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=eps -regressionepsilon=bestEpi gamma=bestGamma cost=bestCost cross =45)

The fitted model now has 44 support vectorsgt print(fit)


type = eps -regression

174


epsilon = bestEpi gamma = bestGamma cost =bestCost cross = 45)





The summary method provides additional details It also reports themean square error for each cross validation (not shown here)gt summary(fit)


type = eps -regressionepsilon = bestEpi gamma = bestGamma cost =

bestCost cross = 45)






Total Mean Squared Error 280095Squared Correlation Coefficient 07877644

175


Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollowsgt pred lt- predict(fit bodyfat[-train ])

A plot of the predicted and observed values shown in Figure 231 isobtained using the plot functiongt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Valuesmain=Training Sample Model Fit)

We calculate the squared correlation coefficient using the cor function Itreports a value of 0712

round(cor(pred bodyfat$DEXfat[-train])^23)[1] 0712

176


Figure 231 Predicted and observed values using SVM eps regression forbodyfat

177

Technique 24

SVM nu-Regression

A nu-regression can be estimated using the package e1071 with the svmfunction

svm(y ~ data kernel cost type=nu-regressionnu gamma cost )

Key parameters include kernel - the kernel function the parameters costgamma and nu type set to nu-regression continuous response variabley and the covariates data


We begin by estimating the support vector machine using the default set-tings This will use a radial basis function as the kernel with a cost parametervalue set equal to 1

gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=nu -regression)



type = nu-regression)

178

TECHNIQUE 24 SVM NU-REGRESSION

ParametersSVM -Type nu -regression


gamma 01111111nu 05


The function provides details of the type of support vector machine (nu-regression) cost parameter kernel type and associated gamma and nu pa-rameters and the number of support vectors in this case 33 which happensto be the same as estimated using the SVM eps-regression (see page 172)We use the tunesvm method to identify the best model for gamma rangingbetween 025 and 4 and cost between 4 and 16 and nu between 001 to 009The results are stored in objgt obj lt- tunesvm(DEXfat ~ data = bodyfat[train ]type=nu-regressiongamma = 2^( -22)cost = 2^(24) nu=seq(0109by=01))

We use the summary function to see the resultgt summary(obj)



- best parametersgamma cost nu025 8 09


The method automatically performs a 10-fold cross validation We seethat the best model has a gamma of 025 cost parameter equal to 8 nu equalto 09 with a 264 misclassification rate These metrics can also be accessedcalling $bestperformance and $bestparametersgt obj$bestperformance[1] 2642275

179


gt obj$bestparametersgamma cost nu

126 025 8 09

We use these optimum parameters to refit the model using leave one outcross validation as followsgt bestNU lt-obj$bestparameters [[3]]gt bestGamma lt-obj$bestparameters [[1]]gt bestCost lt-obj$bestparameters [[2]]

gt fitlt- svm(DEXfat ~ data = bodyfat[train ]type=nu-regressionnu=bestNU gamma=bestGamma cost=bestCost cross =45)



type = nu-regressionnu = bestNU gamma = bestGamma cost = bestCost

cross = 45)



gamma 025nu 09



180




cross = 45)



gamma 025nu 09








181


Figure 241 Predicted and observed values using SVM nu regression forbodyfat

182

Technique 25

Bound-constraint SVMeps-Regression

A bound-constraint SVM eps-regression can be estimated using the packagekernlab with the ksvm function

ksvm(y ~ data kernel type=eps -bsvrkpar )

Key parameters include kernel - the kernel function type set totype=eps-bsvr kpar which contains parameter values multicategory re-sponse variable y and the covariates data

Step 1 Load Required PackagesWe build the model using the data frame bodyfat contained in the THdatapackage For additional details on this sample see page 62library(kernlab)data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observationssetseed (465)train lt- sample (171 45 FALSE)

183



fitlt- ksvm(DEXfat ~ data = bodyfat[train ]type=eps -bsvrkernel=rbfdot kpar=list(sigma =005) cross =10)


SV type eps -bsvrGaussian Radial Basis kernel functionHyperparameter sigma = 005


Objective Function Value -77105Training error 0096254Cross validation error 1877613

The function provides details of the type of support vector machine (eps-bsvr) kernel parameter (sigma =005) Notice the model estimates 35 sup-port vectors with a training error of 96 and cross validation error of 187Since the output of ksvm (in this case values are stored in fit) is an S4 ob-ject both errors can be accessed directly using ldquordquo gt fitparam$C[1] 1

gt fitparam$epsilon[1] 01

gt fiterror[1] 009625418


184

TECHNIQUE 25 BOUND-CONSTRAINT SVM EPS-REGRESSION

We need to tune the model to obtain the optimum parameters Letrsquoscreate a small bit of code to do this for us First we set up the ranges for thecost and sigma and epsilon parametersgt epsilon lt- seq(001 01 by=001)gt sigma lt-seq(011by=001)

gt n_epsilon lt-length(epsilon)gt n_sigma lt-length(sigma)

The total number of models to be estimated is stored in runs Since itcontains 910 models it may take a little while to rungt runslt-n_sigman_epsilongt runs[1] 910

The results in terms of cross validation error costepsilon and sigmaare stored in resultsgt results lt-1(4runs)gt dim(results) lt- c(runs 4)gt colnames(results) lt- c(costepsilonsigma

error)

The loop variables are as followsgt countepsilon lt-0gt countsigma lt-0

gt i=1gt j=1gt k=1gt count=1

The main loop for tuning is as followsgt for (val in epsilon)

for (val in sigma) fitlt- ksvm(DEXfat ~ data = bodyfat[train ]type=eps -bsvrepsilon=epsilon[k]kernel=rbfdot kpar=list(sigma=sigma[i])cross =45)

185


results[count 1]= fitparam$Cresults[count 2]= sigma[i]results[count 3]= fitparam$epsilonresults[count 4]= fitcrosscountsigma = countsigma +1count=count+1i=i+1

end sigma loopi=1

k=k+1cat(iteration () = round((countruns)100 0)

n)

end epsilon loopi=1k=1j=j+1


along the lines ofiteration () = 10iteration () = 20iteration () = 30


Take a peek at the results and you should see something like thisgt results

cost epsilon sigma error1 1 010 001 22151082 1 011 001 22766893 1 012 001 23241924 1 013 001 23782745 1 014 001 2435676

Now letrsquos find the best cross validation performance (218) and its as-sociated row number (274)gt with(results error[error == min(error)])

186


[1] 2186181



error))

gt fit_costlt-results[best_per_row 1]gt fit_sigma lt-results[best_per_row 2]gt fit_epsilon lt-results[best_per_row 3]gt fit_xerror lt-results[best_per_row 4]

gt best_result lt-cbind(fit_cost fit_epsilon fit_sigma fit_xerror)

gt colnames(best_result)lt- c(costepsilonsigmaerror)

gt best_resultcost epsilon sigma error

[1] 1 004 01 2186181

After all that effort we may as well estimate the optimal model using thetraining data and show the cross validation error It is around 218gt fitlt- ksvm(DEXfat ~ data = bodyfat[train ]type=eps -bsvrcost=fit_cost epsilon=fit_epsilon kernel=rbfdot kpar=list(sigma=fit_sigma)cross =45)


Step 4 Make PredictionsThe predict method with the test data and the fitted model fit are used asfollows

187


pred lt- predict(fit2 bodyfat[-train ])

We fit a linear regression using pred as the response variable and theobserved values as the covariate The regression line alongside the predictedand observed values shown in Figure 251 are visualized using the plotmethod combined with the abline method to show the linear regressionlinegt linReg lt-lm(pred ~ bodyfat$DEXfat [-train ])

gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Valuesmain=Training Sample Model Fit)

gt abline(linReg col=darkred)

The correlation between the test sample predicted and observed values is083gt round(cor(pred bodyfat$DEXfat[-train])^23)

[1][1] 0834

188


Figure 251 Bound-constraint SVM eps-Regression observed and fitted valuesusing bodyfat

189

Support Vector NoveltyDetection

190

Technique 26

One-Classification SVM

A One-Classification SVM can be estimated using the package e1071 withthe svm function

svm(xkernel type=one -classification )

Key parameters x the attributes you use to train the modeltype=rdquoone-classificationrdquo and kernel - the kernel function

Step 1 Load Required PackagesWe build the one-classification SVM using the data frame Vehicle containedin the mlbench package For additional details on this sample see page 23gt library(e1071)gt library(caret)gt library (mlbench)gt data(Vehicle)gt library(scatterplot3d)

Step 2 Prepare Data amp Tweak ParametersTo begin we choose the bus category as our one classification class and createthe variable isbus to store TRUEFALSE values of vehicle typegt setseed (103)gt Vehicle$isbus[Vehicle$Class==bus] lt- TRUEgt Vehicle$isbus[Vehicle$Class=bus] lt- FALSE

As a reminder of the distribution of Vehicle we use the table method

191


gt table(Vehicle$Class)

bus opel saab van218 212 217 199

As a check note that you should observed 218 observations correspondingto bus in isbus= TRUEgt table(Vehicle$isbus)

FALSE TRUE628 218

Our next step is to create variables isbusTrue and isbusFalse to holdpositive (bus) and negative (non-bus) observations respectivelygt isbusTrue lt-subset(Vehicle Vehicle$isbus==TRUE)gt isbusFalse lt-subset(Vehicle Vehicle$isbus==FALSE)

Next create a subset of 218 positive values and sample 200 of these atrandom The remaining 18 observations will form the test samplegt train lt- sample (1218 100 FALSE)

gt trainAttributes lt-isbusTrue[train 118]gt trainLabels lt-isbusTrue[train 19]

Now as a quick check type trainLabels and check that they are all bus

Step 3 Estimate amp Evaluate ModelWe need to tune the model We begin by setting up the tuning parametersThe variable runs contains the total number of models we will estimate duringthe tuning processgamma lt- seq(001 005 by=001)nu lt- seq(001 005 by=001)

n_gamlt-length(gamma)n_nult-length(nu)

runslt-n_gamn_nu

countgamma lt- 0countnult-0

192

TECHNIQUE 26 ONE-CLASSIFICATION SVM

Next set up the variable to store the results and the loop count variables- ij and countgt results lt-1(3runs)gt dim(results) lt- c(runs 3)gt colnames(results) lt- c(gammanuPerformance

)

gt i=1gt k=1gt count=1

The main tuning loop is created as follows (notice we estimate the modelusing a 10-fold cross validation (cross = 10) and store the results in fit)

gt for (val in nu)

for (val in gamma) print(gamma[i])print(nu[k])

fitlt-svm(trainAttributes y=NULL type=rsquoone -classification rsquonu=nu[k]gamma=gamma[i]cross =10)

results[count 2]= fit$gammaresults[count 1]= fit$nuresults[count 3]= fit$totaccuracycountgamma = countgamma+1count=count+1i=i+1

end gamma loopi=1k=k+1

end nu loop


193



gamma nu Performance1 001 001 942 001 002 883 001 003 874 001 004 81

Now letrsquos find the best cross validation performancegt with(results Performance[Performance == max(

Performance)])[1] 94 94

Notice the best performance contains two values both at 94 Since thisis a very high value it is possibly the result of over fitting - a consequence ofover tuning a model The optimal values may need to be stored for later useWe save them in best_result using a few lines of codegt best_per_row lt-which(results$Performance == max(

results$Performance))

gt fit_gamma lt-results[best_per_row 1]gt fit_nult-results[best_per_row 2]gt fit_perlt-results[best_per_row 3]

gt best_result lt-cbind(fit_gamma fit_nu fit_per)gt colnames(best_result)lt- c(gammanu

Performance)

gt best_resultgamma nu Performance

[1] 001 001 94[2] 002 001 94

The relationship between the parameters and performance is visualizedusing the scatterplot3d function and shown in Figure 261gt scatterplot3d(results$nu results$gamma results$

Performance xlab=nuylab=gammazlab=Performance)

194


Figure 261 Relationship between tuning parameters

Letrsquos fit the optimum model using a leave one out cross validationgt fitlt-svm(trainAttributes y=NULL type=one -classificationnu=fit_nu gamma=fit_gammacross =100)

Step 4 Make PredictionsTo illustrate the performance of the model on the test sample we will use theconfusionMatrixmethod from the caret package The first step is to gather

195


together the required information for the training and test sample using thepredict method to make the forecastsgt trainpredictors lt-isbusTrue[train 118]gt trainLabels lt-isbusTrue[train 20]

gt testPositive lt-isbusTrue[-train ]gt testPosNeg lt-rbind(testPositive isbusFalse)

gt testpredictors lt-testPosNeg [ 118]gt testLabels lt-testPosNeg [20]

gt svmpredtrain lt-predict(fit trainpredictors)gt svmpredtest lt-predict(fit testpredictors)

Next we create two tables confTrain containing the training predictionsand confTest for the test sample predictionsgt confTrain lt-table(Predicted=svmpredtrain Reference

=trainLabels)gt confTest lt-table(Predicted=svmpredtest Reference=

testLabels)

Now we call the confusionMatrix method for the test sampleconfusionMatrix(confTest positive=TRUE)

Confusion Matrix and Statistics

ReferencePredicted FALSE TRUE

FALSE 175 5TRUE 453 113

Accuracy 0386195 CI (0351 04221)

No Information Rate 08418P-Value [Acc gt NIR] 1

Kappa 0093Mcnemar rsquos Test P-Value lt2e-16

Sensitivity 09576Specificity 02787

Pos Pred Value 01996

196


Neg Pred Value 09722Prevalence 01582

Detection Rate 01515Detection Prevalence 07587

Balanced Accuracy 06181

rsquoPositive rsquo Class TRUE

The method produces a range of test statistics including an accuracy rateof only 38 and kappa of 0093 Maybe we over-tuned the model a little Inthis example we can see clearly the consequence of over fitting in training ispoor generalization in the test sample

For illustrative purposes we re-estimate the model this time with nu =005 and gamma 001gt fitlt-svm(trainAttributes y=NULL type=one -classificationnu=005 gamma =005cross =100)

A little bit of house keeping as discussed previouslygt trainpredictors lt-isbusTrue[train 118]gt trainLabels lt-isbusTrue[train 20]




gt confTrain lt-table(Predicted=svmpredtrain Reference=trainLabels)

gt confTest lt-table(Predicted=svmpredtest Reference=testLabels)

Now we are ready to pass the necessary information to theconfusionMatrix methodgt confusionMatrix(confTest positive=TRUE)

197




FALSE 568 29TRUE 60 89

Accuracy 0880795 CI (08552 09031)


Kappa 05952Mcnemar rsquos Test P-Value 0001473


Pos Pred Value 05973Neg Pred Value 09514

Prevalence 01582Detection Rate 01193

Detection Prevalence 01997Balanced Accuracy 08293


Now the accuracy rate is 088 with a kappa of around 06 An importanttake away is that support vector machines as with many predictive analyticmethods can be very sensitive to over fitting

198

NOTES

Notes35See the paper by Cortes C Vapnik V (1995) Support-Vector Networks Machine

Learning 20 273ndash29736Note that classification tasks based on drawing separating lines to distinguish between

objects of different class memberships are known as hyperplane classifiers The SVM iswell suited for such tasks

37The nu parameter in nu-SVM38For further details see Hoehler FK 2000 Bias and prevalence effects on kappa

viewed in terms of sensitivity and specificity J Clin Epidemiol 53 499ndash50339Gomes A L et al Classification of dengue fever patients based on gene expression

data using support vector machines PloS one 56 (2010) e1126740Huang Wei Yoshiteru Nakamori and Shou-Yang Wang Forecasting stock market

movement direction with support vector machine Computers amp Operations Research3210 (2005) 2513-2522

41Min Jae H and Young-Chan Lee Bankruptcy prediction using support vector ma-chine with optimal choice of kernel function parameters Expert systems with applications284 (2005) 603-614

42Upstill-Goddard Rosanna et al Support vector machine classifier for estrogen re-ceptor positive and negative early-onset breast cancer (2013) e68606

43Tehrany Mahyat Shafapour et al Flood susceptibility assessment using GIS-basedsupport vector machine model with different kernel types Catena 125 (2015) 91-101

44Calculated by the author using the average of all four support vector machines45Radhika K R M K Venkatesha and G N Sekhar Off-line signature authenti-

cation based on moment invariants using support vector machine Journal of ComputerScience 63 (2010) 305

46Guo Shuyu Robyn M Lucas and Anne-Louise Ponsonby A novel approach forprediction of vitamin d status using support vector regression PloS one 811 (2013)

47Latitude ambient ultraviolet radiation levels ambient temperature hours in the sun6 weeks before the blood draw (log transformed to improve the linear fit) frequency ofwearing shorts in the last summer physical activity (three levels mild moderate vigor-ous) sex hip circumference height left back shoulder melanin density buttock melanindensity and inner upper arm melanin density

48For further details see either of the following1 Cawley GC Leave-one-out cross-validation based model selection criteria for

weighted LS-SVMs In International Joint Conference on Neural Networks IEEE2006 p 1661ndash1668

2 Vapnik V Chapelle O Bounds on error expectation for support vector machinesNeural computation 200012(9)2013ndash2036

3 Muller KR Mika S Ratsch G Tsuda K Scholkopf B An introduction to kernel-based learning algorithms IEEE Transactions on Neural Networks

49See1 Bach FR Lanckriet GR Jordan MI Multiple kernel learning conic duality and

the SMO algorithm In Proceedings of the twenty-first international conference onMachine learning ACM 2004 p 6ndash13

2 Zien A Ong CS Multiclass multiple kernel learning In Proceedings of the 24thinternational conference on Machine learning ACM 2007 p 1191ndash1198

199


50Thadani K Jayaraman V Sundararajan V Evolutionary selection of kernels in supportvector machines In International Conference on Advanced Computing and Communica-tions IEEE 2006 p 19ndash24

51See for example

1 Lin Shih-Wei et al Particle swarm optimization for parameter determination andfeature selection of support vector machines Expert systems with applications 354(2008) 1817-1824

2 Melgani Farid and Yakoub Bazi Classification of electrocardiogram signals withsupport vector machines and particle swarm optimization Information Technologyin Biomedicine IEEE Transactions on 125 (2008) 667-677

52For further discussion on the issues surrounding over fitting see G C Cawley and NL C Talbot Preventing over-fitting in model selection via Bayesian regularization of thehyper-parameters Journal of Machine Learning Research volume 8 pages 841-861 April2007

200

Part III

Relevance Vector Machine

201

The Basic Idea

The relevant vector machine (RVM) shares its functional form with the sup-port vector machine (SVM) discussed in Part II RVMs exploit a probabilisticBayesian learning framework

We begin with a data set of N training pairs xiyi where xi is the inputfeature vector and yi is the target output RVM make predictions using

yi = wTK + ε (261)

where w = [w1 wN ] is the vector of weights K =[k(xi x1) k(xi xN)] T is the vector of kernel functions and ε is the errorwhich for algorithmic simplicity is assumed to be zero-mean independentlyidentically distributed Gaussian with variance σ2 Therefore the predictionyi consists of the target output polluted by Gaussian noise

The Gaussian likelihood of the data is given by

p(y|wσ2

)= (2π)

minusN2 σminusNexp

minus 1

2σ2 yminusΦw2 (262)

where y = [y1 yN ] w = [ε w1 wN ] and Φ is an N times (N + 1) matrixwith Φij = k(xixjminus1) and Φi1 = 1

A standard approach to estimate the parameters in equation 262 is to usea zero-mean Gaussian to introduce an individual hyperparmeter αi on each ofthe weights wi so that

w sim N(

0 1α

)(263)

where α = [α1 αN ] The prior and posterior probability distributionsare then easily derived53

RVM uses much fewer kernel functions than the SVM It also yields prob-abilistic predictions automatic estimation of parameters and the ability touse arbitrary kernel functions The majority of parameters are automaticallyset to zero during the learning process giving a procedure that is extremely

203


effective at discerning basis functions which are relevant for making goodpredictions

NOTE

RVM requires the inversion of N times N matrices Since it takesO (N3) operations to invert an N times N matrix it can quicklybecome computationally expensive and therefore slow as thesample size increases

204


13 PRACTITIONER TIP

The relevance vector machine is often assessed using the rootmean squared error (RMSE) or the Nash-Sutcliffe efficiency(NS) The larger the value of NS the smaller the value ofRMSE

RMSE =radicsumN

i=1(xi minus xi)2

N(264)

NS = 1minussumN

i=1(xi minus xi)2sumNi=1(xi minus xi)2 (265)

xi is the predicted value x is the average of the sample andN is the number of observations

Oil Sand Pump PrognosticsIn wet mineral processing operations slurry pumps deliver a mixture of bi-tumen sand and small pieces of rock from one site to another Due to theharsh environment in which they operate these pumps can fail suddenly Theconsequent downtime can lead to a large economic cost for the mining com-pany due to the interruption of the mineral processing operations Hu andTse54 combine a RVM with an exponential model to predict the remaininguseful life (RUL) of slurry pumps

Data were collected from the inlet and outlet of slurry pumps operatingin an oil sand mine using four accelerometers placed at key locations on thepump In total the pump was subjected to 904 measurement hours

Two data-sets were collected from different positions (Site 1 and Site 2)in the same pump and an alternative model of the pump degradation was de-

205


veloped using the sum of two exponential functions The overall performanceresults are shown in Table 13

Site 1 Site 1 Site 2 Site 2RVM Exponential Model RVM Exponential Model7051 2531 2885 705

Table 13 Hu and Tsersquos weighted average accuracy of prediction

Precision AgriculturePrecision agriculture involves using aircraft or spacecraft to gather high-resolution information on crops to better manage the growing and harvestingcycle Chlorophyll concentration measured in mass per unit leaf area (μgcmminus2) is an important biophysical measurement retrievable from air or spacereflectance data Elarab et al55 use a RVM to estimate spatially distributedchlorophyll concentrations

Data was gathered from three aircraft flights during the growing season(early growth mid growth and early flowering) The data set consisted of thedependent variable (Chlorophyll concentration) and 8 attribute or explana-tory variables All attributes were used to train and test the model

The researchers used six different kernels (Gauss Laplace spline Cauchythin plate spline (tps) and bubble) alongside a 5-fold cross validation Avariety of kernel widths ranging from 10-5 to 105 were also used The rootmean square error and Nash-Sutcliffe efficiency were used to assess model fitThe best fitting trained model used a tps kernel and had a root mean squareerror of 531 μg cmminus2 and Nash-Sutcliffe efficiency of 076

The test data consisted of a sample gathered from an unseen (by theRVM) flight The model using this data had a root mean square error of852 μg cmminus2 and Nash-Sutcliffe efficiency of 071 The researchers concludeldquoThis result showed that the [RVM] model successfully performed when givenunseen datardquo

Deception Detecting in SpeechZhou56 develop a technique to detect deception in speech by extracting speechdynamics (prosodic and non-linear features) and applying a RVM

206

The data set consisted of recorded interviews of 16 male and 16 femaleparticipants A total of 640 deceptive samples and 640 non-deceptive sampleswere used in the analysis

Speech dynamics such as pitch frequency short-term vocal energy andmei-frequency cepstrum coefficients57 were used as input attribute features

Classification accuracy of the RVM was assessed relative to a supportvector machine and a neural network for various training sample sizes Forexample with a training sample of 400 the RVM correctly classifies 7037of male voices whilst the support vector machine and neural network correctlyclassify 6814 and 4213 respectively

The researchers observe that a combination of prosodic and non-linearfeatures modeled using a RVM is effective for detecting deceptive speech

Diesel Engine PerformanceThe mathematical form of diesel engines is highly nonlinear Because of thisthey are often modeled using an artificial neural network (ANN) Wong etal58 perform an experiment to assess the ability of an ANN and a RVM topredict diesel engine performance

Three inputs are used in the models (engine speed load and cooling watertemperature) and the output variables are brake specific fuel consumptionand exhaust emissions such as nitrogen oxide and carbon dioxide

A water-cooled 4-cylinder direct-injection diesel engine was used to gen-erate data for the experiment Data was recorded at five engine speeds (12001400 1600 1800 and 2000 rpm) with engine torques (28 70 140 210 and252 Nm) Each test was carried out three times and the average values wereused in the analysis In all 22 sets of data were collected with 18 used as thetraining data and 4 to assess model performance

The ANN had a single layer with twenty hidden neurons and a Hyperbolictangent sigmoid transfer function

For the training data the researchers report an average root mean squareerror (RMSE) of 327 for the RVM and 4159 for the ANN The average RMSEfor the test set was 1773 and 3856 for the RVM and ANN respectively Theresearchers also note that the average R2 for the test set for the RVM was0929 and 0707 for the ANN

Gas Layer ClassificationZhao et al59 consider the issue of optimal parameter selection for a RVM andidentification of gas at various drill depths To optimize the parameters of a

207


RVM they use the particle swarm algorithm60The sample consist of 201 wells drilled at various depths with 63 gas pro-

ducing wells and 138 non-gas producing wells A total of 12 logging attributesare used as features into the RVM model A prediction accuracy of 9353is obtained during training using all the attributes the prediction accuracywas somewhat lower at 9175 for the test set

Credit Risk AssessmentTong and Li61 assess the use of RVM and a support vector machine (SVM)in the assessment of company credit risk The data consist of financial char-acteristics of 464 Chinese firms listed on Chinarsquos securities markets A totalof 116 of the 464 firms had experienced a serious credit event and were codedas ldquo0rdquo by the researchers The remaining firms were coded ldquo1rdquo

The attribute vector consisted of 25 financial ratios drawn from seven basiccategories (cash flow return on equity earning capacity operating capacitygrowth capacity short term solvency and long term solvency) Since many ofthese financial ratios have a high correlation with each other the researchersused principal components (PCA) and isomaps to reduce the dimensional-ity The models they then compared were a PCA-RVM Isomap-SVM andIsomap-RVM A summary of the results is reported in Table 14

Model AccuracyIsomap-RVM 9028Isomap-SVM 8611PCA-RVM 8959

Table 14 Summary of Tong and Lirsquos overall prediction accuracy

208

13 PRACTITIONER TIP

You may have noticed that several researchers compared theirRVM to a support vector machine or other model A datascientist I worked with faced an issue where one model (logisticregression) performed marginally better than an alternativemodel (decision tree) However the decision was made to gowith the decision tree because due to a quirk in the softwareit was better labeledZhou et al demonstrated the superiority of a RVM over a sup-port vector machine for their data-set The evidence was lesscompelling in the case of Tong and Li where Table 14 appearsto indicate a fairly close race In the case where alternativemodels perform in a similar range the rationale for which tochoose may be based on considerations not related to whichmodel performed best in absolute terms on the test set Thiswas certainly the case in the choice between logistic regressionand decision trees faced by my data scientist co-worker

209

Technique 27

RVM Regression

A RVM regression can be estimated using the package kernlab with the rvmfunction

rvm(y ~ data kernel kpar )

Key parameters include kernel - the kernel function kpar which con-tains parameter values multicategory response variable y and the covariatesdata

Step 1 Load Required PackagesWe build the RVM regression using the data frame bodyfat contained in theTHdata package For additional details on this sample see page 62library(kernlab)data(bodyfatpackage=THdata)


210

TECHNIQUE 27 RVM REGRESSION

Step 3 Estimate amp Evaluate ModelWe estimate the RVM using a radial basis kernel (type = rbfdot) and pa-rameter sigma equal to 005 We also set the cross validation parametercross =10

gt fitlt- rvm(DEXfat ~ data = bodyfat[train ]kernel=rbfdot kpar=list(sigma =05)cross =10)

The print function provides details of the estimated model The func-tion provides details including the kernel type (radial basis) kernel parameter(sigma =05) number of relevance vectors (43) and the cross validation er-rorgt print(fit)Relevance Vector Machine object of class rvmProblem type regression


Number of Relevance Vectors 43Variance 1211944Training error 24802080247Cross validation error 1079221

Since the output of rvm is an S4 object it can be accessed directly usingldquordquo gt fiterror[1] 2480208


OK letrsquos fit two other models and choose the one with the lowest crossvalidation error The first model fit1 is estimated using the default settingof rvm The second model fit2 uses a Laplacian kernel Ten-fold crossvalidation is used for both modelsgt fit1lt- rvm(DEXfat ~ data = bodyfat[train ]

211


cross =10)

gt fit2lt- rvm(DEXfat ~ data = bodyfat[train ]kernel=laplacedot kpar=list(sigma =0001) cross =10)

Now we are ready to assess the fit of each of the three models - fit fit1and fit2gt fitcross[1] 1082592

gt fit1cross[1] 6550046


The model fit2 using a Laplacian kernel has by far the best overall crossvalidation error

Step 4 Make PredictionsThe predict method with the test data and the fitted model fit2 is used asfollowspred lt- predict(fit2 bodyfat[-train ])

We fit a linear regression using the pred as the response variable and theobserved values as the covariate The regression line alongside the predictedand observed values shown in Figure 271 are visualized using the plotmethod combined with abline method to show the linear regression linegt linReg lt-lm(pred ~ bodyfat$DEXfat [-train ])

gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=PredictedValuesmain=TrainingSampleModelFit)


212



[1][1] 0813

Figure 271 RVM regression of observed and fitted values using bodyfat

213


Notes53For further details see

1 M E Tipping ldquoBayesian inference an introduction to principles and practice inmachine learningrdquo in Advanced Lectures on Machine Learning O Bousquet Uvon Luxburg and G Raumltsch Eds pp 41ndash62 Springer Berlin Germany 2004

2 M E Tipping ldquoSparseBays An Efficient Matlab Implementation ofthe Sparse Bayesian Modelling Algorithm (Version 20)rdquo March 2009httpwwwrelevancevectorcom

54Hu Jinfei and Peter W Tse A relevance vector machine-based approach with ap-plication to oil sand pump prognostics Sensors 139 (2013) 12663-12686

55Elarab Manal et al Estimating chlorophyll with thermal and broadband multi-spectral high resolution imagery from an unmanned aerial system using relevance vectormachines for precision agriculture International Journal of Applied Earth Observationand Geoinformation (2015)

56Zhou Yan et al Deception detecting from speech signal using relevance vectormachine and non-linear dynamics features Neurocomputing 151 (2015) 1042-1052

57Mei-frequency cepstrum is a representation of the short-term power spectrum of asound commonly used as features in speech recognition systems See for example LoganBeth Mel Frequency Cepstral Coefficients for Music Modeling ISMIR 2000 and HasanMd Rashidul Mustafa Jamil and Md Golam Rabbani Md Saifur Rahman Speakeridentification using Mel frequency cepstral coefficients variations 1 (2004) 4

58Wong Ka In Pak Kin Wong and Chun Shun Cheung Modelling and prediction ofdiesel engine performance using relevance vector machine International journal of greenenergy 123 (2015) 265-271

59Zhao Qianqian et al Relevance Vector Machine and Its Application in Gas LayerClassification Journal of Computational Information Systems 920 (2013) 8343-8350

60See Haiyan Lu Pichet Sriyanyong Yong Hua Song Tharam Dillon Experimentalstudy of a new hybrid PSO with mutation for economic dispatch with non-smooth costfunction [J] International Journal of Electrical Power and Energy Systems 32 (9) 2012921ndash935

61Tong Guangrong and Siwei Li Construction and Application Research of Isomap-RVM Credit Assessment Model Mathematical Problems in Engineering (2015)

214

Part IV

Neural Networks

215

The Basic Idea

A artificial neural network (ANN) is constructed from a number of intercon-nected nodes known as neurons These are arranged into an input layer ahidden layer and an output layer The input nodes correspond to the numberof features you wish to feed into the ANN and the number of output nodescorrespond to the number of items you wish to predict Figure 272 presentsan overview of an artificial neural network topology (ANN) It has 2 inputnodes 1 hidden layer with 3 nodes and 1 output node

Figure 272 A basic neural network

The NeuronAt the heart of a neural network is the neuron Figure 273 outlines theworkings of an individual neuron A weight is associated with each arc into

217


a neuron and the neuron then sums all inputs according to

Sj =Nsum

i=1wijaj + bj (271)

The parameter bj represents the bias associated with the neuron It allowsthe network to shift the activation function ldquoupwardsrdquo or ldquodownwardsrdquo Thistype of flexibility is important for successful machine learning

Figure 273 An artificial neuron

Activation and LearningA neural network is generally initialized with random weights Once the net-work has been initialized it is then trained Training consists of two elements- activation and learning

bull Step 1 An activation function f (Sj) is applied and the output passedto the next neuron(s) in the network The sigmoid function is a popularactivation function

bull f (Sj) = 11 + exp (minusSj)

(272)

bull Step 2 A learning ldquolawrdquo that describes how the adjustments of theweights are to be made during training The most popular learning lawis backpropagation

218

The Backpropagation Algorithm

It consists of the following steps

1 The network is presented with input attributes and the target outcome

2 The output of the network is compared to the actual known targetoutcome

3 The weights and biases of each neuron are adjusted by a factor basedon the derivative of the activation function the differences between thenetwork output and the actual target outcome and the neuron outputsThrough this process the network ldquolearnsrdquo

Two parameters are often used to speed up learning and prevent the systemfrom being trapped in a local minimum They are known as the learning rateand the momentum

13 PRACTITIONER TIP

ANNs are initialized by setting random values to the weightsand biases One rule of thumb is to set the random values tolie in the range (-2 k to 2 k) where k is the number of inputs

How Many Nodes and Hidden LayersOne of the very first questions asked about neural networks is how manynodes and layers should be included in the model There are no fixed rulesas to how many nodes to include in the hidden layer However as the numberof hidden nodes increase so does the time taken for the network to learn fromthe input data

219


Sheet Sediment TransportTayfur62model sediment transport using artificial neural networks (ANNs)The sample consisted of the original experimental hydrological data of Kilinicamp Richardson63

A three-layer feed-forward artificial neural network with two neurons inthe input layer eight neurons in the hidden layer and one neuron in the out-put layer was built The sigmoid function was used as an activation functionin the training of the network and the learning of the ANN was accomplishedby the back-propagation algorithm A random value of 02 and mdash10 wereassigned for the network weights and biases before starting the training pro-cess

The ANNs performance was assessed relative to popular physical hydro-logical models (flow velocity shear stress stream power and unit streampower) against a combination of slope types (mild steep and very steep) andrain intensity (low high very high)

The ANN outperformed the popular hydrological models for very highintensity rainfall on both steep and very steep slope see Table 15

Mild slope Steep slope Very steep slopeLow Intensity physical physical ANNHigh Intensity physical physical physicalVery high inensity physical ANN ANN

Table 15 Which model is best Model transition framework derived fromTayfurrsquos analysis

220

13 PRACTITIONER TIP

Tayfurrsquos results highlight an important issue facing the datascientist Not only is it often a challenge to find the bestmodel among competing candidates it is even more difficultto identify a single model that works in all situations Agreat solution offered by Tayfur is to have a model transitionmatrix that is determine which model(s) perform well underspecific conditions and then use the appropriate model for agiven condition

Stock Market VolatilityMantri et al64 investigate the performance of a multilayer perceptron relativeto standard econometric models of stock market volatility (GARCH Ex-ponential GARCH Integrated GARCH and the Glosten Jagannathan andRunkle GARCH model)

Since the multilayer perceptron does not make assumptions about thedistribution of stock market innovations it is of interest to Financial Analystsand Statisticians The sample consisted of daily data collected on two Indianstock indices (BSE SENSEX and the NSE NIFTY) over the period January1995 to December 2008

The researchers found no statistical difference between the volatility ofthe stock indices estimated by the multilayer perceptron and standard econo-metric models of stock market volatility

Trauma SurvivalThe performance of trauma departments in the United Kingdom is widelyaudited by applying predictive models that assess the probability of survivaland examining the rate of unexpected survivals and deaths The standardapproach is the TRISS methodology which consists of two logistic regressionsone applied if the patient has a penetrating injury and the other applied forblunt injuries65

Hunter Henry and Ferguson66 assess the performance of the TRISSmethodology against alternative logistic regression models and a multilayerperceptron

221


The sample consisted of 15055 cases gathered from Scottish Trauma de-partments over the period 1992-1996 The data was divided into two subsetsthe training set containing 7224 cases from 1992-1994 and the test setcontaining 7831 cases gathered from 1995-1996 The researcherrsquos logistic re-gression models and the neural network were optimized using the trainingset

The neural network was optimized ten times with the best resulting modelselected The researchers conclude that neural networks can yield betterresults than logistic regression

13 PRACTITIONER TIP

The sigmoid function is a popular choice as an activation func-tion It is good practice to standardize (ie convert to therange (01)) all external input values before passing them intoa neural network This is because without standardizationlarge input values require very small weighting factors Thiscan cause two basic problems

1 Inaccuracies introduced by very small floating point cal-culations on your computer

2 Changes made by the back-propagation algorithm willbe extremely small causing training to be slow (the gra-dient of the sigmoid function at extreme values wouldbe approximately zero)

Brown Trout RedsLek et al67 compare the ability of multiple regression and neural networks topredict the density of brown trout redds in southwest France Twenty nineobservation stations distributed on six rivers and divided into 205 morpho-dynamic units collected information on 10 ecological metrics

The models were fitted using all the ecological variables and also witha sub-set of four variables Testing consisted of random selection for thetraining set (75 of observations and the test set 25 of observations) Theprocess was repeated a total of five times

The average correlation between the observed and estimated values overthe five samples is reported in Table 16 The researchers conclude that bothmultiple regression and neural networks can be used to predict the density of

222

brown trout redds however the neural network model had better predictionaccuracy

Neural Network Multiple RegressionTrain Test Train Test0900 0886 0684 0609

Table 16 Let et alrsquos reported correlation coefficients between estimate andobserved values in training and test samples

Electric Fish LocalizationWeakly electric fish emit an electric discharge used to navigate the surround-ing water and to communicate with other members of their shoal Trackingof individual fish is often carried out using infrared cameras However thisapproach becomes unreliable when there is a visual obstruction68

Kiar et al69 develop an non-invasive means of tracking weakly electric fishin real-time using a cascade forward neural network The data set contained299 data points which were interpolated to be 29900 data points The accu-racy of the neural network was 943 within 1cm of actual fish location witha mean square error of 002 mm and a image frame rate of 213 Hz

Chlorophyll DynamicsWu et al70 developed two modeling approaches - artificial neural networks(ANN) and multiple linear regression (MLR) to simulate the daily Chloro-phyll a dynamics in a northern German lowland river Chlorophyll absorbssunlight to synthesize carbohydrates from CO2 and water It is often used asa proxy for the amount of phytoplankton present in water bodies

Daily Chlorophyll a samples were taken over an 18 month period Intotal 426 daily samples were obtained Each 10th daily sample was assignedto the validation set resulting in 42 daily observations The calibration setcontained 384 daily observations

For ANN modelling a three layer back propagation neural network wasused The input layer consisted of 12 neurons corresponding to the inde-pendent variables shown in Table 17 The same independent variables werealso used in the multiple regression model The dependent variable in bothmodels was the daily concentration of Chlorophyll a

223


Air temperature Ammonium nitrogenAverage daily discharge ChlorideChlorophyll a concentration Daily precipitationDissolved inorganic nitrogen Nitrate nitrogenNitrite nitrogen Orthophosphate phosphorusSulfate Total phosphorus

Table 17 Wu et alrsquos input variables

The results of the ANN and MLR illustrated a good agreement betweenthe observed and predicted daily concentration of Chlorophyll a see Table 18

Model R-Square NS RMSEMLR 053 053 275NN 063 062 194

Table 18 Wu et alrsquos performance metrics (NS = Nash Sutcliffe efficiencyRMSE = root mean square error)

13 PRACTITIONER TIP

Whilst there are no fixed rules about how to standardize in-puts here are four popular choices for your original input vari-able xi

zi = xi minus xmin

xmax minus xmin

(273)

zi = xi minus xσx

(274)

zi = xiradicSSi

(275)

zi = xi

xmax + 1 (276)

SSi is the sum of square of xi and x and σx are the mean andstandard deviation of xi

224

Examples of Neural NetworkClassification

225

Technique 28

Resilient Backpropagation withBacktracking

A neural network with resilient backpropagation amp backtracking can be esti-mated using the package neuralnet with the neuralnet function

neuralnet(y ~ data hidden algorithm = rprop+)

Key parameters include the response variable y covariates data hiddenthe number of hidden neurons and algorithm = rprop+ to specify re-silient backpropagation with backtracking

Step 1 Load Required PackagesThe neural network is built using the data frame PimaIndiansDiabetes2contained in the mlbench package For additional details on this data set seepage 52gt library(neuralnet)gt data(PimaIndiansDiabetes2package=mlbench)

Step 2 Prepare Data amp Tweak ParametersThe PimaIndiansDiabetes2 has a large number of misclassified values(recorded as NA) particularly for the attributes of insulin and tricepsWe remove these two attributes from the sample and use the naomit methodto remove any remaining misclassified values The cleaned data is stored intemp

226

TECHNIQUE 28 RESILIENT BACKPROPAGATION WITH

gt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)

Next we need to convert the response variable and attributes into a matrixand then use the scale method to standardize the matrix tempgt ylt-(temp$diabetes)gt levels(y) lt- c(01)

gt ylt-asnumeric(ascharacter(y))gt y lt-asdataframe(y)

gt names(y)lt-c(diabetes)

gt temp$diabetes lt-NULLgt templt-cbind(temp y)gt templt-scale(temp)

Now we can select the training sample We choose to use 600 out of the724 observations The variable f is used to store the form of the model wherediabetes is the dependent or response variable Be sure to check n is equalto 724 the number of observations in the full samplegt setseed (103)gt n=nrow(temp)

gt train lt- sample (1n 600 FALSE)

gt flt- diabetes ~ pregnant+ glucose + pressure +mass + pedigree + age

Step 3 Estimate amp Evaluate ModelThe model is fitted using neuralnet with four hidden neurons

gt fitlt- neuralnet(f data = temp[train ]hidden=4algorithm = rprop+)

The print method gives a nice overview of the modelgt print(fit)

227


Call neuralnet(formula = f data = temp[train ]hidden = 4 algorithm = rprop+)

1 repetition was calculated

Error Reached Threshold Steps1 181242328 0009057962229 8448

A nice feature of the neuralnet package is the ability to visualize a fittednetwork using the plot method see Figure 281gt plot(fit intercept = FALSE showweights = FALSE)

13 PRACTITIONER TIP

It often helps intuition to visualize data To see the fittedintercepts set intercept = TRUE and to see the estimatedneuron weights set showweights = TRUE For example tryenteringplot(fit intercept = TRUE showweights = TRUE)

228


age

pedigree

mass

pressure

glucose

pregnant

diabetes

Error 181242328 Steps 8448

Figure 281 Resilient Backpropagation and Backtracking neural network us-ing PimaIndiansDiabetes2

Step 4 Make PredictionsWe transfer the data into a variable called z and use this with the predictmethod and the test samplegt zlt-tempgt zlt-z[-7]gt pred lt- compute(fit z[-train ] )

The actual predictions should look something like thisgt sign(pred$netresult)

229


[1]4 -15 -17 -112 117 120 1

Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test samplegt table( sign(pred$netresult)sign(temp[-train 7])dnn =c(Predicted Observed))


-1 61 61 20 37

We also need to calculate the error rategt error_rate = (1- sum( sign(pred$netresult) ==

sign(temp[-train 7]) ) 124)gt round( error_rate 2)[1] 021

The misclassification error rate is around 21

230

Technique 29

Resilient Backpropagation

A neural network with resilient backpropagation without backtracking canbe estimated using the package neuralnet with the neuralnet function

neuralnet(y ~ data hidden algorithm = rprop -)

Key parameters include the response variable y covariates data hiddenthe number of hidden neurons and algorithm = rprop- to specify re-silient backpropagation

Step 3 Estimate amp Evaluate ModelSteps 1 and 2 are outlined beginning on page 226

The model is fitted using neuralnet with four hidden neurons


The print method gives a nice overview of the model The error is closeto that observed for the neural network model estimated with backtrackingoutlined on page226gt print(fit)

Call neuralnet(formula = f data = temp[train ]hidden = 4 algorithm = rprop -)


231





[1]4 -15 17 -112 117 1



-1 62 41 19 39




232

Technique 30

Smallest Learning Rate

A neural network using the smallest learning rate can be estimated using thepackage neuralnet with the neuralnet function

neuralnet(y ~ data hidden algorithm = slr )

Key parameters include the response variable y covariates data hiddenthe number of hidden neurons and algorithm = slr to use of a glob-ally convergent algorithm that uses resilient backpropagation without weightbacktracking and additionally the smallest learning rate



gt fitlt- neuralnet(f data = temp[train ]hidden=4algorithm = slr)

The print method gives a nice overview of the model The error is closeto that observed for the neural network model estimated with backtrackingoutlined on page 226 however the algorithm (due to the small learning rate)takes very many more steps to convergegt print(fit)

Call neuralnet(formula = f data = temp[train ]hidden = 4 algorithm = slr)

233






[1]4 -15 -17 -112 117 120 -1

Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test samplegt table( sign(pred$netresult)sign(temp[-train 7]) dnn =c(Predicted Observed))


-1 58 101 23 33




234

Technique 31

Probabilistic Neural Network

A probabilistic neural network can be estimated using the package pnn withthe learn function

learn(y data )

Key parameters include the response variable y and the covariatescontained in data

Step 1 Load Required PackagesThe neural network is built using the data frame PimaIndiansDiabetes2contained in the mlbench package For additional details on this data set seepage 52gt library(pnn)gt data(PimaIndiansDiabetes2package=mlbench)


235


Next we need to convert the response variable and attributes into a matrixand then use the scale method to standardize the matrix tempgt templt-(PimaIndiansDiabetes2)gt temp$insulin lt- NULLgt temp$triceps lt- NULLgt templt-naomit(temp)

gt ylt-(temp$diabetes)

gt temp$diabetes lt-NULLgt templt-scale(temp)gt templt-cbind(asfactor(y)temp)

Now we can select the training sample We choose to use 600 out of the724 observationsgt setseed (103)gt n=nrow(temp)

gt n_train lt- 600gt n_testlt-n-n_train

gt train lt- sample (1n n_train FALSE)

Step 3 Estimate amp Evaluate ModelThe model is fitted using the learn method with the fitted model stored infit_basic

gt fit_basic lt- learn(dataframe(y[train]temp[train ]))

You can use the attributes method to identify the slots characteristicsof fit_basicgt attributes(fit_basic)$names[1] model set category

column categories[5] k n

236

TECHNIQUE 31 PROBABILISTIC NEURAL NETWORK

13 PRACTITIONER TIP

Remember you can access the contents of a fitted probabilisticneural network by using the $ notation For example to seewhat is in the ldquomodelrdquo slot you would typegt fit_basic$model[1] Probabilistic neural network

The summary method provides details on the modelgt summary(fit_basic)

Length Class Modemodel 1 -none - characterset 8 dataframe listcategorycolumn 1 -none - numericcategories 2 -none - characterk 1 -none - numericn 1 -none - numeric

Next we use the smooth method to set the smoothing parameter sigmaWe use a value of 05gt fit lt- smooth(fit_basic sigma =05)

13 PRACTITIONER TIP

Much of the time you will not have a pre-specified value inmind for the smoothing parameter sigma However you canlet the smooth function find the best value using its inbuiltgenetic algorithm To do that you would type someting alongthe lines ofgtsmooth(fit_basic)

The performance statistics of the fitted model are assessed using the perfmethod For example enter the following to see various aspects of the fittedmodelgt perf(fit)gt fit$observed

237


gt fit$guessedgt fit$successgt fit$failsgt ft$success_rategt fit$bic

Step 4 Make PredictionsLetrsquos take a look at the testing sample To see the first row of covariates inthe testing set entergt round(temp[-train ][1 ]2)

pregnant glucose pressure100 -085 -107 -052

mass pedigree age-063 -093 -105

You can see the first observation of the response variable in the test samplein a similar waygt y[-train ][1][1] negLevels neg pos

Now letrsquos predict the first response value in the test set using the covari-atesgt guess(fit asmatrix(temp[-train ][1 ]))$category[1] neg

Take a look at the associated probabilities In this case there is a 99probability associated with the neg classgt guess(fit asmatrix(temp[-train ][1 ]))$

probabilitiesneg pos

0996915706 0003084294

Here is how to see both prediction and associated probabilitiesgt guess(fit asmatrix(temp[-train ][1 ]))$category[1] neg

$probabilities

238


neg pos0996915706 0003084294

OK now we are ready to predict all the response values in the test sampleWe can do this with a few lines of R codegt predlt-1n_testgt for (i in 1n_test)pred[i]lt-guess(fit asmatrix(temp[-train ][i]))$

category

Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test samplegt table( pred y[-train] dnn =c(Predicted Observed))

ObservedPredicted neg pos

neg 79 2pos 2 41

We also need to calculate the error rategt error_rate = (1- sum( pred == y[-train]) n_test

)gt round( error_rate 3)[1] 0032


239

Technique 32

Multilayer Feedforward NeuralNetwork

A multilayer feedforward neural network can be estimated using the AMOREpackage with the train function

train(net PT errorcriterium )

Key parameters include net the neural network you wish to train P train-ing set attributes T the training set response variable output values andthe error criteria (Least Mean Squares or Least Mean Logarithm Squared orTAO Error) contained in errorcriterium

Step 1 Load Required PackagesThe neural network is built using the data frame PimaIndiansDiabetes2contained in the mlbench package For additional details on this data set seepage 52gt library(AMORE)gt data(PimaIndiansDiabetes2package=mlbench)


240

TECHNIQUE 32 MULTILAYER FEEDFORWARD NEURAL



gt ylt-asnumeric(ascharacter(y))



Now we can select the training sample We choose to use 600 out of the724 observationsgt setseed (103)gt n=nrow(temp)gt train lt- sample (1n 600 FALSE)

Now we need to create the neural network we wish to traingt net lt- newff(nneurons=c(1321)learningrateglobal =001momentumglobal =05errorcriterium=LMLS Stao=NAhiddenlayer=sigmoidoutputlayer=purelinmethod=ADAPTgdwm)

Irsquoll explain the code above line by line In the first line wersquore creating anobject called net that will contain the structure of our new neural networkThe first argument to newff is nneurons which allows us to specify thenumber of inputs number of nodes in each hidden layer and number ofoutputs So in the example above we have 1 input 2 hidden layers the firstcontaining 3 nodes and the second containing 2 nodes and 1 output

The learningrateglobal argument constrains how much the algo-rithm is allowed to change the weights from iteration to iteration as the net-work is trained In this case learningrateglobal = 001 which means

241


that the algorithm canrsquot increase or decrease any one weight in the networkby any more than 001 from trial to trial

The errorcriterium argument specifies the error mechanism used ateach iteration There are three options here ldquoLMSrdquo (for least mean squares)ldquoLMLSrdquo (for least-mean-logarithm-squared) and ldquoTAOrdquo (for the Tao errormethod) In general I tend to choose ldquoLMLSrdquo as my starting point How-ever I will often train my networks using all three methods

The hiddenlayer and outputlayer arguments are used to choose thetype of activation function you will use to interpret the summations of theinputs and weights for any of the layers in your network I have set out-putlayer=purelin which results in linear output Other options includetansig sigmoid hardlim and custom

Finally method specifies the solution strategy for converging on theweights within the network For building prototype models I tend to useeither ADAPTgd (adaptive gradient descend) or ADAPTgdwm (adaptive gra-dient descend with momentum)

Step 3 Estimate amp Evaluate ModelThe model is fitted using the train method In this example I have setreport=TRUE to provide output during the algorithm run and nshows=5 toshow a total of 5 training epochs

gt fit lt- train(net P=temp[train ]T=temp[train 7]errorcriterium=LMLSreport=TRUE showstep=100nshows =5)

indexshow 1 LMLS 0337115972269707indexshow 2 LMLS 0335651051328758indexshow 3 LMLS 0335113569553075indexshow 4 LMLS 0334753676125557indexshow 5 LMLS 0334462044665089

Step 4 Make PredictionsThe sign function is used to convert predictions into negative and positiveHere I use the fitted network held in fit to predict using the test sample

242


gt pred lt- sign(sim(fit$net temp[-train ]))

Letrsquos create a confusion matrix so we can see how well the neural networkperformed on the test sample Notice that sign(temp[-train7] containsthe observed output response values for the test samplegt table( pred sign(temp[-train 7]) dnn =c(Predicted Observed))


-1 68 271 13 16

We also need to calculate the error rategt error_rate = (1- sum( pred == sign(temp[-train 7])

) 124)gt round( error_rate 3)[1] 0323


243

Examples of Neural NetworkRegression

244

Technique 33


A neural network with resilient backpropagation and backtracking can beestimated using the package neuralnet with the neuralnet function



Step 1 Load Required PackagesThe neural network is built using the data frame bodyfat contained in theTHdata package For additional details on this data set see page 62gt library(neuralnet)gt data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observations The response variable and attributesare structured and stored in the variable f In addition the standardizedvariables are stored in the data frame scale_bodyfatgt setseed (465)

245


gt train lt- sample (171 45 FALSE)

gt flt- DEXfat ~ waistcirc + hipcirc+age + elbowbreadth + kneebreadth + anthro3a+ anthro3b + anthro3c +anthro4

gt scale_bodyfat lt-asdataframe(scale(bodyfat))

Step 3 Estimate amp Evaluate ModelThe number of hidden neurons should be determined in relation to the neededcomplexity For this example we use one hidden neuron

gt fitlt- neuralnet(f data = scale_bodyfat[train ]hidden=1 algorithm = rprop+)

The plotnn method can be used to visualize the network as shown inFigure 331gt plotnn(fit)

246


013

059

anthro4

minus00

0467

anthro3c

008

537

anthro3b

000138

anthro3a

005012kneebreadth

minus001639

elbowbreadth

002802

age

024594

hipcirc

01271

waistcirc

73896 DEXfatminus04029

1

minus297094

1

Error 1711438 Steps 2727

Figure 331 Neural network used for the bodyfat regression

The print method gives a nice overview of the modelgt print(fit)Call neuralnet(formula = f data = scale_bodyfat[

train ]hidden = 1 algorithm = rprop+)



247


A nice feature of the neuralnet package is the ability to print a summaryof the results matrixgt round(fit$resultmatrix 3)

1error 1711reachedthreshold 0010steps 2727000Interceptto1 layhid1 -0403waistcircto1 layhid1 0127hipcircto1 layhid1 0246ageto1 layhid1 0028elbowbreadthto1 layhid1 -0016kneebreadthto1 layhid1 0050anthro3ato1 layhid1 0001anthro3bto1 layhid1 0085anthro3cto1 layhid1 -0005anthro4to1 layhid1 0131IntercepttoDEXfat -29711layhid 1toDEXfat 7390

Step 4 Make PredictionsWe predict using the scaled values then remove the response variable DEXfatstoring the predictions in pred via the compute methodgt without_fatlt-scale_bodyfatgt without_fat$DEXfat lt-NULL

gt pred lt- compute(fit without_fat[-train ] )

Next we build a linear regression between the predicted and observed val-ues in the training set storing the model in linReg Next the plot method isused to visualize the relationship between the predicted and observed valuessee Figure 332 The abline method plots the linear regression line fitted bylinReg Finally we call the cor method to calculate the squared correlationcoefficient between the predicted and observed values it returns a value of0864gt linReg lt-lm(pred$netresult~ scale_bodyfat$DEXfat

[-train ])

gt plot(scale_bodyfat$DEXfat[-train]

248


pred$netresult xlab=DEXfatylab=Predicted Valuesmain=Training Sample Model Fit)


gt round(cor(pred$netresult scale_bodyfat$DEXfat[-train])^26)

[1][1] 0864411

249


Figure 332 Resilient Backpropagation with Backtracking observed and pre-dicted values using bodyfat

250

Technique 34






We fit a network with one hidden neuron as follows

gt fitlt- neuralnet(f data = scale_bodyfat[train ]hidden=1 algorithm = rprop -)


train ] hidden = 1 algorithm = rprop -)



251



1error 172reachedthreshold 001steps 365100Interceptto1 layhid1 -037waistcircto1 layhid1 013hipcircto1 layhid1 026ageto1 layhid1 003elbowbreadthto1 layhid1 -002kneebreadthto1 layhid1 005anthro3ato1 layhid1 000anthro3bto1 layhid1 009anthro3cto1 layhid1 000anthro4to1 layhid1 014IntercepttoDEXfat -2861layhid 1toDEXfat 697

Step 4 Make PredictionsWe predict using the scaled values and the compute methodgt scale_bodyfat$DEXfat lt-NULLgt pred lt- compute(fit scale_bodyfat[-train ] )

Next we build a linear regression between the predicted and observedvalues in the training set storing the model in linReg

The plot method is used to visualize the relationship between the pre-dicted and observed values see Figure 341 The abline method plots thelinear regression line fitted by linReg Finally we call the cor method to cal-culate the squared correlation coefficient between the predicted and observedvalues it returns a value of 0865gt linReg lt-lm(pred$netresult~ bodyfat$DEXfat [-

train])

gt plot(bodyfat$DEXfat[-train]pred$netresult xlab=DEXfatylab=Predicted Values

252

TECHNIQUE 34 RESILIENT BACKPROPAGATION

main=Test Sample Model Fit)


gt round(cor(pred$netresult bodyfat$DEXfat[-train])^23)

[1][1] 0865

Figure 341 Resilient Backpropagation without Backtracking observed andpredicted values using bodyfat

253

Technique 35


A neural network using he smallest learning rate can be estimated using thepackage neuralnet with the neuralnet function


Key parameters include the response variable y covariates data hiddenthe number of hidden neurons and algorithm = slr which specifies theuse of a globally convergent algorithm with resilient backpropagation and thesmallest learning rate


The network is built with 1 hidden neuron as follows

gt fitlt- neuralnet(f data = scale_bodyfat[train ]hidden=1 algorithm = slr)


train ] hidden = 1 algorithm = slr)



Here is a summary of the fitted model weights

254

TECHNIQUE 35 SMALLEST LEARNING RATE

gt round(fit$resultmatrix 2)1

error 171reachedthreshold 001steps 1027200Interceptto1 layhid1 -038waistcircto1 layhid1 013hipcircto1 layhid1 026ageto1 layhid1 003elbowbreadthto1 layhid1 -002kneebreadthto1 layhid1 005anthro3ato1 layhid1 000anthro3bto1 layhid1 009anthro3cto1 layhid1 000anthro4to1 layhid1 013IntercepttoDEXfat -2881layhid 1toDEXfat 708


Next we build a linear regression between the predicted and observedvalues in the training set storing the model in linReg The plot method isused to visualize the relationship between the predicted and observed valuessee Figure 351 The abline method plots the linear regression line fitted bylinReg Finally we call the cor method to calculate the squared correlationcoefficient between the predicted and observed values it returns a value of0865gt linReg lt-lm(pred$netresult~ bodyfat$DEXfat [-

train])

gt plot(bodyfat$DEXfat[-train]pred$netresult xlab=DEXfatylab=Predicted Valuesmain=Test Sample Model Fit)

255




[1][1] 0865

Figure 351 Resilient Backpropagation with smallest learning rate and pre-dicted values using bodyfat

256

Technique 36

General Regression NeuralNetwork

A General Regression Neural Network neural network can be estimated usingthe package grnn with the learn function

learn(yx)

Key parameters include the response variable y and the explanatory vari-ables contained in x

Step 1 Load Required PackagesThe neural network is built using the data frame bodyfat contained in theTHdata package For additional details on this data set see page 62gt require(grnn)gt data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observationsgt setseed (465)gt n lt- nrow(bodyfat)


257


gt train lt- sample (1nn_train FALSE)

Next we take the log of the response variable and store it in y We dothe same for the attributes variables storing them in xgt ylt-log(bodyfat$DEXfat)gt xlt-log(bodyfat)gt x$DEXfat lt-NULL

Step 3 Estimate amp Evaluate ModelWe use the learn method to fit the model and the smooth method to smooththe general regression neural network We set sigma to 25 in the smoothfunction and store the resultant model in fit

gt fit_basic lt- learn(dataframe(y[train]x[train ]))gt fit lt- smooth(fit_basic sigma = 25)

Step 4 Make PredictionsLetrsquos take a look at the first row of the covariates in the testing setgt round(x[-train ][1 ]2)

age waistcirc hipcirc elbowbreadth47 404 461 472 196

kneebreadth anthro3a anthro3b anthro3c224 149 16 15

anthro447 181

To see the first observation of the response variable in the test sample youcan use a similar techniquegt y[-train ][1][1] 3730021

Now we are ready to predict the first response value in the test set usingthe covariates Here is how to do thatgt guess(fit asmatrix(x[-train ][1 ]))[1] 3374901

258

TECHNIQUE 36 GENERAL REGRESSION NEURAL NETWORK

Of course we will want to predict using all the values in the test sampleWe can achieve this using the guess method and the following lines of Rcodegt predlt-1n_test

gt for (i in 1n_test)pred[i]lt-guess(fit asmatrix(x[-train ][i]))

Finally we plot in Figure 361 the observed and fitted values using theplot method and on the same chart plot the regression line of the predictedversus observed values The overall squared correlation is 0915gt plot(y[-train]pred xlab=log(DEXfat)ylab=Predicted Valuesmain=Test Sample Model Fit)

gt abline(linReg lt-lm(pred~ y[-train])col=darkred)

gt round(cor(pred y[-train])^23)[1] 0915

259


Figure 361 General Regression Neural Network observed and predicted val-ues using bodyfat

260

Technique 37

Monotone Multi-LayerPerceptron

On occasion you will find yourself modeling variables for which you requiremonotonically increasing behavior of the response variables with respect tospecified attributes It is possible to train and make predictions from a multi-layer perceptron neural network with partial monotonicity constraints Themonmlp package implements the monotone multi-layer perceptron neural net-work using the approach of Zhang and Zhang71 The model can be estimatedusing the monmlpfit function

monmlpfit(x yhidden1 hidden2 monotone)

Key parameters include the response variable y the explanatory variablescontained in x the number of neurons in the first and second hidden layers(hidden1 and hidden2) and the monotone constraints on the covariates

Step 1 Load Required PackagesThe neural network is built using the data frame bodyfat contained in theTHdata package For additional details on this data set see page 62gt library(monmlp)gt data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersWe use 45 out of the 71 observations in the training sample with the remainingsaved for testing The variable train selects the random sample withoutreplacement from the 71 observations

261


gt setseed (465)gt train lt- sample (171 45 FALSE)

Next we take the log of the response variable and store it in y We dothe same for the attributes variables storing them as a matrix in xgt ylt-log(bodyfat$DEXfat)gt xlt-log(bodyfat)

gt x$DEXfat lt-NULLgt xlt-asmatrix(x)

Step 3 Estimate amp Evaluate ModelWe build a 1 hidden layer network with 1 node with monotone constraintson all 9 attributes contained in x (we set monotone =19)

gt fitlt- monmlpfit(x=x[train ] y=asmatrix(y[train])hidden1=1hidden2=0monotone =19)

NOTE

The method monmlpfit takes a number of additional argu-ments Three that you will frequently want to set manuallyinclude

bull ntrials - the number of repeated trials used to avoidlocal minima

bull nensemble - the number of ensemble members to fit

bull bag - A logical variable indicating whether or not boot-strap aggregation (bagging) should be used

For example in the model we are building we could havespecifiedfitlt- monmlpfit(x=x[train ]y=asmatrix(y[train])hidden1=1hidden2=0monotone =19ntrials =100 nensemble =50bag=TRUE)

262

TECHNIQUE 37 MONOTONE MULTI-LAYER PERCEPTRON

We take a look at the fitted values using the plotmethod see Figure 371gt plot(attributes(fit)$yattributes(fit)$ypred

ylab=Fitted Values xlab=Observed Values)

Since the fit looks particularly good we better calculate the correlationcoefficient At 097 it is pretty highgt cor(attributes(fit)$yattributes(fit)$ypred)

[1][1] 0973589

Figure 371 Monotone Multi-Layer Perceptron training observed and pre-dicted values using bodyfat

263


Step 4 Make PredictionsWe predict using the scaled variables and the test sample with themonmlppredict methodgt pred lt- monmlppredict(x = x[-train ] weights =

fit)

Next a liner regression model is used to fit the predicted and observedvalues for the test samplegt linReg lt-lm(pred~ y [-train])

Now plot the result (see Figure 372) and calculate the squared correlationcoefficientgt plot(y[-train]pred xlab=DEXfat ylab=

Predicted Values main=Test Sample Model Fit)gt abline(linReg col=darkred)

gt round(cor(pred y[-train])^23)[1]

[1] 0958

264


Figure 372 Monotone Multi-Layer Perceptron observed and predicted valuesusing bodyfat

265

Technique 38

Quantile Regression NeuralNetwork

Quantile regression models the relationship between the independent vari-ables and the conditional quantiles of the response variable It provides amore complete picture of the conditional distribution of the response variablewhen both lower and upper or all quantiles are of interest The qrnn pack-age combined with the qrnnfit function can be used to estimate a quantileregression neural network

qrnnfit(x y nhidden tau )

Key parameters include the response variable y the explanatory variablescontained in x the number of hidden neurons nhidden and the quantiles tobe fitted held in tau

Step 1 Load Required PackagesThe neural network is built using the data frame bodyfat contained in theTHdata package For additional details on this data set see page 62gt data(bodyfatpackage=THdata)gt library(qrnn)gt library(scatterplot3d)


266

TECHNIQUE 38 QUANTILE REGRESSION NEURAL NETWORK




We will estimate the model using the 5th 50th and 95th quantilesprobs lt- c(005 050 095)

Step 3 Estimate amp Evaluate ModelWe estimate the model using the following

gt fit lt- pred lt- list()

gt for(i in seq_along(probs))fit[[i]] lt- qrnnfit(x = x[train ]

y = asmatrix(y[train])nhidden = 4tau = probs[i]itermax = 1000ntrials = 1nensemble =20)

Letrsquos go through it line by line The first line sets up as a list fit whichwill contain the fitted model and pred which will contain the predictionsBecause we have three quantiles held in probs to estimate we use a for loopas the core work engine The function qrnnfit is used to fit the model foreach quantile

A few other things are worth pointing out here First we build the modelwith four hidden neurons (nhidden = 4) Second we set the maximumnumber of iterations of the optimization algorithm at 1000 Third we setntrials = 1 this parameter controls the number of repeated trials used toavoid local minima We set it to 1 for illustration In practice you should set

267


this to a much higher value The same is true for the parameter nensemblewhich we set to 20 It controls the of ensemble members to fit You willgenerally want to set this to a high value also

Step 4 Make PredictionsFor each quantile the qrnnpredict method is used alongside the fittedmodel Again we choose to use a for loop to capture each quantiles pre-dicted valuegt for(i in seq_along(probs))pred[[i]] lt- qrnnpredict(x[-train ] fit[[i]])

Next we create three variables to hold the predicted values Note thereare 26 sample values in totalgt pred_highlt-pred_lowlt-pred_mean lt-126

Now we are ready to store the predictions Since we use 20 ensembles foreach quantile we store the average prediction using the mean methodgt for (i in 126)pred_low[i]lt- mean(pred [[1]][i])pred_mean[i]lt- mean(pred [[2]][i])pred_high[i]lt- mean(pred [[3]][i])

Now we fit a linear regression model using the 50th percentile valuesheld in pred_mean and the actual observed values The result is visualizedusing the plot method and shown in Figure 381 The squared correlationcoefficient is then calculated It is fairly high at 0885gt linReg2 lt-lm(pred_mean~ y [-train])

gt plot(y[-train]pred_mean xlab=log(DEXfat)ylab=Predicted Valuesmain=Test Sample Model Fit)

gt abline(linReg2 col=darkred)

gt round(cor(pred_mean y[-train])^23) [1] 0885

268


Figure 381 Quantile Regression Neural Network observed and predictedvalues (pred_mean) using bodyfat

It is often useful to visualize the results of a model In Figure 382 we usescatterplot3d to plot the relationship between the pred_lowpred_highand the observed values in the test samplegt scatterplot3d(pred_low y [-train]pred_high xlab=quantile = 005ylab=Test Samplezlab=quantile = 095)

269


Figure 382 Quantile regression neural Network with scatterplot3d of 5thand 95th predicted quantiles

270

NOTES

Notes62Tayfur Gokmen Artificial neural networks for sheet sediment transport Hydrolog-

ical Sciences Journal 476 (2002) 879-89263Kilinc Mustafa Mechanics of soil erosion from overland flow generated by simulated

rainfall Colorado State University Hydrology Papers (1973)64Mantri Jibendu Kumar P Gahan and Braja B Nayak Artificial neural net-

worksndashAn application to stock market volatility Soft-Computing in Capital Market Re-search and Methods of Computational Finance for Measuring Risk of Financial Instruments(2014) 179

65For further details see HR Champion etal Improved Predictions from a SeverityCharacterization of Trauma (ASCOT) over Trauma and Injury Severity Score (TRISS)Results of an Independent Evaluation Journal of Trauma Injury Infection and CriticalCare 40 (1) 1996

66Hunter Andrew et al Application of neural networks and sensitivity analysis toimproved prediction of trauma survival Computer methods and programs in biomedicine621 (2000) 11-19

67Lek Sovan et al Application of neural networks to modelling nonlinear relationshipsin ecology Ecological modelling 901 (1996) 39-52

68See for example Jun J J Longtin A and Maler L (2011) Precision measurementof electric organ discharge timing from freely moving weakly electric fish J Neurophysiol107 pp 1996-2007

69Kiar Greg et al Electrical localization of weakly electric fish using neural networksJournal of Physics Conference Series Vol 434 No 1 IOP Publishing 2013

70Wu Naicheng et al Modeling daily chlorophyll a dynamics in a German lowlandriver using artificial neural networks and multiple linear regression approaches Limnology151 (2014) 47-56

71See Zhang H and Zhang Z 1999 Feedforward networks with monotone con-straints In International Joint Conference on Neural Networks vol 3 p 1820-1823doi101109IJCNN1999832655

271

Part V

Random Forests

272

The Basic Idea

Suppose you have a sample of size N with M features Random forests (RF)build multiple decision trees with each grown from a different set of trainingdata For each tree K training samples are randomly selected with replace-ment from the original training data set

In addition to constructing each tree using a different random sample(bootstrap) of the data the RF algorithm differs from a traditional decisiontree because at each decision node the best splinting feature is determinedfrom a randomly selected subspace of m features where m is much smallerthan the total number of features M Traditional decision trees split eachnode using the best split among all features

Each tree in the forest is grown to the largest extent possible withoutpruning To classify a new object each tree in the forest gives a classificationwhich is interpreted as the tree lsquovotingrsquo for that class The final predictionis determined by majority votes among the classes decided by the forest oftrees

NOTE

Although each tree in a RF tends to be less accurate thana classical decision tree combining multiple predictions intoone aggregate prediction a more accurate forecast is often ob-tained Part of the reason is that prediction of a single de-cision tree tend to be highly sensitive to noise in its trainingset This is not the case for the average of many trees pro-vided they are uncorrelated For this reason the RF oftendecrease overall variance without increasing bias relative to asingle decision tree

274

Random FernsRandom ferns are an ensemble of constrained trees originally used in imageprocessing for identifying textured patches surrounding key-points of interestacross images taken from different viewpoints This machine learning methodgained widespread attention in the image processing community because itis extremely fast easy to implement and appeared to outperform randomforests in image processing classification tasks provided the training set wassufficiently large72

While a tree applies a different decision function at each node a fernsystematically applies the same decision function for each node of the samelevel Each fern consists of a small set of binary tests and returns the proba-bility that a object belongs to any one of the classes that have been learnedduring training A naive Bayesian approach is used to combine the collectionof probabilities Ferns differ from decision trees in two ways

bull In decision trees the binary tests are organized hierarchically (hence theterm tree) ferns are flat Each fern consists of a small set of binary testsand returns the probability that a observation belongs to any one of theclasses that have been learned during training

bull Ferns are more compact than trees In fact 2(N-1) operations are neededto grow a tree of 2N leaves only N operations are needed to grow a fernof 2N leaves

bull For decision trees the posterior distributions are computed additivelyin ferns they are multiplicative

275


Tropical Forest Carbon MappingAccurate and spatially-explicit maps of tropical forest carbon stocks areneeded to implement carbon offset mechanisms such as REDD+ (ReducedDeforestation and Degradation Plus73) Parametric statistics have tradition-ally dominated the discipline74 Remote sensing technologies such as LightDetection and Ranging (LiDAR) have been used successfully to estimatespatial variation in carbon stocks75 However due to cost and technologicallimitations it is not yet feasible to use it to map the entire worlds tropicalforests

Mascaro et al76 evaluated the performance of the Random Forest algo-rithm in up scaling airborne LiDAR based carbon estimates compared tothe traditionally accepted stratification-based sampling approach over a 16-million hectare focal area of the Western Amazon

Two separate Random Forest models were investigated each using anidentical set of 80000 randomly selected input pixels The first Random For-est model used the same set of input variables as the stratification approachThe second Random Forest used additional ldquopositionrdquo parameters x and ycoordinates combined with two diagonal coordinates see Table 19

The resultant maps of ecosystem carbon stock are shown in Figure 383Notice that several regions exhibit pronounced differences when using theRandom Forest with the additional four positionrsquo parameters (image B)

The stratification method had a root mean square error (RMSE) of 332Mg C ha-1 and adjusted R2 of 037 (predicted versus observed) The firstRandom Forest model had a RMSE =316 Mg C ha-1and an adjusted R2 of043 The Random Forest model with the additional four positionrsquo parametershad a RMSE 267 Mg Cha-1 and an adjusted R2 of 059 The researchersconclude there is an improvement when using Random Forest with positioninformation which is consistent at all distances (see Figure 384)

276

Variable Explanation S RF1 RF2easting UTM X coordinate - - xnorthing UTM Y coordinate - - xdiagx X coordinate after 45 - - x

degree clockwise image rotationdiagy Y coordinate after 45 - - x

degree clockwise image rotationfrac_soil Percent cover of soil - - x

as determined by Landsat imageprocessing with CLASlite

frac_pv Percent cover of x x xphotosynthetic vegetationas determined by Landsat imageprocessing with CLASlite

frac_npv Percent cover of x x xnon-photosynthetic vegeationas determined by Landsat imageprocessing with CLASlite

elevation SRTM elevation x x xabove sea level

slope SRTM slope x x xaspect SRTM aspect x x xgeoeco Habitat class as determined x x x

by synthetic integration ofnational geological mapNatureServe and other sources

Table 19 Input variables used Mascaro et al S = Stratification-basedsampling RF = Random Forest

277


Figure 383 Mascaro et alrsquos Predicted carbon stocks using three differentmethodologies (a) Stratification and mapping of median carbon stocks ineach class (b) Random Forest without the inclusion of position information(c) Random Forest using additional model inputs for position Source offigure Mascaro et al

278

Figure 384 Mascaro et alrsquos performance of three modeling techniques as as-sessed in 36 validation cells Left panels highlight model performance againstLiDAR-observed above ground carbon density from CAO aircraft data (MgC ha-1) while right panels highlight the model performance by increasing dis-tance from CAO aircraft data The color-scale reflects the two-dimensionaldensity of observations adjusted to one dimension using a square root trans-formation Source of figure Mascaro et al

279


Opioid Dependency amp Sleep DisorderedBreathingPatients using chronic opioids are at elevated risk for potentially lethal dis-orders of breathing during sleep Farney et al77 investigate sleep disorderedbreathing in patients receiving therapy with buprenorphine naloxone

A total of 70 patients admitted for therapy with buprenorphinenaloxonewere recruited into the study Indices were created for Apnoeahypopnoea(AHI) obstructive apnoea (OAI) central apnoea (CAI) and hypopnoea (HI)

Each index was computed as the total of defined respiratory events dividedby the total sleep time in hours scored simultaneously by two onlooking re-searchers Oximetry data were analyzed to calculate mean SpO2 lowest SpO2and time spent lt 90 SpO2 during sleep

For each of these response metrics a Random Forest model was built usingthe following attributes - buprenorphine dose snoring tiredness witnessedapnoeas hypertension body mass index age neck circumference gender useof benzodiazepines antidepressants antipsychotics and smoking history Theresearchers find that the random forests showed little relationship betweenthe explanatory variables and the response variables

Waste-water Deterioration ModelingVitorino et al78 build a Random Forest model to predict the condition ofindividual sewers and to determine the expected length of sewers in poorcondition The data consisted of two sources a sewer data table and a in-spection data table

The sewer table contained information on the sewer identification codezone construction material diameter installation date and a selection of userdefined covariates The inspection data table contained information on thesewer identification code date of last inspection and condition of sewer atdate of last inspection

The Random Forest was trained using all available data and limited to 50trees It was then used to predict the condition of individual sewer pipes Theresearchers predict the distribution of sewer pipe in poor condition by type ofmaterial used for construction The top three highest ranked materials wereCIP (3183) followed by unknown material (2794) and RPM (2389)

280

Optical Coherence Tomography amp GlaucomaGlaucoma is the second most common cause of blindness As glaucomatousvisual field (VF) damage is irreversible the early diagnosis of glaucoma isessential Sugimoto et al79 develop a random forest classifier to predict thepresence of (VF) deterioration in glaucoma suspects using optical coherencetomography (OCT) data

The study investigated 293 eyes of 179 live patients referred to the Univer-sity of Tokyo Hospital for glaucoma between August 2010 and July 2012 TheRandom Forest algorithm with 10000 trees was used to classify the presenceor absence of glaucomatous VF damage using 237 different OCT measure-ments Age gender axial length and eight other rightleft eye metrics werealso included as training attributes

The researchers report a receiver operating characteristic curve value of09 for the random forest This compared well to the value of 075 for anindividual tree and between 077 and 086 for individual rightleft eye metrics

Obesity Risk FactorsKanerva et al80 investigate 4720 Finnish subjects who completed health ques-tionnaires about leisure time physical activity smoking status and educa-tional attainment Weight and height were measured by experienced nursesA random forest algorithm using the randomForest package in R and thecollected indicator variables was used to predict overweight or obesity Theresults were compared with a logistic regression model

The researchers observe that the random forest and logistic regression hadvery similar classification power for example the estimated error rates for themodels were equal - 42 for men (RF) versus logistic regression 43 for menThe researchers tentatively conclude ldquoMachine learning techniques may pro-vide a solution to investigate the network between exposures and eventuallydevelop decision rules for obesity risk estimation in clinical workrdquo

Protein InteractionsChen et al81 develop a novel random forest model to predict protein-proteininteractions They choose a traditional classification framework with twoclasses Either a protein pair interacts with each other or they do not Eachprotein pair is characterized by a very large attribute space in total 4293unique attributes

281


The training and test set each contained 8917 samples 4917 positive and4000 negative samples Five-fold cross validation is used and the maximumsize of a tree is 450 levels The researchers report an overall sensitivity of7978 and specificity of 6438 this compares well to an alternative max-imum likelihood technique which had a sensitivity of 7403 and specificityof 3753

282

Technique 39

Classification Random Forest

A classification random forest can be built using the package randomForestwith the randomForest function

randomForest(z ~ data ntree importance=TRUE proximity=TRUE )

Key parameters include ntree which controls the number of trees to growin each iteration importance a logical variable if set to TRUE assesses theimportance of predictors proximity another logical variable which if set toTRUE calculates a proximity measure among the rows z the data-frame ofclasses data the data set of attributes with which you wish to build theforest

Step 1 Load Required PackagesWe build the classification random forest using the Vehicle data frame (seepage 23 ) contained in the mlbench packagegt library (randomForest)gt library(mlbench)gt data(Vehicle)

Step 2 Prepare Data amp Tweak ParametersThe variable numtrees is used to contain the number of trees grown at eachiteration We set it to 1000 A total 500 of the 846 observations in Vehicleare used to create a randomly selected training samplegt setseed (107)

283


gt N=nrow(Vehicle)gt numtrees =1000gt train lt- sample (1N 500 FALSE)

The data frame attributes is used to hold the complete set of attributescontained in Vehiclegt attributes lt-Vehicle[train ]gt attributes$Class lt- NULL

Step 3 Estimate the Random ForestWe estimate the random forest using the training sample and the functionrandomForestThe parameter ntree = 1000 and we set both importanceand proximity equal to TRUE

gtfitlt- randomForest(Class ~ data = Vehicle[train]ntree=numtrees importance=TRUE proximity=TRUE)

The print function returns details of the random forest

gt print(fit)

CallrandomForest(formula = Class ~ data = Vehicle[

train ] ntree = numtrees importance =TRUE proximity = TRUE)

Type of random forest classificationNumber of trees 1000

No of variables tried at each split 4

OOB estimate of error rate 272Confusion matrix

bus opel saab van classerrorbus 121 1 1 1 002419355opel 1 60 60 6 052755906saab 5 49 66 9 048837209van 1 0 2 117 002500000

Important information include the formula used to fit the model thenumber of trees grown at each iteration the number of variables tried at eachsplit the confusion matrix class errors and out of the bag error estimate

284

TECHNIQUE 39 CLASSIFICATION RANDOM FOREST

The random forest appears to identify both van and bus with a low error(less than 3) but saab (49) appears to be essentially a random guesswhilst opel (53) is marginally better than random

Step 4 Assess ModelA visualization of the error rate by iteration for each bus opel saab vanand overall out of the bag error is obtained using plot see Figure 391gt plot(fit)

Figure 391 Error estimate across iterations for each class and overall usingVehicle

285


The misclassification error on van and bus settle as well as the out of bagerror settle down to a stable range within 200 trees For opel stability occursaround 400 trees whilst for saab around 800 trees

The rfcv function is used to perform a 10-fold cross valida-tion (cvfold=10) with variable importance re-assessed at each step(reductionrecursive=TRUE) The result is stored in cvgt cv lt- rfcv(trainx = attributes trainy = Vehicle

$Class[train]ntree=numtrees cvfold=10recursive=TRUE)

13 PRACTITIONER TIP

This function rfcv calculates cross-validated prediction per-formance with a sequentially reduced number of attributesranked by variable importance using a nested cross-validationprocedure The minimum value of the cross validation risk canbe used to select the optimum number of attributes (knownalso as predictors) in a random forest model

A plot of cross validated error versus number of predictors can be calcu-lated as followsgt with(cv plot(nvar errorcv log=x type=o lwd=2xlab=number of predictorsylab=cross -validated error ()))

Figure 392 shows the resultant plot It indicates a steady decline in thecross validated error for the first nine attributes and then it levels off

286


Figure 392 Cross-validated prediction performance by number of attributesfor Vehicle

Variable importance shown in Figure 393 is obtained using the functionvarImpPlotgt varImpPlot(fit)

287


Figure 393 Random Forrest variable importance plot for Vehicle

MaxLRa and DCirc seem to be important whether measured by averagedecrease in the accuracy or reduction in node impurity (gini)

Partial dependence plots provide a way to visualize the marginal relation-ship between the response variable and the covariates attributes Letrsquos usethis idea combined with the importance function to investigate the top fourattributesgt imp lt- importance(fit)

gt impvar lt- rownames(imp)[order(imp[ 1] decreasing=TRUE)]

288


gt op lt- par(mfrow=c(2 2))

gt for (i in seq_along (14)) partialPlot(fit Vehicle[train ] impvar[i] xlab=

impvar[i]main=paste(Partial Dependence on impvar[i]))

gt par(op)

Figure 394 shows the partial dependence plots for MaxLRaDCircPrAxisRa and MaxLRa

13 PRACTITIONER TIP

The importance function takes the type argument Set type= 1 if you only want to see importance by average decrease inaccuracy and set type = 2 to see only importance by averagedecrease in node impurity measured by the gini coefficient

289


Figure 394 Random forest partial dependence plot for MaxLRaDCircPrAxisRa and MaxLRa

It is often helpful to look at the margin This is measured as the propor-tion of votes for the correct class minus the maximum proportion of votesfor the other classes A positive margin implies correct classification and anegative margin incorrect classificationgt plot(margin(fit))

The margin for fit is show in Figure 395

290


Figure 395 Random forest margin plot for Vehicle

Step 5 Make PredictionsNext we use the fitted model on the test samplegtfitlt-randomForest(Class~data=Vehicle[-train ]

ntree=numtrees importance=TRUE proximity=TRUE)

gt print(fit)

Call

291


randomForest(formula = Class ~ data = Vehicle[-train ] ntree = numtrees importance =TRUE proximity = TRUE)





We notice a sharp deterioration in the classification error for saab overthe validation set although opel has improved dramatically The misclassi-fication rates for van and bus are in line with those observed in the trainingsample The overall out of the bag estimate is around 27

An important aspect of building random forest models is tuning the modelWe can attempt to do this by finding the optimal value of mtry and then fit-ting that model on the test data The parameter mtry controls the numberof variables randomly sampled as candidates at each split The tuneRF func-tion returns the optimal value of mtry The smallest out of the bag erroroccurs at mtry = 3 this is confirmed from Figure 396gtbest_mytry lt- tuneRF(attributes Vehicle$Class[

train] ntreeTry=numtrees stepFactor =15 improve=001 trace=TRUE plot=TRUE dobest=FALSE)

mtry = 4 OOB error = 27Searching left mtry = 3 OOB error = 266001481481 001mtry = 2 OOB error = 268-0007518797 001Searching right mtry = 6 OOB error = 266

292


Figure 396 Random forest OOB error by mytry for Vehicle

Finally we fit a random forest model on the test data using mtry=3gtfitbestlt-randomForest(Class~data = Vehicle[-

train ]ntree=numtrees importance=TRUE proximity=TRUE mtry =3)

gt print(fitbest)

CallrandomForest(formula = Class ~ data = Vehicle[-

train ] ntree = numtrees importance =TRUE proximity = TRUE mtry = 3)

293






The model fitbest has a slightly lower out of the bag error at 2514However relative to fit the accuracy of opel has declined somewhat

294

Technique 40

Conditional InferenceClassification Random Forest

A conditional inference classification random forest can be built using thepackage party with the cforest function

cforest(z ~ data controls )

Key parameters include controls which controls parameters such as thenumber of trees grown in each iteration z the data-frame of classes data thedata set of attributes with which you wish to build the forest

Step 1 Load Required PackagesWe build the classification random forest using the Vehicle data frame (seepage 23) contained in the mlbench package We also load the caret packageas it provides handy performance metrics for random forestsgt library (party)gt library(caret)gt library(mlbench)gt data(Vehicle)

295


13 PRACTITIONER TIP

If you have had your R session open for a lengthy period oftime you may have objects hogging memory To see whatobjects are being held in memory typels()

To remove all objects from memory userm(list=ls())

To remove a specific object from memory userm(object_name)

Step 2 Prepare Data amp Tweak ParametersA total 500 of the 846 observations are in Vehicle are used to create arandomly selected training samplegt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)

Step 3 Estimate and Assess the Random ForestWe estimate the random forest using the training sample and the functioncforest The number of trees is set to 1000 (ntree = 1000) with mytryset to 5

gt fitlt-cforest(Class ~ data = Vehicle[train ]controls =cforest_unbiased(ntree = 1000 mtry =5))

The accuracy of the model and the kappa statistic can be retrieved using thecaret package

gt caret cforestStats(fit)Accuracy Kappa086600 082146

Letrsquos do a little memory management and remove fit

296


gt rm(fit)

Letrsquos re-estimate the random forest using the default settings in cforestgt fitlt-cforest(Class ~ data = Vehicle[train ])gt caret cforestStats(fit)Accuracy Kappa

08700000 08267831

Since both the accuracy and kappa are slightly higher we will use this asthe fitted model

13 PRACTITIONER TIP

Remember that all simulation based models including the ran-dom forest are subject to random variation This variationcan be important when investigating variable importance Itis advisable before interpreting a specific importance rankingto check whether the same ranking is achieved with a differentmodel run (random seed) or a different number of trees

Variable importance is calculated using the function varimp First weestimate variable importance using the default approach which calculates themean decrease in accuracy using the method outlined in Hapfelmeier et al82

gt ord1lt-varimp(fit)gt imp1lt-ord1[order(-ord1 [118])]

gt round(imp1 [13] 3)Elong MaxLRa ScatRa0094 0069 0057

gt rm(ord1 imp1)

Next we try the unconditional approach using the lsquomean decrease in ac-curacyrsquo outline by Breiman83 Interestingly the attributes Elong MaxLRaand ScatRa are rank ordered exactly the same by both methods and theyhave very similar scoresgt ord2lt-varimp(fit pre10_0 = TRUE)gt imp2lt-ord2[order(-ord2 [118])]

gt round(imp2 [13] 3)Elong MaxLRa ScatRa

297


0100 0083 0057

gt rm(ord2 imp2)

13 PRACTITIONER TIP

In addition to the two unconditional measures of variable im-portance already discussed a conditional version is also avail-able It is conditional in the sense that it adjusts for cor-relations between predictor variables For the random forestestimated by fit conditional variable importance can be ob-tained bygt varimp(fit conditional = TRUE)

Step 5 Make PredictionsFirst we estimate accuracy and kappa using the fitted model and the testsamplegt fit_testlt-cforest(Class ~ data = Vehicle[-train

])gt caret cforestStats(fit_test)Accuracy Kappa

08179191 07567866

Both accuracy and kappa values are close to the values estimated on thevalidation sample This is encouraging in terms of model fit

The predict function calculates predicted values over the testing sampleThese values along with the original observations are used to construct thetest sample confusion matrix and overall misclassification error We observean error rate of 286gt predlt-predict(fit newdata=Vehicle[-train ]type =

response)

gt table(Vehicle$Class[-train]pred dnn=c( ObservedClassPredictedClass ))


bus 91 0 1 2opel 8 30 33 14

298


saab 6 23 51 8van 2 0 2 75



299

Technique 41

Classification Random Ferns

A classification random ferns model can be built using the package rFernswith the rFerns functionrFerns(z ~ data ferns )

Key parameters include ferns which controls the number of ferns grownat each iteration z the data-frame of classes data the data set of attributeswith which you wish to build the forest

Step 1 Load Required PackagesWe construct a classification random ferns model using the Vehicle dataframe (see page 23) contained in the mlbench packagegt library(rFerns)gt library(mlbench)gt data(Vehicle)

Step 2 Prepare Data amp Tweak ParametersA total 500 of the 846 observations in Vehicle are used to create a randomlyselected training samplegt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)

300

TECHNIQUE 41 CLASSIFICATION RANDOM FERNS

Step 3 Estimate and Assess the Model

13 PRACTITIONER TIP

To print the out of bag error as the model iterates addthe parameter reportErrorEvery to rFerns For examplereportErrorEvery = 50 returns the error every 50th itera-tion Your estimated model would look something like thisgt fitlt- rFerns(Class ~data = Vehicle[train ]ferns =300reportErrorEvery =50)

Done fern 50300 current OOB error039200






The number of ferns is set to 5000 with the parametersaveErrorPropagation=TRUE to save the out of bag errors at each it-eration

gt fitlt- rFerns(Class ~ data = Vehicle[train ]ferns =5000 saveErrorPropagation=TRUE)

The details of the fitted model can be viewed using the print method Themethod returns the number of ferns in the forest the fern depth the out ofbag error estimate and the confusion matrix

gt print(fit)

301


Forest of 5000 ferns of a depth 5

OOB error 3320 OOB confusion matrixTrue

Predicted bus opel saab vanbus 109 2 7 0opel 7 65 44 0saab 0 32 40 0van 8 28 38 120

It is often useful to view the out of bag error by fern size this can beachieved using the plot method and modelname$oobErrgt plot(fit$oobErr xlab=Number of Fernsylab=OOB Errortype=l)

Figure 411 shows the resultant chart Looking closely at the figure itappears the error is minimized at approximately 30 somewhere between2000 and 3000 ferns To get the exact number use the whichmin methodgt whichmin(fit$oobErr)[1] 2487gt fit$oobErr[whichmin(fit$oobErr)][1] 0324

So the error reaches a minimum at 324 for 2487 ferns

302


Figure 411 Classification Random Ferns out of bag error estimate by itera-tion for Vehicle

13 PRACTITIONER TIP

To determine the time taken to run the model you would use$timeTaken as followsgt fit$timeTakenTime difference of 0314466 secs

Other useful components of the rFerns object are given inTable 20

303


Parameter Description$model The estimated model$oobErr Out of bag error$importance Importance scores$oobScores class scores for each object$oobPreds Class predictions for each object$timeTaken Time used to train the model

Table 20 Some key components a rFerns object

Letrsquos compare re-estimate with the number of ferns equal to 2487gt fitlt- rFerns(Class ~ data = Vehicle[train ]

ferns =2487 importance=TRUE saveForest=TRUE)

gt print(fit)




Alas we ended up with the same out of bag error as our original modelNevertheless we will keep this version of the model

The influence of variables can be assessed using $importance The sixmost important variables and their standard deviations are as followsgt implt-fit$importance[order(-fit$importance [1]) ]

gt round(head(imp) 3)MeanScoreLoss SdScoreLoss

MaxLRa 0353 0008Elong 0219 0006ScVarMaxis 0205 0005ScVarmaxis 0201 0005PrAxisRect 0196 0005

304


DCirc 0196 0004

Step 5 Make PredictionsPredictions using the test sample can be obtained using the predict methodThe results are stored in predClass and combined using the table methodto create the confusion matrixgt predClass lt-predict(fit Vehicle[-train ])

gt table( predClass Vehicle$Class[-train] dnn =c( Predicted Class Observed Class ))

Observed ClassPredicted Class bus opel saab van

bus 85 5 4 0opel 4 38 36 0saab 0 13 28 0van 5 29 20 79

Finally we calculate the misclassification error rate At 335 it is fairlyclose to the error estimate for the validation samplegt error_rate = (1- sum( Vehicle$Class[-train] ==

predClass ) 346)gt round( error_rate 3)[1] 0335

305

Technique 42

Binary Response RandomForest

On many occasions the response variable is binary In this chapter weshow how to build a random forest model in this situation A binary re-sponse random forest can be built using the package randomForest with therandomForest function


Key parameters include ntree which controls the number of trees to growin each iteration importance a logical variable if set to TRUE assesses theimportance of predictors proximity another logical variable which if set toTRUE calculates a proximity measure among the rows z the binary responsevariable data the data set of attributes with which you wish to build theforest

Step 1 Load Required PackagesWe build the binary response random forest using the Sonar data frame (seepage 482) contained in the mlbench package The package ROCR is alsoloaded We will use this package during the final stage of our analysisgt library (randomForest)gt library(mlbench)gt data(Sonar)gt require(ROCR)

306

TECHNIQUE 42 BINARY RESPONSE RANDOM FOREST

Step 2 Prepare Data amp Tweak ParametersThe variable numtrees is used to contain the number of trees grown at eachiteration We set it to 1000 A total 157 out of the 208 observations in Sonarare used to create a randomly selected training samplegt setseed (107)gt N=nrow(Sonar)gt numtrees =1000gt train lt- sample (1N 157 FALSE)

The data frame attributes holds the complete set of attributes containedin Sonargt attributes lt-Sonar[train ]gt attributes$Class lt- NULL

Step 3 Estimate the Random ForestWe estimate the random forest using the training sample and the functionrandomForestThe parameter ntree = 1000 and we set importance equalto TRUE

gt fitlt- randomForest(Class~ data = Sonar[train ]ntree=numtrees importance=TRUE)


gt print(fit)

CallrandomForest(formula = Class ~ data = Sonar[

train ] ntree = numtrees importance =TRUE)




M R classerrorM 70 12 01463415R 17 58 02266667

307


The classification error on M is less than 20 whilst for R is around 23The out of the bag estimate of error is also less than 20

Step 4 Assess ModelA visualization of the error rate by iteration for each M and R with the overallout of the bag error is obtained using plot see Figure 421gt plot(fit)

Figure 421 Random Forest Error estimate across iterations for M R andoverall using Sonar

The misclassification errors settle down to a stable range by 800 trees

308


Next the variable importance scores are calculated and the partial depen-dence plot see Figure 422 is plottedgt imp lt- importance(fit)gt impvar lt- rownames(imp)[order(imp[ 1] decreasing

=TRUE)]gt op lt- par(mfrow=c(2 2))

gt for (i in seq_along (14)) partialPlot(fit Sonar[train ] impvar[i] xlab=


gt par(op)

309


Figure 422 Binary random forest partial dependence plot using Sonar

Figure 422 seems to indicate that V11 V12 V9 and V52 have similarshapes in terms of partial dependence Looking closely at the diagram youwill notice that all except V52 appear to be responding on a similar scaleGiven the similar pattern we calculate a number of combinations of thesethree variables and add them to our group of attributesgt templt-cbind(Sonar$V11 Sonar$V12 Sonar$V9)

gt Vmaxlt- apply(temp 1 function(x) max(x))gt Vminlt- apply(temp 1 function(x) min(x))gt Vmean lt- apply(temp 1 function(x) mean(x))gt Vmedlt- apply(temp 1 function(x) median(x))

310


gt attributes lt-cbind(attributes Vmax[train]Vmin[train]Vmean[train]Vmed[train])

gt Sonar lt-cbind(Sonar Vmax Vmin Vmean Vmed)

Now our list of attributes in Sonar also includes the maximum minimummean and median of V11 V12 V9 Letrsquos refit the model and calculatevariable importance of all attributesgt fitlt- randomForest(Class ~ data = Sonar[train ]

ntree=numtrees importance=TRUE)gt varImpPlot(fit)

Figure 423 shows variable importance using two measures - the meandecrease in accuracy and the mean decrease in the gini score Notice theconsistency in the identification of the top eight variables between these twomethods It seems the most influential variables are Vmax Vmin VmeanVmed V11 V12 V9 and V52

311


Figure 423 Binary Random Forest variable Importance plots for Sonar

A ten-fold cross validation is performed using the function rfcv followedwith a plot shown in Figure 424 of the cross validated error by number ofpredictors (attributes)gt cv lt- rfcv(trainx = attributes ntree=numtrees

trainy = Sonar$Class[train] cvfold=10 recursive=TRUE)

gt with(cv plot(nvar errorcv log=x type=olwd=2xlab=number of predictorsylab=cross -validated error ()))

312


The error falls sharply from 2 to around 9 predictors and continues todecline leveling out by 50 predictors

Figure 424 Binary random forest cross validation error by number of pre-dictors using Sonar

Step 5 Make PredictionsIt is always worth experimenting with the tuning parameter mytry It controlsthe number of variables randomly sampled as candidates at each split Thefunction tuneRF can be used to visualize the out of bag error associated withdifferent values of mytry

313


gtbest_mytry lt- tuneRF(attributes Sonar$Class[train] ntreeTry=numtrees stepFactor =15 improve=001 trace=TRUE plot=TRUE dobest=FALSE)

Figure 425 illustrates the output of tuneRF The optimal value occurs atmytry = 8 with an out of the bag error around 1975

We use this value of mytry and refit the model using the test sample

Figure 425 Using tuneRF to obtain optimal mytry value

gtfitbestlt-randomForest(Class~data=Sonar[-train ]ntree=numtrees importance=TRUE proximity=TRUE mtry =8)

314


gt print(fitbest)

CallrandomForest(formula = Class ~ data = Sonar[-





M R classerrorM 26 3 01034483R 9 13 04090909

The misclassification error for M and R is 10 and 41 respectively withan out of the bag error estimate of around 24 Finally we calculate and plotthe Receiver Operating Characteristic for the training set see Figure 426gt fitpreds lt- predict(fitbest data=Sonar$Class[-

train -61] type = rsquoprobrsquo)gt preds lt- fitpreds [2]gt plot(performance(prediction(preds Sonar$Class[-

train]) rsquotprrsquo rsquofprrsquo))gt abline(a=0b=1lwd=2lty=2col=gray)

315


Figure 426 Binary Random Forest ROC for test set using Sonar

316

Technique 43

Binary Response Random Ferns

A binary response random ferns model can be built using the package rFernswith the rFerns function

rFerns(z ~ data ferns saveErrorPropagation=TRUE)

Key parameters include z the binary response variable data the data setof attributes with which you wish to build the ferns and ferns which con-trols the number of ferns to grow in each iteration saveErrorPropagationa logical variable if set to TRUE calculates and saves the out of bag errorapproximation

Step 1 Load Required PackagesWe build the binary response random ferns using the Sonar data frame (seepage 482) contained in the mlbench package The package ROCR is alsoloaded We will use this package during the final stage of our analysis tobuild a ROC curvegt library(rFerns)gt library(mlbench)gt data(Sonar)gt require(ROCR)

Step 2 Prepare Data amp Tweak ParametersA total 157 out of the 208 observations in Sonar are used to create a randomlyselected training sample

317


gt setseed (107)gt N=nrow(Sonar)gt train lt- sample (1N 157 FALSE)

Step 3 Estimate and Assess the Random FernsWe use the function rFerns with 1000 ferns The parametersaveErrorPropagation=TRUE so that we can save the out of bag error esti-mates

gt fitlt- rFerns(Class ~ data = Sonar[train ]ferns=1000 saveErrorPropagation=TRUE)

The print function returns details of the fitted model

gt print(fit)



Predicted M RM 67 17R 15 58

The forest consists of 1000 ferns of a depth 5 with an out of the bag errorrate of around 20 Letrsquos plot the error rategt plot(fit$oobErr xlab=Number of Fernsylab=OOB

Errortype=l)

From Figure 431 it looks as if the error is minimized around 335 fernswith an approximate out of the bag error rate of around 20 To get theexact value usegt whichmin(fit$oobErr)[1] 335

gt fit$oobErr[whichmin(fit$oobErr)][1] 01910828

It seems the error is minimized at 335 ferns with a oob error of 19

318

TECHNIQUE 43 BINARY RESPONSE RANDOM FERNS

Figure 431 Binary Random Ferns out of bag error by number of Ferns forSonar

Now setting ferns = 335 we re-estimate the modelgt fitlt- rFerns(Class ~ data = Sonar[train ]ferns

=335 importance=TRUE saveForest=TRUE)

gt print(fit)



319



13 PRACTITIONER TIP

To see a print out of the error estimate as the models iteratesadd the parameter reportErrorEvery to rFerns For exam-ple reportErrorEvery=50 returns an out of the bag errorestimate at every 50th iteration Try thistestlt- rFerns(Class ~ data = Sonar[train

]ferns =100reportErrorEvery =20)

Now we turn our attention to investigating variable importance Thevalues can be extracted from the fitted model using fit$importance Sincelarger values indicate greater importance we use the order function to sortfrom largest to smallest but only show the top sixgt implt-fit$importance[order(-fit$importance [1]) ]

gt round(head(imp) 3)

MeanScoreLoss SdScoreLossV11 0125 0028V12 0109 0021V10 0093 0016V49 0068 0015V47 0060 0017V9 0060 0017

The output reports the top six attributes by order of importance It alsoprovides a measure of their variability (SdScoreLoss) It turns out that V11has the highest importance score followed by V12 and V10 Notice that V49V47 and V9 are all clustered around 006

Step 4 Make PredictionsWe use the predict function to predict the classes using the test sample createand print the confusion matrix and calculate the misclassification error The

320


misclassification error is close to 18gt predClass lt-predict(fit Sonar[-train ])gt table( predClass Sonar$Class[-train] dnn =c(

Predicted Class Observed Class ))

Observed ClassPredicted Class M R

M 24 4R 5 18

gt error_rate = (1- sum( Sonar$Class[-train] ==predClass ) 51)


It can be of value to calculate the Receiver Operating Characteristic(ROC) This can be constructed using the using the ROCR package Firstwe convert the raw scores estimated using the test sample from the modelfit into probabilities or at least values that lie in the zero to one rangegt predScores lt-predict(fit Sonar[-train ]scores=TRUE

)gt predScores lt-predScores+abs(min(predScores))

Next we need to generate the prediction object As inputs we use thepredicted class probabilities in predScores and the actual test sample classes(Sonar$Class[-train]) The predictions are stored in pred and passed tothe performance method which calculates the true positive rate and falsepositive rate The result is visualized as shown in Figure 433 using plot

gt pred lt- prediction(predScores [2]Sonar$Class[-train])

gt perf lt- performance( pred tpr fpr )

gt plot( perf )gt abline(a=0b=1lwd=2lty=2col=gray)

To interpret the ROC curve let us remember that perfect classificationimplies a 100 true positive rate (and therefore a 0 false positive rate)This classification only happens at the upper left-hand corner of the graphand results in plot as shown in Figure 432

Now looking at Figure 433 we see that the closer it gets to the upperleft corner the higher the classification rate The diagonal line shown inFigure 433 represents random chance so the distance of our graph over thediagonal line represents how much better it is over a random guess The area

321


under the curve which takes the value 1 for a 100 true positive rate canbe calculated using a few lines of codegt auc lt- performance(pred auc)gt auc_area lt- slot(auc yvalues)[[1]]gt round(auc_area 3)[1] 0929

Figure 432 100 true positive rate ROC curve

322


Figure 433 Binary Random Ferns Receiver Operating Characteristic forSonar

323

Technique 44

Survival Random Forest

A survival random forest can be built using the package randomForestSRCwith the rfsrc function

rfsrc(z ~ data )

Key parameters include z the survival response variable (note we setz=Surv(time status)where Surv is a survival object constructed using thesurvival package) and data the data set of attributes with which you wishto build the forest

Step 1 Load Required PackagesWe construct a survival random forest model using the rhDNase data frame(see page 100) contained in the mlbench packagegt library (randomForestSRC)gt library(simexaft)gt library(survival)gt data(rhDNase)

Step 2 Prepare Data amp Tweak ParametersA total 600 of the 641 observations in rhDNase are used to create a randomlyselected training sample The rows in rhDNase have ldquofunkyrdquo numbering Weuse the rownames method to create a sequential number for each patient (1for the first patient and 641 for the last patient)gt setseed (107)gt N=nrow(rhDNase)

324

TECHNIQUE 44 SURVIVAL RANDOM FOREST

gt rownames(rhDNase) lt- 1nrow(rhDNase)gt train lt- sample (1N 600 FALSE)

The forced expiratory volume (FEV) is considered a risk factor andwas measured twice at randomization (rhDNase$fev and rhDNase$fev2)We take the average of the two measurements as an explanatory vari-able The variable response is defined as the logarithm of the time fromrandomization to the first pulmonary exacerbation measured in the objectsurvreg(Surv(time2 status))gt rhDNase$fevave lt- (rhDNase$fev + rhDNase$fev2)2

Step 3 Estimate and Assess the ModelWe estimate the random forest using the training sample and the functionrfsrc The number of trees is set to 1000 and the number of variablesrandomly sampled as candidates at each split set equal to 3 (mytry=3) witha maximum of 3 (nsplit =3) split points chosen randomly

gt fitlt-rfsrc(Surv(time2 status) ~ trt + fev+fev2+fevave data= rhDNase[train ] nsplit = 3 ntree =1000 mtry =3)

The error rate by number of trees and variable importance are given usingthe plot method - see Figure 441 The error rate levels out around 400trees remaining relatively constant until around 600 trees Our constructedvariable fevave is indicated as the most important variable followed byfev2

gt plot(fit)

Importance Relative Impfevave 00206 10000fev2 00197 09586fev 00073 03529trt 00005 00266

325


Figure 441 Survival random forest error rate and variable importance forrhDNase

Further details of variable importance can be explored using a rangeof methods via the vimp function and the parameter importanceThis parameter can take four distinct values permute randompermuteensemble randomensemble Letrsquos calculate all four meth-odsgt permute lt-vimp(fit importance=permute)$importancegt random lt-vimp(fit importance=random)$importancegt permuteensemble lt-vimp(fit importance=permute

ensemble)$importancegt randomensemble lt-vimp(fit importance=random

326


ensemble)$importance

We combine the results using rbind into the matrix tab Overall we seethat fev2 and fevaverage appear to be the most important attributesgt tablt-rbind(permute random permuteensemble random

ensemble)gt round(tab 3)

trt fev fev2 fevavepermute 0001 0007 0021 0021random 0001 0010 0015 0018permuteensemble -0002 -0027 -0015 -0016randomensemble -0004 -0025 -0027 -0026

13 PRACTITIONER TIP

Another way to assess variable importance is using thevarselect method It takes the form varselect(object= fitted model method=md) Method can take the val-ues md for minimal depth (default) vh for variable huntingand vhvimp for variable hunting As an illustration usingthe fitted model fit typegt md lt- varselect(object = fit method=md

)gt vh lt- varselect(object = fit method=vh

)gt vhvimp lt- varselect(object = fit

method=vhvimp)

Searching for interaction effects is an important task in many areas ofstatistical modeling A useful tool is the method findinteraction Keyparameters for this method include the number of variables to be used (nvar)and the method (method) We set method = ldquovimprdquo to use the approach ofIshwaran84 In this method the importance of each variable is calculatedTheir paired variable importance is also calculated The sum of these twovalues is known as ldquoAdditiverdquo importance A large positive or negative dif-ference between ldquoPairedrdquo and ldquoAdditiverdquo indicates a potential interactionprovided the individual variable importance scores are largegt findinteraction(fit nvar = 8method=vimp)

Method vimp

327


No of variables 4Variables sorted by VIMP TRUE

No of variables used for pairing 4Total no of paired interactions 6

Monte Carlo replications 1Type of noising up used for VIMP permute

Paired Additive Differencefevavefev2 00346 00405 -00058fevavefev 00296 00284 00012fevavetrt 00217 00222 -00005fev2fev 00283 00270 00013fev2trt 00196 00208 -00012fevtrt 00134 00090 00045

Since the differences across all variable pairs appear rather small we con-clude there is little evidence of an interaction effect

Now we investigate interactions effects using method=maxsubtree Thisinvokes a maximal subtree analysis85gt findinteraction(fit nvar = 8method=maxsubtree

)

Method maxsubtreeNo of variables 4

Variables sorted by minimal depth TRUE

fevave fev2 fev trtfevave 007 012 013 021fev2 012 008 013 020fev 012 012 009 023trt 013 013 016 015

In this case reading across the rows in the resultant table small [i][j] val-ues with small [i][i] values are an indication of a potential interaction betweenattribute i and attribute j

Reading across the rows we do not see any such values and again we donot find evidence supporting an interaction effect

Step 4 Make PredictionsPredictions using the test sample can be obtained using the predict methodThe results are stored in pred

328


gt pred lt- predict(fit data= rhDNase[train ])

Finally we use the plot and plotsurvival methods to visualize thepredicted outcomes see Figure 442 and Figure 443gt plot(pred)gt plotsurvival(pred)

Figure 442 Survival random forest visualization of plot(pred) usingrhDNase

329


Figure 443 Survival random forest visualization of prediction

330

Technique 45

Conditional Inference SurvivalRandom Forest

A conditional inference Survival random forest can be built using the packageparty with the cforest function


Key parameters include controls which controls parameters such as thenumber of trees grown in each iteration z the survival response variable (notewe set z=Surv(time status)where Surv is a survival object constructedusing the survival package) and data the data set of attributes with whichyou wish to build the forest

Step 1 Load Required PackagesWe construct a conditional inference survival random forest model using therhDNase data frame (see page 100) contained in the mlbench packagegt library (party)gt library(simexaft)gt library(survival)gt data(rhDNase)

Step 2 Prepare Data amp Tweak ParametersWe will use all 641 observations in rhDNase for our analysis The rows inrhDNase have ldquofunkyrdquo numbering Therefore we use the rownames methodto create a sequential number for each patient (1 for the first patient and 641for the last patient)

331


gt setseed (107)gt rownames(rhDNase)

The forced expiratory volume (FEV) is considered a risk factor and wasmeasured twice at randomization (rhDNase$fev and rhDNase$fev2) Wetake the average of the two measurements as an explanatory variable Theresponse is defined as the logarithm of the time from randomization to thefirst pulmonary exacerbation measured in the object survreg(Surv(time2status))gt rhDNase$fevave lt- (rhDNase$fev + rhDNase$fev2)2gt zlt-Surv(rhDNase$time2 rhDNase$status)

Step 3 Estimate and Assess the ModelWe estimate the random forest using the training sample and the functioncforest The control cforest_unbiased is used to set the number of trees(ntree = 10) and to set the number of variables randomly sampled as can-didates at each split (mytry = 2)

fitlt-cforest(z ~ trt + fevave data= rhDNase controls =cforest_unbiased(ntree = 10mtry =2))

Step 4 Make PredictionsWe will estimate conditional Kaplan-Meier survival curves for individual pa-tients First we use the predict methodgt predlt-predict(fit newdata=rhDNase type = prob)

Next we use the plot method to graph the conditional Kaplan-Meiercurves for the individual patients (patient 1 137 205 and 461) seeFigure 451gt op lt- par(mfrow=c(2 2))gt plot(pred [[1]] main=Patient 1xlab=Statusylab

=Time)gt plot(pred [[137]] main=Patient 137xlab=Status

ylab=Time)gt plot(pred [[205]] main=Patient 205xlab=Status


ylab=Time)

332

TECHNIQUE 45 CONDITIONAL INFERENCE SURVIVAL

gt par(op)

Figure 451 Survival random forest Kaplan-Meier curves for the individualpatients using rhDNase

333

Technique 46

Conditional InferenceRegression Random Forest

A conditional inference regression random forest can be built using the pack-age party with the cforest function


Key parameters include controls which controls parameters such as thenumber of trees grown in each iteration z the continuous response variableand data the data set of attributes with which you wish to build the forest

Step 1 Load Required PackagesWe build the conditional inference regression random forest using the bodyfat(see page 62) data frame contained in the THdata package We also load thecaret package as it provides handy performance metrics for random forestsgt library (party)gt data(bodyfatpackage=THdata)gt library(caret)


334

TECHNIQUE 46 CONDITIONAL INFERENCE REGRESSION

Step 3 Estimate and Assess the Decision TreeWe estimate the random forest using the training sample and the functioncforest The control cforest_unbiased is used to set the number of trees(ntree = 100) and to set the number of variables randomly sampled as can-didates at each split (mytry = 5)

gt fitlt-cforest(DEXfat ~ data = bodyfat[train ]controls =cforest_unbiased(ntree = 100mtry =5))

The fitted model root mean square error and R-squared are obtained usingthe caret packagegt round(caret cforestStats(fit) 4)

RMSE Rsquared51729 08112

We calculate variable importance using three alternate methods Thefirst method calculates the unconditional mean decrease in accuracy usingthe approach of Hapfelmeier et al 86gt ord1lt-varimp(fit)gt imp1lt-ord1[order(-ord1 [13])]

gt round(imp1 3)hipcirc waistcirc age69957 47366 0000

The second approach calculates the unconditional mean decrease in accu-racy using the approach of Breiman87gt ord2lt-varimp(fit pre10_0 = TRUE)

gt imp2lt-ord2[order(-ord2 [13])]


The third approach calculates the conditional mean decrease in accuracyusing the approach of Breimangt ord3lt-varimp(fit conditional = TRUE)


335



It is informative to note that all three approaches identify hipcirc as hemost important variable followed by waistcirc

Step 5 Make PredictionsWe use the test sample observations and the fitted regression forest to predictDEXfat The scatter plot between predicted and observed values is shownin Figure 461 The squared correlation coefficient between predicted andobserved values is 0756gt predlt-predict(fit newdata=bodyfat[-train ]type =

response)

gtplot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=Predicted Values main=Training SampleModel Fit)


DEXfat 0756

336


Figure 461 Conditional Inference regression forest scatter plot for DEXfat

337

Technique 47

Quantile Regression Forests

Quantile regression forests are a generalization of random forests and of-fer a non-parametric way of estimating conditional quantiles of predictorvariables88 Quantile regression forests can be built using the packagequantregForest with the quantregForest function

quantregForest(yx data ntree quantiles )

Key parameters include y the continuous response variable x the dataset of attributes with which you wish to build the forest ntree the numberof trees and quantiles the quantiles you wish to include

Step 1 Load Required PackagesWe build the Quantile Regression Forests using the bodyfat (see page 62)data frame contained in the THdata packagegt require(quantregForest)gt data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersFollowing the approach taken by Garcia et al we use 45 of the 71 observationsto build the regression tree The remainder will be used for prediction The45 training observations were selected at random without replacement Weuse train to partition the data into a validation and test samplegt train lt- sample (171 45 FALSE)gt numtrees =2000gt xtrain lt-bodyfat[train ]

338

TECHNIQUE 47 QUANTILE REGRESSION FORESTS

gt xtrain$DEXfat lt- NULLgt xtest lt-bodyfat[-train ]gt xtest$DEXfat lt- NULLgt DEXtrain lt-bodyfat$DEXfat[train]gt DEXtest lt-bodyfat$DEXfat[-train]

Step 3 Estimate and Assess the ModelWe use the quantregForest method setting the number of variables ran-domly sampled as candidates at each split and node size equal to 5 (mytry= 5nodesize= 5) We use fit to contain the results of the fitted model

gt fitlt- quantregForest(y=DEXtrain x=xtrain mtry= 5ntree=numtrees nodesize= 5importance=TRUE

quantiles=c(025 05 075))

Use the print method to see a summary of the estimated modelgt print(fit)

CallquantregForest(x = xtrain y = DEXtrain mtry = 5

nodesize = 5 ntree = numtrees importance= TRUE quantiles = c(025 05 075))

Number of trees 2000No of variables tried at each split 5

To see variable importance across all fitted quantiles use the methodvarImpPlotqrf A visualization using this method for fit is shown inFigure 471 It appears hipcirc and waistcirc are the most influentialvariables for the median and 75th quantilegt varImpPlotqrf(fit)

339


Figure 471 Quantile regression forests variable importance across all fittedquantiles for bodyfat

Letrsquos look in a little more detail at the median This is achieved us-ing the varImpPlotqrf method and setting quantile=05 The result isillustrated in Figure 472gt varImpPlotqrf(fit quantile =05)

340


Figure 472 Quantile regression forests using the varImpPlotqrf methodand setting quantile=05 for bodyfat

The plot method returns the 90 prediction interval on the estimateddata see Figure 473 It appears the model is well calibrated to the datagt plot(fit)

341


Figure 473 Quantile Regression Forests 90 prediction interval usingbodyfat

Step 4 Make PredictionsWe use the test sample observations and the fitted regression forest to predictDEXfat Notice we set all = TRUE to use all observations for predictiongt predlt- predict(fit newdata= xtest all=TRUE)

To take a closer look at the value of pred you can entergt head(pred)

quantile= 01 quantile= 05 quantile= 09[1] 3422470 4153230 4714400

342


[2] 3540183 4153541 5241859[3] 2250604 2700303 3540398[4] 2764735 3556909 4273546[5] 2130879 2620480 3533114[6] 3249331 3886571 4452887

The plot of the fitted and observed values is shown in Figure 474 Thesquared correlation coefficient between the predicted and observed values is0898gt plot(DEXtest pred[2]xlab=DEXfat ylab=

Predicted Values (median) main=Training SampleModel Fit)

gt round(cor(pred[2] DEXtest)^23)[1] 0898

343


Figure 474 Quantile regression forests scatter plot between predicted andobserved values using bodyfat

344

Technique 48

Conditional Inference OrdinalRandom Forest

Conditional inference ordinal random forests are used when the response vari-able is measured on an ordinal scale In marketing for instance we often seeconsumer satisfaction measured on an ordinal scale - ldquovery satisfiedrdquo ldquosat-isfiedrdquo ldquodissatisfiedrdquo and ldquovery dissatisfiedrdquo In medical research constructssuch as self-perceived health are often measured on an ordinal scale - ldquoveryunhealthyrdquo ldquounhealthyrdquo ldquohealthyrdquo ldquovery healthyrdquo Conditional inference or-dinal random forests can be built using the package party with the cforestfunction


Key parameters include controls which controls parameters such as thenumber of trees grown in each iteration response variable Z an ordered factorand data the data set of attributes with which you wish to build the forest

Step 1 Load Required PackagesWe build our random forest using the wine data frame contained in theordinal package (see page 95) We also load the caret package as it provideshandy performance metrics for random forestsgt library(caret)gt library(ordinal)gt data(wine)

345


Step 2 Prepare Data amp Tweak ParametersWe use train to partition the validation set using fifty observations to buildthe model and the remainder as the test setgt setseed (107)gt N=nrow(wine)gt train lt- sample (1N 50 FALSE)

Step 3 Estimate and Assess the ModelWe use the cforest method with 100 trees and set the number of variablesrandomly sampled as candidates at each split equal to 5 (mytry = 5) Weuse fit to contain the results of the fitted model

gt fitlt-cforest(rating ~ data = wine[train ]controls =cforest_unbiased(ntree = 100mtry =5))

Accuracy and kappa are obtained using caret

gt round(caret cforestStats(fit) 4)Accuracy Kappa

08600 07997

The fitted model has reasonable accuracy and kappa statistics Wenext calculate variable importance using three alternate methods The firstmethod calculates the unconditional mean decrease in accuracy using theapproach of Hapfelmeier et al (See page 335 for further details of these meth-ods)gt ord1lt-varimp(fit)

gt imp1lt-ord1[order(-ord1 [15])]gt round(imp1 3)response temp contact bottle judge

0499 0000 0000 0000 0000

The second approach calculates the unconditional mean decrease in accu-racy using the approach of Breimangt ord2lt-varimp(fit pre10_0 = TRUE)


0448 0000 0000 0000 0000

346




0481 0000 0000 0000 0000

For all three methods it seems the response (wine bitterness score) is theonly informative attribute in terms of importance

Step 4 Make PredictionsWe next use the test sample observations and cforest and display the accu-racy and kappa using caretgt fit_testlt-cforest(rating ~ data = wine[-train ])gt round(caret cforestStats(fit_test) 4)Accuracy Kappa

03182 00000

Accuracy is now around 32 and kappa has fallen to zero Although thesenumber are not very encouraging nevertheless we investigate the predictiveperformance using the predict method and storing the results in predgt predlt-predict(fit newdata=wine[-train ]type =

response)

Now we compare the predicted values to those actually observed by usingthe table method to create a confusion matrixgt tblt-table(wine$rating[-train]pred dnn=c(actual

predicted))gt tb


1 0 3 0 0 02 0 6 0 0 03 0 0 7 0 04 0 0 0 4 05 0 0 0 2 0

Finally the misclassification rate can be calculated We see it is around23 for the test sample

347



348

NOTES

Notes72See Ozuysal M Fua P Lepetit V ( 2007) Fast Keypoint Recognition in Ten Lines of

Code In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR 2007) pp 18

73Please visit httpwwwun-reddorg74See for example

1 Evans JS Murphy MA Holden ZA Cushman SA (2011) Modeling species distri-bution and change using random forest In Drew CA Wiersma YF HuettmannF editors Predictive species and habitat modeling in landscape ecology conceptsand applications New York City NY USA Springer Science+Business Media pp139ndash159

2 Breiman L (2001) Statistical modeling The two cultures (with comments and arejoinder by the author) Statistical Science 16 199ndash231

75See for example

1 Baccini A Goetz SJ Walker WS Laporte NT Sun M et al (2012) Estimatedcarbon dioxide emisclassificationions from tropical deforestation improved by car-bondensity maps Nature Climate Change 2 182ndash185

2 Drake JB Knox RG Dubayah RO Clark DB Condit R et al (2003) Above groundbiomass estimation in closed canopy neotropical forests using lidar remote sensingFactors affecting the generality of relationships Global Ecology and Biogeography12 147ndash159

3 And Hudak AT Strand EK Vierling LA Byrne JC Eitel JUH et al (2012) Quan-tifying above ground forest carbon pools and fluxes from repeat LiDAR surveysRemote Sensing of Environment 123 25ndash40

76Mascaro Joseph et al A tale of two ldquoforestsrdquo Random Forest machine learning aidstropical forest carbon mapping (2014) e85993

77Farney Robert J et al Sleep disordered breathing in patients receiving therapy withbuprenorphinenaloxone European Respiratory Journal 422 (2013) 394-403

78Vitorino D et al A Random Forest Algorithm Applied to Condition-based Waste-water Deterioration Modeling and Forecasting Procedia Engineering 89 (2014) 401-410

79Sugimoto Koichiro et al Cross-sectional study Does combining optical coherencetomography measurements using the lsquoRandom Forestrsquodecision tree classifier improve theprediction of the presence of perimetric deterioration in glaucoma suspects BMJ open310 (2013) e003114

80Kanerva N et al Random forest analysis in identifying the importance of obesityrisk factors European Journal of Public Health 23suppl 1 (2013) ckt124-042

81Chen Xue-Wen and Mei Liu Prediction of proteinndashprotein interactions using ran-dom decision forest framework Bioinformatics 2124 (2005) 4394-4400

82Alexander Hapfelmeier Torsten Hothorn Kurt Ulm and Carolin Strobl (2012) ANew Variable Importance Measure for Random Forests with misclassified Data Statisticsand Computing httpdxdoiorg101007s11222-012-9349-1

83Leo Breiman (2001) Random Forests Machine Learning 45(1) 5ndash3284See Ishwaran H (2007) Variable importance in binary regression trees and forests

Electronic J Statist 1519-53785For more details of this method see

349


1 Ishwaran H Kogalur UB Gorodeski EZ Minn AJ and Lauer MS (2010) High-dimensional variable selection for survival data J Amer Statist Assoc 105205-217

2 Ishwaran H Kogalur UB Chen X and Minn AJ (2011) Random survival forestsfor highdimensional data Statist Anal Data Mining 4115-132

86See Alexander Hapfelmeier Torsten Hothorn Kurt Ulm and Carolin Strobl (2012) ANew Variable Importance Measure for Random Forests with misclassified Data Statisticsand Computing httpdxdoiorg101007s11222-012-9349-1

87see Leo Breiman (2001) Random Forests Machine Learning 45(1) 5ndash3288See Meinshausen Nicolai Quantile regression forests The Journal of Machine Learn-

ing Research 7 (2006) 983-999

350

Part VI

Cluster Analysis

351

The Basic Idea

Cluster analysis is an exploratory data analysis tool which aims to group aset of objects in such a way that objects in the same group are more similarto each other than to those in other groups The groups of similar objectsare called clusters

Cluster analysis itself is not one specific algorithm but consists of a widevariety of approaches Two of the most popular are hierarchical clusteringand partition based clustering We discuss each briefly below

Hierarchical ClusteringHierarchical clustering methods attempt to maximize the distance betweenclusters They begin by assigning each object to its own cluster At each stepthe algorithm merges together the least distant pair of clusters until onlyone cluster remains At every step the distance between clusters say clusterA and cluster B is updated

There are two basic approaches used in hierarchical clustering algorithmsThe first is known as agglomerative and the second divisive

Agglomerative Approach

The agglomerative approach begins with each observations considered a sep-arate cluster These observations are then merged successively into largerclusters until a hierarchy of clusters known as a dendogram emerges

Figure 544 shows a typical dendogram using the thyroid data set Fur-ther details of this data are discussed on page 385 Because agglomerativeclustering algorithms begin with the individual observations then grow tolarger clusters it is known as a bottom up approach to clustering

353


Divisive ApproachThe divisive approach begins with the entire sample of observations and thenproceeds to divide it into smaller clusters Two approaches are generallytaken The first uses a single attribute to determine the split at each stepThis approach is called monothetic The second approach uses all attributesat each step and is know as polythetic Since for a sample of n observationsthere are 2n-1-1 possible divisions of clusters the divisive approach is compu-tationally intense Figure 552 shows an example of a dendogram producedusing the divisive approach

Partition Based ClusteringPartitioning clustering methods attempt to minimize the ldquoaveragerdquo distancewithin clusters They typically proceeds as follows

1 Select at random group centers and assign objects to the nearest group

2 Form new centers at the center of these groups

3 Move individual objects to a new group if it is closer to that group thanthe center of its present group

4 Repeat step 2 and 3 until no (or minimal) change in groupings occurs

K- MeansThe most popular partition based clustering algorithm is the k-meansmethod It requires a distance metric and the number of centroids (clus-ters) to be specified Here is how it works

1 Determine how many clusters you expect in your sample say for exam-ple k

2 Pick a group of k random centroids A centroid is the center of a cluster

3 Assign sample observations to the closest centroid The closeness isdetermined by a distance metric They become part of that cluster

4 Calculate the center (mean) of each cluster

5 Check assignments for all the sample observations If another center iscloser to an observation reassign the it to that cluster

354

6 Repeat step 3-5 until no reassignment occur

Much of the popularity of k-means lies in the fact that the algorithm isextremely fast and can therefore be easily applied to large data-sets Howeverit is important to note that the solution is a local maximum so in practiceseveral starting points should be used

355


NOTE

Different clustering algorithms may produce very differentclusters Unfortunately there is no generally accepted ldquobestrdquomethod One way to assess an algorithm is to take a data setwith a known structure and see whether the algorithm is ableto reproduce this structure This is the approach taken byMusmeci et al

Financial Market StructureMusmeci et al89 study the structure of the New York Stock Exchange usingvarious clustering techniques A total of 4026 daily closing prices from 342stocks are collected for their analysis Both trended and detrended log pricereturns are considered for analysis Four of the clustering methods consideredare Single Linkage (SL) Average Linkage (AL) Complete Linkage (CL) andk-medoids using the Industrial Classification Benchmark90 (ICB) as a knowclassification outcome reference

Figure 481 illustrates the findings of their analysis for a predefined clustersize equal to 23 SL appears to generate one very large cluster which containsstocks in all the ICB sectors this is clearly inaccurate AL CL and k-medoidsshow more structured clustering although for AL there remain a number ofclusters with less than 5 stocks91

The Adjusted Rand index and number of sectors identified by each methodare presented in Table 21 ICB has 19 sectors and one would hope that theclustering methods would report a similar number of clusters For the trendedstock returns k-medoids comes closet with 17 sectors and a Adjusted Randindex of 0387 For the detrended case k-medoids has a lower Adjusted Randindex than CL and AL but it again comes closet to recovering the actualnumber of sectors in ICB

356

Figure 481 Musmeci et alrsquos composition of clustering in terms of ICB super-sectors The x-axis represents the cluster labels the y-axis the number ofstocks in each cluster Color and shading represent the ICB super-sectorsSource Musmeci et al

357


with trendk-medoids CL AL SL

Adjusted Rand index 0387 0387 0352 0184Clusters 17 39 111 229

detrendedAdjusted Rand index 0467 0510 0480 0315Clusters 25 50 60 101

Table 21 Adjusted Rand index and number of estimated sectors by clusteringmodel reported by Musmeci et al

13 PRACTITIONER TIP

Musmeci et al use the Adjusted Rand index as part of theiranalysis The index is a measure of the similarity betweentwo data partitions or clusters using the same sample Thisindex has zero expected value in the case of a random parti-tion and takes the value 1 in the case of perfect agreementbetween two clusters Negative values of the index indicateanti-correlation between the two clusters You can calculateit in R by the function adjustedRandIndex contained in themclust package

358

Understanding Rat TalkSo called ldquoRat Talkrdquo consists of ultrasonic vocalizations (USVs) emitted byindividual rats to members of their colony It is possible that USVs conveyboth emotional andor environmental information Takahashi et al92 usecluster analysis to categorize the USVs of pairs of rats as they go about theirdaily activities

The data set consisted of audio and video recording The audio recordingdata was analyzed 50 ms at a time Where a USVs was emitted the cor-responding video sequence was also analyzed USV calls were categorizedaccording to Wright et alrsquos rat communication classification system93 Thisconsists of seven call types - upward downward flat short trill inverted Uand 22-kHz

A two-step cluster analysis was performed In the first step frequencyand duration data were formed into preclusters The number of clusters wasautomatically determined on the basis of Schwarz Bayesian Criterion withthe log-likelihood criterion used as a distance measure In the second step ahierarchical clustering algorithm was applied to the preclusters

Table 22 presents some of the findings Notice that three clusters wereobtained which the researchers labeled as ldquofeedingrdquo ldquomovingrdquo and ldquofightingrdquoFeeding is associated with a low frequency USV moving with a mediumfrequency and fighting with the highest frequency Overall the analysis tendsto indicate that there is an association between USVs and rat behavior

Cluster Frequency Duration Dominantcall type

1 Feeding 2456 plusmn218kHz 62870 plusmn41545 ms 22-kHz2 Moving 4178 plusmn588kHz 3118 plusmn3240 ms Flat3 Fighting 5918 plusmn491kHz 916 plusmn1008 ms Short

Table 22 Summary of findings of Takahashi et al

Identification of Asthma ClustersKim et al94 use cluster analysis to help define asthma sub-types as a pre-lude to searching for better asthma management The researchers conducttheir analysis using two cohorts of asthma patients The first known as theCOREA cohort consists of 724 patients The second the SCH cohort consists

359


of 1843 patients We focus our discussion on the COREA cohort as similarresults were found for the SCH cohort

Six primary variables FEV195 body mass index age at onset atopic sta-

tus smoking history and history of hospital use were used to help character-ize the asthma clusters All measurements were standardized using z-scoresfor continuous variables and as 0 or 1 for categorical variables Hierarchicalcluster analysis using Wardrsquos method was used to generate a dendrogram forestimation of the number of potential clusters This estimate was then usedin a k-means cluster analysis

The researchers observed four clusters The first cluster contained 81patients and was dominated by male patients with the greatest number ofsmokers and a mean onset age of 46 years The second cluster contained151 patients and around half of the patients had atopy The third clusterhad 253 patients with the youngest mean age at onset (21 years) and abouttwo-thirds of patients had atopy The final group had 239 patients and thehighest FEV1 at 979 with mean age at onset of 48 years

Automated Determination of the Arterial In-put FunctionCerebral perfusion also referred to as cerebral blood flow (CBF) is one ofthe most important parameters related to brain physiology and functionThe technique of dynamic-susceptibility contrast (DSC) MRI96 is a popularmethod to measure perfusion It relies on the intravenous injection of acontrast agent and the rapid measurement of the transient signal changesduring the passage of the contrast agent passes through the brain

Central to quantification of CBF is the arterial input function (AIF)which describes the contrast agent input to the tissue of interest

The creation of quantitative maps of cerebral blood flow cerebral bloodvolume (CBV) and mean transit time (MTT) are created using a deconvo-lution method97 98

Yin et al99 consider the use of two clustering techniques (k-means andfuzzy c-means (FCM)) for AIF detection Forty-two volunteers were recruitedonto the study They underwent DSC MRI imaging After suitable transfor-mation of the image data both clustering techniques were applied with thenumber of clusters pre-set to 5

For the mean curve of each cluster the peak value (PV) the time topeak (TTP) and the full-width half maximum (FWHM) were computed fromwhich a measure M = P V

(T P P timesF W HM)was calculated Following the approach

360

of Murase et al100the researchers select the cluster with the highest M todetermine the AIF

Figure 482 shows the AIFrsquos for each clustering method for a 37 yearold male patient Relative to FCM the researchers observe K-means-basedAIF shows similar TTP higher PV and narrower FWHM The researchersconclude by stating ldquothe K-means method yields more accurate and repro-ducible AIF results compared to FCM cluster analysis The execution time islonger for the K-means method than for FCM but acceptable because it leadsto more robust and accurate follow-up hemodynamic mapsrdquo

361


Figure 482 Comparison of AIFs derived from the FCM and Kmeans clus-tering methods Source Yin et al doi101371journalpone0085884g007

362

13 PRACTITIONER TIP

The question of how many clusters to pre-specify in methodssuch as k-means is a common issue Kim et al solve thisproblem by using the dendrogram from hierarchical clusteranalysis to generate the appropriate number of clusters Givena sample of size N An alternative and very crude rule ofthumb is101

k asympradicN

2 (481)

Choice Tests for a Wood SampleChoice tests are frequently used by researchers to determine the food pref-erence of insects For insects which are sensitive to taste natural variationsin the same species of wood might be sufficient to confound experimentalresults102 Therefore researchers have spent considerable effort attempting toreduce this natural variation103

Oberst et al104 test the similarity of wood samples cut sequentially fromdifferent Monterey pine (Pinus radiata) sources by applying fuzzy c-meansclustering

Veneer discs from different trees geographical locations were collectedfor two samples The small data set consisted of 505 discs cut from 10 sheets ofwood the large data set consisted of 1417 discs cut from 22 sheets Fuzzy-cmeans clustering using three physical properties (dry weight moisture ab-sorption and reflected light intensity) was used to evaluate both data-sets

Six clusters were identified for each data set For the small data set allsix cluster centers were in regions of negative mode-skewness which simplymeans that most of the wood veneer had more bright (early wood) than dark(late wood) regions For the large data set only four of the six cluster centerswere in regions of negative mode-skewness

The researchers found that the difference between the small and largedata set for the mode skewness of the reflected light intensity was statisticallysignificant This was not the case for the other two properties (dry weightand moisture absorption)

Oberst et al conclude by observing that ldquothe clustering algorithm wasable to place the veneer discs into clusters that match their original source

363


veneer sheets by using just the three measurements of physical propertiesrdquo

Evaluation of Molecular DescriptorsMolecular descriptors play a fundamental role in chemistry pharmaceuti-cal sciences environmental protection policy and health research They arebelieved to map molecular structures into numbers allowing some mathe-matical treatment of the chemical information contained in the molecule AsTodeschini and Consonni state105

The molecular descriptor is the final result of a logic and mathematicalprocedure which transforms chemical information encoded within a symbolicrepresentation of a molecule into a useful number or the result of some stan-dardized experiment

Dehmer et al106 evaluate 919 descriptors of 6 different categories (con-nectivity edge adjacency topological walk path counts information and 2Dmatrix based) by means of clustering The samples for their analysis camefrom thee data sets107 MS2263 C15and N8 For each of the 6 categories 24301572840 and 469 descriptors were acquired

In order to evaluate the descriptors seven hierarchical clustering algo-rithms (Ward Single Complete Average Mcquitty Median and the Cen-troid) were applied to each of the three data-sets

Dehmer et al report the cophentic correlation coefficients for the averageclustering solutions for the three data-sets as 084 089 and 093 The re-searchers also plot the hierarchical clusters see Figure 483 and they observethat ldquoThe figure indicates that the descriptors of each categories have notbeen clustered correctly regarding their respective groupsrdquo

364

Figure 483 Dehme et alrsquos hierarchical clustering using the average algorithmMS2265 (left) C15(middle) N8 (right) The total number of descriptors equals919 They belong to 6 different categories which are as follows connectivityindices (24) edge adjacency indices (301) topological indices (57) walk pathcounts (28) information indices (40) and 2D Matrix-based (469) SourceDehme et al doi101371journalpone0083956g001

365


13 PRACTITIONER TIP

Although often viewed as graphical summary of a sample itis important to remember that dendrograms actually imposestructure on the data For example the same matrix of pair-wise distances between observations will be represented by adifferent dendrogram depending on the distance function (egcomplete or average linkage) that is usedOne way to measure how faithfully a dendrogram preservesthe pairwise distances between the original data points is touse the cophenetic correlation coefficientIt is defined as the correlation between the n(n minus 1)2 pairwisedissimilarities between observations and the between clusterdissimilarities at which two observations are first joined to-gether in the same cluster (often known as cophenetic dis-similarities) It takes a maximum value of 1 Higher valuescorrespond to greater preservation of the pairwise distancesbetween the original data pointsIt can be calculated using the cophenetic function in thestats package

366

Partition Based Methods

367

Technique 49

K-Means

For our initial analysis we will use the kmeansruns function in the fpc pack-age This uses the kmeans method in the stats package

kmeansruns(xkrange criterion )

Key parameters include x the data set of attributes for which you wishto find clusters krange the suspecteded minimum and maximum numberof clusters and criterion which determines the metric used to assess theclusters

Step 1 Load Required PackagesFirst we load the required packagesgt require(fpc)gt data(Vehiclepackage=mlbench)

We use the Vehicle data frame contained in the mlbench package for ouranalysis see page 23for additional details on this data

Step 2 Prepare Data amp Tweak ParametersWe store the Vehicle data set in xgt setseed (98765)gt xlt-Vehicle [-19]

Step 3 Estimate and Assess the ModelK- means requires you specify the exact number of clusters However inpractice you will rarely know this number precisely One solution is to specify

368

TECHNIQUE 49 K-MEANS

a range of possible clusters and use a metric such as the Calinski-Harabaszindex to choose the appropriate number

Let us suppose we expect the number of clusters to be between 3 and 8In this case we could call the kmeansruns method setting krange to rangebetween 4 and 8 and the criterion parameter = ch for the Calinski-Harabasz index

gt no_k lt- kmeansruns(xkrange =48 critout=TRUE runs=10 criterion=ch)

4 clusters 21512675 clusters 25288956 clusters 25715047 clusters 23750428 clusters 2305053

The optimal number of clusters is the solution with the highest Calinski-Harabasz index value In this case the Calinski-Harabasz index selects 6clusters

We can also use the silhouette average width as our decision criteriongt no_k lt- kmeansruns(xkrange =48 critout=TRUE runs

=10 criterion=asw)4 clusters 044239195 clusters 047160476 clusters 044201397 clusters 044831068 clusters 03405826

The widest width occurs are 5 clusters This is a smaller number thanobtained by the Calinski-Harabasz index however using both methods wehave narrowed down the range of possible clusters to lie between 5 to 6Visualization often helps in making the final selection To do this letrsquos builda sum of squared error (SSE) scree plot I explain line by line belowgt wgs lt- (nrow(x) -1)sum(apply(x2var))

gt for (i in 28)fit_templt-kmeans(asmatrix(x) centers=i)wgs[i]lt-sum(fit_temp$withinss)

369


gt plot(18 wgs type=b xlab=Number ofClusters ylab=Within groups sum of squares)

Our first step is to assign the first observation in wgs to a large numberThis is followed by a loop which calls kmeans in the stats package for k = 2to 8 clusters This is followed by using the plot method to create a scree plotshown in Figure 491 Typically in a scree plot we look for a sharp elbow toindicate the appropriate number of clusters In this case the elbow is gentlebut 4 clusters appears to be a good number to try

Figure 491 k-means clustering scree plot using Vehicle

Letrsquos fit the model with 4 clusters using the kmeans function in the statspackage and plotting the results in Figure 492

370


gt fitlt-kmeans(x 4 algorithm = Hartigan -Wong)gt plot(x col = fit$cluster)

Figure 492 k-means pairwise plot of clusters using Vehicle

We can also narrow down our visualization to a single pair To illustratethis letrsquos look at the relationship between Comp (column 1 in x) and ScatRa(column 7 in x) The plot is shown in Figure 493gt plot(x[c(17)] col = fit$cluster)gt points(fit$centers[c(17)] col = 14 pch = 8

cex = 2)

371


Figure 493 k-means pairwise cluster plot of Comp and ScatRa

372

Technique 50

Clara Algorithm

For our analysis we will use the clara function in the cluster package

clara(xk)

Key parameters include x the dissimilarity matrix and k the number ofclusters

Step 1 Load Required PackagesFirst we load the required packagesgt require (cluster)gt require(fpc)gt data(thyroidpackage=mclust)

We use the thyroid data frame contained in the mclust package for ouranalysis see page 385 for additional details on this data The fpc packagewill be used to help select the optimum number of clusters

Step 2 Prepare Data amp Tweak ParametersThe thyroid data is stored in x We also drop the class labels stored in thevariable Diagnosis Finally we use the daisy method to create a dissimi-larity matrixgt setseed (1432)gt xlt-thyroidgt x$Diagnosis lt- NULLgt dissim lt-daisy(x)

373


Step 3 Estimate and Assess the ModelTo use of the Clara algorithm you have to specify the exact number of clustersin your sample However in practice you will rarely know this number pre-cisely One solution is to specify a range of possible clusters and use a metricsuch as the average silhouette width to choose the appropriate number

Let us suppose we expect the number of clusters to be between 1 and 6In this case we could call the pamk method setting krange to lie between1 and 6 and the criterion parameter = asw for the average silhouettewidth Here is how to do that

gt pk1 lt- pamk(dissim krange =16criterion=aswcritout=TRUE usepam=FALSE)

1 clusters 02 clusters 051728893 clusters 040311424 clusters 043251065 clusters 048455676 clusters 04251244

The optimal number of clusters is the solution with the largest averageshillotte width So in this example 2 clusters has largest width

However we fit the model with 4 clusters and use the plot method tovisualize the resultgt fitlt-clara(xk=4)

gt par( mfrow = c(1 2))gt plot(fit)

Figure 501 shows the resultant plots

374

TECHNIQUE 50 CLARA ALGORITHM

Figure 501 Plot using the clara algorithm with k = 4 and data set thyroid

375

Technique 51

PAM Algorithm

For our analysis we will use the pam function in the cluster package

pam(xk)

Key parameters include x the dissimilarity matrix and k is the numberof clusters

Step 1 Load Required PackagesFirst we load the required packagesrequire (cluster)require(fpc)data(winepackage=ordinal)

We use the wine data frame contained in the ordinal package for ouranalysis see page 95 for additional details on this data

Step 2 Prepare Data amp Tweak ParametersSince the wine data set contains ordinal variables we use the daisy methodwith metric set to ldquogowerrdquo to create a dissimilarity matrix The gowermetric can handle nominal ordinal and binary data108 The wine sample isstored in data We also drop the wine rating column and then pass data tothe daisy method storing the result in xgt setseed (1432)gt data lt-winegt data lt-data[-2]gt xlt-daisy(data metric = gower)

376

TECHNIQUE 51 PAM ALGORITHM

Step 3 Estimate and Assess the ModelThe PAM algorithm requires you to specify the exact number of clusters inyour sample However in practice you will rarely know this number preciselyOne solution is to specify a range of possible clusters and use a metric suchas the average silhouette width to choose the appropriate number Let ussuppose we expect the number of clusters to be between 1 and 5 In this casewe could call the pamk method setting krange to lie between 1 and 5 andthe criterion parameter = asw for the average silhouette width Here ishow to do that

gt pk1 lt- pamk(xkrange =15criterion=aswcritout=TRUE usepam=TRUE)


The optimal number of clusters is the solution with the largest averagesilhouette width So in this example the largest width occurs at 4 clusters

Now we fit the model with 4 clusters and use the clusplot method tovisualize the result Figure 511 shows the resultant plotgt fitlt-pam(x4)gt clusplot(fit)

377


Figure 511 Partitioning around medoids using pam with k = 4 for wine

378

Technique 52

Kernel Weighted K-Means

The kernel weighted version of the k-means algorithm projects the sampledata into a non-linear feature space by use of a kernel see Part II It hasthe benefit that it can identify clusters which are not linearly separable inthe input space For our analysis we will use the kkmeans function in thekernlab package

kkmeans(x centers )

Key parameters include x the matrix of data to be clustered and centersthe number of clusters

Step 1 Load Required PackagesFirst we load the required packagesgt require(kernlab)gt data(Vehiclepackage=mlbench)

We use the Vehicle data frame contained in the mlbench package for ouranalysis see page 95 for additional details on this data

Step 2 Prepare Data amp Tweak ParametersWe drop the vehicle class column and store the result in xgt setseed (98765)gt xlt-Vehicle [-19]

379


Step 3 Estimate and Assess the ModelWe estimate the model using four clusters (centers = 4) storing the resultin fit The plot method is then used to visualize the result see Figure 521

gt fit lt- kkmeans(asmatrix(x) centers =4)gt plot(xcol=fit)

Letrsquos focus in on the pairwise clusters associated with Comp (column 1 inx) and ScatRa (column 7 in x) We can visualize this relationship using theplot method Figure 522 shows the resultant plotsgt plot(x[c(17)] col = fit)gt points(centers(fit)[c(17)] col = 14 pch = 8

cex = 2)

380

TECHNIQUE 52 KERNEL WEIGHTED K-MEANS

Figure 521 Kernel Weighted K-Means with centers = 4 using Vehicle

381


Figure 522 Kernel Weighted K-Means pairwise clusters of Comp and ScatRausing Vehicle

382

Hierarchy Based Methods

383

Technique 53

Hierarchical AgglomerativeCluster Analysis

Hierarchical Cluster Analysis is available with the basic installation of R Itis obtained using the stats package with the hclust function

hclust(d method )

Key parameters include d the dissimilarity matrix and method the ag-glomeration method to be used

Step 1 Load Required PackagesFirst we load the required packages Irsquoll explain each belowgt library(colorspace)gt library(dendextend)gt require(qgraph)gt require (cluster)gt data(thyroidpackage=mclust)

We will use the thyroid data frame contained in the mclust packagein our analysis For color output we use the colorspace package Thedendextend and qgraph packages will help us better visualize our resultsWe will also use the bannerplot method from the cluster package

384

TECHNIQUE 53 HIERARCHICAL AGGLOMERATIVE

NOTE

The thyroid data frame109 was constructed from five labo-ratory tests administered to a sample of 215 patients Thetests were used to predict whether a patientrsquos thyroid functioncould be classified as euthyroidism (normal) hypothyroidism(under active thyroid) or hyperthyroidism The thyroid dataframe contains the following variables

bull Class variable

ndash Diagnosis Diagnosis of thyroid operation HypoNormal and Hyper

lowast Distribution of Diagnosis (number of in-stances per class)

lowast Class 1 (normal) 150lowast Class 2 (hyper) 35lowast Class 3 (hypo) 30

bull Continuous attributes

ndash RT3U T3-resin uptake test (percentage)ndash T4 Total Serum thyroxin as measured by the iso-

topic displacement methodndash T3 Total serum triiodothyronine as measured by

radioimmuno assayndash TSH Basal thyroid-stimulating hormone (TSH) as

measured by radioimmuno assayndash DTSH Maximal absolute difference of TSH value

after injection of 200 micro grams of thyrotropin-releasing hormone as compared to the basal value

Step 2 Prepare Data amp Tweak ParametersWe store the thyroid data set in thyroid2 and remove the class variable(Diagnosis) The variable diagnosis_col will be used to color the dendo-grams using rainbow_hcl

385


gt thyroid2 lt-thyroid [-1]gt diagnosis_labels lt- thyroid [1]

gt diagnosis_col lt- rev(rainbow_hcl (3))[asnumeric(diagnosis_labels)]

Step 3 Estimate and Assess the ModelOur first step is to create a dissimilarity matrix We use the manhattanmethod storing the results in dthyroid

gt dthyroid lt-dist(thyroid2 method=manhattan)

Now are ready to fit our basic model I have had some success with wardrsquosmethod so letrsquos try it firstgt fit lt- hclust(dthyroid method=wardD2)

NOTE

I have had good practical success using wardrsquos method How-ever it is interesting to notice that in the academic lit-erature there are two different algorithms associated withthe method The method we selected (wardD2) squaresthe dissimilarities before clustering An alternative wardmethod wardD does not110The print method provides a useful overview of the fitted model

gt print(fit)

Callhclust(d = dthyroid method = wardD2)

Cluster method wardD2Distance manhattanNumber of objects 215

Now we create a banneplot of the fitted modelgt bannerplot(fit main=Euclidean)

The result is shown in Figure 531

386


Figure 531 Hierarchical Agglomerative Cluster Analysis bannerplot usingthyroid and Wards method

OK we should take a look at the fitted model It will be in the formof a dendogram First we set diagnosis to capture the classes Then wesave fit as a dendrogram into dend followed by ordering of the observationssomewhat using the rotate methodgt diagnosis lt- rev(levels(diagnosis_labels))gt dend lt- asdendrogram(fit)gt dend lt- rotate(dend 1215)

We set k=3 in color_branches for each of the three diagnosis types(hyper normal hypo)

387


gt dend lt- color_branches(dend k=3)

Next we match the labels to the actual classesgt labels_colors(dend) lt-rainbow_hcl (3)[sort_levels_values(asnumeric(diagnosis_labels)[orderdendrogram(dend)])]

We still need to add the diagnosis classes to the labels This is achievedas followsgt labels(dend) lt- paste(ascharacter(diagnosis_

labels)[orderdendrogram(dend)](labels(dend))sep = )

A little administration is required in our next step reduce the size of thelabels to 75 of their original size Then we create the actual plot shown inFigure 532gt dend lt- set(dend labels_cex 075)gt par(mar = c(3337))gt plot(dend

main = Clustered Thyroid data where thelabels give the true diagnosis

horiz = TRUE nodePar = list(cex = 007))gt legend(topleft legend = diagnosis fill =

rainbow_hcl (3))

Overall the data seems to have been well separated However there issome mixing between normal and hypo

388


Figure 532 Hierarchical Agglomerative Cluster Analysis dendrogram usingthyroid

Since our initial analysis used wardrsquos method we may as well take alook at the other methods We choose six of the most popular methodsand investigate their correlation We begin by capturing the methods inhclust_methods and creating a list in hclust_listgthclust_methods=c(wardDwardD2 singlecomplete averagemcquitty mediancentroid)

gt hclust_list lt-dendlist ()

389


Here is the main loop Notice we fit the modelrsquos using temp_fit with thedissimilarity matrix dthyroidgt for(i in seq_along(hclust_methods))

print(hclust_methods[i])temp_fitlt-hclust(dthyroid method=hclust_methods

[i])

hclust_list lt- dendlist(hclust_list asdendrogram(temp_fit))

Letrsquos add names to the listgt names(hclust_list ) lt- hclust_methods

Next we take a look at the output Notice the wide variation in heightproduced by the different methodsgt hclust_list$wardDrsquodendrogram rsquo with 2 branches and 215 members total

at height 1233947

$wardD2rsquodendrogram rsquo with 2 branches and 215 members total

at height 3378454

$singlersquodendrogram rsquo with 2 branches and 215 members total

at height 335

$completersquodendrogram rsquo with 2 branches and 215 members total

at height 1601

$averagersquodendrogram rsquo with 2 branches and 215 members total

at height 8466573

$mcquittyrsquodendrogram rsquo with 2 branches and 215 members total

at height 1038406

390


$medianrsquodendrogram rsquo with 2 branches and 215 members total

at height 4114692

$centroidrsquodendrogram rsquo with 2 branches and 215 members total

at height 6373702

Now we investigate the correlation between methods Note thatmethod=common measures the commonality between members of nodesgt cor lt-cordendlist(hclust_list method=common)gt par( mfrow = c(2 2))gt qgraph(cor minimum =070title= Correlation = 070

)gt qgraph(cor minimum =075title= Correlation = 075



)

Figure 533 shows the resultant plot for varying levels of correlation Itgives us another perspective on the clustering algorithms We can see thatmost methods have around 75 commonality within nodes with one another

391


Figure 533 Hierarchical Agglomerative Cluster Analysis correlation betweenmethods using thyroid

Since wardD and wardD2 have high commonality letrsquos look in detailusing a tanglegram visualization The ldquowhichrdquo parameter allows us to pickthe elements in the list to compare Figure 534 shows the resultgt hclust_list gt dendlist(which = c(12)) gt

ladderize gtset(branches_k_color k=3) gttanglegram(faster = TRUE)

We can use a similar approach for average and mcquitty see Figure 535gt hclust_list gt dendlist(which = c(56)) gt

392



Figure 534 Tanglegram visualization between wardD and wardD2 usingthyroid

393


Figure 535 Tanglegram visualization between average and mcquitty usingthyroid

Finally we plot the dendograms for all of the methods This is shown inFigure 536gt par( mfrow = c(4 2))

gt for(i in seq_along(hclust_methods))fit lt- hclust(dthyroid method=hclust_methods[i])

dend lt- asdendrogram(fit)

394


dend lt- rotate(dend 1215)dend lt- color_branches(dend k=3)

labels_colors(dend) lt-rainbow_hcl (3)[sort_levels_values(

asnumeric(diagnosis_labels)[orderdendrogram(dend)]

)]

labels(dend) lt- paste(ascharacter(diagnosis_labels)[orderdendrogram(dend)]

(labels(dend))sep = )

dend lt- set(dend labels_cex 075)

par(mar = c(3337))plot(dend

main = hclust_methods[i]horiz = TRUE nodePar = list(cex = 007))

legend(bottomleft legend = diagnosis fill =rainbow_hcl (3))

395


Figure 536 Hierarchical Agglomerative Cluster Analysis for all methodsusing thyroid

396

Technique 54

Agglomerative Nesting

Agglomerative nesting is available in the cluster package using the functionagnes

agnes(xmetric method )

Key parameters include x the data set of attributes for which you wishto find clusters metric the metric used for calculating dissimilarities whilstmethod defines the clustering method to be used

Step 1 Load Required PackagesFirst we load the required packages Irsquoll explain each belowgt require (cluster)gt library(colorspace)gt library(dendextend)gt require(corrplot)gt data(thyroidpackage=mclust)

The package cluster contains the agnes function For color output weuse the colorspace package and dendextend to create fancy dendogramsThe corrplot package is used to visualize correlations Finally we use thethyroid data frame contained in the mclust package for our analysis

We also make use of two user defined functions The first we callpanelhist which we will use to plot histograms It is defined as followsgt panelhist lt- function(x )

usr lt- par(usr) onexit(par(usr))par(usr = c(usr[12] 0 15) )

397


h lt- hist(x plot = FALSE)breaks lt- h$breaks nB lt- length(breaks)y lt- h$counts y lt- ymax(y)rect(breaks[-nB] 0breaks[-1] ydensity = 50border = blue)

The second function will be used for calculating the Pearson correlationcoefficient Here is the R codegt panelcor lt- function(x y digits = 3 prefix =

cexcor )usr lt- par(usr) onexit(par(usr))par(usr = c(0 1 0 1))r lt- (cor(x ymethod = pearson))

txt lt- format(c(r 0123456789) digits = digits)[1]

txt lt- paste0(prefix txt)

if(missing(cexcor)) cexcor lt- 08strwidth(txt)text(05 05 txt cex = cexcor))

Step 2 Prepare Data amp Tweak ParametersWe store the thyroid data set in thyroid2 and remove the class variable(Diagnosis)The variable diagnosis_col will be used to color the dendo-grams using rainbow_hcl

gt thyroid2 lt-thyroid [-1]gt diagnosis_labels lt- thyroid [1]gt diagnosis_col lt- rev(rainbow_hcl (3))[asnumeric(

diagnosis_labels)]

Letrsquos spend a few moments investigating the data set We will use vi-sualization to assist us We begin by looking at the pairwise relationships

398

TECHNIQUE 54 AGGLOMERATIVE NESTING

between the attributes as shown in Figure 541 This plot was built usingthe pairs method as followsgt par(oma = c(4 1 1 1))

gt pairs(thyroid2 col = diagnosis_col upperpanel = panelcor cexlabels=2pch=19cex = 12panel = panelsmooth diagpanel = panelhist)

gt par(fig = c(0 1 0 1)oma = c(0 0 0 0)mar = c(0 0 0 0)new = TRUE)

gt plot(0 0 type = nbty = n xaxt = n yaxt = n)

gt legend(bottom cex = 1horiz = TRUE inset = c(10) bty = nxpd = TRUE legend = ascharacter(levels(diagnosis_labels))fill = unique(diagnosis_col))

gt par(xpd = NA)

The top panels (above the diagonal) give the correlation coefficient be-tween the attributes We see that RT3U is negatively correlated with T4 andT3 It is moderately correlated with TSH and DTSH We also see that T3 andT4 are negatively correlated with TSH and DTSH Whilst TSH and DTSH arepositively correlated

The diagonal in Figure 541 shows the distribution of each attribute andbottom panels are colored by diagnosis using diagnosis_col Notice thatHypo Normal and Hyper appear to be distinctly different each other (asmeasured by the majority of attributes) However all three groups diagnosistypes cannot be easily separated if measured by RT3U and DTSH alone

399


Figure 541 Pairwise relationships distributions and correlations for at-tributes in thyroid

The same conclusion that the diagnosis types are distinct can be made bylooking at the parallel coordinates plot of the data shown in Figure 541 Itcan be calculated as followsgt par(oma = c(4 1 1 1))

gt MASS parcoord(thyroid2 col = diagnosis_col varlabel = TRUE lwd = 2)

400



gt plot(0 0 type = nbty = nxaxt = n yaxt = n)

gt legend(bottom cex = 125horiz = TRUE inset = c(10)bty = nxpd = TRUE legend = ascharacter(levels(diagnosis_labels))fill = unique(diagnosis_col))

par(xpd = NA)

401


Figure 542 Parallel coordinates plot using thyroid

Step 3 Estimate and Assess the ModelThe agnes function offers a number of methods for agglomerative nest-ing These include average single (single linkage) complete (com-plete linkage) ward (Wardrsquos method) and weighted The default isaverage however I have had some success with this wardrsquos method in thepast so I usually give it a try first I also set the parameter stand = TRUE tostandardize the attributes

gt fit lt- agnes(thyroid2 stand = TRUE metric=

402


euclideanmethod=ward)

The print method provides an overview of fit (we only show the firstfew lines of output below)gt print(fit)Call agnes(x = thyroid2 metric = euclideanstand = TRUE method = ward)

Agglomerative coefficient 09819448

The agglomerative coefficient measures the amount of clustering structureHigher values indicate more structure It can also be viewed as the averagewidth of the banner plot shown in Figure 543 A banner plot is calculatedusing the bannerplot methodgt bannerplot(fit)

403


Figure 543 Agglomerative nesting bannerplot using thyroid

Now we should take a look at the fitted model It will be in the formof a dendogram First we set diagnosis to capture the classes Then wesave fit as dendrogram in dend followed by ordering of the observationssomewhat using the rotate methodgt diagnosis lt- rev(levels(diagnosis_labels))gt dend lt- asdendrogram(fit)gt dend lt- rotate(dend 1215)

Our next step is to color the branches based on the three clustersWe setk=3 in color_branches for each of the three diagnosis types (hyper normalhypo)

404



Then we match the labels to the actual classesgt labels_colors(dend) lt-rainbow_hcl (3)[sort_levels_values(asnumeric(diagnosis_labels)[orderdendrogram(dend)])]



A little administration is our next step reduce the size of the labels to75 of their original size This assists in making the eventual plot look lessclutteredgt dend lt- set(dend labels_cex 075)

And now to the visualization using the plot method gt par(mar = c(3337))

gt plot(dend main = Clustered Thyroid data where the labels give

the true diagnosishoriz = TRUE nodePar = list(cex = 007))


The result is show in Figure 544 Overall Hyper seems well separatedwith some mixing between Normal and Hypo

405


Figure 544 Agglomerative Nesting dendogram using thyroid The labelcolor corresponds to the true diagnosis

Since the agnes function provides multiple methods in the spirit of pre-dictive analytics empiricist analysis letrsquos look at the correlation between thefive most popular methods We assign these methods to agnes_methodsgt agnes_methods=c(wardaveragesinglecompleteweighted)

gt agnes_list lt-dendlist ()

Our main look for the calculations is as follows

406


gt for(i in seq_along(agnes_methods))

temp_fitlt-agnes(thyroid2 stand = TRUE metric=euclideanmethod=agnes_methods[i])

agnes_list lt- dendlist(agnes_list asdendrogram(temp_fit))

Need to add the names of the methods as a final touchgt names(agnes_list ) lt- agnes_methods

Now letrsquos take a look at the outputgt agnes_list$wardrsquodendrogram rsquo with 2 branchesand 215 members total at height 4701778

$averagersquodendrogram rsquo with 2 branchesand 215 members total at height 1197675

$singlersquodendrogram rsquo with 2 branchesand 215 members total at height 5450811

$completersquodendrogram rsquo with 2 branchesand 215 members total at height 2403689

$weightedrsquodendrogram rsquo with 2 branchesand 215 members total at height 1612223

407


Notice that all methods produce two branches however there is consider-able variation in the height of the dendogram ranging from 54 for ldquosinglerdquoto 47 for ldquowardrdquo

We use corrplot with method=common to visualize the correlation be-tween the methodsgt corrplot(cordendlist(agnes_list method=common)

pietype= lower)

The resultant plot shown in Figure 545 gives us another perspective onour clustering algorithms We can see that most of methods have around 75nodes in common with one another

Figure 545 Agglomerative Nesting correlation plot using thyroid

408


Finally letrsquos plot the all five dendograms Here is howgt par( mfrow = c(3 2))

gt for(i in seq_along(agnes_methods))fit lt- agnes(thyroid2 stand = TRUE

metric=euclideanmethod=agnes_methods[i])

dend lt- asdendrogram(fit)dend lt- rotate(dend 1215)

dend lt- color_branches(dend k=3)

labels_colors(dend) lt-rainbow_hcl (3)[sort_levels_values(asnumeric(diagnosis_labels)[orderdendrogram(dend)]

)]

labels(dend) lt- paste(ascharacter(diagnosis_labels)[orderdendrogram(dend)](labels(dend))sep = )


par(mar = c(3337))

plot(dend main = agnes_methods[i]horiz = TRUE nodePar = list(cex = 007))legend(bottomleftlegend = diagnosis fill = rainbow_hcl (3))

409


Wow That is a nice bit of typing the result is well worth it Take a lookat Figure 546 (It still appears as wardrsquos method works the best)

Figure 546 Agglomerative nesting dendogram for various methods usingthyroid

410

Technique 55

Divisive Hierarchical Clustering

Divisive hierarchical clustering is available in the cluster package using thefunction diana

diana(xmetric )

Key parameters include x the data set of attributes for which you wish tofind clusters and metric the metric used for calculating dissimilarities

Step 1 Load Required PackagesFirst we load the required packages Irsquoll explain each belowgt require (cluster)gt library(colorspace)gt library(dendextend)gt require(circlize)

We use the thyroid data frame contained in the mclust package for ouranalysis see 385 page for additional details on this data For color output weuse the colorspace package and dendextend to create fancy dendogramsThe circlize package is used to visualize a circular dendogram

Step 2 Prepare Data amp Tweak ParametersWe store the thyroid data set in thyroid2 and remove the class variable(Diagnosis)The variable diagnosis_col will be used to color the dendo-grams using rainbow_hclgt thyroid2 lt-thyroid [-1]gt diagnosis_labels lt- thyroid [1]

411



Step 3 Estimate and Assess the ModelThe model can be fitted as follows

gt fit lt- diana(thyroid2 stand = TRUE metric=euclidean)

The parameter stand = TRUE is used to standardize the attributes Theparameter metric=euclidean is used for calculating dissimilarities a pop-ular alternative is ldquomanhattanrdquo

Now we can use the banneplot methodgt bannerplot(fit main=Euclidean)


412

TECHNIQUE 55 DIVISIVE HIERARCHICAL CLUSTERING

Figure 551 Divisive hierarchical clustering bannerplot using thyroid

We should take a look at the fitted model It will be in the form of adendogram First we set diagnosis to capture the classes Then we savefit as a dendrogram into dend followed by ordering of the observationssomewhat using the rotate methodgt diagnosis lt- rev(levels(diagnosis_labels))gt dend lt- asdendrogram(fit)gt dend lt- rotate(dend 1215)


413



We match the labels to the actual classesgt labels_colors(dend) lt-rainbow_hcl (3)[sort_levels_values(asnumeric(diagnosis_labels)[orderdendrogram(dend)])]



A little administration is our next step reduce the size of the labels to 70of their original size This assists in making the eventual plot look less clut-tered Then circlize the dendrogram using method circlize_dendrogramgt dend lt- set(dend labels_cex 07)gt circlize_dendrogram(dend)gt legend(bottomleft legend = diagnosis fill =

rainbow_hcl (3)bty=n)


414


Figure 552 Divisive hierarchical clustering dendogram using thyroid Thelabel color corresponds to the true diagnosis

415

Technique 56

Exemplar Based AgglomerativeClustering

Exemplar Based Agglomerative Clustering is available in the apclusterpackage using the function aggExCluster

aggExCluster(d x)

Key parameters include x the data set of attributes for which you wish tofind clusters and d the similarity matrix

Step 1 Load Required PackagesFirst we load the required packages Irsquoll explain each belowgt require(apcluster)gt data(bodyfatpackage=THdata)

We use the bodyfat data frame contained in the THdata package for ouranalysis see page 62 for additional details on this data

Step 2 Prepare Data amp Tweak ParametersWe store the bodyfat data set in xgt setseed (98765)gt xlt-bodyfat


416

TECHNIQUE 56 EXEMPLAR BASED AGGLOMERATIVE

gt fit lt- aggExCluster(negDistMat(r=10) x)

Next we visualize the dendogram using the plot methodgt plot(fit showSamples=TRUE)


Figure 561 Agglomerative Clustering dendogram using bodyfat

Letrsquos look at the relationship between dexfat and waistcirc for differentcluster sizes To do this we use a while loop see Figure 562 There are twomain branches with possibly two sub-branches each suggesting 4 or 5 clustersoverall

417


gt i=2gt par(mfrow = c(2 2))

gt while (ilt=5)plot(fit x[c(23)] k=ixlab=DEXfat ylab=

waistcircmain=c(Number Clusters i))i=i+1

Figure 562 Exemplar Based Agglomerative Clustering - dexfat andwaistcirc for various cluster sizes

418

Fuzzy Methods

419

Technique 57

Rousseeuw-Kaufmanrsquos FuzzyClustering Method

The method outlined by Rousseeuw and Kaufman111 for fuzzy clustering isavailable in the cluster package using the function fanny

fanny(xkmembexp metric )

Key parameters include x the data set of attributes for which you wish tofind clusters k the number of clusters membexp the membership exponentused in the fit criterion and metric the measure to be used for calculatingdissimilarities between observations

Step 1 Load Required PackagesWe use the wine data frame contained in the ordinal package (see page 95)We also load the smacof package which offers a number of approaches formultidimensional scalinggt require (cluster)gt require(smacof)gt data(winepackage=ordinal)

Step 2 Prepare Data amp Tweak ParametersWe store the wine data set in data and then remove the wine ratings by spec-ifying datalt-data[-2] One of the advantages of Rousseeuw-Kaufmanrsquosfuzzy clustering method over other fuzzy clustering techniques is that it canhandle dissimilarity data Since wine contains ordinal variables we use the

420

TECHNIQUE 57 ROUSSEEUW-KAUFMANrsquoS FUZZY

daisy method to create a dissimilarity matrix with metric set to the generaldissimilarity coefficient of Gower112gt setseed (1432)gt data lt-winegt data lt-data[-2]gt xlt-daisy(data metric = gower)

Step 3 Estimate and Assess the ModelWe need to specify the number of clusters Letrsquos begin with 3 clusters (k=3)and setting metric = ldquogowerrdquo We set the parameter membexp =11 Thehigher the value in membexp the more fuzzy the clustering in general

gt fitlt-fanny(xk=3membexp = 11 metric=gower)

The cluster package provides silhouette plots via the plot method Thesecan help determine the viability of the clusters Figure 571 presents thesilhouette plot for fit We see one large cluster containing 36 observationswith a width of 025 and two smaller clusters each with 18 observations andan average width of 046gt plot(fit)

421


Figure 571 Rousseeuw-Kaufmanrsquos fuzzy clustering method silhouette plotusing wine

It can be instructive to investigate the relationship between the fuzzinessparameter in membexp and the average silhouette width To do this we usea small while loop and call fanny with values of membexp ranging from 11to 18gt fuzzy =11gt while (fuzzy lt19)temp_fitlt-fanny(xk=3membexp = fuzzy metric=

euclidean)

422


cat(Fuzz = fuzzy Average Silhouette Width =round(temp_fit$silinfo [[3]] 1)n)

fuzzy=fuzzy +01

The resultant output is shown below Notice how the average silhouettewidth declines as the fuzziness increasesFuzz = 11 Average Silhouette Width = 04Fuzz = 12 Average Silhouette Width = 03Fuzz = 13 Average Silhouette Width = 03Fuzz = 14 Average Silhouette Width = 03Fuzz = 15 Average Silhouette Width = 03Fuzz = 16 Average Silhouette Width = 03Fuzz = 17 Average Silhouette Width = 03Fuzz = 18 Average Silhouette Width = 02

We still need to determine the optimal number of clusters Letrsquos try mul-tidimensional scaling to investigate further We use the smacofSym methodfrom the package smacof and store the result in scaleTgt scaleT lt- smacofSym(x)$conf

gt plot(scaleT type = nmain=Fuzzy = 11 with k =3)

gt text(scaleT label = rownames(scaleT)col = rgb(fit$membership))

423


Figure 572 Use the smacofSym method to help identify clusters using wine

The result is visualized using the plot method and shown in Figure 572The image appears to identify four clusters Even though five official ratingswere used to assess the quality of the wine we will refit using k=4gt fitlt-fanny(xk=4membexp = 11 metric=gower)

Finally we create a silhouette plotgt plot(fit)

Figure 573 shows the resultant plot A rule of thumb113 is to choose asthe number of clusters the silhouette plot with the largest average Here wesee the average silhouette plot is now 046 compared to 036 for k=3

424


Figure 573 Rousseeuw-Kaufmanrsquos method silhouette plot with four clustersusing wine

425

Technique 58

Fuzzy K-Means

Fuzzy K-means is available in the fclust package using the function FKM

FKM(x k=3 m=15 RS=1 )

Key parameters include x the data set of attributes for which you wishto find clusters k the number of clusters m the fuzziness parameter and RSthe number of random starts

Step 1 Load Required PackagesWe use the Vehicle data frame contained in the mlbench package (see page23)gt require (fclust)gt data(Vehiclepackage=mlbench)

Step 2 Prepare Data amp Tweak ParametersWe store the Vehicle data set in x and remove the vehicle types stored incolumn 19gt setseed (98765)gt xlt-Vehicle [-19]

Step 3 Estimate and Assess the ModelSuppose we believe the actual number of clusters will be between 3 and 6We can estimate a model with 3 clusters using the following

426

TECHNIQUE 58 FUZZY K-MEANS

gt fit3lt-FKM(x k=3 m=15 RS=1 stand =1)

We set the fuzzy parameter m to 15 and for illustrative purposes only setRS = 1 In practice you will want to set a higher number The parameterstand = 1 instructs the algorithm to standardize the observations in x beforeany analysis

The other three models can be estimated in a similar mannergt fit4lt-FKM(x k=4 m=15 RS=1 stand =1)gt fit5lt-FKM(x k=5 m=15 RS=1 stand =1)gt fit6lt-FKM(x k=6 m=15 RS=1 stand =1)

The question we now face is how to determine which of our models is opti-mal Fortunately we can use the fclustindex function to help us out Thisfunction returns the value of six clustering indices often used for choosingthe optimal number of clusters The indices include PC (partition coeffi-cient) PE (partition entropy) MPC (modified partition coefficient) SIL(silhouette) SILF (fuzzy silhouette) and XB (Xie and Beni index) Thekey thing to remember is that the optimal number of clusters occurs at themaximum value of each of these indices except for PE where the optimalnumber of clusters occurs at the minimum

Here is how to calculate the index for each modelgt f3lt-Fclustindex(fit3)gt f4lt-Fclustindex(fit4)gt f5lt-Fclustindex(fit5)gt f6lt-Fclustindex(fit6)

Now letrsquos take a look at the output of each Fclustindexgt round(f3 2)

PC PE MPC SIL SILF XB076 043 064 044 052 065

gt round(f4 2)PC PE MPC SIL SILF XB

066 064 054 037 047 099


059 078 049 032 041 102


052 093 042 026 035 197

427


Overall K = 3 is the optimum choice for the majority of methodsIt is fun to visualize clusters Letrsquos focus on fit3 and focus on the two

important variables Comp and ScatRa It can be instructive to plot clustersusing the first two principal components We use the plot method withthe parameter v1v2=c(17) to select Comp (first variable in x and ScatRa(seventh variable in x) and pca=TRUE in the second plot to use the principalcomponents as the x and y axisgt par(mfrow = c(1 2))gt plot(fit3 v1v2=c(17))gt plot(fit3 v1v2=c(17)pca=TRUE)

Figure 581 shows both plots There appears to be reasonable separationbetween the clusters

428


Figure 581 Fuzzy K-Means clusters with k=3 using Vehicle

429

Technique 59

Fuzzy K-Medoids

Fuzzy K-means is available in the fclust package using the functionFKMmed

FKMmed(x k=3 m=15 RS=1 )

Key parameters include x the data set of attributes for which you wish tofind clusters k the number of clusters m the fuzziness parameter and RS thenumber of random starts Note that the difference between fuzzy K-meansand fuzzy K-Medoids is that the cluster prototypes (centroids) are artificialrandomly computed in fuzzy K-means In the fuzzy K-Medoids the clusterprototypes (medoids) are a subset of the actual observed objects



430

TECHNIQUE 59 FUZZY K-MEDOIDS


gt fit3lt-FKMmed(x k=3 m=15 RS=10 stand =1)

We set the fuzzy parameter m to 15 and RS = 10 The parameter stand= 1 instructs the algorithm to standardize the observations in x before anyanalysis

The other three models can be estimated in a similar mannergt fit4lt-FKMmed(x k=4 m=15 RS=10 stand =1)gt fit5lt-FKMmed(x k=5 m=15 RS=10 stand =1)gt fit6lt-FKMmed(x k=6 m=15 RS=10 stand =1)

The questions we now face is how to determine which of our models isoptimal Fortunately we can use the fclustindex function to help usout This function returns the value of six clustering indices often used forchoosing the optimal number of clusters The indices include PC (partitioncoefficient) PE (partition entropy) MPC (modified partition coefficient)SIL (silhouette) SILF (fuzzy silhouette) and XB (Xie and Beni index)The key thing to remember is that the optimal number of clusters occurs atthe maximum value of each of these indices except for PE where the optimalnumber of clusters occurs at the minimum





029 130 006 009 011 605


024 151 005 007 011 863

431



021 168 005 007 008 790

Overall K=3 is the optimum choice for the majority of methodsNow we use fit3 and focus on the two important variables Comp and

ScatRa It can also be instructive to plot clusters using the first two principalcomponents We use the plot method with the parameter v1v2=c(17) toselect Comp (first variable in x and ScatRa (seventh variable in x) andpca=TRUE in the second plot to use the principal components as the x andy axisgt par(mfrow = c(1 2))gt plot(fit3 v1v2=c(17))gt plot(fit3 v1v2=c(17)pca=TRUE)

Figure 591 shows both plots

432


Figure 591 Fuzzy K-Medoids clusters with k=3 using Vehicle

433

Other Methods

434

Technique 60

Density-Based Cluster Analysis

In density based cluster analysis two parameters are important eps definesthe radius of neighborhood of each point and minpts is the minimumnumber of points within a radius (or ldquoreachrdquo) associated with the clustersof interest For our analysis we will use the dbscan function in the clusterpackage

dbscan(data eps MinPts)

Key parameters include data the data matrix dataframe or dissimilaritymatrix

Step 1 Load Required PackagesFirst we load the required packagesgt require(fpc)gt data(thyroidpackage=mclust)

We use the thyroid data frame contained in the mclust package for ouranalysis see page 385 for additional details on this data

Step 2 Estimate and Assess the ModelWe estimate the model using the dbscan method Note in practice MinPtsis chosen by trial and error We set MinPts = 4 and eps = 85 We alsoset showplot = 1 which will allow you to see visually how dbscan works

gt fit lt- dbscan(thyroid[-1] eps=85 MinPts=4showplot =1)

435


Next we take a peek at the fitted modelgt fitdbscan Pts =215 MinPts =4 eps =85

0 1 2 3border 18 3 3 2seed 0 185 1 3total 18 188 4 5

Unlike other approaches to cluster analysis the density based cluster algo-rithm can identify outliers (data points that do not belong to any clusters)Points in cluster 0 are unassigned outliers - 18 in total

The pairwise plot of clusters shown in Figure 601 is calculated as followsgt plot(fit thyroid)

436

TECHNIQUE 60 DENSITY-BASED CLUSTER ANALYSIS

Figure 601 Density-Based Cluster Analysis pairwise cluster plot usingthyroid

Letrsquos take a closer look at the pairwise cluster plot between the attributesT4 (column 3 in thyroid) and RT3U (column 2 in thyroid) To do this weuse the plot method see Figure 602gt plot(fit thyroid[c(32)])

437


Figure 602 Density-Based Cluster Analysis pairwise plot of T4 and RT3U

438

Technique 61

K-Modes Clustering

K-modes clustering is useful for finding clusters in categorical data It isavailable in the klaR package using the function kmodes

kmodes(x k)

Key parameters include x the data set of categorical attributes for whichyou wish to find clusters and k is the number of clusters

Step 1 Load Required PackagesWe use the housing data frame contained in the MASS package We alsoload the plyr package which we will use to map attributes in housing tonumerical categoriesgt require(klaR)gt data(housingpackage=MASS)gt library(plyr)

439


NOTE

The housing data frame contains data from an investigationof satisfaction with housing conditions in Copenhagen carriedout by the Danish Building Research Institute and the DanishInstitute of Mental Health Research114 It contains 72 rowswith measurments on the following 5 variables

bull Sat Satisfaction of householders with their presenthousing circumstances (High Medium or Low orderedfactor)

bull Infl Perceived degree of influence householders haveon the management of the property (High MediumLow)

bull Type Type of rental accommodation (TowerAtrium Apartment Terrace)

bull Cont Contact residents are afforded with other resi-dents (Low High)

bull Freq The numbers of residents in each class

Step 2 Prepare Data amp Tweak ParametersWe store the housing data set in x We then convert the attributes intonumerical values using mapvalues and asnumeric methodsgt setseed (98765)gt xlt-housing [-5]

gt x$Sat lt- mapvalues(x$Sat from = c(Low Medium High)to = c(1 23))

gt x$Sat lt-asnumeric(x$Sat)

gt x$Infl lt- mapvalues(x$Infl

440

TECHNIQUE 61 K-MODES CLUSTERING

from = c(Low Medium High)to = c(1 23))

gt x$Infl lt-asnumeric(x$Infl)

gt x$Type lt- mapvalues(x$Type from = c(Tower ApartmentAtriumTerrace)to = c(1 234))

gt x$Type lt-asnumeric(x$Type)

gt x$Cont lt- mapvalues(x$Cont from = c(Low High)to = c(1 3))

gt x$Cont lt-asnumeric(x$Cont)

Step 3 Estimate and Assess the ModelSuppose we believe the actual number of clusters is 3 We estimate the modeland plot the result (see Figure 611) as follows

gt fit lt- kmodes(x 3)gt plot(x col = fit$cluster)

441


Figure 611 K-Modes Clustering with k = 3 using housing

It is also possible to narrow down our focus For example letrsquos investigatethe pairwise relationship between type and sat To do this we use the plotmethod with the jitter method which adds a small amount of noise to theattributes The result is shown in Figure 612gt plot(jitter(asmatrix(x[c(13)]))col = fit$cluster)

gt points(fit$modes col = 15 pch = 8cex=4)

442


Figure 612 K-Modes Clustering pairwise relationship between type andsat

443

Technique 62

Model-Based Clustering

Model based clustering identifies clusters using the Bayesian information cri-terion (BIC) for Expectation-Maximization initialized by hierarchical cluster-ing for Gaussian mixture models It is available in the Mclust package usingthe function mclust

Mclust(x G)

Key parameters include x the data set of attributes for which you wish tofind clusters and g the maximum and minimum expected number of clusters

Step 1 Load Required PackagesWe use the thyroid data frame (see page 23)gt require(mclust)gt data(thyroid)

Step 2 Prepare Data amp Tweak ParametersWe store the thyroid data set in x and remove the first column which containsthe Diagnosis valuesgt setseed (98765)gt x=thyroid[-1]

Step 3 Estimate and Assess the ModelSuppose we believe the number of clusters lies between 1 and 20 We can usethe Mclust method to find the optimal number as follows

444

TECHNIQUE 62 MODEL-BASED CLUSTERING

gt fit lt- Mclust(asmatrix(x) G=120)

NOTE

Notice that for multivariate data the Mclust methods com-putes the optimum Bayesian information criterion for each ofthe following Gaussian mixture models

1 EII = spherical equal volume

2 VII = spherical unequal volume

3 EEI = diagonal equal volume and shape

4 VEI = diagonal varying volume equal shape

5 EVI = diagonal equal volume varying shape

6 VVI = diagonal varying volume and shape

7 EEE = ellipsoidal equal volume shape and orienta-tion

8 EVE = ellipsoidal equal volume and orientation

9 VEE = ellipsoidal equal shape and orientation

10 VVE = ellipsoidal equal orientation

11 EEV = ellipsoidal equal volume and equal shape

12 VEV = ellipsoidal equal shape

13 EVV = ellipsoidal equal volume

14 VVV = ellipsoidal varying volume shape and orien-tation

Letrsquos take a look at the resultgt fitrsquoMclust rsquo model objectbest model diagonal varying volume and shape (VVI

) with 3 components

The optimal number of clusters using all 14 Gaussian mixture models

445


appears to be 3 Here is another way to access the optimum number ofclustersgt k lt- dim(fit$z)[2]gt k[1] 3

Since BIC values using 14 models are used for choosing the number ofclusters it can be instructive to plot the result by individual model The plotmethod what = ldquoBICrdquo achieves this The result is shown in Figure 621Notice most models reach a peak around 3 clustersgt plot(fit what = BIC)

Figure 621 Model-Based Clustering BIC by model for thyroid

446


We can also visualize the fitted model by setting what = uncertainty(see Figure 622) what = density (see Figure 623) and what =classification (see Figure 624)

Figure 622 Model-Based Clustering plot with what = uncertaintyrdquo usingthyroid

447


Figure 623 Model-Based Clustering plot with what = densityrdquo usingthyroid

448


Figure 624 Model-Based Clustering plot with what = classificationrdquousing thyroid

Finally letrsquos look at the uncertainty and classification pairwise plots usingT3 and T4 see Figure 625gt par(mfrow = c(1 2))

gt plot(fit x[c(12)c(13)]what = uncertainty)

gt plot(fit x[c(12)c(13)]what = classification)

449


Figure 625 Model based uncertainty and classification pairwise plots usingT3 and T4

450

Technique 63

Clustering of Binary Variables

Clustering of Binary Variables is available in the cluster package using thefunction mona

mona(x)

Key parameters include x the data set of binary attributes for which youwish to find clusters

Step 1 Load Required PackagesWe use the locust data frame from the package bildgt require(cluster)gt data(locustpackage=bild)

451


NOTE

The locust data frame contains data on the effect of hungeron the locomotory behavior of 24 locust observed at 161 timepoints It contains 3864 observations on the following at-tributes

1 id a numeric vector that identifies the number of theindividual profile

2 move a numeric vector representing the response vari-able

3 sex a factor with levels 1 for male and 0 for female

4 time a numeric vector that identifies the number of thetime points observed The time vector considered wasobtained dividing (1161) by 120 (number of observedperiods in 1 hour)

5 feed a factor with levels 0 no and 1 yes

Step 2 Prepare Data amp Tweak ParametersBecause the locust data has a time dimension our first task is to transformthe data into a format that can be used by mono First letrsquos create a variablex that will hold the binary attributesgt nlt-nrow(locust)gt xlt-seq(1 120)gt dim(x) lt- c(24 5)gt colnames(x) lt- colnames(locust)

We create a binary data set where we assign 1 if a move took place bythe locust within 20 days 0 otherwise We do likewise with feeding Webegin by assigning zero to the move and feed attributes (ie x[i2]lt-0and x[i2]lt-0 ) Note that locust[count1]locust[count2] andlocust[count3] contain id time and feed respectivelycount=0k=1for(i in seq_along(seq(1 24)))

452

TECHNIQUE 63 CLUSTERING OF BINARY VARIABLES

x[i2]lt-0x[i5]lt-0

for (k in seq (1161))count=count+1

if (k==1) x[i1]lt-locust[count 1]x[i3]lt-locust[count 3]x[i4]lt-locust[count 4]

if (locust[count 2]==1 ampamp klt20) x[i2]lt-locust[count 2]

if (locust[count 5]==1 ampamp klt20)x[i5]lt-locust[count 5]

klt-k+1k =1

Finally we remove the id and time attributesgt xlt-x[-1]gt xlt-x[-3]

As a check the contents of x should look similar to thisgt head(x6)

move sex feed[1] 0 2 2[2] 1 1 2[3] 0 2 2[4] 1 1 2[5] 0 2 2[6] 0 1 2

453


Step 3 Estimate and Assess the ModelHere is how to estimate the model and print out the associated bannerplotsee Figure 631

fitlt-mona(x)plot(fit)

Figure 631 Clustering of Binary Variables bannerplot using locust

454

Technique 64

Affinity Propagation Clustering

Affinity propagation takes a given similarity matrix and simultaneously con-siders all data points as potential exemplars Real-valued messages are thenexchanged between data points until a high-quality set of exemplars and cor-responding clusters emerges It is available in the apcluster package usingthe function apcluster

apcluster(s x)

Key parameters include x the data set of attributes for which you wish tofind clusters and s the similarity matrix

Step 1 Load Required PackagesWe use the Vehicle data frame from the package mlbench See page 23 foradditional details on this data setgt require(apcluster)gt data(Vehiclepackage=mlbench)

Step 2 Prepare Data amp Tweak ParametersWe store the sample in the variable x and remove the Vehicle types (Class)attribute gt setseed (98765)gt xlt-Vehicle [-19]

455


Step 3 Estimate and Assess the ModelWe fit the model as follows where r refers to the number of columns in x

gt fit lt- apcluster(negDistMat(r=18) x)

The model determines the optimal number of clusters To see the actualnumbers usegt length(fitclusters)[1] 5

So for this data set the algorithm identifies five clusters We can visualizethe results using the plot method as shown in Figure 641 Note we only plot15 attributesgt plot(fit x[ 115])

Finally here is how to zoom in on a specific pairwise relationship Wechoose Circ and ScVarMaxis as shown in Figure 642gt plot(fit x[c(2 11)])

456

TECHNIQUE 64 AFFINITY PROPAGATION CLUSTERING

Figure 641 Affinity Propagation pairwise cluster plot using Vehicle

457


Figure 642 Affinity Propagation pairwise cluster plot between Circ andScVarMaxis

458

Technique 65

Exemplar-Based AgglomerativeClustering

Exemplar-Based Agglomerative Clustering is available in the apclusterpackage using the function aggExCluster

aggExCluster(s x)


Step 1 Load Required PackagesWe use the bodyfat data frame from the package THdata See page 62 foradditional details on this data setgt require(apcluster)gt data(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersWe store the sample data in the variable xgt setseed (98765)gt xlt-bodyfat

Step 3 Estimate and Assess the ModelWe fit the model using a two-step approach First we use affinity propaga-tion to determine the number of clusters via the apcluster method Foradditional details on affinity propagation and apcluster see page 455

459


gt fit1 lt- apcluster(negDistMat(r=10) x)

Here are some of the details of the fitted model It appears to have fourclustersgt fit1

APResult object

Number of samples = 71Number of iterations = 152Input preference = -6482591e+14Sum of similarities = -4886508e+14Sum of preferences = -2593037e+15Net similarity = -3081687e+15Number of clusters = 4

Next we create a hierarchy of the four clusters using exemplar-based ag-glomerative clustering via the aggExCluster methodgt fit lt- aggExCluster(x=fit1)

We can plot the resultant dendogram as shown in Figure 651gt plot(fit showSamples=FALSE)

460

TECHNIQUE 65 EXEMPLAR-BASED AGGLOMERATIVE

Figure 651 Exemplar-Based Agglomerative Clustering dendogram usingbodyfat

Finally we zoom in on the pairwise cluster plots between DEXfat andwaistcirc see page 462 To do this we use a small while loop as followsgt i=2gt par(mfrow = c(2 2))gt while (ilt=5)plot(fit x[c(23)] k=ixlab=DEXfat ylab=


461


Figure 652 Exemplar-Based Agglomerative Clustering relationship betweenDEXfat and waistcirc

462

Technique 66

Bagged Clustering

During bagged clustering a partitioning cluster algorithm such as k-means isrun repeatedly on bootstrap samples from the original data The resultingcluster centers are then combined using hierarchical agglomerative clusteranalysis (see page 384) The approach is available in the e1071 packageusing the function bclust

bclust(x centers basecenters distmethod )

Key parameters include x the data set of attributes for which you wishto find clusters centers the number of clusters basecenters the number ofcenters used in each repetition and distmethod the distance method usedfor hierarchical clustering

Step 1 Load Required PackagesWe use the thyroid data frame from the package mclust See page 385 foradditional details on this data setgt require (e1071)gt data(thyroidpackage=mclust)

Step 2 Prepare Data amp Tweak ParametersWe store the standardized sample data in the variable xgt xlt-thyroid [-1]gt xlt-scale(x)

463


Step 3 Estimate and Assess the ModelWe fit the model using the bclust method with 3 centers

gt fit lt- bclust(xcenters=3basecenters=5distmethod=manhattan)

We can use the plot method to visualize the dendogram Figure 661shows the resultgt plot(fit)

464

TECHNIQUE 66 BAGGED CLUSTERING

Figure 661 Bagged Clustering dendogram (top) and scree style plot (bot-tom) Gray line is slope of black line

We can also view a box plot of fit as follows see Figure 662gt boxplot(fit)

465


Figure 662 Bagged Clustering boxplots

We can also view boxplots by attribute as shown in Figure 663gt boxplot(fit bycluster=FALSE)

466


Figure 663 Bagged Clustering boxplots by attribute

467


Notes89Musmeci Nicoloacute Tomaso Aste and Tiziana Di Matteo Relation between financial

market structure and the real economy comparison between clustering methods (2015)e0116201

90The Industry Classification Benchmark (ICB) is a definitive system categorizing over70000 companies and 75000 securities worldwide enabling the comparison of compa-nies across four levels of classification and national boundaries For more information seehttpwwwicbenchmarkcom

91Since correlations between stocks and stock market sectors change over time the timeperiod over which the analysis is carried out may also impact the number of clustersrecovered by a specific technique

92Takahashi Nobuaki Makio Kashino and Naoyuki Hironaka Structure of rat ultra-sonic vocalizations and its relevance to behavior PloS one 511 (2010) e14115-e14115

93Wright JM Gourdon JC Clarke PB (2010) Identification of multiple call categorieswithin the rich repertoire of adult rat 50-kHz ultrasonic vocalizations effects of am-phetamine and social context Psychopharmacology 211 1ndash13

94Kim Tae-Bum et al Identification of asthma clusters in two independent Koreanadult asthma cohorts The European respiratory journal 416 (2013) 1308-1314

95FEV1 is the volume exhaled during the first second of a forced expiratory activity96In recent years there has been growing interest in using perfusion imaging in the initial

diagnosis and management of many conditions Magnetic resonance imaging (MRI) uses apowerful magnetic field radio frequency pulses and a computer to produce high resolutionimages of organs soft tissues and bone

97See for example

1 Karonen JO Liu Y Vanninen RL et al Combined perfusion- and diffusion-weighted MR imaging in acute ischemic stroke during the 1st week a longitudinalstudy Radiology 2000217886ndash894

2 Rother J Jonetz-Mentzel L Fiala A et al Hemodynamic assessment of acute strokeusing dynamic single-slice computed tomographic perfusion imaging Arch Neurol2000571161ndash1166

3 Nabavi DG Cenic A Henderson S Gelb AW Lee TY Perfusion mapping usingcomputed tomography allows accurate prediction of cerebral infarction in experi-mental brain ischemia Stroke 200132175ndash183

4 Eastwood JD Lev MH Azhari T et al CT perfusion scanning with deconvolutionanalysis pilot study in patients with acute MCA stroke Radiology

5 Kealey SM Loving VA Delong DM Eastwood JD User-defined vascular inputfunction curves influence on mean perfusion parameter values and signal- to-noiseratio Radiology 2004231587ndash593

98The computation of these parametric maps is often performed as a semi automated pro-cess The observer selects an appropriate artery to represent the arterial input function anda vein to represent the venous function From these arterial and venous timendashattenuationcurves the observer then determines the pre and post enhancement cutoff values for thecalculation of the perfusion parameters During data acquisition it is essential to selectslice locations in the brain that contain a major intracranial artery to represent the AIF

468

NOTES

99Yin Jiandong et al Comparison of K-Means and fuzzy c-Means algorithm perfor-mance for automated determination of the arterial input function PloS one 92 (2014)e85884

100Murase K Kikuchi K Miki H Shimizu T Ikezoe J (2001) Determination of arterialinput function using fuzzy clustering for quantification of cerebral blood flow with dynamicsusceptibility contrast-enhanced MR imaging J Magn Reson Imaging 13 797ndash806

101See Mardia et al (1979) Multivariate Analysis Academic Press102For example see

1 Rudman P (1967) The causes of natural durability in timber Pt XXI The an-titermitic activity of some fatty acids esters and alcohols Holzforschung 21(1) 2412

2 Rudman P Gay F (1963) Causes of natural durability in timber X Deterrentproperties of some three-ringed carboxylic and heterocyclic substances to the subter-ranean termite Nasutitermes exitiosus CSIRO Div Forest Prod MelbourneHolzforschung 17 2125 13

3 Lukmandaru G Takahashi K (2008) Variation in the natural termite resistance ofteak (Tectona grandis Linn fil) wood as a function of tree age Annals of ForestScience 65 708ndash708

103See for example

1 Lenz M (1994) Nourishment and evolution of insect societies Westview Press Boul-der and Oxford and IBH Publ New Delhi chapter Food resources colony growthand caste development in wood feeding termites 159ndash209 23

2 Evans TA Lai JCS Toledano E McDowall L Rakotonarivo S et al (2005) Termitesassess wood size by using vibration signals Proceedings of the National Academyof Science 102(10) 3732ndash3737 24

3 Evans TA Inta R Lai JCS Prueger S Foo NW et al (2009) Termites eaves-drop to avoid competitors Proceedings of the Royal Society B Biological Sciences276(1675) 4035ndash4041

104Oberst Sebastian Theodore A Evans and Joseph CS Lai Novel Method for PairingWood Samples in Choice Tests (2014) e88835

105Todeschini Roberto and Viviana Consonni Handbook of molecular descriptors Vol11 John Wiley amp Sons 2008

106Dehmer Matthias Frank Emmert-Streib and Shailesh Tripathi Large-scale evalua-tion of molecular descriptors by means of clustering (2013) e83956

107For more details see

1 Dehmer M Varmuza K Borgert S Emmert-Streib F (2009) On entropy-basedmolecular descriptors Statistical analysis of real and synthetic chemical structuresJournal of Chemical Information and Modeling 49 1655ndash1663 21

2 Dehmer M Grabner M Varmuza K (2012) Information indices with high discrimi-native power for graphs PLoS ONE 7 e31214

108For further details see Gower J C (1971) A general coefficient of similarity and someof its properties Biometrics 27 857ndash874

109Further details at ftpftpicsuciedupubmachine-learning-databasesthyroid-diseasenew-thyroidnames

469


110For further details see - Murtagh Fionn and Pierre Legendre Wardrsquos hierarchicalagglomerative clustering method Which algorithms implement wardrsquos criterion Journalof Classification 313 (2014) 274-295

111Kaufman Leonard and Peter J Rousseeuw Finding groups in data an introductionto cluster analysis Vol 344 John Wiley amp Sons 2009

112Gower J C (1971) A general coefficient of similarity and some of its propertiesBiometrics 27 857ndash874

113For further details see Struyf Anja Mia Hubert and Peter J Rousseeuw Integratingrobust clustering techniques in S-PLUS Computational Statistics amp Data Analysis 261(1997) 17-37

114For further details see Madsen M (1976) Statistical analysis of multiple contingencytables Two examples Scand J Statist 3 97ndash106

470

Part VII

Boosting

471

The Basic Idea

Boosting is a powerful supervised classification learning concept It combinesthe performance of many ldquoweakrdquo classifiers to produce a powerful committeeA weak classifier is only required to be better than chance It can there-fore be very simple and computationally inexpensive The basic idea is toiteratively apply simple classifiers and to combine their solutions to obtaina better prediction result If classifiers misclassify some data train anothercopy of it mainly on this misclassified part with the hope that it will developa better solution Thus the algorithm increasingly focuses on new strategiesfor classifying difficult observations

Here is how it works in a nutshell

1 A boosting algorithm manipulates the underlying training data by it-eratively re-weighting the observations so that at every iteration theclassifier finds a new solution from the data

2 Higher accuracy is achieved by increasing the importance of ldquodifficultrdquoobservations so that observations that were misclassified receive higherweights This forces the algorithm to focus on those observations thatare increasingly difficult to classify

3 In the final step all previous results of the classifier are combined into aprediction committee where the weights of better performing solutionsare increased via an iteration-specific coefficient

4 The resulting weighted majority vote selects the class most often chosenby the classifier while taking the error rate in each iteration into account

The Power of the BoostBoosting is one of the best techniques in the data scientist toolkit WhyBecause it often yields the best predictive models and can often be reliedupon to produce satisfactory results Academic studies and applied studieshave found similar finding Here we cite just two

473


bull Bauer and Kohavi115 performed an extensive comparison of boostingwith several other competitors on 14 data-sets They found boostingoutperformed all other algorithms They concluded ldquoFor learning taskswhere comprehensibility is not crucial voting methods are extremelyuseful and we expect to see them used significantly more than they aretodayrdquo

bull Friedman Hastie and Tibshirani116 using eight data-sets compare arange of boosting techniques with the very popular classification andregression tree They find all the boosting methods outperform

NOTE

During boosting the target function is optimized with someimplicit penalization whose impact varies with the number ofboosting iterations The lack of a explicit penalty term in thetarget function is the main difference between boosting andother popular penalization methods such as Lasso117

474


NOTE

The first boosting technique was developed by RobertSchapire118 working out of the MIT Laboratory for ComputerScience His research article ldquoStrength of Weak Learnabilityrdquotheorem showed that a weak base classifier always improveits performance by training two additional classifiers on fil-tered versions of the classification data stream The authorobserves ldquoA method is described for converting a weak learn-ing algorithm into one that achieves arbitrarily high accuracyThis construction may have practical applications as a tool forefficiently converting a mediocre learning algorithm into onethat performs extremely wellrdquo He was right boosting is nowused in a wide range of practical applications

Reverberation SuppressionReverberation suppression is a critical problem in sonar communicationsThis is because as an acoustic signal is radiated degradation occurs due toreflection from the surface bottom and via the volume of water Cheepuru-palli et al119 study of the propagation of sound in water

One popular solution is to use the empirical mode decomposition (EMD)algorithm120 as a filtering technique A noise corrupted signal is applied toEMD and intrinsic mode functions are generated The key is to separate thesignal from the noise It turns out that noisy intrinsic mode functions are highfrequency component signals whilst signal-led intrinsic mode functions are lowfrequency component signals The selection of the appropriate intrinsic modefunctions which are used for signal reconstruction is often done manually121

475


The researchers use Ada Boost to automatically classify ldquonoiserdquo amp ldquosignalrdquointrinsic mode functions Over the signal to noise ratio from minus10 dB to 10dB

The results were very encouraging as they found that combining AdaBoost with EMD increases the likelihood of correct detection The researchersconclude ldquothat the reconstruction of the chirp signal even at low input SNR[signal to noise] conditions is achieved with the use of Ada Boost based EMDas a de-noising techniquerdquo

CardiotocographyFetal state (normal suspect pathologic) is assessed by Karabulut and Ib-rikci122 using Ada Boost combined with six individual machine learning al-gorithms (Naive Bayes Radial Basis Function NetworkBayesian NetworkSupport Vector Machine Neural Network and C45 Decision Tree)

The researchers use the publicly available cardiotocography data set123

which includes a total of 2126 samples of which the fetal state is normalin 1655 cases suspicious in 295 and pathologic in 176 The estimated AdaBoost confusion matrix is shown in Table 23 Note that the error rate fromthis table is around 5

Predicted valuesNormal Suspicious Pathologic1622 26 7

Actual values 55 236 47 7 162

Table 23 Confusion matrix of Karabulut and Ibrikci

Stock Market TradingCreamer and Freund124 develop a multi-stock automated trading system thatrelies in part on model based boosting Their trading system consists of alogitboost algorithm an expert weighting algorithm and a risk managementoverlay

Their sample consisted of 100 stocks chosen at random from the SampP 500index using daily data over a four year period The researchers used 540

476

trading days for the training set 100 trading days for the validation set anda 411 trading days for the test set

Four variants of their trading system are compared to a buy and hold strat-egy This is a common investment strategy and involves buying a portfolioof stocks and holding them without selling for the duration of the investmentperiod Transaction costs are assumed to vary from $0 to $0003 Creamerand Freund report all four variants of their trading system outperformed thebuy and hold strategy

Vehicle Logo RecognitionSam et al125 consider the problem of automatic vehicle logo recognition Theresearchers select the modest adaboost algorithm A total of 500 vehicleimages were used in the training procedure The detection of vehicle logowas carried out by sliding a sub-window across the image at multiple scalesand locations A total of 200 images were used in the test set with 184images recognized successfully see Table 24 This implies a misclassificationerror rate of around 9

Manufacturer Correct MistakenAudi 20 0BMW 16 4Honda 19 1KIA 17 3

Mazda 18 2Mitsubishi 20 0Nissan 17 3Suzuki 18 2Toyota 19 1

Volkswagen 20 0

Total 184 16

Table 24 Logo recognition rate reported by Sam et al

477


Basketball Player DetectionMarkoski et al126 investigate player detection during basketball games usingthe gentle adaboost algorithm A total of 6000 positive examples that containthe players entire body and 6000 positive examples of players upper body onlywere used to train the algorithm

The researchers observe for images that contain players whole body thealgorithm was unable to reduce the level of false positives below 50 Inother words the flipping a coin would have been more accurate However aset of testing images using players upper body obtained an accuracy of 705The researchers observe the application of the gentle adaboost algorithm fordetecting players upper body results in a relatively large number of falsepositive observations

MagnetoencephalographyMagnetoencephalography (MEG) is used to study how stimulus features areprocessed in the human brain Takiguchi et al127 use an Ada Boost algorithmto find the sensor area contributing to the accurate discrimination of vowels

Four volunteers with normal hearing were recruited for the experimentTwo distinct Japanese sounds were delivered to the subjectrsquos right ear Onhearing the sound the volunteer was asked to press a reaction key MEGamplitudes were measured from 61 pairs of sensors128

The ada-Boost algorithm was applied to every latency range with theclassification decision made at each time instance The researchers observethat the Ada-Boost classification accuracy first increased as a function oftime reaching a maximum value of 910 in the latency range between 50and 150 ms These results outperformed their pre-stimulus baseline

478

Binary Boosting

NOTE

A weak classifier h(x) is slightly better than random chance(50) at guessing which class an object belongs in A strongclassifier has a high probability (gt95) of choosing correctlyDecision trees (see Part I) are often used as the basis for weakclassifiers

Classical binary boosting is founded on the discrete Ada Boost algorithmin which a sequentially generated weighted set of weak base classifiers arecombined to form an overall strong classifier Ada Boost was the first adaptiveboosting algorithm It automatically adjusts it parameters to the data basedon actual performance at each iteration

How Ada Boost WorksGiven a set of training feature vectors xi(i = 1 2 N) a target which rep-resents the binary class yi isin minus1+1 the algorithm attempts to find theoptimal classification by making the individual error εm at iteration m assmall as possible given a iteration specific distribution of weights on the fea-tures

Incorrectly classified observations receive more weight in the next itera-tion Correctly classified observations receive less weight in the next iterationThe weights at iteration m are calculated using the iteration specific learningcoefficient αm multiplied by a classification function η (hm(x))

As the algorithm iterates it focuses more weight on misclassified objectsand attempts to correctly classify them in the next iteration In this way clas-sifiers that are accurate predictors of the training data receive more weightwhereas classifiers that are poor predictors receive less weight The proce-dure is repeated until a predefined performance requirement is satisfied

479


NOTE

Differences in the nature of classical boosting algorithms areoften driven by the functional form of αm and η (hm(x)) Forexample

1 Ada BoostM1 αm = 05 log(

1minusεm

εm

)with η(x) =

sign(x)

2 Real Ada Boost and Real L2 set η(p) = log( p1minusp

) p isin[0 1]

3 Gentle Ada Boost and Gentle L2 Boost η(x) = x

4 Discrete L2 Real L2 and Gentle L2 use a logistic lossfunction as opposed to the exponential loss functionused by Real Ada Boost Ada BoostM1 and Gentle AdaBoost

480

Technique 67

Ada BoostM1

The Ada BoostM1 algorithm (also known as Discrete Ada Boost) usesdiscrete boosting with a exponential loss function129 In this algorithmη(x) = sign(x) and αm = 05 log

(1minusεm

εm

) It can be run using the package

ada with the ada function

ada(z ~ data iter loss = e type = discrete )

Key parameters include iter the number of iterations used to estimatethe model z the data-frame of binary classes data the data set of attributeswith which you wish to train the model

NOTE

Ada Boost is often used with simple decision trees as weakbase classifiers This can result in significantly improved per-formance over a single decision tree130 The package ada cre-ates a classification model as an ensemble of rpart trees Ittherefore uses the rpart library as its engine (see page 4 andpage 272)

Step 1 Load Required PackagesWe build our Ada BoostM1 model using Sonar a data frame in the mlbenchpackagegt library (ada)gt library(mlbench)gt data(Sonar)

481


NOTE

Sonar contains 208 observations on sonar returns collectedfrom a metal cylinder and a cylindrical shaped rock positionedon a sandy ocean floor131 Returns were collected at a rangeof 10 meters and obtained from the cylinder at aspect anglesspanning 90deg and from the rock at aspect angles spanning180deg In total 61 variables were collected all numerical withthe exception of the sonar return (111 cylinder returns and 97rock returns) which are nominal The label associated witheach record is contained in the letter R if the object is a rockand M if it is metal

Step 2 Prepare Data amp Tweak ParametersTo view the classification for the 95th to 105th record typegt Sonar$Class [95105][1] R R R M M M M M M M M

Levels M R

The 95th to 97th observation are labeled R for rock whilst the 98th to105th observations are labeled M for metal The ada package uses the rpartfunction to generate decision trees So we need to supply an rpartcontrolobject to the modelgtdefault lt- rpartcontrol(cp = -1 maxdepth = 2

minsplit = 0)

13 PRACTITIONER TIP

The maxdepth parameter is raised to power of 2 In the codethat follows we set maxdepth = 2 which is equivalent to 22=4

We use 157 of the 208 observations to train the modelgt setseed (107)gt n=nrow(Sonar)gt indtrain lt- sample (1n 157 FALSE)gt train lt- data[indtrain ]gt test lt- data [-indtrain ]

482

TECHNIQUE 67 ADA BOOSTM1

Since Sonar$Class is non-numeric we change it to a numeric scale This willbe useful later when we plot the training and test data

gt zlt-Sonar$Classgt zlt-asnumeric(z)

Next we create a new data frame combining the numeric variable z with theoriginal data in Sonar This is followed by a little tidying up by removingClass (which contained the ldquoMrdquo amp ldquoRrdquo labels) from the data frame

gt data lt-(cbind(z Sonar))gt data$Class lt- NULL

Step 3 Estimate ModelNow we are ready to run the Ada BoostM1 model on the training sampleWe choose to perform 50 boosting iterations by setting the parameter iter= 50

gt setseed (107)gt output_train lt- ada(z ~ data = train iter = 50

loss = e type = discretecontrol = default)

Take a look at the output

gt output_trainCallada(z ~ data = train iter = 50 loss = e type

= discretecontrol = default)

Loss exponential Method discrete Iteration 50

Final Confusion Matrix for DataFinal Prediction

True value 1 21 80 22 3 72

Train Error 0032

483


Out -Of-Bag Error 0045 iteration= 45

Additional Estimates of number of iterations

trainerr1 trainkap144 44

Notice the output gives the training error confusion matrix and threeestimates of the number of iterations In this example you could use theOut-Of-Bag estimate for 45 iterations training error estimate or the kappaerror estimate for 44 iterations As you can see the model achieved low errorrates The training error is 32 The Out-Of-Bag error rate is around 45

Step 4 Assess Model PerformanceWe use the addtest function to evaluate the testing set without having torefit the modelgt setseed (107)

gt output_train lt-addtest(x=output_train testx=test[-1] testy=test [1])





True value 1 21 80 22 3 72

Train Error 0032


484



trainerr1 trainkap1 testerrs2 testkaps244 44 32 32

Notice the estimate of iterations for the training error and kapa declinedfrom 44 (for training) to 32 for the test sample

It can be instructive to plot both training and test error results by iterationon one chartgt plot(output_train test=TRUE)

The resulting plot shown in Figure 671 shows that the training errorsteadily decreases across iterations This shows that boosting can learn thefeatures in the data set The testing error also declines somewhat acrossiterations although not as dramatically as for the training error

It is sometimes helpful to examine the performance at a specific iterationHere is what happened at iteration 25 for traingt summary(output_train niter =25)Callada(z ~ data = train iter = 50 loss = e type



Training Results

Accuracy 0955 Kappa 0911

Testing Results


The accuracy of the training and testing samples is above 080 To assessthe overall performance entergt summary(output_train)Callada(z ~ data = train iter = 50 loss = e type



485


Training Results


Testing Results


Notice that accuracy is in excess of 80 for both test and training datasetsKappa declines from 94 in training to 61 for the test data set

13 PRACTITIONER TIP

In very many circumstances you may find your training setunbalanced in the sense that one class has very many moreobservations than the other Ada Boost will tend to focus onlearning the larger set with resultant low errors Of coursethe low error is related to the focus on the larger class TheKappa coefficient132 provides an alternative measure of ab-solute classification error which adjusts for class imbalancesAs with the correlation coefficient higher values indicate astronger fit

In Figure 672 the relative importance of each variable is plotted Thiscan be obtained by typinggtvarplot(output_train)

486


Figure 672 Variable importance plot for Ada Boost using Sonar data set

The largest five individual scores can be printed out by enteringgt scores lt-varplot(output_train plotit=FALSE type=

scores)

gt round(scores [15] 3)V10 V17 V48 V9 V13

0078 0078 0076 0072 0071

487


Step 5 Make PredictionsNext we predict using the fitted model and the test sample This can beeasily achieved bygt predlt-predict(output_train test[-1]type=both)

A summary of the class predictions and the associated plot provides per-formance informationgt summary(pred$class)1 2

23 28

gt plot(pred$class)

The model classifies 23 observations into class 1 (recall ldquoMrdquo if objectmetal) and 28 observations into class 2 ( R if the object is a rock) This isreflected in Figure 673

A nice feature of this model is the ability to see the probabilities of anobservation and the class assignment For example the first observation hasassociated probabilities and class assignmentgt pred$probs [[1 1]][1] 05709708gt pred$probs [[1 2]][1] 04290292gt pred$class[1][1] 1Levels 1 2

The probability of the observation belonging to class 1 is 57 and to class2 around 43 and therefore the predicted class is class 1 The second andthird observations can be assessed in a similar fashiongt pred$probs [[2 1]][1] 007414183

gt pred$probs [[2 2]][1] 09258582

gt pred$class[2][1] 2Levels 1 2

488


gt pred$probs [[3 1]][1] 07416495

gt pred$probs [[3 2]][1] 02583505


Notice that the second observation is assigned to class 2 with an associ-ated probability of approximately 93 and the third observation is assignedto class 1 with an associated probability of approximately 74

489


Figure 671 Error by iteration for train and test

490


Figure 673 Model Predictions for class 1 (ldquoMrdquo ) and class 2 ( R)

491

Technique 68

Real Ada Boost

The Real Ada Boost algorithm uses discrete boosting with an exponentialloss function and η(p) = log( p

1minusp) p isin [0 1] It can be run using the package


ada(z ~ data iter loss = e type = realbagfrac =1 )

Key parameters include iter the number of iterations used to estimatethe model z the data-frame of binary classes data the data set of attributeswith which you wish to train the model To perform Stochastic Gradientboosting you set the bagfrac argument less than 1( default is bagfrac =05) Pure ε-boosting only happens if bagfrac = 1

Steps 1-2 are outlined beginning on page 481

Step 3 amp 4 Estimate Model amp Assess PerformanceWe choose to perform 50 boosting iterations by setting the parameter iter= 50

NOTE

The main control to avoid over fitting in boosting algorithmsis the stopping iteration A very high number of iterationsmay favor over-fitting However stopping the algorithm tooearly might lead to poorer prediction on new data In practiceover fitting appears to be less of a risk than under-fitting andthere is considerable evidence that Ada Boost is quite resistantto overtting133

492

TECHNIQUE 68 REAL ADA BOOST

gt setseed (107)gt output_train lt- ada(z ~data = train iter = 50loss = e type = realbagfrac=1control = default)


= realbagfrac = 1 control = default)

Loss exponential Method real Iteration 50


True value 1 21 82 02 0 75

Train Error 0




In this case the model perfectly fits the training data and the trainingerror is zero

13 PRACTITIONER TIP

Although the out-of-bag error rate is also zero it is of littlevalue here because there are no subsamples in pure ε-boosting(we set bagfrac = 1)

We next fit the test data using Stochastic Gradient boosting withbagfrac at its default value The overall error rate remains a modest 13with an Out-Of-Bag Error of 45

493


gt setseed (107)gt output_SB lt- ada(z ~ data = train iter = 50 loss = e type = realbagfrac =05control = default)

gt output_SBCallada(z ~ data = train iter = 50 loss = e type




True value 1 21 82 02 2 73

Train Error 0013




To assess variable importance we entergt scores lt-varplot(output_train plotit=FALSE type=scores)


0087 0081 0078 0076 0072

The top five variables differ somewhat from those obtained with AdaBoostM1 (see page 487)

494


Step 5 Make PredictionsNow we fit the data to the test data set make predictions and comparethe results to the actual values observed in the sample using the commandtable(pred) and table (test$z) ( recall test$z contains the class values)gt setseed (107)gt output_test lt- ada(z ~ data = test iter = 50

loss = e type = realcontrol = default)

gt predlt-predict(output_test newdata=test)

gt table(pred)pred1 2

29 22

gt table (test$z)

1 229 22

The model fits the observations perfectly

495

Technique 69

Gentle Ada Boost

The Gentle Ada Boost algorithm uses discrete boosting with an exponentialloss function and η(x) = x It can be run using the package ada with the adafunction

ada(z ~ data = train iter = 50 loss = e type= gentle nu=001 control = default )

Key parameters include iter the number of iterations used to estimatethe model z the data-frame of binary classes data the data set of attributeswith which you wish to train the model nu is the shrinkage parameter forboosting Steps 1-2 are outlined beginning on page 481

NOTE

The idea behind the shrinkage parameter nu is to slow downlearning and reduce the likelihood of over-fitting Smallerlearning rates (such as nu lt 01) often produce dramatic im-provements in a modelrsquos generalization ability over gradientboosting without shrinking (nu = 1)134 However small valuesof nu increase computational time by increasing the numberof iterations required to learn

Step 3 Estimate ModelWe choose to perform 50 boosting iterations by setting the parameter iter= 50gt setseed (107)

496

TECHNIQUE 69 GENTLE ADA BOOST

gt output_train lt- ada(z ~ data = train iter = 50 loss = etype = gentlenu=001 control = default)


= gentlenu = 001 control = default)

Loss exponential Method gentle Iteration 50


True value 1 21 72 82 13 64

Train Error 0134




The training error is around 13 with an Out-Of-Bag Error of 12 Letrsquossee if we can do better by using the parameter bagshift=TRUE to estimatea ensemble shifted towards bagginggt output_baglt- ada(z ~ data = train iter = 50

loss = e type = gentle nu=001 bagshift=TRUE control = default)

gt output_bagCallada(z ~ data = train iter = 50 loss = e type

= gentlenu = 001 bagshift = TRUE control = default)

497




True value 1 21 70 102 20 57

Train Error 0191




The training error has risen to 19 We continue our analysis using bothmodels

Step 4 Assess Model PerformanceWe fit the test data to both models First out boosted modelgt setseed (107)gt output_test lt- ada(z ~ data = test iter = 50

loss = e type = gentle nu=001 control =default)

gt output_testCallada(z ~ data = test iter = 50 loss = e type




True value 1 21 31 02 3 17

498


Train Error 0059




The training error rate is certainly lower than for the test set at 59 andthe number of iterations for kappa is only 8 The ensemble shifted towardsbagging results are as followsgt output_test_bag lt- ada(z ~ data = train iter =

50 loss = e type = gentle nu=001 bagshift=TRUE control = default)

gt output_test_bagCallada(z ~ data = train iter = 50 loss = e type




True value 1 21 38 422 11 66

Train Error 0338




The training error is higher at 34 The Out-Of-Bag Error remains stub-bornly in the 17 range

499


Step 5 Make PredictionsWe use the test sample to make predictions for both models and compare theresults to the known valuesgt predlt-predict(output_test newdata=test)gt table(pred)pred1 2

34 17

gt predlt-predict(output_test_bag newdata=test)gt table(pred)pred1 2

31 20

gt table (test$z)

1 231 20

Note that test$z contains the actual values It turns out that the resultsfor both models are within an acceptable range The out of bag model predictsthe classes perfectly

500

Technique 70

Discrete L2 Boost

Discrete L2 Boost uses discrete boosting with a logistic loss function It canbe run using the package ada with the ada function

ada(z ~ data iter loss = l type = discretecontrol )


Step 1 Load Required PackagesWe will build the model using the soldat data frame contained in the adapackage The objective is to use the Discrete L2 Boost model to predict therelationship between the structural descriptors (aka features or attributes)and solubility insolubilitygt library(ada)gt data(soldat)

NOTE

The soldat data frame consists of 5631 compounds testedto assess their ability to dissolve in a watersolvent mixtureCompounds were categorized as either insoluble (n=3493) orsoluble (n=2138) Then for each compound 72 continuousnoisy structural descriptors were computed Notice that oneof the descriptors contain 787 misclassified values

501


Step 2 Prepare Data amp Tweak ParametersWe partition the data into a training set containing 60 of the observationsa test set containing 30 of the observations and a validation set containing10 of the observationsgt nlt-nrow(soldat)gt setseed (103)gt random_sample lt-(1n)gt train_sample lt-ceiling(n06)gt test_sample lt-ceiling(n03)gt valid_sample lt-ceiling(n01)gt train lt-soldat[random_sample [1 train_sample]]gt testlt-soldat[random_sample[(train_sample +1)(train

_sample+test_sample)]]gt valid lt-soldat[random_sample[(test_sample+train_

sample +1)n]]

Wow that is a lot of typing Better check we have the right number ofobservations (5631)gt nrow(train)+nrow(test)+nrow(valid)[1] 5631

Now we set the rpartcontrol The maxdepth controls the maximumdepth of any node in the final tree with the root node counted as depth 0It is raised to power of 2 We set the max depth of 24= 16gt default lt-rpartcontrol(cp=-1maxdepth=4 maxcompete

=5minsplit =0)

Step 3 Estimate ModelNow we are ready to run the model on the training sample We choose 50boosting iterations by setting the parameter iter = 50

gt setseed (127)gt output_test lt- ada(y ~ data = train iter = 50

loss = l type = discretecontrol = default)gt output_testCallada(y ~ data = train iter = 50 loss = l type


502

TECHNIQUE 70 DISCRETE L2 BOOST

Loss logistic Method discrete Iteration 50


True value -1 1-1 1918 1891 461 811

Train Error 0192




Notice the output gives the training error confusion matrix and threeestimates of the number of iterations The training error estimate is 19 andthe Out-Of-Bag Error is 207 Of course you could increase the number ofiterations to attempt to push down both error rates However for illustrativepurposes we will stick with this version of the model

Step 4 Assess Model Performance

13 PRACTITIONER TIP

Notice that the actual binary classes (coded -1 and +1) are incolumn 73 in the test train and validation datasets Thereforewe add the line testx=valid[-73]testy=valid[73]to the addtest function

We use the addtest function to evaluate the testing set without needing torefit the model

gt output_testlt-addtest(output_test testx=test[-73]testy=test [73])

gt summary(output_test)

503


Callada(y ~ data = train iter = 50 loss = l type



Training Results


Testing Results


We see a decline in both the model accuracy and kappa from the trainingto the test set Although an accuracy of 759 may be sufficient we wouldlike to see a slightly higher kappa

Next we plot the training error and kappa by iteration for training andtest see Figure 701gt plot(output_test TRUE TRUE)

504


Figure 701 Error and kappa by iteration for training and test sets

The error rate of both the test and training set appear to decline asthe number of iterations increases Kappa appears to rise somewhat as theiterations increase for the training set but it quickly flattens out for thetesting data set

We next look at the attribute influence scores and plot the 10 most influ-ential variables see Figure 702gt scores lt-varplot(output_test plotit=FALSE type=

scores)gt barplot(scores [110])

505


Figure 702 Ten most influential variables

For this example there is not a lot of variation between the top ten influ-ential variables

Step 5 Make PredictionsNext we re-estimate using the validation sample to predict the classes andcompare to the actual observed classesgt setseed (127)gt output_valid lt- ada(y ~ data = valid iter = 50

loss = l type = discretecontrol = default)

506


gt predlt-predict(output_valid newdata=valid)

gt table(pred)pred-1 1

378 184gt table(valid$y)

-1 1356 206

The model predicts 378 in ldquoclass -1rdquo and 184 in ldquoclass 1rdquo whilst we actuallyobserve 356 and 206 in ldquoclass -1rdquo and ldquoclass 1rdquo respectively A summary ofthe class probability based predictions is obtained usinggt round(pred 3)

[1] [2][1] 0010 0990[2] 0062 0938[3] 0794 0206[4] 0882 0118

[559] 0085 0915[560] 0799 0201[561] 0720 0280[562] 0996 0004

507

Technique 71

Real L2 Boost

The Real L2 Boost uses real boosting with a logistic loss function It can berun using the package ada with the ada function

ada(z ~ data iter loss = l type = realcontrol )

Key parameters include iter the number of iterations used to estimatethe model z the data-frame of binary classes data the data set of attributeswith which you wish to train the model Steps 1-2 are outlined beginning onpage 501

Step 3 Estimate ModelWe choose 50 boosting iterations by setting the parameter iter = 50gtsetseed (127)gt output_test lt- ada(y ~ data = train iter = 50

loss = l type = realcontrol = default)

gt output_testCallada(y ~ data = train iter = 50 loss = l type

= realcontrol = default)

Loss logistic Method real Iteration 50


508

TECHNIQUE 71 REAL L2 BOOST

True value -1 1-1 1970 1371 487 785

Train Error 0185




The confusion matrix for the data yields a training error of 185 TheOut-Of-Bag Error is a little over 21

Step 4 Assess Model PerformanceThe addtest function is used to evaluate the testing set without needing torefit the modelgt output_testlt-addtest(output_test testx=test

[-73]testy=test [73])

gt summary(output_test)Callada(y ~ data = train iter = 50 loss = l type



Training Results


Testing Results


It is noticeable that the accuracy of the model falls from 815 for thetraining set to 753 for the test set A rather sharp decline is also observedin the kappa statistic

509


We next calculate the influential variables As we observed with DiscreteL2 Boost there is not a lot of variation in the top five most influential withall hovering in the approximate range of 10-11gt scores lt-varplot(output_test plotit=FALSE type=

scores)

gt round(scores [15] 3)x1 x2 x23 x5 x51

0114 0111 0108 0108 0106

Step 5 Make PredictionsWe re-estimate and use the validation sample to make predictions Noticethat 385 are predicted to be in ldquoclass -1rdquo and 177 in ldquoclass 1rdquo These predictedvalues are very close to those obtained by the Discrete L2 Boostgt predlt-predict(output_valid newdata=valid)



-1 1356 206

510

Technique 72

Gentle L2 Boost

The Gentle L2 Boost uses gentle boosting with a logistic loss function It canbe run using the package ada with the ada function

ada(z ~ data iter loss = l type = gentlecontrol )


Step 3 Estimate ModelWe choose 50 boosting iterations by setting the parameter iter = 50gt setseed (127)gt output_test lt- ada(y ~ data = train iter = 50

loss = l type = gentlecontrol = default)


= gentlecontrol = default)

Loss logistic Method gentle Iteration 50


511


True value -1 1-1 1924 1831 493 779

Train Error 02




The test data set obtained a training error of 20 and a slightly higherOut-Of-Bag Error estimate of close to 22

Step 4 Assess Model PerformanceNext we begin the assessment of the modelrsquos performance using the test dataset Notice that the accuracy declines slightly from 80 for the trainingsample to 74 for the test data set We see modest declines in kappa also aswe move from the training to the test data setgt output_testlt-addtest(output_test testx=test





Training Results


Testing Results


512

TECHNIQUE 72 GENTLE L2 BOOST

The top five most influential variables are given below It is interestingto observe that they all lie roughly in a range of 9 to 10gt scores lt-varplot(output_test plotit=FALSE type=

scores)


0099 0097 0097 0097 0095

Step 5 Make PredictionsWe use the validation sample to make class predictions Notice that 412 arepredicted to be in ldquoclass -1rdquo and 150 in ldquoclass 1rdquo These predicted values aresomewhat different from those obtained by the Discrete L2 Boost (see page506) and Real Discrete L2 Boost (see page 510) gt output_valid lt- ada(y ~ data = valid iter = 50





-1 1356 206

513

Multi-Class Boosting

514

Technique 73

SAMME

SAMME135 is a extension of the binary Ada Boost algorithm to two or moreclasses It uses an exponential loss function with αm = ln

(1minusεm

εm

)+ ln(kminus1)

where k is the number of classes

13 PRACTITIONER TIP

When the number of classes k = 2 then αm = ln(

1minusεm

εm

)which is similar to the learning coefficient of Ada BoostM1Notice that SAMME can be used for both binary boosting andmulticlass boosting This is also the case for the Breiman andthe Freund extensionrsquos discussed on page 519 and page 522respectively As an experiment re-estimate the Sonar dataset (see page 482) using SAMME What do you notice aboutthe results relative to Ada BoostM1

The SAMME algorithm can be run using the package adabag with theboosting function

boosting(Class ~ data mfinal control coeflearn=Zhu )

Key parameters include mfinal the number of iterations used to estimatethe model Class the data-frame of classes data the data set containing thefeatures over which you wish to train the model and coeflearn=Zhu tospecify using the SAMME algorithm

515


Step 1 Load Required Packages

NOTE

The Vehicle data set (see page 23) is stored in the mlbenchpackage It is automatically loaded with the adabag packageIf you need to load Vehicle directly type libary(mlbench)atthe R prompt

We begin by loading the library package and the Vehicle data setgt library (adabag)gt data(Vehicle)

Step 2 Prepare Data amp Tweak ParametersThe adabag package uses the rpart function to generate decision trees Sowe need to supply an rpartcontrol object to the model We use 500observations as the training set and the remainder for testinggt default lt- rpartcontrol(cp = -1 maxdepth = 4

minsplit = 0)gt setseed (107)gt N=nrow(Vehicle)gt train lt- sample (1N 500 FALSE)gt observed lt-Vehicle[train ]

Step 3 amp 4 Estimate Model amp Assess Model Perfor-manceWe estimate the model create a table of predicted and observed values andthen calculate the error rategtsetseed (107)gtfitlt-boosting(Class ~ data = Vehicle[train ]

mfinal =25control = default coeflearn=Zhu)

gt table(observed$Class fit$class dnn= c(ObservedClassPredictedClass))


516

TECHNIQUE 73 SAMME

bus 124 0 0 0opel 0 114 13 0saab 0 12 117 0van 0 0 0 120

gt error_rate = (1-sum(fit$class== observed$Class)500)


13 PRACTITIONER TIP

The object returned from using the boosting functioncontains some very useful information particularly rel-evant are - $trees $weights $votes $prob $classand $importance For example if you enter fit$trees Rwill return details of the trees used to estimate the model

Notice the error rate is 5 which seems quite low Perhaps we have overfit the model As a check we perform a 10-fold cross validation The confusionmatrix is obtained using $confusion and the training error by $errorgtsetseed (107)gt cvlt-boostingcv(Class ~ data = Vehicle mfinal

=25v=10control = default coeflearn=Zhu)

gt cv$confusionObserved Class

Predicted Class bus opel saab vanbus 208 0 3 1opel 2 113 91 4saab 7 94 116 11van 1 5 7 183

gt cv$error[1] 02671395

The cross validation error rate is much higher at 27 This is probably abetter reflection of what we can expect on the testing sample

Before we use the test set to make predictions we take a look at the threemost influential featuresgt influence lt-sort(fit$importance decreasing = TRUE)

517


gt round(influence [13] 1)MaxLRa PrAxisRa ScVarMaxis

140 90 87

NOTE

Cross validation is a powerful concept because it can be usedto estimate the error of an ensemble without having to dividethe data into a training and testing set It is often used insmall samples but is equally advantageous in larger datasetsalso

Step 5 Make PredictionsWe use the test data set to make predictionsgt predlt-predictboosting(fit newdata=Vehicle[-train

])

gt pred$error[1] 02745665

gt pred$confusionObserved Class


The prediction error at 27 is close to the cross validated error rate

518

Technique 74

Breimanrsquos Extension

Breimanrsquos extension assumes αm = 05 log(

1minusεm

εm

) It can be run using the

package adabag with the boosting function

boosting(Class ~ data mfinal control coeflearn=Breiman )

Key parameters include mfinal the number of iterations used to estimatethe model Class the data-frame of classes data the data set containing thefeatures over which you wish to train the model and coeflearn=Breiman

13 PRACTITIONER TIP

The function boosting takes the optional parameter boosBy default it is set to TRUE and a bootstrap sample is drawnusing the weight for each observation If boos = FALSE thenevery observation is used

Steps 1 and 2 are outlined beginning on page 516

Step 3 amp 4 Estimate Model amp Assess Model Perfor-manceWe fit the model using the training sample calculate the error rate and alsocalculate the error rate using a 10-fold cross validationgtsetseed (107)gtfitlt-boosting(Class~data= Vehicle[train ]mfinal

=25control = default coeflearn=Breiman)

519




gtsetseed (107)gtcvlt-boostingcv(Class~data = Vehicle mfinal =25v

=10control = default coeflearn=Breiman)gt round(cv$error 2)[1] 026

Whilst the error rate of 10 for the fitted model on the training sample ishigher that observed for the SAMME algorithm (see text beginning on page516) it remains considerably lower than that obtained by cross validationOnce again we expect the cross-validation error to better reflect what weexpect to observe in the test sample

The order of the three most influential features is a followsgt influence lt-sort(fit$importance decreasing = TRUE)

gt round(influence [13] 1)MaxLRa ScVarmaxis MaxLRect

237 140 88

MaxLRa is the most influential variable here and also using the SAMMEalgorithm (see page 517)

Step 5 Make PredictionsFinally we make predictions using the test data set and print out the observedversus predicted valuegt predlt-predictboosting(fit newdata=Vehicle[-train

])

gt round(pred$error 2)[1] 028


Predicted Class bus opel saab vanbus 87 8 4 1opel 1 43 33 1

520

TECHNIQUE 74 BREIMANrsquoS EXTENSION

saab 2 25 45 4van 4 9 6 73

The predicted error rate is 28 (close to the cross validated value)

521

Technique 75

Freundrsquos Adjustment

Freundrsquos adjustment is another direct extension of the binary Ada Boostalgorithm to two or more classes where αm = log

(1minusεm

εm

) It can be run using

the package adabag with the boosting function

boosting(Class ~ data mfinal control coeflearn=Freund )

Key parameters include mfinal the number of iterations used to estimatethe model Class the data-frame of classes data the data set containing thefeatures over which you wish to train the model and coeflearn=Freund


Step 3 amp 4 Estimate Model amp Assess Model Perfor-manceWe fit the model using the training sample calculate the error rate and alsocalculate the error rate using a 10-fold cross validationgt setseed (107)gtfitlt-boosting(Class~data = Vehicle[train ]mfinal

=25control = default coeflearn=Freund)



gt setseed (107)

522

TECHNIQUE 75 FREUNDrsquoS ADJUSTMENT

gtcvlt-boostingcv(Class~data = Vehicle mfinal =25v=10control = default coeflearn=Freund)

gt round(cv$error 2)[1] 025

Whilst the error rate of 6 for the fitted model on the training sample ishigher that observed for the SAMME algorithm (see (see text beginning onpage 516) it remains considerably lower than that obtained by cross valida-tion


gt round(influence [13] 1)MaxLRa ScVarmaxis PrAxisRa

201 97 93

MaxLRa is the most influential variable here (also using the SAMMEand Breiman algorithm - see pages 517 and 520)

Step 5 Make PredictionsFinally we make predictions using the test data set and print out the observedversus predicted tablegt predlt-predictboosting(fit newdata=Vehicle[-train

])




The predicted error rate is 27 (close to the cross validated value of 25)

523

Continuous Response BoostedRegression

524

Technique 76

L2 Regression

NOTE

L2 boosting minimizes the least squares error (the sum of thesquare of the differences between the observed value (yi) andthe estimated values (yi))

L2 =Nsum

i=1(yi minus yi)2 (761)

yi is estimated as a function of the independent variables features

L2 Boosting for continuous response variables can be run using the packagemboost with the glmboost function

glmboost(z ~ data family=Gaussian ()control )

Key parameters include z the continuous response variable data the dataset of independent variables family=Gaussian() implements L2 boostingcontrol which limits the number of boosting iterations and also controls theshrinkage parameter

Step 1 Load Required PackagesWe begin by loading the mboost package and the bodyfat data set describedon page 62gtlibrary(mboost)

525


gtdata(bodyfatpackage=THdata)

Step 2 Prepare Data amp Tweak ParametersIn the original study Garcia et al used 45 observations for model validationWe follow the same approach using the remaining 26 observations as thetesting samplesetseed (465)train lt- sample (171 45 FALSE)

Step 3 Estimate Model amp Assess Fit

13 PRACTITIONER TIP

Set trace = TRUE if you want to see status information duringthe fitting process

First we fit the model using the glmboost function We set the numberof boosting iterations to 75 and the shrinkage parameter (nu) to 01 This isfollowed by using coef to show the estimated model coefficientsgtfitlt-glmboost(DEXfat ~ data = bodyfat[train ]

family = Gaussian ()control = boost_control(mstop=75 nu = 01trace = FALSE))

gt coef(fit off2int=TRUE)(Intercept) age waistcirc-6627793041 002547924 017377131

hipcirc elbowbreadth kneebreadth046430335 -063272508 086864111

anthro3a anthro3b336145109 352597323

526

TECHNIQUE 76 L2 REGRESSION

13 PRACTITIONER TIP

We use coef(fitoff2int=TRUE) to add back the offset tothe intercept To see the intercept without the offset usecoef(fit)

A key tuning parameter of boosting is the number of iterations We setmstop = 75 However to prevent over fitting it is important to choose theoptimal stopping iteration with care We use 10-fold cross validated esti-mates of the empirical risk help us to choose the optimal number of boostingiterations Empirical risk is calculated using the cvrisk functiongt cv10f lt- cv(modelweights(fit) type = kfoldB

=10)

gt cvm lt- cvrisk(fit folds = cv10f)

gt mstop(cvm)[1] 40

gt fit[mstop(cvm)]

527


Figure 761 Cross-validated predictive risk for bodyfat data set and L2 Re-gression

Figure 761 displays the predictive risk The optimal stopping iterationminimizes the average risk over all samples Notice that mstop(cvm) returnsthe optimal number in this case 40 We use fit[mstop(cvm)]to set themodel parameters automatically to the optimal mstop

Given our optimal model we calculate a bootstrapped confidence intervalat the 90 level for each of the parameters For illustration we only use 200bootstraps in practice you should use at least 1000gt CI lt- confint(fit B =200 level = 09)

gt CI

528


Bootstrap Confidence Intervals5 95

(Intercept) -748992833 -4916894522age 00000000 004193643waistcirc 00250575 027216736hipcirc 02656622 060493289elbowbreadth -05086485 045619227kneebreadth 00000000 193441658anthro3a 00000000 805412043anthro3b 00000000 532118418anthro3c 00000000 305345810anthro4 00000000 466542476

13 PRACTITIONER TIP

To compute a confidence interval for another level simply en-ter the desired level using the print function For exampleprint(CI level = 08) returnsgt print(CI level = 08)



The confidence intervals indicate that waistcirc and hipcirc are sta-tistically significant Since our goal is to build a parsimonious model we reestimate the model using only these two variablesgt fit2 lt- gamboost(DEXfat ~ waistcirc+hipcirc data

= bodyfat[train ]family=Gaussian ()control =boost_control(mstop =150 nu = 01trace = FALSE))

529


gt CI2 lt- confint(fit2 B =50 level = 09)

gt par(mfrow = c(1 2))gt plot(CI2 which = 1)gt plot(CI2 which = 2)gt par(new=TRUE)

Notice we use the gamboost function rather than glmboost This is be-cause we want to compute and display point-wise confidence intervals usingthe plot function The confidence intervals are shown in Figure 762 Forboth variables the effect shows an almost linear increase with circumferencesize

Figure 762 Point-wise confidence intervals for waistcirc and hipcirc

530


Step 4 Make PredictionsWe now make predictions using the testing data set plot the result (seeFigure 763) and calculate the squared correlation coefficient The final modelshows a relatively linear relationship with DEXfat with a squared correlationcoefficient of 0858 (linear correlation = 0926)gt predlt-predict(fit2 newdata=bodyfat[-train ] type

= response)



[1] 0858

531


Figure 763 Plot of observed versus predicted values for L2 bodyfat regres-sion

532

Technique 77

L1 Regression

NOTE

L1 minimizes the least absolute deviations (also known as theleast absolute errors) by minimizing the sum of the absolutedifferences between the observed value (yi) and the estimatedvalues (yi)

L1 =|Nsum

i=1yi minus yi |

yi is estimated as a function of the independent variables features Unlike L2 loss with is sensitive to extreme valuesL1 is robust to outliers

L1 boosting for continuous response variables can be run using the packagemboost with the glmboost function

glmboost(z ~ data family=Laplace ()control )

Key parameters include z the continuous response variable data the dataset of independent variablersquos family=Laplace() implements L1loss boostingcontrol which limits the number of boosting iterations and the shrinkageparameter Steps 1-3 are outlined beginning on page 525

Step 3 Estimate Model amp Assess FitFirst we fit the model using the glmboost function We set the number ofboosting iterations to 300 and the shrinkage parameter (nu) to 01 This is

533


followed by using coef to show the estimated model coefficientsgtfitlt-glmboost(DEXfat~data = bodyfat[train ]

family=Laplace ()control = boost_control(mstop=300nu = 01trace = FALSE))



anthro3a anthro3b anthro3c058847326 364957585 200388290

All of the variables receive a weight with the exception of anthro4 A10-fold cross validated estimate of the empirical risk is used to choose theoptimal number of boosting iterations The empirical risk is calculated usingthe cvrisk functiongt cv10f lt- cv(modelweights(fit) type = kfoldB

=10)gt cvm lt- cvrisk(fit folds = cv10f)

gt mstop(cvm)[1] 260

gtfit[mstop(cvm)]

The optimal stopping iteration is 260 and fit[mstop(cvm)]sets the modelparameters automatically to the optimal mstop

Step 4 Make PredictionsPredictions using the testing data set are made using the optimal model fitA plot of the predicted and observed values (see Figure 771) shows a rela-tively linear relationship with DEXfat with a squared correlation coefficientof 0906 (linear correlation = 095)gt predlt-predict(fit newdata=bodyfat[-train ] type

= response)

534


gt plot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=PredictedValues main=TrainingSampleModelFit)


[1] 0906


535

Technique 78

Robust Regression

Robust boosting regression uses the Huber loss function which is less sensitiveto outliers than the L2 loss function discussed on 525 It can be run usingthe package mboost with the glmboost function

glmboost(z ~ data family=Huber ()control )

Key parameters include z the continuous response variable data the dataset of independent variables family = Huber() control which limits thenumber of boosting iterations and the shrinkage parameter Steps 1-3 areoutlined beginning on page 525

Step 3 Estimate Model amp Assess FitFirst we fit the model using the glmboost function We set the number ofboosting iterations to 300 and the shrinkage parameter (nu) to 01 This isfollowed by using coef to show the estimated model coefficientsgt fit lt- glmboost(DEXfat ~ data = bodyfat[train

]family=Huber ()control = boost_control(mstop=300nu = 01trace = FALSE))



anthro3a anthro3b anthro3c

536

TECHNIQUE 78 ROBUST REGRESSION

026460670 514548154 078826723




gtfit[mstop(cvm)]


Step 4 Make PredictionsPredictions using the testing data set are made using the optimal model fitA plot of the predicted and observed values (see Figure 781) shows a rela-tively linear relationship with DEXfat with a squared correlation coefficientof 0913gt predlt-predict(fit newdata=bodyfat[-train ] type

= response)

gtplot(bodyfat$DEXfat[-train]pred xlab=DEXfatylab=PredictedValues main=TrainingSampleModelFit)


[1] 0913

537


Figure 781 Plot of observed versus predicted values for Robust bodyfatregression

538

Technique 79

Generalized Additive Model

The generalized additive model (GAM) is a generalization of the linear regres-sion model in which the coefficients can be estimated as smooth functions ofcovariates The GAM model can account for non-linear relationships betweenresponse variables and covariates

E(Y | X1 X2 XK) = α + f1 (X1) + f2 (X2) + fK (XK) (791)

A boosted version of the model can be run using the package mboost withthe gamboost function

gamboost(z ~ data control )

Key parameters include z the continuous response variable data the dataset of independent variables control which limits the number of boostingiterations and controls the shrinkage parameter Steps 1-3 are outlined be-ginning on page 525

NOTE

For more than 100 years what data scientists could do waslimited by asymptotic theory Consumers of statistical theorysuch as Economists focused primarily on linear models andresidual analysis With boosting there is no excuse anymorefor using overly restrictive statistical models

Step 3 Estimate Model amp Assess FitWe previously saw that waistcirc and hipcirc are the primary independentvariables for predicting body fat (see page 525) We use these two variables

539


to estimate linear terms (bols()) smooth terms (bbs()) and interactionsbetween waistcirc and hipcirc modeled using decision trees (btree())gt mlt-DEXfat~waistcirc+hipcirc+bols(waistcirc)+bols(

hipcirc)+bbs(waistcirc)+bbs(hipcirc)+ btree(hipcirc

+ waistcirc tree_controls = ctree_control(maxdepth= 4 mincriterion = 0))

We set the number of boosting iterations to 150 and the shrinkage pa-rameter (nu) to 01 This is followed by a 10-fold cross validated estimate ofthe empirical risk to choose the optimal number of boosting iterations Theempirical risk is calculated using the cvrisk function fit[mstop(cvm)]assigns the model coefficients at the optimal iterationgt fitlt- gamboost(m data = bodyfat[train ]control =

boost_control(mstop =150 nu = 01trace = FALSE))

gt cv10f lt- cv(modelweights(fit) type = kfoldB=10)


gt mstop(cvm)[1] 41

gt fit[mstop(cvm)]


Plots of the linear effects and the interaction map can be obtained bytypinggt plot(fit which =1)

gt plot(fit which =2)


The resulting plots are given in Figure 791 and Figure 792 Figure 791indicates a strong linear effect for both waistcirc and hipcircFigure 792indicates that a hip circumference larger that about 110 cm leads to increasedbody fat provided waist circumference is larger than approximately 90cm

540

TECHNIQUE 79 GENERALIZED ADDITIVE MODEL

Figure 791 GAM regression estimates of linear effects for waistcirc andhipcirc

541


Figure 792 Interaction model component between waistcirc and hipcirc

Step 4 Make PredictionsPredictions using the test data set are made with the optimal model fit Aplot of the predicted and observed values (see Figure 793) shows a relativelylinear relationship with DEXfat The squared correlation coefficient is 0867(linear correlation = 093)gt predlt-predict(fit newdata=bodyfat[-train ] type

= response)

542




[1] 0867

Figure 793 Plot of observed versus predicted values for GAM regression

543

Technique 80

Quantile Regression

Quantile regression models the relationship between the independent vari-ables and the conditional quantiles of the response variable It provides amore complete picture of the conditional distribution of the response variablewhen both lower and upper or all quantiles are of interest

For example in the analysis of body mass both lower (underweight) andupper (overweight) quantiles are of interest to health professionals and healthconscious individuals We apply to the bodyfat data set to illustrate thispoint

Quantile regression can be run using the package mboost with theglmboost function

glmboost(z ~ data family=QuantReg ()control )

Key parameters include z the continuous response variable data the dataset of independent variables family=QuantReg control which limits thenumber of boosting iterations and the shrinkage parameter

Step 1 Load Required PackagesWe illustrate the use of this model using the bodyfat data frame Details ofthis data set are discussed on page 62library(mboost)data(bodyfatpackage=THdata)setseed (465)

544

TECHNIQUE 80 QUANTILE REGRESSION

Step 2 Estimate Model amp Assess FitThe question we address is do a common set of factors explain the 25thand 75th percentiles of the data set To solve this issue we estimate twoquantile regression models one at the 25th percentile and the other at the75th percentilegt fit25 lt- glmboost(DEXfat ~ data = bodyfat

family=QuantReg(tau =025) control = boost_control(mstop =1200 nu = 01trace = FALSE))

gt fit75 lt- glmboost(DEXfat ~ data = bodyfat family=QuantReg(tau =075) control = boost_control(mstop =1200 nu = 01trace = FALSE))

The coefficients of the 25th percentile regression aregt coef(fit25 off2int=TRUE)

(Intercept) age waistcirc-5824524124 003352251 016259137



anthro4060159287





545


There appears to be considerable overlap in the set of explanatory vari-ables which explain both percentiles However it is noticeable that anthro4only appears in the 25th percentile regression To investigate further we usea 10-fold cross validated estimate of the empirical risk to choose the opti-mal number of boosting iterations for each quantile regression model Theempirical risk is calculated using the cvrisk functiongt cv10f25 lt- cv(modelweights(fit25) type = kfold

B=10)gt cvm25 lt- cvrisk(fit25 folds = cv10f25)

gt cv10f75 lt- cv(modelweights(fit75) type = kfoldB=10)

gt cvm75 lt- cvrisk(fit75 folds = cv10f75)

gt mstop(cvm25)[1] 994


gt fit25[mstop(cvm25)]gt fit75[mstop(cvm75)]

The optimal stopping iteration is 994 for the 25th percentile model and1200 for the 75th percentile model

Given our optimal model we calculate a bootstrapped confidence interval(90 level) around each of the parameters For illustration we only use 50bootstraps in practice you should use at least 1000gt CI25 lt- confint(fit25 B =50 level = 09)gt CI75 lt- confint(fit75 B =50 level = 09)

The results for the 25th percentile model aregt CI25


(Intercept) -6291485919 -4384055225age -002032722 006116936waistcirc 002054691 036351636hipcirc 014107346 048121885elbowbreadth -102553929 023022726

546


kneebreadth -003557848 137440528anthro3a 000000000 411999837anthro3b 000000000 318697032anthro3c 209103968 649277249anthro4 000000000 136671922



(Intercept) -7277094640 -491977452age -000093535 00549349waistcirc 010970068 03596424hipcirc 016935863 05051908elbowbreadth -136476008 05104004kneebreadth 000000000 19214064anthro3a -076595108 61881146anthro3b 000000000 66016240anthro3c -031714787 56599846anthro4 -027462792 28194127

Both models have waistcirc and hipcirc as significant explanatory vari-ables anthro3c is the only other significant variable and then only for thelower weight percentiles

547

Technique 81

Expectile Regression

Expectile regression136 is used for estimating the conditional expectiles of aresponse variable given a set of attributes Having multiple expectiles at dif-ferent levels provides a more complete picture of the conditional distributionof the response variable

A boosted version of Expectile regression can be run using the packagemboost with the glmboost function

glmboost(z ~ data family=ExpectReg ()control)

Key parameters include z the continuous response variable data the dataset of independent variables family=ExpectReg() control which limits thenumber of boosting iterations and the shrinkage parameter

We illustrate the use of Expectile regression using the bodyfat data frameDetails of this data set are discussed on page 62 We continue our analysis ofpage 544 Taking the final models developed for the quantile regression andfitting them using ExpectReg (note Step 1 is outlined on page 544)

Step 2 Estimate Model amp Assess FitWe set fit25 as the 25th expectile regression using the three significant vari-ables identified using a quantile regression (waistcirchipcirc anthro3csee page 544) We also use fit75 as the 75th expectile regression using thetwo significant variables (waistcirc and hipcirc) identified via quantileregression

Ten-fold cross validation is used to determine the optimal number ofiterations for each of the models The optimal number is captured usingmstop(cvm25) and mstop(cvm75) for the 25th expectile regression and 75thexpectile regression respectively

548

TECHNIQUE 81 EXPECTILE REGRESSION

gtfit25 lt-glmboost(DEXfat~ waistcirc+hipcirc+anthro3c data = bodyfat family=ExpectReg(tau =025) control= boost_control(mstop =1200 nu = 01trace =

FALSE))

gt fit75 lt- glmboost(DEXfat ~ waistcirc+hipcirc data= bodyfat family=ExpectReg(tau =025) control =



gt cvm25 lt- cvrisk(fit25 folds = cv10f25)gt cv10f75 lt- cv(modelweights(fit75) type = kfold



gt mstop(cvm75)[1] 826gt fit25[mstop(cvm25)]

Notice that the optimal number of iterations is 219 and 826 for the 25thexpectile regression and 75th expectile regression respectively The 90 con-fidence intervals are estimated for each model (we use a small bootstrap of B= 50 in practice you will want to use at least 1000)gt CI25 lt- confint(fit25 B =50 level = 09)gt CI75 lt- confint(fit75 B =50 level = 09)

gt CI25Bootstrap Confidence Intervals

5 95(Intercept) -610691823 -489044440waistcirc 01035435 02891155hipcirc 03204619 05588739anthro3c 44213144 72163618


5 95

549


(Intercept) -601313272 -418993208waistcirc 02478238 04967517hipcirc 02903714 06397651

As we might have expected all the variables included in each model arestatistically significant Both models have waistcirc and hipcirc as sig-nificant explanatory variables It seems anthro3c is important in the lowerweight percentile These findings are similar to those observed by the boostedquantile regression models discussed on page 544

550

Discrete Response BoostedRegression

551

Technique 82

Logistic Regression

Boosted logistic regression can be run using the package mboost with theglmboost function

glmboost(z ~ data family=Binomial(link = c(logit))control )

Key parameters include z the discrete response variable data the dataset of independent variables family=Binomial control which limits thenumber of boosting iterations and the shrinkage parameter

Step 1 Load Required PackagesWe will build the model using the Sonar data set discussed on page 482gt library(mboost)gt data(Sonarpackage=mlbench)

Step 2 Prepare Data amp Tweak ParametersWe use the first 157 observations to train the model and the remaining 51observations as the test setgt setseed (107)gt n=nrow(Sonar)gt train lt- sample (1n 157 FALSE)

552

TECHNIQUE 82 LOGISTIC REGRESSION

Step 3 Estimate ModelWe fit the model setting the maximum number of iterations to 200 Thena 10-fold cross validated estimate of the empirical risk is used to choose theoptimal number of boosting iterations

Empirical risk is calculated using the cvrisk functiongt fit lt- glmboost(Class~ data = Sonar[train ]

family=Binomial(link = c(logit))control = boost_control(mstop =200 nu = 01trace = FALSE))




gt fit[mstop(cvm)]

The optimal number is 123 (from mstop(cvm)) We use fit[mstop(cvm)]to capture the model estimates at this iteration

Step 4 Classify DataThe optimal model is used to classify the test data set A threshold parameter(thresh) is used to translate the model scores into classes ldquoMrdquo and ldquoRrdquoFinally the confusion table is printed out

gt predlt-predict(fit newdata=Sonar[-train ] type =response)


labels=c(M R))

gt table(Sonar$Class[-train]predFac dnn=c(actualpredicted))

predictedactual M R

M 22 7R 4 18

553


The overall error rate of the model is 7+451 = 215

554

Technique 83

Probit Regression

Boosted probit regression can be run using the package mboost with theglmboost function

glmboost(z ~ data family=Binomial(link = c(probit))control )

Key parameters include z the discrete response variable data the dataset of independent variables family=Binomial control which limits thenumber of boosting iterations and the shrinkage parameter Step 1 and 2 areoutlined on page 552

Step 3 Estimate ModelThe model is fit by setting the maximum number of iterations to 200 Thena 10-fold cross validated estimate of the empirical risk is used to choose theoptimal number of boosting iterations The empirical risk is calculated usingthe cvrisk functiongt fit lt- glmboost(Class~ data = Sonar[train ]

family=Binomial(link = c(probit))control =boost_control(mstop =200 nu = 01trace = FALSE))




555


gt fit[mstop(cvm)]


Step 4 Classify DataThe optimal model is used to classify the test data set A threshold parameter(thresh) is used to translate the model scores into classes ldquoMrdquo and ldquoRrdquoFinally the confusion table is printed outgt predlt-predict(fit newdata=Sonar[-train ] type =

response)


labels=c(M R))


predictedactual M R

M 21 8R 4 18


556

Boosted Regression for Countamp Ordinal Response Data

557

Technique 84

Poisson Regression

Modeling count variables is a common task for the data scientist Whenthe response variable (yi) is a count variable following a Poisson distributionwith a mean that depends on the covariates x1xk the Poisson regressionmodel is used A boosted Poisson regression can be run using the packagemboost with the glmboost function

glmboost(z ~ data family=Poisson ()control )

Key parameters include z the response variable data the data set ofindependent variables family=Poisson control which limits the numberof boosting iterations and the shrinkage parameter

NOTE

If yi sim Poisson (λi) then the mean is equal to λi and thevariance is also equal to λi Therefore in a Poisson regressionboth the mean and the variance depend on the covariates ie

ln (λi) = β0 + β1x1i + + βkxki (841)

Step 1 Load Required PackagesThe required packages and data are loaded as followsgt library(mboost)gt library(MixAll)gt data(DebTrivedi)gt setseed (465)

558

TECHNIQUE 84 POISSON REGRESSION

gt flt-ofp~hosp + health + numchron + gender + school+ privins

The number of physician office visits ofp is the response variable The co-variates are - hosp (number of hospital stays) health (self-perceived healthstatus) numchron (number of chronic conditions) as well as the socioeco-nomic variables gender school (number of years of education) and privins(private insurance indicator)

Step 2 Estimate Model amp Assess FitThe model is fit using glmboost with the maximum number of iterationsequal to 1200 The parameter estimates are shown in Table 25

gt fitlt-glmboost(fdata=DebTrivedi family=Poisson ()control = boost_control(mstop =1200 nu = 01trace= FALSE))

gt round(coef(fit off2int=TRUE) 3)

(Intercept) hosp healthpoor1029 0165 0248healthexcellent numchron gendermale-0362 0147 -0112school privinsyes0026 0202

Table 25 Initial coefficient estimates

A 10-fold cross validated estimate of the empirical risk is used to choosethe optimal number of boosting iterations The empirical risk is calculatedusing the cvrisk function The parameter estimates are shown in Table 26gt cv10f lt- cv(modelweights(fit) type = kfoldB


gt mstop(cvm)[1] 36

gt fit[mstop(cvm)]

559




Table 26 Cross validated coefficient estimates

The optimal number of iterations is 36 (from mstop(cvm)) We usefit[mstop(cvm)] to capture the model estimates at this iteration and printout the estimates using round(coef(fitoff2int=TRUE)3) We note thatall the parameter estimates for the optimal fit model are close to those ob-served prior to the 10-fold cross validation

Finally we estimate a 90 confidence interval for each of the parametersusing a small bootstrap sample of 50 (in practice you should use at least1000)gt CI lt- confint(fit B =50 level = 09)

gt CIBootstrap Confidence Intervals

5 95(Intercept) 09589353 118509474hosp 01355033 021464596healthpoor 01663162 032531872healthexcellent -04494058 -021435073numchron 01207087 016322440gendermale -01591000 -004314812school 00143659 003374010privinsyes 01196714 023438877

All of the covariates are statistically significant and contribute to explain-ing the number of physician office visits

560

Technique 85

Negative Binomial Regression

Negative binomial regression is often used for over-dispersed count outcomevariables A boosted negative binomial regression can be run using the pack-age mboost with the glmboost function

glmboost(z ~ data family=NBinomial ()control)

Key parameters include z the count response variable data the dataset of independent variables family=NBinomial control which limits thenumber of boosting iterations and the shrinkage parameter

13 PRACTITIONER TIP

A feature of the data frame DebTrivedi reported by the origi-nal researchers (see page 86) is that the data has a high degreeof unconditional over dispersion relative to the standard Pois-son model Overdispersion simply means that the data hasgreater variability than would be expected based on the givenstatistical model (in this case Poission) One way to han-dle over dispersion is to use the Negative Binomial regressionmodel


Step 2 Estimate Model amp Assess FitWe continue with the DebTrivedi data frame and estimate a Negative bino-mial regression using the same set of covariates as discuss on page 558 The

561


model is estimated using glmboost with the maximum number of iterationsequal to 1200 The parameter estimates are shown in Table 27gt fitlt-glmboost(fdata=DebTrivedi family=NBinomial(c

(0 100))control = boost_control(mstop =1200 nu =01trace = FALSE))




Although the values of the estimates are somewhat different from thoseon page 559 the signs of the coefficient remain consistent




gt fit[mstop(cvm)]


The optimal number of iterations is 312 (from mstop(cvm)) Thereforewe use fit[mstop(cvm)] to capture the model estimates at this iteration andprint out the estimates using round(coef(fitoff2int=TRUE)3) We notethat all the parameter estimates for the optimal fit model are very close tothose observed prior to the 10-fold cross validation

Finally we estimate a 90 confidence interval for each of the parametersusing a small bootstrap sample of 50 (in practice you should use at least1000)

562

TECHNIQUE 85 NEGATIVE BINOMIAL REGRESSION



gt CI lt- confint(fit B =50 level = 09)



All of the covariates are statistically significant

563

Technique 86

Hurdle Regression

Hurdle regression is used for modeling count data where there is over disper-sion and excessive zero counts in the outcome variable It can be run usingthe package mboost with the glmboost function

glmboost(z ~ data family=Hurdle ()control )

Key parameters include z the response variable data the data set ofindependent variables family=Hurdle control which limits the number ofboosting iterations and the shrinkage parameter

NOTE

A feature of the DebTrivedi reported by the researchers Deband Trivedi (see page 86) is that the data include a high pro-portion of zero counts corresponding to zero recorded demandover the sample interval One way to handle an excess ofzero counts is to fit a negative binomial regression model tothe non-zero counts This can be achieved using the Hurdlefunction In the hurdle approach the process that determinesthe zerononzero count threshold is different from the processthat determines the count once the hurdle (zero in this case)is crossed Once the hurdle is crossed the data are assumedto follow the density for a truncated negative binomial distri-bution


564

TECHNIQUE 86 HURDLE REGRESSION

Step 2 Estimate Model amp Assess FitWe continue with the DebTrivedi data frame and estimate a Hurdle regres-sion using the same set of covariates as discuss on page 558 The model isestimated using glmboost with the maximum number of iterations equal to1200 The parameter estimates are shown in Table 29gt fitlt-glmboost(fdata=DebTrivedi family=Hurdle(c



(Intercept) hosp healthpoor-3231 0352 0533healthexcellent numchron gendermale-0586 0299 -0206school privinsyes0044 0395




Although the sign of the intercept and values of the estimated coefficientsare somewhat different from those on 559 the signs of the coefficient arepreserved A 10-fold cross validated estimate of the empirical risk is usedto choose the optimal number of boosting iterations The empirical risk iscalculated using the cvrisk function The parameter estimates are shown inTable 30gt cv10f lt- cv(modelweights(fit) type = kfoldB


565



The optimal number is 1489 (from mstop(cvm)) We usefit[mstop(cvm)] to capture the model estimates at this iteration and printout the estimates using round(coef(fitoff2int=TRUE)3) We note thatall the parameter estimates for the optimal fit model are very close to thoseobserved prior to the 10-fold cross validation

566

Technique 87

Proportional Odds Model

The Proportional Odds Model is a class of generalized linear models used formodelling the dependence of an ordinal response on discrete or continuouscovariates

It is used when it is not possible to measure the response variable onan interval scale In bio medical research for instance constructs such asself-perceived health can be measured on an ordinal scale (ldquovery unhealthyrdquoldquounhealthyrdquo ldquohealthyrdquo ldquovery healthyrdquo) A boosted version of the model canbe run using the package mboost with the glmboost function

glmboost(z ~ data family=PropOdds ()control )

Key parameters include the response Z which is an ordered factor datathe data set of explanatory variable family = PropOdds() control whichlimits the number of boosting iterations and the shrinkage parameter

Step 1 Load Required Packages amp Tweak DataThe required packages and data are loaded as followsgt library(mboost)gt library(ordinal)gt data(wine)gt setseed (125)

The package ordinal contains the data frame wine which is used in theanalysis It also contains the function clm which we use later to estimate annon-boosted version of the Proportional Odds Model Further details of wineare given on page 95

567


Step 2 Estimate Model amp Assess FitWe estimate the model using rating as the response variable and contactand temp as the covariates The model is estimated using glmboost with themaximum number of iterations equal to 1200 This followed by a five-foldcross validation using the minimum empirical risk to determine the optimalnumber of iterationsgt fitlt-glmboost(rating ~ temp + contact data= wine

family=PropOdds ()control = boost_control(mstop=1200nu = 01))

gt cv5f lt- cv(modelweights(fit) type = kfoldB=5)gt cvm lt- cvrisk(fit folds = cv5f)gt mstop(cvm)[1] 167

gt round(cvm[mstop(cvm)]3)[1] 142

gt fit[mstop(cvm)]

The optimal number of iterations is 167 (from mstop(cvm)) with aempirical risk of 142 We use fit[mstop(cvm)] to capture the modelestimates at this iteration and print out the parameter estimates usinground(coef(fitoff2int=TRUE)3)

The estimate of a Proportional Odds Model using the function clm is alsoreportedgt round(coef(fit off2int=TRUE) 3)(Intercept) tempwarm contactyes

-1508 2148 1230

gt fit2lt- clm(rating ~ temp + contact data = wine link=logit)

gt round(fit2$coefficients [56] 3)tempwarm contactyes

2503 1528

Although the coefficients differ between the boosted and non-boosted ver-sion of the model both models yield the same class predictionsgt predlt-predict(fit type=class)

568

TECHNIQUE 87 PROPORTIONAL ODDS MODEL

gt pred2 lt-predict(fit2 type=class)

gt table(pred)pred1 2 3 4 50 18 36 18 0

gt table (pred2)pred21 2 3 4 50 18 36 18 0

Finally we compare the fitted model with the actual observations andcalculate the overall error rategt tblt-table(wine$rating pred dnn=c(actual

predicted))

gt tbpredicted

actual 1 2 3 4 51 0 4 1 0 02 0 9 12 1 03 0 5 16 5 04 0 0 5 7 05 0 0 2 5 0

gt error lt- 1-(sum(diag(tb))sum(tb))

gt round (error 3)[1] 0556

The error rate for the model is 56

569

Boosted Models for SurvivalAnalysis

NOTE

Survival analysis is concerned with studying the time betweenentry to a study and a subsequent event (such as death)The objective is to use a statistical model to simultaneouslyexplore the effects of several explanatory variables on sur-vivalTwo popular approaches are the Cox Proportional Haz-ard Model and Accelerated Failure Time Models

Cox Proportional Hazard ModelThe Cox Proportional Hazard Model is a statistical technique for exploringthe relationship between survival (typically of a patient) and several explana-tory variables It takes the form

hi(t) = exp (β1X1i + + βkXki)h0(t) (871)where hi(t) is the hazard function for the ith individual at time th0(t) is

the baseline hazard function and X1 Xk are the explanatory covariatesThe model provides an estimate of the treatment effect on survival after

adjustment for other explanatory variables In addition it is widely used inmedical statistics because it provides an estimate of the hazard (or risk) ofdeath for an individual given their prognostic variables

NOTE The hazard function is the probability that an individual will experience

an event (for example death) within a small time interval given that theindividual has survived up to the beginning of the interval In the medicalcontext it can therefore be interpreted as the risk of dying at time t

570


Accelerated Failure Time ModelsParametric accelerated failure time (AFT) models provide an alternative tothe (semi-parametric) Cox proportional hazards model for statistical mod-eling of survival data137 Unlike the Cox proportional hazards model theAFT approach models survival times directly and assumes that the effect ofa covariate is to accelerate or decelerate the life course of a response by someconstant

The AFT model treats the logarithm of survival time as the responsevariable and includes an error term that is assumed to follow a particulardistribution

Equation 872 shows the log-linear form of the AFT model for the ithindividual where logTi is the log-transformed survival time X1 Xk are theexplanatory covariates and εi represents the residual or unexplained variationin the log-transformed survival times while μ and σv are intercept and scaleparameter respectively138

log Ti = micro+ β1X1i + + βkXki + σεi (872)

Under the AFT model parameterization the distribution chosen for Ti

dictates the distribution of the error term εi Popular survival time distri-butions include the Weibull distribution the log-logistic distribution and thelog-normal distribution

13 PRACTITIONER TIP

If the baseline hazard function is known to follow a Weibuldistribution accelerated failure and proportional hazards as-sumptions are equivalent

Model Empirical RiskWeibull AFT 0937Lopgnormal AFT 1046Log-logistic AFT 0934Cox PH 2616Gehan 0267

Table 31 Estimation of empirical risk using the rhDNase data frame

571


Assessing FitOne of the first steps the data scientist faces in fitting survival models isto determine which distribution should be specified for the survival timesTi One approach is to fit a model for each distribution and choose themodel which minimizes the Akaikersquos Information Criterion (AIC)139 or similarcriteria An alternative is to choose the model which minimizes the cross-validated estimation of empirical risk

As an example Table 31 shows the cross-validated estimation of empiricalrisk for various models using the rhDNase data frame

572

Technique 88

Weibull Accelerated FailureTime Model

The Weibull Accelerated Failure Time Model is one of the most populardistributional choices for modeling survival data It can be run using thepackage mboost with the glmboost function

glmboost(z ~ data family=Weibull ()control )

Key parameters include z the survival response variable (note we setz=Surv(time status)where Surv is an a survival object constructed usingthe survival package)data the data set of explanatory variable family =Weibull() control which limits the number of boosting iterations and theshrinkage parameter

Step 1 Load Required Packages amp Tweak DataThe required packages and data are loaded as followsgt library(mboost)gt library(simexaft)gt library(survival)gt data(rhDNase)gt setseed (465)

Forced expiratory volume (FEV) was considered a risk factor and wasmeasured twice at randomization (rhDNase$fev and rhDNase$fev2) Wetake the average of the two measurements as an explanatory variable Theresponse is defined as the logarithm of the time from randomization to thefirst pulmonary exacerbation measured in the object survreg(Surv(time2status))

573


gt rhDNase$fevave lt- (rhDNase$fev + rhDNase$fev2)2gt zlt-Surv(rhDNase$time2 rhDNase$status)

Step 2 Estimate Model amp Assess FitWe estimate a model using z as the response variable rhDNase$fevaveand trt (a treatment assignment indicator taking values 1 if patient receiverhDNase and 0 if patent receive placebo) The model is fit using glmboostwith the maximum number of iterations equal to 1200 This followed by afive-fold cross validation using the minimum empirical risk to determine theoptimal number of iterationsgt fitlt-glmboost(Surv(time2 status) ~ trt + fevave

data= rhDNase family=Weibull ()control = boost_control(mstop =1200 nu = 01))

gt cv5f lt- cv(modelweights(fit) type = kfoldB=5)gt cvm lt- cvrisk(fit folds = cv5f)



gt fit[mstop(cvm)]

The optimal number of iterations is 500 (from mstop(cvm)) witha empirical risk of 0937 We use fit[mstop(cvm)] to capture themodel estimates at this iteration and print out the estimates usinground(coef(fitoff2int=TRUE)3)

We then fit a non-boosted Weibull AFT model using the survreg func-tion from the survival package and compare the parameter estimates to theboosted model This is followed by obtaining an estimate of the boostedmodels scale parameter using the function nuisance()gt fit1 lt- survreg(Surv(time2 status) ~ trt + fev

ave data = rhDNase dist = weibull)

gt round(coef(fit off2int=TRUE) 3)(Intercept) trt fevave

574

TECHNIQUE 88 WEIBULL ACCELERATED FAILURE TIME

4531 0348 0019

gt round(coef(fit1 off2int=TRUE) 3)(Intercept) trt fevave

4518 0357 0019

gt round(nuisance(fit) 3)[1] 092

The boosted model has a similar fevave to the non-boosted AFT modelThere is a small difference in the estimated values of the intercept and trt

575

Technique 89

Lognormal Accelerated FailureTime Model

The Lognormal Accelerated Failure Time Model can be run using the packagemboost with the glmboost function

glmboost(z ~ data family=Lognormal ()control)

Key parameters include z the survival response variable (note we setz=Surv(time2 status)where Surv is an a survival object constructed us-ing the survival package)data the data set of explanatory variable family= Lognormal() control which limits the number of boosting iterations andthe shrinkage parameter

Details on step 1 are given on 573

Step 2 Estimate Model amp Assess FitWe estimate a model using z as the response variable rhDNase$fevaveand trt (a treatment assignment indicator taking values 1 if patient receiverhDNase and 0 if patent receive placebo) The model is fit using glmboostwith the maximum number of iterations equal to 1200 This followed by afive-fold cross validation using the minimum empirical risk to determine theoptimal number of iterationsgt fitlt-glmboost(z ~ trt + fevave data= rhDNase

family=Lognormal ()control = boost_control(mstop=1200nu = 01))


576

TECHNIQUE 89 LOGNORMAL ACCELERATED FAILURE




gt fit[mstop(cvm)]


We then fit a non-boosted Lognormal AFT model using the survregfunction from the survival package and compare the parameter estimates tothe boosted model This is followed by obtaining an estimate of the boostedmodels scale parameter using the function nuisance()gt fit1 lt- survreg(Surv(time2 status) ~ trt + fev

ave data = rhDNase dist = lognormal)


4103 0408 0021


4083 0424 0022


The boosted model has a similar fevave to the non-boosted AFT modelThere is a much larger difference in the estimated value trt (0408 boostedAFT versus 0424 non-boosted Lognormal AFT)

577

Technique 90

Log-logistic Accelerated FailureTime Model

The Log-logistic Accelerated Failure Time Model can be run using the pack-age mboost with the glmboost function

glmboost(z ~ data family=Loglog ()control )

Key parameters include z the survival response variable (note we setz=Surv(time2 status)where Surv is an a survival object constructed us-ing the survival package)data the data set of explanatory variable family= Loglog() control which limits the number of boosting iterations and theshrinkage parameter



family=Loglog ()control = boost_control(mstop=1200nu = 01))


578

TECHNIQUE 90 LOG-LOGISTIC ACCELERATED FAILURE



gt fit[mstop(cvm)]


We then fit a non-boosted Log-logistic AFT model using the survregfunction from the survival package and compare the parameter estimates tothe boosted model This is followed by obtaining an estimate of the boostedmodels scale parameter using the function nuisance()gt fit1 lt- survreg(Surv(time2 status) ~ trt + fev

ave data = rhDNase dist = loglogistic)


4110 0384 0020


4093 0396 0021


The boosted model has a similar fevave to the non-boosted AFT modelThere is a larger difference in the estimated value trt (0384 boosted AFTversus 0396 non-boosted Log-logistic AFT)

579

Technique 91

Cox Proportional HazardModel

The Cox Proportional Hazard Model can be run using the package mboostwith the glmboost function

glmboost(z ~ data family=CoxPH ()control )

Key parameters include z the survival response variable (note we setz=Surv(time2 status)where Surv is an a survival object constructed us-ing the survival package)data the data set of explanatory variable family= CoxPH() control which limits the number of boosting iterations and theshrinkage parameter



family=CoxPH ()control = boost_control(mstop=1200nu = 01))


580

TECHNIQUE 91 COX PROPORTIONAL HAZARD MODEL



gt fit[mstop(cvm)]


13 PRACTITIONER TIP

With boosting the whole business of model diagnostics isgreatly simplified For example you can quickly boost anadditive non-proportional hazards model in order to check ifit fits your data better than a linear Cox model If the twomodels perform roughly equivalent you know that assumingproportional hazards and linearity is reasonable

We then fit a non-boosted Cox Proportional Hazard Model model usingthe coxph function from the survival package and compare the parameterestimates to the boosted modelgt fit1 lt- coxph(Surv(time2 status) ~ trt + fevave

data = rhDNase)


1431 -0370 -0020

gt round(coef(fit1 off2int=TRUE) 3)trt fevave

-0378 -0021

Both models have similar treatment effects

581


NOTE

For the Cox Proportional Hazard model a positive regressioncoefficient for an explanatory variable means that the hazardis higher and thus the prognosis worse Conversely a negativeregression coefficient implies a better prognosis for patientswith higher values of that variableIn Figure 911 we plot the predicted survivor function

S1 lt- survFit(fit)plot(S1)

582


Figure 911 Cox Proportional Hazard Model predicted survivor function forrhDNase

583

Technique 92

Gehan Loss Accelerated FailureTime Model

The Gehan Loss Accelerated Failure Time Model calculates a rank-basedestimation of survival data where the loss function is defined as the sum ofthe pairwise absolute differences of residuals It can be run using the packagemboost with the glmboost function

glmboost(z ~ data family=Gehan ()control )

Key parameters include z the survival response variable (note we setz=Surv(time2 status)where z is the response variable data the data setof explanatory variable family = Gehan() control which limits the num-ber of boosting iterations and the shrinkage parameter



family=Gehan ()control = boost_control(mstop=1200nu = 01))


584

TECHNIQUE 92 GEHAN LOSS ACCELERATED FAILURE




gt fit[mstop(cvm)]


3718 0403 0022


585


Notes115Bauer Eric and Ron Kohavi An empirical comparison of voting classification algo-

rithms Bagging boosting and variants Machine learning 361 (1998) 2116Friedman Jerome Trevor Hastie and Robert Tibshirani Additive logistic regression

a statistical view of boosting (with discussion and a rejoinder by the authors) The annalsof statistics 282 (2000) 337-407

117A type of proximal gradient method for learning from data See for example1 Parikh Neal and Stephen Boyd Proximal algorithms Foundations and Trends

in optimization 13 (2013) 123-2312 Polson Nicholas G James G Scott and Brandon T Willard Proximal Algo-

rithms in Statistics and Machine Learning arXiv preprint arXiv150203175 (2015)118Schapire Robert E The strength of weak learnability Machine learning 52 (1990)

197-227119Cheepurupalli Kusma Kumari and Raja Rajeswari Konduri Noisy reverberation

suppression using adaboost based EMD in underwater scenario International Journal ofOceanography 2014 (2014)

120For details on EMD and its use see the classic paper by Rilling Gabriel PatrickFlandrin and Paulo Goncalves On empirical mode decomposition and its algorithmsIEEE-EURASIP workshop on nonlinear signal and image processing Vol 3 NSIP-03Grado (I) 2003

121See for example1 Kusma Kumari and K Raja Rajeshwari ldquoApplication of EMD as a robust adaptive

signal processing technique in radarsonar communicationsrdquo International Journalof Engineering Science and Technology (IJEST) vol 3 no 12 pp 8262ndash 82662011

2 Kusma Kumari Ch and K Raja Rajeswari ldquoEnhancement of performance measuresusing EMD in noise reduction applicationrdquo International Journal of Computer Ap-plications vol 70no 5 pp 10ndash14 2013

122Karabulut Esra Mahsereci and Turgay Ibrikci Analysis of Cardiotocogram Datafor Fetal Distress Determination by Decision Tree Based Adaptive Boosting ApproachJournal of Computer and Communications 209 (2014) 32

123See Newman DJ Heittech S Blake CL and Merz CJ (1998) UCI Repositoryof Machine Learning Databases University California Irvine Department of Informationand Computer Science

124Creamer Germaacuten and Yoav Freund Automated trading with boosting and expertweighting Quantitative Finance 104 (2010) 401-420

125Sam Kam-Tong and Xiao-Lin Tian Vehicle logo recognition using modest adaboostand radial tchebichef moments International Conference on Machine Learning and Com-puting (ICMLC 2012) 2012

126Markoski Branko et al Application of Ada Boost Algorithm in Basketball PlayerDetection Acta Polytechnica Hungarica 121 (2015)

127Takiguchi Tetsuya et al An adaboost-based weighting method for localizing humanbrain magnetic activity Signal amp Information Processing Association Annual Summit andConference (APSIPA ASC) 2012 Asia-Pacific IEEE 2012

128Note that each sensor weight value calculated by Ada Boost provides an indication ofhow useful the MEG-sensor pair is for vowel recognition

586

NOTES

129See the original paper by Freund Y Schapire R (1996) ldquoExperiments with a NewBoosting Algorithmrdquo In ldquoInternational Conference on Machine Learningrdquo pp 148ndash156

130See the seminal work of Bauer E Kohavi R An Empirical Comparison of VotingClassification Algorithms Bagging Boosting and Variants Journal of Machine Learning199936105139 They report a 27 relative improvement using Ada Boost compared toa single decision tree

Also take a look at the following

1 Ridgeway G The State of Boosting Computing Science and Statistics199931172181

2 Meir R Ratsch G An Introduction to Boosting and Leveraging Advanced Lectureson Machine Learning 2003p 118183

131Gorman R Paul and Terrence J Sejnowski Analysis of hidden units in a layerednetwork trained to classify sonar targets Neural networks 11 (1988) 75-89

132See Cohen J (1960) A coefficient of agreement for nominal data Educational andPsychological Measurement 20 37ndash46

133See for example

1 Buhlmann P Hothorn T Boosting Algorithms Regularization Prediction andModel Fitting (with Discussion) Statistical Science 200722477522

2 Zhou ZH Ensemble Methods Foundations and Algorithms CRCMachine Learningamp Pattern Recognition Chapman amp Hall 2012

3 Schapire RE Freund Y Boosting Foundations and Algorithms MIT Press 2012134See Hastie T Tibshirani R Friedman J H (2009) 10 Boosting and Additive

Trees The Elements of Statistical Learning (2nd ed) New York Springer pp 337ndash384135SAMME stands for stagewise additive modeling using a multi-class exponential loss

function136For additional details see Newey W amp Powell J (1987) lsquoAsymmetric least squares

estimation and testingrsquo Econometrica 55 819ndash847137See for example Wei LJ The accelerated failure time model a useful alternative to

the Cox regression model in survival analysis Statist Med 1992111871ndash1879138See Collett D Modelling Survival Data in Medical Research 2 CRC Press Boca

Raton 2003139See Akaike A A new look at the statistical model identification IEEE Trans Autom

Control 197419716ndash723

587

INDEX

CongratulationsYou made it to the end Here are three things you can do next

1 Pick up your free copy of 12 Resources to Supercharge Your Pro-ductivity in R at http www auscov com tools html

2 Gift a copy of this book to your friends co-workers teammates or yourentire organization

3 If you found this book useful and have a moment to spare I wouldreally appreciate a short review Your help in spreading the word isgratefully received

Good luck

PS Thanks for allowing me to partner with you on your data analysisjourney

591








INDEX

They Laughed As They Gave Me The DataTo AnalyzeBut Then They Saw My ChartsWish you had fresh ways to presentdata explore relationships visualizeyour data and break free from mun-dane charts and diagrams




593

Preface

How to Get the Most from this Book

I Decision Trees



Technique 1 Classification Tree

Technique 2 C50 Classification Tree

Technique 3 Conditional Inference Classification Tree

Technique 4 Evolutionary Classification Tree

Technique 5 Oblique Classification Tree

Technique 6 Logistic Model Based Recursive Partitioning

Technique 7 Probit Model Based Recursive Partitioning

Regression Trees for Continuous Response Variables

Technique 8 Regression Tree

Technique 9 Conditional Inference Regression Tree

Technique 10 Linear Model Based Recursive Partitioning

Technique 11 Evolutionary Regression Tree

Decision Trees for Count amp Ordinal Response Data

Technique 12 Poisson Decision Tree

Technique 13 Poisson Model Based Recursive Partitioning

Technique 14 Conditional Inference Ordinal Response Tree

Decision Trees for Survivial Analysis

Technique 15 Exponential Algorithm

Technique 16 Conditional Inference Survival Tree

II Support Vector Machines



Technique 17 Binary Response Classification with C-SVM

Technique 18 Multicategory Classification with C-SVM

Technique 19 Multicategory Classification with nu-SVM

Technique 20 Bound-constraint C-SVM classification

Technique 21 Weston - Watkins Multi-Class SVM

Technique 22 Crammer - Singer Multi-Class SVM


Technique 23 SVM eps-Regression

Technique 24 SVM nu-Regression

Technique 25 Bound-constraint SVM eps-Regression

Support Vector Novelty Detection

Technique 26 One-Classification SVM

III Relevance Vector Machine


Technique 27 RVM Regression

IV Neural Networks


Examples of Neural Network Classification

Technique 28 Resilient Backpropagation with Backtracking

Technique 29 Resilient Backpropagation

Technique 30 Smallest Learning Rate

Technique 31 Probabilistic Neural Network

Technique 32 Multilayer Feedforward Neural Network

Examples of Neural Network Regression




Technique 36 General Regression Neural Network

Technique 37 Monotone Multi-Layer Perceptron

Technique 38 Quantile Regression Neural Network

V Random Forests


Technique 39 Classification Random Forest

Technique 40 Conditional Inference Classification Random Forest

Technique 41 Classification Random Ferns

Technique 42 Binary Response Random Forest

Technique 43 Binary Response Random Ferns

Technique 44 Survival Random Forest

Technique 45 Conditional Inference Survival Random Forest

Technique 46 Conditional Inference Regression Random Forest

Technique 47 Quantile Regression Forests

Technique 48 Conditional Inference Ordinal Random Forest

VI Cluster Analysis



Technique 49 K-Means

Technique 50 Clara Algorithm

Technique 51 PAM Algorithm

Technique 52 Kernel Weighted K-Means


Technique 53 Hierarchical Agglomerative Cluster Analysis

Technique 54 Agglomerative Nesting

Technique 55 Divisive Hierarchical Clustering

Technique 56 Exemplar Based Agglomerative Clustering

Fuzzy Methods

Technique 57 Rousseeuw-Kaufmanrsquos Fuzzy Clustering Method

Technique 58 Fuzzy K-Means

Technique 59 Fuzzy K-Medoids

Other Methods

Technique 60 Density-Based Cluster Analysis

Technique 61 K-Modes Clustering

Technique 62 Model-Based Clustering

Technique 63 Clustering of Binary Variables

Technique 64 Affinity Propagation Clustering

Technique 65 Exemplar-Based Agglomerative Clustering

Technique 66 Bagged Clustering

VII Boosting


Binary Boosting

Technique 67 Ada BoostM1

Technique 68 Real Ada Boost

Technique 69 Gentle Ada Boost

Technique 70 Discrete L2 Boost

Technique 71 Real L2 Boost

Technique 72 Gentle L2 Boost


Technique 73 SAMME

Technique 74 Breimanrsquos Extension

Technique 75 Freundrsquos Adjustment

Continuous Response Boosted Regression

Technique 76 L2 Regression


Technique 78 Robust Regression

Technique 79 Generalized Additive Model

Technique 80 Quantile Regression

Technique 81 Expectile Regression

Discrete Response Boosted Regression

Technique 82 Logistic Regression

Technique 83 Probit Regression

Boosted Regression for Count amp Ordinal Response Data

Technique 84 Poisson Regression

Technique 85 Negative Binomial Regression

Technique 86 Hurdle Regression

Technique 87 Proportional Odds Model

Boosted Models for Survival Analysis

Technique 88 Weibull Accelerated Failure Time Model

Technique 89 Lognormal Accelerated Failure Time Model

Technique 90 Log-logistic Accelerated Failure Time Model

Technique 91 Cox Proportional Hazard Model

Technique 92 Gehan Loss Accelerated Failure Time Model

2015-10-23T190127+0000

Preflight Ticket Signature

92APPLIED PREDICTIVE MODELING

Techniques in R

Over ninety of the most important models used bysuccessful Data Scientists With step by stepinstructions on how to build them FAST

Dr ND Lewis






ISBN-13 978-1517516796ISBN-10 151751679X


Acknowledgments





iii











bull Random forests

bull Random ferns



bull Decision trees



vi








vii






viii
















x




13 PRACTITIONER TIP




1



13 PRACTITIONER TIP




2

13 PRACTITIONER TIP




13 PRACTITIONER TIP





3

Part I

Decision Trees

4

The Basic Idea


NOTE




6








7



13 PRACTITIONER TIP



8








9







10


NOTE



11


the following steps








NOTE


12










13


13 PRACTITIONER TIP


Sensitivity = NTP


Specificity = NTN







14






15


NOTE









16

NOTE











17


CART C45 ID35587 5416 5272


13 PRACTITIONER TIP


18





NOTE


19




Tree Depth T1 T25 148 14028 137 124910 155 1361


13 PRACTITIONER TIP


20


21

Technique 1

Classification Tree




13 PRACTITIONER TIP



22


NOTE







23




24





13 PRACTITIONER TIP


$class[1] factor



gt fit



0410480 0366812 0000000 )

25


14) SkewMaxis lt 645 7 9561 van (0000000 0000000 0428571 0571429 )

15) SkewMaxis gt 645 64 10300 van (0015625 0000000 0000000 0984375 )









13 PRACTITIONER TIP






Circ KurtMaxis

26











27



13 PRACTITIONER TIP


28



NOTE






29





30






bus 86 1 3 4opel 1 55 20 9saab 4 55 23 6van 2 2 5 70



31

Technique 2





C50(z ~ data)



32






gt C5imp(fit)



33


13 PRACTITIONER TIP



class)



bus 84 1 5 4opel 0 48 31 6saab 3 38 47 0van 2 1 4 72



34


13 PRACTITIONER TIP


]type = prob)


5 0018 0004 0004 09747 0050 0051 0852 004810 0011 0228 0750 001012 0031 0032 0782 015515 0916 0028 0029 002719 0014 0070 0903 0013

35

Technique 3



ctree(z ~data )






36




gt plot(fit)


37




response)

38




bus 36 0 11 47opel 3 50 24 8saab 6 58 19 5van 0 0 21 58



13 PRACTITIONER TIP



3 4 6 745 108 75 118





39









error () 529 416 37


40

Technique 4


NOTE



evtree(z ~ data )



41





gt plot(fit)

42




gt fit


MaxLRa +

43




KurtMaxis + HollRa





= 473)



response)



bus 85 0 0 9opel 14 48 0 23saab 18 57 0 13van 0 1 0 78

44




13 PRACTITIONER TIP


]type = prob)


]type = node)





bus 72 3 15 4opel 4 30 41 10saab 6 23 55 4van 3 1 6 69


45



46

Technique 5


NOTE







47










48


13 PRACTITIONER TIP



49





gt summary(fit)

Classification tree

50










51

Technique 6






NOTE


52






53


NOTE








gt plot(fit)

54


13 PRACTITIONER TIP



bw = 15))

55







56







gt AIC(fit)[1] 2608691


train ])


labels=c(neg pos))


gt tb

57



neg 92 17pos 26 29


13 PRACTITIONER TIP



58

Technique 7









3 -4070 0078 -02224 -2932 0046 14746 -3416 0089 08147 -1174 0100 -1003

59




gt AIC(fit)[1] 2608555


train ])


labels=c(neg pos))


gt tbpredicted



60


61

Technique 8

Regression Tree





NOTE


62








gt fit


1) root 45 672400 33642) anthro4 lt 533 19 133200 3004

63


4) anthro4 lt 4545 5 025490 2667








-0268300 -0080280 0009712 0000000 00734000332900





64







65




66





67



68

Technique 9



ctree(z ~data )





69



gt plot(fit)




70





= 140613) weights = 9



= 158626) weights = 10






log(DEXfat) 068

71



72

Technique 10








73






gt plot(fit)

13 PRACTITIONER TIP


74





= 169623) weights = 31


75


-14309 1196 2660



-53867 09887 09466



-35815 06027 09546





$lsquo3lsquo

CallNULL

Weighted Residuals

76


Min 1Q Median 3Q Max-03272 00000 00000 00000 04376







gt AIC(fit)[1] -8084949

13 PRACTITIONER TIP



77





78

Technique 11



evtree(z ~ data )





79





13 PRACTITIONER TIP


80







81



3859)| | [4] anthro3c gt= 377 31496 (n = 12 err =

1868)| [5] hipcirc gt= 109 43432 (n = 15 err = 5548)


13 PRACTITIONER TIP





82



83


13 PRACTITIONER TIP


84

Technique 12






85


NOTE








86


13 PRACTITIONER TIP


cex =6)



87




poisson)



n= 4406




88



89


13 PRACTITIONER TIP






55 37 8




90

Technique 13







91





gt plot(fit)


92





gt AIC(fit)[1] 2828175



93

Technique 14



ctree(z ~data )



94


NOTE




95





predicted))gt tb

96



1 0 5 0 0 02 0 16 5 1 03 0 13 8 5 04 0 2 3 7 05 0 0 2 5 0


97




98

Technique 15






99


NOTE







100



13 PRACTITIONER TIP



101




102





103





104

Technique 16



ctree(z ~data )





105






106








7) weights = 177







3 5 6 7232 105 127 177

107






108

NOTES




















109




















110

Part II


111

The Basic Idea




113



NOTE



114





y = f(xα) = sign

(Nssumi=1

αiyiK(si x) + b

)(161)


Nsumi=1

αi minus12

Nsumi=1

Nsumj=1


115



Nsumi=1



13 PRACTITIONER TIP



116



NOTE


117


NOTE


k = Po minus Pe

1minus Pe

(164)





118


119


13 PRACTITIONER TIP








120








121








122

13 PRACTITIONER TIP








123




13 PRACTITIONER TIP






124


125

Technique 17





13 PRACTITIONER TIP




i xj + r)d γ gt 0



i xj + r)


126







127






140 140






128






17383352 = 00135









129


13 PRACTITIONER TIP






step=8)

130



131



NOTE



type=class)



-1 65 271 16 16


132


13 PRACTITIONER TIP




Observed))Observed

Predicted -1 1-1 76 131 5 30




133

Technique 18







134








gamma 005555556


( 122 64 118 72 )

Number of Classes 4



gamma = 2^( -22) cost = 2^(24))


135







13 PRACTITIONER TIP


crosscross =20))






136








137






138






gamma 003








gamma 003


( 100 32 94 36 )

Number of Classes 4

Levels

139


bus opel saab van








140






141



bus 87 2 0 5opel 0 55 28 2saab 0 20 68 0van 1 1 1 76



142

Technique 19











143




nu 05


( 110 88 107 98 )

Number of Classes 4








144





gt summary(obj)





145





146


print(fit)





nu 002








nu 002


( 71 29 72 32 )

Number of Classes 4

147










148





149



bus 87 1 1 5opel 0 41 42 2saab 0 30 58 0van 1 1 2 75



150

Technique 20







151













152



gt fitcross[1] 0248

13 PRACTITIONER TIP













153










i=1j=j+1

end cost loop

154






cost sigma error1 4 01 023585862 4 02 026414143 4 03 027441084 4 04 02973064




error))




155





156





157





bus 88 1 2 3opel 1 31 50 3saab 1 24 61 2van 1 0 3 75



158

Technique 21









159







gt fitcross[1] 0278





160


gt j=1gt count=1







i=1j=j+1

end cost loop




161



1 4 01 023569022 4 02 026397313 4 03 032289564 4 04 029730645 4 05 02885522





error))




[1] 64 01 02122896


162




gt fitcross

163


[1] 02762626




bus 88 1 0 5opel 1 37 43 4saab 1 33 53 1van 1 1 2 75




164

Technique 22









165







gt fitcross[1] 0244





166


gt j=1gt count=1







i=1j=j+1

end cost loop




167



1 4 01 022996632 4 02 023787883 4 03 025824924 4 04 02730640


02075758



error))




[1] 16 01 02075758


168




gt fitcross

169


[1] 02478114




bus 88 0 0 6opel 6 43 30 6saab 3 29 52 4van 1 0 1 77



170


171

Technique 23

SVM eps-Regression






172














173


gt summary(obj)







6 025 8 001






174

















175






176



177

Technique 24

SVM nu-Regression










178




gamma 01111111nu 05









179



126 025 8 09






cross = 45)



gamma 025nu 09



180




cross = 45)



gamma 025nu 09








181



182

Technique 25







183












184






error)





185



end sigma loopi=1


n)








186


[1] 2186181



error))





[1] 1 004 01 2186181




187







[1][1] 0834

188



189


190

Technique 26








191





FALSE TRUE628 218







runslt-n_gamn_nu


192



)



gt for (val in nu)





end nu loop


193










Performance)


[1] 001 001 94[2] 002 001 94



194





195








testLabels)




FALSE 175 5TRUE 453 113

Accuracy 0386195 CI (0351 04221)





196















197





Accuracy 0880795 CI (08552 09031)









198

NOTES






















199



51See for example




200

Part III


201

The Basic Idea



yi = wTK + ε (261)



p(y|wσ2

)= (2π)

minusN2 σminusNexp

minus 1




w sim N(

0 1α

)(263)



203



NOTE


204


13 PRACTITIONER TIP


RMSE =radicsumN

i=1(xi minus xi)2

N(264)

NS = 1minussumN






205










206











207








208

13 PRACTITIONER TIP


209

Technique 27

RVM Regression






210










211


cross =10)










212



[1][1] 0813


213













214

Part IV

Neural Networks

215

The Basic Idea




217



Sj =Nsum

i=1wijaj + bj (271)






(272)


218







13 PRACTITIONER TIP



219








220

13 PRACTITIONER TIP







221




13 PRACTITIONER TIP







222









223







13 PRACTITIONER TIP


zi = xi minus xmin

xmax minus xmin

(273)

zi = xi minus xσx

(274)

zi = xiradicSSi

(275)

zi = xi

xmax + 1 (276)


224


225

Technique 28







226













227






13 PRACTITIONER TIP


228


age

pedigree

mass

pressure

glucose

pregnant

diabetes

Error 181242328 Steps 8448




229


[1]4 -15 -17 -112 117 120 1



-1 61 61 20 37




230

Technique 29











231





[1]4 -15 17 -112 117 1



-1 62 41 19 39




232

Technique 30










233






[1]4 -15 -17 -112 117 120 -1



-1 58 101 23 33




234

Technique 31



learn(y data )




235












236


13 PRACTITIONER TIP





13 PRACTITIONER TIP



237










0996915706 0003084294


$probabilities

238


neg pos0996915706 0003084294


category



neg 79 2pos 2 41




239

Technique 32







240











241










242





-1 68 271 13 16




243


244

Technique 33







245








246


013

059

anthro4

minus00

0467

anthro3c

008

537

anthro3b

000138

anthro3a

005012kneebreadth

minus001639

elbowbreadth

002802

age

024594

hipcirc

01271

waistcirc


1

minus297094

1







247







[-train ])


248





[1][1] 0864411

249



250

Technique 34












251







train])


252





[1][1] 0865


253

Technique 35













254






train])


255




[1][1] 0865


256

Technique 36



learn(yx)





257









anthro447 181



258







259



260

Technique 37







261







NOTE






262





[1][1] 0973589


263



fit)





[1] 0958

264



265

Technique 38







266












267










268




269



270

NOTES












271

Part V

Random Forests

272

The Basic Idea




NOTE


274






275







276











277



278


279









280







281



282

Technique 39







283







gt print(fit)








284





285





13 PRACTITIONER TIP




286




287






288





gt par(op)


13 PRACTITIONER TIP


289





290





gt print(fit)

Call

291











292





gt print(fitbest)



293







294

Technique 40






295


13 PRACTITIONER TIP










296


gt rm(fit)


08700000 08267831


13 PRACTITIONER TIP





gt rm(ord1 imp1)



297


0100 0083 0057

gt rm(ord2 imp2)

13 PRACTITIONER TIP




08179191 07567866



response)



bus 91 0 1 2opel 8 30 33 14

298


saab 6 23 51 8van 2 0 2 75



299

Technique 41






300



13 PRACTITIONER TIP











gt print(fit)

301








302



13 PRACTITIONER TIP



303






gt print(fit)








304


DCirc 0196 0004




bus 85 5 4 0opel 4 38 36 0saab 0 13 28 0van 5 29 20 79



305

Technique 42






306







gt print(fit)






M R classerrorM 70 12 01463415R 17 58 02266667

307






308






gt par(op)

309





310







311






312





313







314


gt print(fitbest)






M R classerrorM 26 3 01034483R 9 13 04090909




315



316

Technique 43







317






gt print(fit)





Errortype=l)




318





gt print(fit)



319



13 PRACTITIONER TIP








320





M 24 4R 5 18











321




322



323

Technique 44



rfsrc(z ~ data )




324







gt plot(fit)


325





326






13 PRACTITIONER TIP




method=vhvimp)


Method vimp

327








)







328





329



330

Technique 45







331











ylab=Time)

332


gt par(op)


333

Technique 46







334













335





response)



DEXfat 0756

336



337

Technique 47







338











339




340




341






342


[2] 3540183 4153541 5241859[3] 2250604 2700303 3540398[4] 2764735 3556909 4273546[5] 2130879 2620480 3533114[6] 3249331 3886571 4452887




343



344

Technique 48






345







08600 07997



0499 0000 0000 0000 0000



0448 0000 0000 0000 0000

346




0481 0000 0000 0000 0000



03182 00000


response)


predicted))gt tb


1 0 3 0 0 02 0 6 0 0 03 0 0 7 0 04 0 0 0 4 05 0 0 0 2 0


347



348

NOTES






75See for example













349






ing Research 7 (2006) 983-999

350

Part VI

Cluster Analysis

351

The Basic Idea








353














354



355


NOTE





356


357






13 PRACTITIONER TIP


358









359












360



361



362

13 PRACTITIONER TIP


k asympradicN

2 (481)







363








364


365


13 PRACTITIONER TIP


366


367

Technique 49

K-Means








368











369






370





cex = 2)

371



372

Technique 50

Clara Algorithm


clara(xk)





373










374



375

Technique 51

PAM Algorithm


pam(xk)





376







377



378

Technique 52



kkmeans(x centers )





379





cex = 2)

380



381



382


383

Technique 53



hclust(d method )




384


NOTE


bull Class variable











385







NOTE


gt print(fit)





386





387









rainbow_hcl (3))


388





389




[i])




at height 1233947


at height 3378454


at height 335


at height 1601


at height 8466573


at height 1038406

390



at height 4114692


at height 6373702





)


391






392




393






394





)]







395



396

Technique 54









397










diagnosis_labels)]


398







gt par(xpd = NA)



399





400





par(xpd = NA)

401





402






403





404












405






406











407




pietype= lower)



408








)]



par(mar = c(3337))


409




410

Technique 55



diana(xmetric )





411








412





413









414



415

Technique 56



aggExCluster(d x)






416







417






418

Fuzzy Methods

419

Technique 57







420






421




euclidean)

422



fuzzy=fuzzy +01





423






424



425

Technique 58

Fuzzy K-Means


FKM(x k=3 m=15 RS=1 )





426










066 064 054 037 047 099


059 078 049 032 041 102


052 093 042 026 035 197

427





428



429

Technique 59

Fuzzy K-Medoids






430











029 130 006 009 011 605


024 151 005 007 011 863

431



021 168 005 007 008 790




432



433

Other Methods

434

Technique 60









435






436




437



438

Technique 61

K-Modes Clustering


kmodes(x k)



439


NOTE











440










441





442



443

Technique 62



Mclust(x G)





444



NOTE

















) with 3 components


445





446




447



448






449



450

Technique 63



mona(x)



451


NOTE









452


x[i2]lt-0x[i5]lt-0





klt-k+1k =1



move sex feed[1] 0 2 2[2] 1 1 2[3] 0 2 2[4] 1 1 2[5] 0 2 2[6] 0 1 2

453





454

Technique 64



apcluster(s x)




455







456



457



458

Technique 65



aggExCluster(s x)





459




APResult object




460





461



462

Technique 66

Bagged Clustering






463





464




465




466



467











97See for example







468

NOTES







103See for example












469







470

Part VII

Boosting

471

The Basic Idea








473




NOTE


474


NOTE




475












476






Volkswagen 20 0

Total 184 16


477







478

Binary Boosting

NOTE






479


NOTE



1minusεm

εm

)with η(x) =

sign(x)


) p isin[0 1]



480

Technique 67

Ada BoostM1


(1minusεm

εm





NOTE



481


NOTE



Levels M R


minsplit = 0)

13 PRACTITIONER TIP



482














True value 1 21 80 22 3 72

Train Error 0032

483












True value 1 21 80 22 3 72

Train Error 0032


484










Training Results


Testing Results





485


Training Results


Testing Results



13 PRACTITIONER TIP



486




scores)


0078 0078 0076 0072 0071

487




23 28

gt plot(pred$class)




gt pred$probs [[2 2]][1] 09258582


488


gt pred$probs [[3 1]][1] 07416495

gt pred$probs [[3 2]][1] 02583505



489



490



491

Technique 68

Real Ada Boost








NOTE


492







True value 1 21 82 02 0 75

Train Error 0





13 PRACTITIONER TIP



493







True value 1 21 82 02 2 73

Train Error 0013






0087 0081 0078 0076 0072


494






29 22

gt table (test$z)

1 229 22


495

Technique 69

Gentle Ada Boost




NOTE



496







True value 1 21 72 82 13 64

Train Error 0134








497




True value 1 21 70 102 20 57

Train Error 0191











True value 1 21 31 02 3 17

498


Train Error 0059










True value 1 21 38 422 11 66

Train Error 0338





499



34 17


31 20

gt table (test$z)

1 231 20


500

Technique 70

Discrete L2 Boost





NOTE


501




sample +1)n]]



=5minsplit =0)





502




True value -1 1-1 1918 1891 461 811

Train Error 0192






13 PRACTITIONER TIP





503





Training Results


Testing Results




504






505






506





-1 1356 206


[1] [2][1] 0010 0990[2] 0062 0938[3] 0794 0206[4] 0882 0118

[559] 0085 0915[560] 0799 0201[561] 0720 0280[562] 0996 0004

507

Technique 71

Real L2 Boost










508


True value -1 1-1 1970 1371 487 785

Train Error 0185










Training Results


Testing Results



509



scores)


0114 0111 0108 0108 0106




-1 1356 206

510

Technique 72

Gentle L2 Boost










511


True value -1 1-1 1924 1831 493 779

Train Error 02










Training Results


Testing Results


512



scores)


0099 0097 0097 0097 0095






-1 1356 206

513


514

Technique 73

SAMME


(1minusεm

εm

)+ ln(kminus1)


13 PRACTITIONER TIP


1minusεm

εm





515



NOTE









516

TECHNIQUE 73 SAMME

bus 124 0 0 0opel 0 114 13 0saab 0 12 117 0van 0 0 0 120



13 PRACTITIONER TIP






gt cv$error[1] 02671395



517



140 90 87

NOTE



])





518

Technique 74



1minusεm

εm





13 PRACTITIONER TIP





519









237 140 88



])




520


saab 2 25 45 4van 4 9 6 73


521

Technique 75



(1minusεm

εm










gt setseed (107)

522







201 97 93



])





523


524

Technique 76

L2 Regression

NOTE


L2 =Nsum







525





13 PRACTITIONER TIP






anthro3a anthro3b336145109 352597323

526


13 PRACTITIONER TIP



=10)


gt mstop(cvm)[1] 40

gt fit[mstop(cvm)]

527





gt CI

528




13 PRACTITIONER TIP






529






530



= response)



[1] 0858

531



532

Technique 77

L1 Regression

NOTE


L1 =|Nsum

i=1yi minus yi |






533










gtfit[mstop(cvm)]



= response)

534




[1] 0906


535

Technique 78

Robust Regression









536


026460670 514548154 078826723




gtfit[mstop(cvm)]



= response)



[1] 0913

537



538

Technique 79







NOTE



539









gt mstop(cvm)[1] 41

gt fit[mstop(cvm)]






540



541




= response)

542




[1] 0867


543

Technique 80

Quantile Regression







544









anthro4060159287





545














546







547

Technique 81









548



FALSE))












5 95

549




550


551

Technique 82

Logistic Regression






552








gt fit[mstop(cvm)]





labels=c(M R))


predictedactual M R

M 22 7R 4 18

553



554

Technique 83

Probit Regression









555


gt fit[mstop(cvm)]



response)


labels=c(M R))


predictedactual M R

M 21 8R 4 18


556


557

Technique 84

Poisson Regression




NOTE


ln (λi) = β0 + β1x1i + + βkxki (841)


558











gt mstop(cvm)[1] 36

gt fit[mstop(cvm)]

559










560

Technique 85





13 PRACTITIONER TIP




561











gt fit[mstop(cvm)]




562








563

Technique 86

Hurdle Regression




NOTE



564











565




566

Technique 87








567






gt fit[mstop(cvm)]



-1508 2148 1230



2503 1528


568






predicted))

gt tbpredicted

actual 1 2 3 4 51 0 4 1 0 02 0 9 12 1 03 0 5 16 5 04 0 0 5 7 05 0 0 2 5 0




569


NOTE








570








13 PRACTITIONER TIP




571




572

Technique 88







573








gt fit[mstop(cvm)]





574


4531 0348 0019


4518 0357 0019



575

Technique 89









576





gt fit[mstop(cvm)]





4103 0408 0021


4083 0424 0022



577

Technique 90









578




gt fit[mstop(cvm)]





4110 0384 0020


4093 0396 0021



579

Technique 91









580




gt fit[mstop(cvm)]


13 PRACTITIONER TIP



data = rhDNase)


1431 -0370 -0020


-0378 -0021


581


NOTE



582



583

Technique 92









584





gt fit[mstop(cvm)]


3718 0403 0022


585





















586

NOTES








133See for example










587

INDEX





Good luck


591








INDEX





593

Preface


I Decision Trees








































IV Neural Networks















V Random Forests












VI Cluster Analysis












Fuzzy Methods




Other Methods








VII Boosting


Binary Boosting








Technique 73 SAMME
























2015-10-23T190127+0000







ISBN-13 978-1517516796ISBN-10 151751679X


Acknowledgments





iii











bull Random forests

bull Random ferns



bull Decision trees



vi








vii






viii
















x




13 PRACTITIONER TIP




1



13 PRACTITIONER TIP




2

13 PRACTITIONER TIP




13 PRACTITIONER TIP





3

Part I

Decision Trees

4

The Basic Idea


NOTE




6








7



13 PRACTITIONER TIP



8








9







10


NOTE



11


the following steps








NOTE


12










13


13 PRACTITIONER TIP


Sensitivity = NTP


Specificity = NTN







14






15


NOTE









16

NOTE











17


CART C45 ID35587 5416 5272


13 PRACTITIONER TIP


18





NOTE


19




Tree Depth T1 T25 148 14028 137 124910 155 1361


13 PRACTITIONER TIP


20


21

Technique 1

Classification Tree




13 PRACTITIONER TIP



22


NOTE







23




24





13 PRACTITIONER TIP


$class[1] factor



gt fit



0410480 0366812 0000000 )

25


14) SkewMaxis lt 645 7 9561 van (0000000 0000000 0428571 0571429 )

15) SkewMaxis gt 645 64 10300 van (0015625 0000000 0000000 0984375 )









13 PRACTITIONER TIP






Circ KurtMaxis

26











27



13 PRACTITIONER TIP


28



NOTE






29





30






bus 86 1 3 4opel 1 55 20 9saab 4 55 23 6van 2 2 5 70



31

Technique 2





C50(z ~ data)



32






gt C5imp(fit)



33


13 PRACTITIONER TIP



class)



bus 84 1 5 4opel 0 48 31 6saab 3 38 47 0van 2 1 4 72



34


13 PRACTITIONER TIP


]type = prob)


5 0018 0004 0004 09747 0050 0051 0852 004810 0011 0228 0750 001012 0031 0032 0782 015515 0916 0028 0029 002719 0014 0070 0903 0013

35

Technique 3



ctree(z ~data )






36




gt plot(fit)


37




response)

38




bus 36 0 11 47opel 3 50 24 8saab 6 58 19 5van 0 0 21 58



13 PRACTITIONER TIP



3 4 6 745 108 75 118





39









error () 529 416 37


40

Technique 4


NOTE



evtree(z ~ data )



41





gt plot(fit)

42




gt fit


MaxLRa +

43




KurtMaxis + HollRa





= 473)



response)



bus 85 0 0 9opel 14 48 0 23saab 18 57 0 13van 0 1 0 78

44




13 PRACTITIONER TIP


]type = prob)


]type = node)





bus 72 3 15 4opel 4 30 41 10saab 6 23 55 4van 3 1 6 69


45



46

Technique 5


NOTE







47










48


13 PRACTITIONER TIP



49





gt summary(fit)

Classification tree

50










51

Technique 6






NOTE


52






53


NOTE








gt plot(fit)

54


13 PRACTITIONER TIP



bw = 15))

55







56







gt AIC(fit)[1] 2608691


train ])


labels=c(neg pos))


gt tb

57



neg 92 17pos 26 29


13 PRACTITIONER TIP



58

Technique 7









3 -4070 0078 -02224 -2932 0046 14746 -3416 0089 08147 -1174 0100 -1003

59




gt AIC(fit)[1] 2608555


train ])


labels=c(neg pos))


gt tbpredicted



60


61

Technique 8

Regression Tree





NOTE


62








gt fit


1) root 45 672400 33642) anthro4 lt 533 19 133200 3004

63


4) anthro4 lt 4545 5 025490 2667








-0268300 -0080280 0009712 0000000 00734000332900





64







65




66





67



68

Technique 9



ctree(z ~data )





69



gt plot(fit)




70





= 140613) weights = 9



= 158626) weights = 10






log(DEXfat) 068

71



72

Technique 10








73






gt plot(fit)

13 PRACTITIONER TIP


74





= 169623) weights = 31


75


-14309 1196 2660



-53867 09887 09466



-35815 06027 09546





$lsquo3lsquo

CallNULL

Weighted Residuals

76


Min 1Q Median 3Q Max-03272 00000 00000 00000 04376







gt AIC(fit)[1] -8084949

13 PRACTITIONER TIP



77





78

Technique 11



evtree(z ~ data )





79





13 PRACTITIONER TIP


80







81



3859)| | [4] anthro3c gt= 377 31496 (n = 12 err =

1868)| [5] hipcirc gt= 109 43432 (n = 15 err = 5548)


13 PRACTITIONER TIP





82



83


13 PRACTITIONER TIP


84

Technique 12






85


NOTE








86


13 PRACTITIONER TIP


cex =6)



87




poisson)



n= 4406




88



89


13 PRACTITIONER TIP






55 37 8




90

Technique 13







91





gt plot(fit)


92





gt AIC(fit)[1] 2828175



93

Technique 14



ctree(z ~data )



94


NOTE




95





predicted))gt tb

96



1 0 5 0 0 02 0 16 5 1 03 0 13 8 5 04 0 2 3 7 05 0 0 2 5 0


97




98

Technique 15






99


NOTE







100



13 PRACTITIONER TIP



101




102





103





104

Technique 16



ctree(z ~data )





105






106








7) weights = 177







3 5 6 7232 105 127 177

107






108

NOTES




















109




















110

Part II


111

The Basic Idea




113



NOTE



114





y = f(xα) = sign

(Nssumi=1

αiyiK(si x) + b

)(161)


Nsumi=1

αi minus12

Nsumi=1

Nsumj=1


115



Nsumi=1



13 PRACTITIONER TIP



116



NOTE


117


NOTE


k = Po minus Pe

1minus Pe

(164)





118


119


13 PRACTITIONER TIP








120








121








122

13 PRACTITIONER TIP








123




13 PRACTITIONER TIP






124


125

Technique 17





13 PRACTITIONER TIP




i xj + r)d γ gt 0



i xj + r)


126







127






140 140






128






17383352 = 00135









129


13 PRACTITIONER TIP






step=8)

130



131



NOTE



type=class)



-1 65 271 16 16


132


13 PRACTITIONER TIP




Observed))Observed

Predicted -1 1-1 76 131 5 30




133

Technique 18







134








gamma 005555556


( 122 64 118 72 )

Number of Classes 4



gamma = 2^( -22) cost = 2^(24))


135







13 PRACTITIONER TIP


crosscross =20))






136








137






138






gamma 003








gamma 003


( 100 32 94 36 )

Number of Classes 4

Levels

139


bus opel saab van








140






141



bus 87 2 0 5opel 0 55 28 2saab 0 20 68 0van 1 1 1 76



142

Technique 19











143




nu 05


( 110 88 107 98 )

Number of Classes 4








144





gt summary(obj)





145





146


print(fit)





nu 002








nu 002


( 71 29 72 32 )

Number of Classes 4

147










148





149



bus 87 1 1 5opel 0 41 42 2saab 0 30 58 0van 1 1 2 75



150

Technique 20







151













152



gt fitcross[1] 0248

13 PRACTITIONER TIP













153










i=1j=j+1

end cost loop

154






cost sigma error1 4 01 023585862 4 02 026414143 4 03 027441084 4 04 02973064




error))




155





156





157





bus 88 1 2 3opel 1 31 50 3saab 1 24 61 2van 1 0 3 75



158

Technique 21









159







gt fitcross[1] 0278





160


gt j=1gt count=1







i=1j=j+1

end cost loop




161



1 4 01 023569022 4 02 026397313 4 03 032289564 4 04 029730645 4 05 02885522





error))




[1] 64 01 02122896


162




gt fitcross

163


[1] 02762626




bus 88 1 0 5opel 1 37 43 4saab 1 33 53 1van 1 1 2 75




164

Technique 22









165







gt fitcross[1] 0244





166


gt j=1gt count=1







i=1j=j+1

end cost loop




167



1 4 01 022996632 4 02 023787883 4 03 025824924 4 04 02730640


02075758



error))




[1] 16 01 02075758


168




gt fitcross

169


[1] 02478114




bus 88 0 0 6opel 6 43 30 6saab 3 29 52 4van 1 0 1 77



170


171

Technique 23

SVM eps-Regression






172














173


gt summary(obj)







6 025 8 001






174

















175






176



177

Technique 24

SVM nu-Regression










178




gamma 01111111nu 05









179



126 025 8 09






cross = 45)



gamma 025nu 09



180




cross = 45)



gamma 025nu 09








181



182

Technique 25







183












184






error)





185



end sigma loopi=1


n)








186


[1] 2186181



error))





[1] 1 004 01 2186181




187







[1][1] 0834

188



189


190

Technique 26








191





FALSE TRUE628 218







runslt-n_gamn_nu


192



)



gt for (val in nu)





end nu loop


193










Performance)


[1] 001 001 94[2] 002 001 94



194





195








testLabels)




FALSE 175 5TRUE 453 113

Accuracy 0386195 CI (0351 04221)





196















197





Accuracy 0880795 CI (08552 09031)









198

NOTES






















199



51See for example




200

Part III


201

The Basic Idea



yi = wTK + ε (261)



p(y|wσ2

)= (2π)

minusN2 σminusNexp

minus 1




w sim N(

0 1α

)(263)



203



NOTE


204


13 PRACTITIONER TIP


RMSE =radicsumN

i=1(xi minus xi)2

N(264)

NS = 1minussumN






205










206











207








208

13 PRACTITIONER TIP


209

Technique 27

RVM Regression






210










211


cross =10)










212



[1][1] 0813


213













214

Part IV

Neural Networks

215

The Basic Idea




217



Sj =Nsum

i=1wijaj + bj (271)






(272)


218







13 PRACTITIONER TIP



219








220

13 PRACTITIONER TIP







221




13 PRACTITIONER TIP







222









223







13 PRACTITIONER TIP


zi = xi minus xmin

xmax minus xmin

(273)

zi = xi minus xσx

(274)

zi = xiradicSSi

(275)

zi = xi

xmax + 1 (276)


224


225

Technique 28







226













227






13 PRACTITIONER TIP


228


age

pedigree

mass

pressure

glucose

pregnant

diabetes

Error 181242328 Steps 8448




229


[1]4 -15 -17 -112 117 120 1



-1 61 61 20 37




230

Technique 29











231





[1]4 -15 17 -112 117 1



-1 62 41 19 39




232

Technique 30










233






[1]4 -15 -17 -112 117 120 -1



-1 58 101 23 33




234

Technique 31



learn(y data )




235












236


13 PRACTITIONER TIP





13 PRACTITIONER TIP



237










0996915706 0003084294


$probabilities

238


neg pos0996915706 0003084294


category



neg 79 2pos 2 41




239

Technique 32







240











241










242





-1 68 271 13 16




243


244

Technique 33







245








246


013

059

anthro4

minus00

0467

anthro3c

008

537

anthro3b

000138

anthro3a

005012kneebreadth

minus001639

elbowbreadth

002802

age

024594

hipcirc

01271

waistcirc


1

minus297094

1







247







[-train ])


248





[1][1] 0864411

249



250

Technique 34












251







train])


252





[1][1] 0865


253

Technique 35













254






train])


255




[1][1] 0865


256

Technique 36



learn(yx)





257









anthro447 181



258







259



260

Technique 37







261







NOTE






262





[1][1] 0973589


263



fit)





[1] 0958

264



265

Technique 38







266












267










268




269



270

NOTES












271

Part V

Random Forests

272

The Basic Idea




NOTE


274






275







276











277



278


279









280







281



282

Technique 39







283







gt print(fit)








284





285





13 PRACTITIONER TIP




286




287






288





gt par(op)


13 PRACTITIONER TIP


289





290





gt print(fit)

Call

291











292





gt print(fitbest)



293







294

Technique 40






295


13 PRACTITIONER TIP










296


gt rm(fit)


08700000 08267831


13 PRACTITIONER TIP





gt rm(ord1 imp1)



297


0100 0083 0057

gt rm(ord2 imp2)

13 PRACTITIONER TIP




08179191 07567866



response)



bus 91 0 1 2opel 8 30 33 14

298


saab 6 23 51 8van 2 0 2 75



299

Technique 41






300



13 PRACTITIONER TIP











gt print(fit)

301








302



13 PRACTITIONER TIP



303






gt print(fit)








304


DCirc 0196 0004




bus 85 5 4 0opel 4 38 36 0saab 0 13 28 0van 5 29 20 79



305

Technique 42






306







gt print(fit)






M R classerrorM 70 12 01463415R 17 58 02266667

307






308






gt par(op)

309





310







311






312





313







314


gt print(fitbest)






M R classerrorM 26 3 01034483R 9 13 04090909




315



316

Technique 43







317






gt print(fit)





Errortype=l)




318





gt print(fit)



319



13 PRACTITIONER TIP








320





M 24 4R 5 18











321




322



323

Technique 44



rfsrc(z ~ data )




324







gt plot(fit)


325





326






13 PRACTITIONER TIP




method=vhvimp)


Method vimp

327








)







328





329



330

Technique 45







331











ylab=Time)

332


gt par(op)


333

Technique 46







334













335





response)



DEXfat 0756

336



337

Technique 47







338











339




340




341






342


[2] 3540183 4153541 5241859[3] 2250604 2700303 3540398[4] 2764735 3556909 4273546[5] 2130879 2620480 3533114[6] 3249331 3886571 4452887




343



344

Technique 48






345







08600 07997



0499 0000 0000 0000 0000



0448 0000 0000 0000 0000

346




0481 0000 0000 0000 0000



03182 00000


response)


predicted))gt tb


1 0 3 0 0 02 0 6 0 0 03 0 0 7 0 04 0 0 0 4 05 0 0 0 2 0


347



348

NOTES






75See for example













349






ing Research 7 (2006) 983-999

350

Part VI

Cluster Analysis

351

The Basic Idea








353














354



355


NOTE





356


357






13 PRACTITIONER TIP


358









359












360



361



362

13 PRACTITIONER TIP


k asympradicN

2 (481)







363








364


365


13 PRACTITIONER TIP


366


367

Technique 49

K-Means








368











369






370





cex = 2)

371



372

Technique 50

Clara Algorithm


clara(xk)





373










374



375

Technique 51

PAM Algorithm


pam(xk)





376







377



378

Technique 52



kkmeans(x centers )





379





cex = 2)

380



381



382


383

Technique 53



hclust(d method )




384


NOTE


bull Class variable











385







NOTE


gt print(fit)





386





387









rainbow_hcl (3))


388





389




[i])




at height 1233947


at height 3378454


at height 335


at height 1601


at height 8466573


at height 1038406

390



at height 4114692


at height 6373702





)


391






392




393






394





)]







395



396

Technique 54









397










diagnosis_labels)]


398







gt par(xpd = NA)



399





400





par(xpd = NA)

401





402






403





404












405






406











407




pietype= lower)



408








)]



par(mar = c(3337))


409




410

Technique 55



diana(xmetric )





411








412





413









414



415

Technique 56



aggExCluster(d x)






416







417






418

Fuzzy Methods

419

Technique 57







420






421




euclidean)

422



fuzzy=fuzzy +01





423






424



425

Technique 58

Fuzzy K-Means


FKM(x k=3 m=15 RS=1 )





426










066 064 054 037 047 099


059 078 049 032 041 102


052 093 042 026 035 197

427





428



429

Technique 59

Fuzzy K-Medoids






430











029 130 006 009 011 605


024 151 005 007 011 863

431



021 168 005 007 008 790




432



433

Other Methods

434

Technique 60









435






436




437



438

Technique 61

K-Modes Clustering


kmodes(x k)



439


NOTE











440










441





442



443

Technique 62



Mclust(x G)





444



NOTE

















) with 3 components


445





446




447



448






449



450

Technique 63



mona(x)



451


NOTE









452


x[i2]lt-0x[i5]lt-0





klt-k+1k =1



move sex feed[1] 0 2 2[2] 1 1 2[3] 0 2 2[4] 1 1 2[5] 0 2 2[6] 0 1 2

453





454

Technique 64



apcluster(s x)




455







456



457



458

Technique 65



aggExCluster(s x)





459




APResult object




460





461



462

Technique 66

Bagged Clustering






463





464




465




466



467











97See for example







468

NOTES







103See for example












469







470

Part VII

Boosting

471

The Basic Idea








473




NOTE


474


NOTE




475












476






Volkswagen 20 0

Total 184 16


477







478

Binary Boosting

NOTE






479


NOTE



1minusεm

εm

)with η(x) =

sign(x)


) p isin[0 1]



480

Technique 67

Ada BoostM1


(1minusεm

εm





NOTE



481


NOTE



Levels M R


minsplit = 0)

13 PRACTITIONER TIP



482














True value 1 21 80 22 3 72

Train Error 0032

483












True value 1 21 80 22 3 72

Train Error 0032


484










Training Results


Testing Results





485


Training Results


Testing Results



13 PRACTITIONER TIP



486




scores)


0078 0078 0076 0072 0071

487




23 28

gt plot(pred$class)




gt pred$probs [[2 2]][1] 09258582


488


gt pred$probs [[3 1]][1] 07416495

gt pred$probs [[3 2]][1] 02583505



489



490



491

Technique 68

Real Ada Boost








NOTE


492







True value 1 21 82 02 0 75

Train Error 0





13 PRACTITIONER TIP



493







True value 1 21 82 02 2 73

Train Error 0013






0087 0081 0078 0076 0072


494






29 22

gt table (test$z)

1 229 22


495

Technique 69

Gentle Ada Boost




NOTE



496







True value 1 21 72 82 13 64

Train Error 0134








497




True value 1 21 70 102 20 57

Train Error 0191











True value 1 21 31 02 3 17

498


Train Error 0059










True value 1 21 38 422 11 66

Train Error 0338





499



34 17


31 20

gt table (test$z)

1 231 20


500

Technique 70

Discrete L2 Boost





NOTE


501




sample +1)n]]



=5minsplit =0)





502




True value -1 1-1 1918 1891 461 811

Train Error 0192






13 PRACTITIONER TIP





503





Training Results


Testing Results




504






505






506





-1 1356 206


[1] [2][1] 0010 0990[2] 0062 0938[3] 0794 0206[4] 0882 0118

[559] 0085 0915[560] 0799 0201[561] 0720 0280[562] 0996 0004

507

Technique 71

Real L2 Boost










508


True value -1 1-1 1970 1371 487 785

Train Error 0185










Training Results


Testing Results



509



scores)


0114 0111 0108 0108 0106




-1 1356 206

510

Technique 72

Gentle L2 Boost










511


True value -1 1-1 1924 1831 493 779

Train Error 02










Training Results


Testing Results


512



scores)


0099 0097 0097 0097 0095






-1 1356 206

513


514

Technique 73

SAMME


(1minusεm

εm

)+ ln(kminus1)


13 PRACTITIONER TIP


1minusεm

εm





515



NOTE









516

TECHNIQUE 73 SAMME

bus 124 0 0 0opel 0 114 13 0saab 0 12 117 0van 0 0 0 120



13 PRACTITIONER TIP






gt cv$error[1] 02671395



517



140 90 87

NOTE



])





518

Technique 74



1minusεm

εm





13 PRACTITIONER TIP





519









237 140 88



])




520


saab 2 25 45 4van 4 9 6 73


521

Technique 75



(1minusεm

εm










gt setseed (107)

522







201 97 93



])





523


524

Technique 76

L2 Regression

NOTE


L2 =Nsum







525





13 PRACTITIONER TIP






anthro3a anthro3b336145109 352597323

526


13 PRACTITIONER TIP



=10)


gt mstop(cvm)[1] 40

gt fit[mstop(cvm)]

527





gt CI

528




13 PRACTITIONER TIP






529






530



= response)



[1] 0858

531



532

Technique 77

L1 Regression

NOTE


L1 =|Nsum

i=1yi minus yi |






533










gtfit[mstop(cvm)]



= response)

534




[1] 0906


535

Technique 78

Robust Regression









536


026460670 514548154 078826723




gtfit[mstop(cvm)]



= response)



[1] 0913

537



538

Technique 79







NOTE



539









gt mstop(cvm)[1] 41

gt fit[mstop(cvm)]






540



541




= response)

542




[1] 0867


543

Technique 80

Quantile Regression







544









anthro4060159287





545














546







547

Technique 81









548



FALSE))












5 95

549




550


551

Technique 82

Logistic Regression






552








gt fit[mstop(cvm)]





labels=c(M R))


predictedactual M R

M 22 7R 4 18

553



554

Technique 83

Probit Regression









555


gt fit[mstop(cvm)]



response)


labels=c(M R))


predictedactual M R

M 21 8R 4 18


556


557

Technique 84

Poisson Regression




NOTE


ln (λi) = β0 + β1x1i + + βkxki (841)


558











gt mstop(cvm)[1] 36

gt fit[mstop(cvm)]

559










560

Technique 85





13 PRACTITIONER TIP




561











gt fit[mstop(cvm)]




562








563

Technique 86

Hurdle Regression




NOTE



564











565




566

Technique 87








567






gt fit[mstop(cvm)]



-1508 2148 1230



2503 1528


568






predicted))

gt tbpredicted

actual 1 2 3 4 51 0 4 1 0 02 0 9 12 1 03 0 5 16 5 04 0 0 5 7 05 0 0 2 5 0




569


NOTE








570








13 PRACTITIONER TIP




571




572

Technique 88







573








gt fit[mstop(cvm)]





574


4531 0348 0019


4518 0357 0019



575

Technique 89









576





gt fit[mstop(cvm)]





4103 0408 0021


4083 0424 0022



577

Technique 90









578




gt fit[mstop(cvm)]





4110 0384 0020


4093 0396 0021



579

Technique 91









580




gt fit[mstop(cvm)]


13 PRACTITIONER TIP



data = rhDNase)


1431 -0370 -0020


-0378 -0021


581


NOTE



582



583

Technique 92









584





gt fit[mstop(cvm)]


3718 0403 0022


585





















586

NOTES








133See for example










587

INDEX





Good luck


591








INDEX





593

Preface


I Decision Trees








































IV Neural Networks















V Random Forests












VI Cluster Analysis












Fuzzy Methods




Other Methods








VII Boosting


Binary Boosting








Technique 73 SAMME
























2015-10-23T190127+0000



Acknowledgments





iii











bull Random forests

bull Random ferns



bull Decision trees



vi








vii






viii
















x




13 PRACTITIONER TIP




1



13 PRACTITIONER TIP




2

13 PRACTITIONER TIP




13 PRACTITIONER TIP





3

Part I

Decision Trees

4

The Basic Idea


NOTE




6








7



13 PRACTITIONER TIP



8








9







10


NOTE



11


the following steps








NOTE


12










13


13 PRACTITIONER TIP


Sensitivity = NTP


Specificity = NTN







14






15


NOTE









16

NOTE











17


CART C45 ID35587 5416 5272


13 PRACTITIONER TIP


18





NOTE


19




Tree Depth T1 T25 148 14028 137 124910 155 1361


13 PRACTITIONER TIP


20


21

Technique 1

Classification Tree




13 PRACTITIONER TIP



22


NOTE







23




24





13 PRACTITIONER TIP


$class[1] factor



gt fit



0410480 0366812 0000000 )

25


14) SkewMaxis lt 645 7 9561 van (0000000 0000000 0428571 0571429 )

15) SkewMaxis gt 645 64 10300 van (0015625 0000000 0000000 0984375 )









13 PRACTITIONER TIP






Circ KurtMaxis

26











27



13 PRACTITIONER TIP


28



NOTE






29





30






bus 86 1 3 4opel 1 55 20 9saab 4 55 23 6van 2 2 5 70



31

Technique 2





C50(z ~ data)



32






gt C5imp(fit)



33


13 PRACTITIONER TIP



class)



bus 84 1 5 4opel 0 48 31 6saab 3 38 47 0van 2 1 4 72



34


13 PRACTITIONER TIP


]type = prob)


5 0018 0004 0004 09747 0050 0051 0852 004810 0011 0228 0750 001012 0031 0032 0782 015515 0916 0028 0029 002719 0014 0070 0903 0013

35

Technique 3



ctree(z ~data )






36




gt plot(fit)


37




response)

38




bus 36 0 11 47opel 3 50 24 8saab 6 58 19 5van 0 0 21 58



13 PRACTITIONER TIP



3 4 6 745 108 75 118





39









error () 529 416 37


40

Technique 4


NOTE



evtree(z ~ data )



41





gt plot(fit)

42




gt fit


MaxLRa +

43




KurtMaxis + HollRa





= 473)



response)



bus 85 0 0 9opel 14 48 0 23saab 18 57 0 13van 0 1 0 78

44




13 PRACTITIONER TIP


]type = prob)


]type = node)





bus 72 3 15 4opel 4 30 41 10saab 6 23 55 4van 3 1 6 69


45



46

Technique 5


NOTE







47










48


13 PRACTITIONER TIP



49





gt summary(fit)

Classification tree

50










51

Technique 6






NOTE


52






53


NOTE








gt plot(fit)

54


13 PRACTITIONER TIP



bw = 15))

55







56







gt AIC(fit)[1] 2608691


train ])


labels=c(neg pos))


gt tb

57



neg 92 17pos 26 29


13 PRACTITIONER TIP



58

Technique 7









3 -4070 0078 -02224 -2932 0046 14746 -3416 0089 08147 -1174 0100 -1003

59




gt AIC(fit)[1] 2608555


train ])


labels=c(neg pos))


gt tbpredicted



60


61

Technique 8

Regression Tree





NOTE


62








gt fit


1) root 45 672400 33642) anthro4 lt 533 19 133200 3004

63


4) anthro4 lt 4545 5 025490 2667








-0268300 -0080280 0009712 0000000 00734000332900





64







65




66





67



68

Technique 9



ctree(z ~data )





69



gt plot(fit)




70





= 140613) weights = 9



= 158626) weights = 10






log(DEXfat) 068

71



72

Technique 10








73






gt plot(fit)

13 PRACTITIONER TIP


74





= 169623) weights = 31


75


-14309 1196 2660



-53867 09887 09466



-35815 06027 09546





$lsquo3lsquo

CallNULL

Weighted Residuals

76


Min 1Q Median 3Q Max-03272 00000 00000 00000 04376







gt AIC(fit)[1] -8084949

13 PRACTITIONER TIP



77





78

Technique 11



evtree(z ~ data )





79





13 PRACTITIONER TIP


80







81



3859)| | [4] anthro3c gt= 377 31496 (n = 12 err =

1868)| [5] hipcirc gt= 109 43432 (n = 15 err = 5548)


13 PRACTITIONER TIP





82



83


13 PRACTITIONER TIP


84

Technique 12






85


NOTE








86


13 PRACTITIONER TIP


cex =6)



87




poisson)



n= 4406




88



89


13 PRACTITIONER TIP






55 37 8




90

Technique 13







91





gt plot(fit)


92





gt AIC(fit)[1] 2828175



93

Technique 14



ctree(z ~data )



94


NOTE




95





predicted))gt tb

96



1 0 5 0 0 02 0 16 5 1 03 0 13 8 5 04 0 2 3 7 05 0 0 2 5 0


97




98

Technique 15






99


NOTE







100



13 PRACTITIONER TIP



101




102





103





104

Technique 16



ctree(z ~data )





105






106








7) weights = 177







3 5 6 7232 105 127 177

107






108

NOTES




















109




















110

Part II


111

The Basic Idea




113



NOTE



114





y = f(xα) = sign

(Nssumi=1

αiyiK(si x) + b

)(161)


Nsumi=1

αi minus12

Nsumi=1

Nsumj=1


115



Nsumi=1



13 PRACTITIONER TIP



116



NOTE


117


NOTE


k = Po minus Pe

1minus Pe

(164)





118


119


13 PRACTITIONER TIP








120








121








122

13 PRACTITIONER TIP








123




13 PRACTITIONER TIP






124


125

Technique 17





13 PRACTITIONER TIP




i xj + r)d γ gt 0



i xj + r)


126







127






140 140






128






17383352 = 00135









129


13 PRACTITIONER TIP






step=8)

130



131



NOTE



type=class)



-1 65 271 16 16


132


13 PRACTITIONER TIP




Observed))Observed

Predicted -1 1-1 76 131 5 30




133

Technique 18







134








gamma 005555556


( 122 64 118 72 )

Number of Classes 4



gamma = 2^( -22) cost = 2^(24))


135







13 PRACTITIONER TIP


crosscross =20))






136








137






138






gamma 003








gamma 003


( 100 32 94 36 )

Number of Classes 4

Levels

139


bus opel saab van








140






141



bus 87 2 0 5opel 0 55 28 2saab 0 20 68 0van 1 1 1 76



142

Technique 19











143




nu 05


( 110 88 107 98 )

Number of Classes 4








144





gt summary(obj)





145





146


print(fit)





nu 002








nu 002


( 71 29 72 32 )

Number of Classes 4

147










148





149



bus 87 1 1 5opel 0 41 42 2saab 0 30 58 0van 1 1 2 75



150

Technique 20







151













152



gt fitcross[1] 0248

13 PRACTITIONER TIP













153










i=1j=j+1

end cost loop

154






cost sigma error1 4 01 023585862 4 02 026414143 4 03 027441084 4 04 02973064




error))




155





156





157





bus 88 1 2 3opel 1 31 50 3saab 1 24 61 2van 1 0 3 75



158

Technique 21









159







gt fitcross[1] 0278





160


gt j=1gt count=1







i=1j=j+1

end cost loop




161



1 4 01 023569022 4 02 026397313 4 03 032289564 4 04 029730645 4 05 02885522





error))




[1] 64 01 02122896


162




gt fitcross

163


[1] 02762626




bus 88 1 0 5opel 1 37 43 4saab 1 33 53 1van 1 1 2 75




164

Technique 22









165







gt fitcross[1] 0244





166


gt j=1gt count=1







i=1j=j+1

end cost loop




167



1 4 01 022996632 4 02 023787883 4 03 025824924 4 04 02730640


02075758



error))




[1] 16 01 02075758


168




gt fitcross

169


[1] 02478114




bus 88 0 0 6opel 6 43 30 6saab 3 29 52 4van 1 0 1 77



170


171

Technique 23

SVM eps-Regression






172














173


gt summary(obj)







6 025 8 001






174

















175






176



177

Technique 24

SVM nu-Regression










178




gamma 01111111nu 05









179



126 025 8 09






cross = 45)



gamma 025nu 09



180




cross = 45)



gamma 025nu 09








181



182

Technique 25







183












184






error)





185



end sigma loopi=1


n)








186


[1] 2186181



error))





[1] 1 004 01 2186181




187







[1][1] 0834

188



189


190

Technique 26








191





FALSE TRUE628 218







runslt-n_gamn_nu


192



)



gt for (val in nu)





end nu loop


193










Performance)


[1] 001 001 94[2] 002 001 94



194





195








testLabels)




FALSE 175 5TRUE 453 113

Accuracy 0386195 CI (0351 04221)





196















197





Accuracy 0880795 CI (08552 09031)









198

NOTES






















199



51See for example




200

Part III


201

The Basic Idea



yi = wTK + ε (261)



p(y|wσ2

)= (2π)

minusN2 σminusNexp

minus 1




w sim N(

0 1α

)(263)



203



NOTE


204


13 PRACTITIONER TIP


RMSE =radicsumN

i=1(xi minus xi)2

N(264)

NS = 1minussumN






205










206











207








208

13 PRACTITIONER TIP


209

Technique 27

RVM Regression






210










211


cross =10)










212



[1][1] 0813


213













214

Part IV

Neural Networks

215

The Basic Idea




217



Sj =Nsum

i=1wijaj + bj (271)






(272)


218







13 PRACTITIONER TIP



219








220

13 PRACTITIONER TIP







221




13 PRACTITIONER TIP







222









223







13 PRACTITIONER TIP


zi = xi minus xmin

xmax minus xmin

(273)

zi = xi minus xσx

(274)

zi = xiradicSSi

(275)

zi = xi

xmax + 1 (276)


224


225

Technique 28







226













227






13 PRACTITIONER TIP


228


age

pedigree

mass

pressure

glucose

pregnant

diabetes

Error 181242328 Steps 8448




229


[1]4 -15 -17 -112 117 120 1



-1 61 61 20 37




230

Technique 29











231





[1]4 -15 17 -112 117 1



-1 62 41 19 39




232

Technique 30










233






[1]4 -15 -17 -112 117 120 -1



-1 58 101 23 33




234

Technique 31



learn(y data )




235












236


13 PRACTITIONER TIP





13 PRACTITIONER TIP



237










0996915706 0003084294


$probabilities

238


neg pos0996915706 0003084294


category



neg 79 2pos 2 41




239

Technique 32







240











241










242





-1 68 271 13 16




243


244

Technique 33







245








246


013

059

anthro4

minus00

0467

anthro3c

008

537

anthro3b

000138

anthro3a

005012kneebreadth

minus001639

elbowbreadth

002802

age

024594

hipcirc

01271

waistcirc


1

minus297094

1







247







[-train ])


248





[1][1] 0864411

249



250

Technique 34












251







train])


252





[1][1] 0865


253

Technique 35













254






train])


255




[1][1] 0865


256

Technique 36



learn(yx)





257









anthro447 181



258







259



260

Technique 37







261







NOTE






262





[1][1] 0973589


263



fit)





[1] 0958

264



265

Technique 38







266












267










268




269



270

NOTES












271

Part V

Random Forests

272

The Basic Idea




NOTE


274






275







276











277



278


279









280







281



282

Technique 39







283







gt print(fit)








284





285





13 PRACTITIONER TIP




286




287






288





gt par(op)


13 PRACTITIONER TIP


289





290





gt print(fit)

Call

291











292





gt print(fitbest)



293







294

Technique 40






295


13 PRACTITIONER TIP










296


gt rm(fit)


08700000 08267831


13 PRACTITIONER TIP





gt rm(ord1 imp1)



297


0100 0083 0057

gt rm(ord2 imp2)

13 PRACTITIONER TIP




08179191 07567866



response)



bus 91 0 1 2opel 8 30 33 14

298


saab 6 23 51 8van 2 0 2 75



299

Technique 41






300



13 PRACTITIONER TIP











gt print(fit)

301








302



13 PRACTITIONER TIP



303






gt print(fit)








304


DCirc 0196 0004




bus 85 5 4 0opel 4 38 36 0saab 0 13 28 0van 5 29 20 79



305

Technique 42






306







gt print(fit)






M R classerrorM 70 12 01463415R 17 58 02266667

307






308






gt par(op)

309





310







311






312





313







314


gt print(fitbest)






M R classerrorM 26 3 01034483R 9 13 04090909




315



316

Technique 43







317






gt print(fit)





Errortype=l)




318





gt print(fit)



319



13 PRACTITIONER TIP








320





M 24 4R 5 18











321




322



323

Technique 44



rfsrc(z ~ data )




324







gt plot(fit)


325





326






13 PRACTITIONER TIP




method=vhvimp)


Method vimp

327








)







328





329



330

Technique 45







331











ylab=Time)

332


gt par(op)


333

Technique 46







334













335





response)



DEXfat 0756

336



337

Technique 47







338











339




340




341






342


[2] 3540183 4153541 5241859[3] 2250604 2700303 3540398[4] 2764735 3556909 4273546[5] 2130879 2620480 3533114[6] 3249331 3886571 4452887




343



344

Technique 48






345







08600 07997



0499 0000 0000 0000 0000



0448 0000 0000 0000 0000

346




0481 0000 0000 0000 0000



03182 00000


response)


predicted))gt tb


1 0 3 0 0 02 0 6 0 0 03 0 0 7 0 04 0 0 0 4 05 0 0 0 2 0


347



348

NOTES






75See for example













349






ing Research 7 (2006) 983-999

350

Part VI

Cluster Analysis

351

The Basic Idea








353














354



355


NOTE





356


357






13 PRACTITIONER TIP


358









359












360



361



362

13 PRACTITIONER TIP


k asympradicN

2 (481)







363








364


365


13 PRACTITIONER TIP


366


367

Technique 49

K-Means








368











369






370





cex = 2)

371



372

Technique 50

Clara Algorithm


clara(xk)





373










374



375

Technique 51

PAM Algorithm


pam(xk)





376







377



378

Technique 52



kkmeans(x centers )





379





cex = 2)

380



381



382


383

Technique 53



hclust(d method )




384


NOTE


bull Class variable











385







NOTE


gt print(fit)





386





387









rainbow_hcl (3))


388





389




[i])




at height 1233947


at height 3378454


at height 335


at height 1601


at height 8466573


at height 1038406

390



at height 4114692


at height 6373702





)


391






392




393






394





)]







395



396

Technique 54









397










diagnosis_labels)]


398







gt par(xpd = NA)



399





400





par(xpd = NA)

401





402






403





404












405






406











407




pietype= lower)



408








)]



par(mar = c(3337))


409




410

Technique 55



diana(xmetric )





411








412





413









414



415

Technique 56



aggExCluster(d x)






416







417






418

Fuzzy Methods

419

Technique 57







420






421




euclidean)

422



fuzzy=fuzzy +01





423






424



425

Technique 58

Fuzzy K-Means


FKM(x k=3 m=15 RS=1 )





426










066 064 054 037 047 099


059 078 049 032 041 102


052 093 042 026 035 197

427





428



429

Technique 59

Fuzzy K-Medoids






430











029 130 006 009 011 605


024 151 005 007 011 863

431



021 168 005 007 008 790




432



433

Other Methods

434

Technique 60









435






436




437



438

Technique 61

K-Modes Clustering


kmodes(x k)



439


NOTE











440










441





442



443

Technique 62



Mclust(x G)





444



NOTE

















) with 3 components


445





446




447



448






449



450

Technique 63



mona(x)



451


NOTE









452


x[i2]lt-0x[i5]lt-0





klt-k+1k =1



move sex feed[1] 0 2 2[2] 1 1 2[3] 0 2 2[4] 1 1 2[5] 0 2 2[6] 0 1 2

453





454

Technique 64



apcluster(s x)




455







456



457



458

Technique 65



aggExCluster(s x)





459




APResult object




460





461



462

Technique 66

Bagged Clustering






463





464




465




466



467











97See for example







468

NOTES







103See for example












469







470

Part VII

Boosting

471

The Basic Idea








473




NOTE


474


NOTE




475












476






Volkswagen 20 0

Total 184 16


477







478

Binary Boosting

NOTE






479


NOTE



1minusεm

εm

)with η(x) =

sign(x)


) p isin[0 1]



480

Technique 67

Ada BoostM1


(1minusεm

εm





NOTE



481


NOTE



Levels M R


minsplit = 0)

13 PRACTITIONER TIP



482














True value 1 21 80 22 3 72

Train Error 0032

483












True value 1 21 80 22 3 72

Train Error 0032


484










Training Results


Testing Results





485


Training Results


Testing Results



13 PRACTITIONER TIP



486




scores)


0078 0078 0076 0072 0071

487




23 28

gt plot(pred$class)




gt pred$probs [[2 2]][1] 09258582


488


gt pred$probs [[3 1]][1] 07416495

gt pred$probs [[3 2]][1] 02583505



489



490



491

Technique 68

Real Ada Boost








NOTE


492







True value 1 21 82 02 0 75

Train Error 0





13 PRACTITIONER TIP



493







True value 1 21 82 02 2 73

Train Error 0013






0087 0081 0078 0076 0072


494






29 22

gt table (test$z)

1 229 22


495

Technique 69

Gentle Ada Boost




NOTE



496







True value 1 21 72 82 13 64

Train Error 0134








497




True value 1 21 70 102 20 57

Train Error 0191











True value 1 21 31 02 3 17

498


Train Error 0059










True value 1 21 38 422 11 66

Train Error 0338





499



34 17


31 20

gt table (test$z)

1 231 20


500

Technique 70

Discrete L2 Boost





NOTE


501




sample +1)n]]



=5minsplit =0)





502




True value -1 1-1 1918 1891 461 811

Train Error 0192






13 PRACTITIONER TIP





503





Training Results


Testing Results




504






505






506





-1 1356 206


[1] [2][1] 0010 0990[2] 0062 0938[3] 0794 0206[4] 0882 0118

[559] 0085 0915[560] 0799 0201[561] 0720 0280[562] 0996 0004

507

Technique 71

Real L2 Boost










508


True value -1 1-1 1970 1371 487 785

Train Error 0185










Training Results


Testing Results



509



scores)


0114 0111 0108 0108 0106




-1 1356 206

510

Technique 72

Gentle L2 Boost










511


True value -1 1-1 1924 1831 493 779

Train Error 02










Training Results


Testing Results


512



scores)


0099 0097 0097 0097 0095






-1 1356 206

513


514

Technique 73

SAMME


(1minusεm

εm

)+ ln(kminus1)


13 PRACTITIONER TIP


1minusεm

εm





515



NOTE









516

TECHNIQUE 73 SAMME

bus 124 0 0 0opel 0 114 13 0saab 0 12 117 0van 0 0 0 120



13 PRACTITIONER TIP






gt cv$error[1] 02671395



517



140 90 87

NOTE



])





518

Technique 74



1minusεm

εm





13 PRACTITIONER TIP





519









237 140 88



])




520


saab 2 25 45 4van 4 9 6 73


521

Technique 75



(1minusεm

εm










gt setseed (107)

522







201 97 93



])





523


524

Technique 76

L2 Regression

NOTE


L2 =Nsum







525





13 PRACTITIONER TIP






anthro3a anthro3b336145109 352597323

526


13 PRACTITIONER TIP



=10)


gt mstop(cvm)[1] 40

gt fit[mstop(cvm)]

527





gt CI

528




13 PRACTITIONER TIP






529






530



= response)



[1] 0858

531



532

Technique 77

L1 Regression

NOTE


L1 =|Nsum

i=1yi minus yi |






533










gtfit[mstop(cvm)]



= response)

534




[1] 0906


535

Technique 78

Robust Regression









536


026460670 514548154 078826723




gtfit[mstop(cvm)]



= response)



[1] 0913

537



538

Technique 79







NOTE



539









gt mstop(cvm)[1] 41

gt fit[mstop(cvm)]






540



541




= response)

542




[1] 0867


543

Technique 80

Quantile Regression







544









anthro4060159287





545














546







547

Technique 81









548



FALSE))












5 95

549




550


551

Technique 82

Logistic Regression






552








gt fit[mstop(cvm)]





labels=c(M R))


predictedactual M R

M 22 7R 4 18

553



554

Technique 83

Probit Regression









555


gt fit[mstop(cvm)]



response)


labels=c(M R))


predictedactual M R

M 21 8R 4 18


556


557

Technique 84

Poisson Regression




NOTE


ln (λi) = β0 + β1x1i + + βkxki (841)


558











gt mstop(cvm)[1] 36

gt fit[mstop(cvm)]

559










560

Technique 85





13 PRACTITIONER TIP




561











gt fit[mstop(cvm)]




562








563

Technique 86

Hurdle Regression




NOTE



564











565




566

Technique 87








567






gt fit[mstop(cvm)]



-1508 2148 1230



2503 1528


568






predicted))

gt tbpredicted

actual 1 2 3 4 51 0 4 1 0 02 0 9 12 1 03 0 5 16 5 04 0 0 5 7 05 0 0 2 5 0




569


NOTE








570








13 PRACTITIONER TIP




571




572

Technique 88







573








gt fit[mstop(cvm)]





574


4531 0348 0019


4518 0357 0019



575

Technique 89









576





gt fit[mstop(cvm)]





4103 0408 0021


4083 0424 0022



577

Technique 90









578




gt fit[mstop(cvm)]





4110 0384 0020


4093 0396 0021



579

Technique 91









580




gt fit[mstop(cvm)]


13 PRACTITIONER TIP



data = rhDNase)


1431 -0370 -0020


-0378 -0021


581


NOTE



582



583

Technique 92









584





gt fit[mstop(cvm)]


3718 0403 0022


585





















586

NOTES








133See for example










587

INDEX





Good luck


591








INDEX





593

Preface


I Decision Trees








































IV Neural Networks















V Random Forests












VI Cluster Analysis












Fuzzy Methods




Other Methods








VII Boosting


Binary Boosting








Technique 73 SAMME
























2015-10-23T190127+0000


92 Applied Predictive Modeling Techniques in R: With step by step instructions on how to build them...

Documents

Transcript of 92 Applied Predictive Modeling Techniques in R: With step by step instructions on how to build them...