Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy...

LearningApacheMahoutClassification

TableofContents

Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Downloadingthecolorimagesofthisbook

Errata

Piracy

Questions

1.ClassificationinDataAnalysis

Introducingtheclassification

Applicationoftheclassificationsystem

Workingoftheclassificationsystem

Classificationalgorithms

Modelevaluationtechniques

Theconfusionmatrix

TheReceiverOperatingCharacteristics(ROC)graph

AreaundertheROCcurve

Theentropymatrix

Summary

2.ApacheMahout

IntroducingApacheMahout

AlgorithmssupportedinMahout

ReasonsforMahoutbeingagoodchoiceforclassification

InstallingMahout

BuildingMahoutfromsourceusingMaven

InstallingMaven

BuildingMahoutcode

SettingupadevelopmentenvironmentusingEclipse

SettingupMahoutforaWindowsuser

Summary

3.LearningLogisticRegression/SGDUsingMahout

Introducingregression

Understandinglinearregression

Costfunction

Gradientdescent

Logisticregression

StochasticGradientDescent

UsingMahoutforlogisticregression

Summary

4.LearningtheNaïveBayesClassificationUsingMahout

IntroducingconditionalprobabilityandtheBayesrule

UnderstandingtheNaïveBayesalgorithm

Understandingthetermsusedintextclassification

UsingtheNaïveBayesalgorithminApacheMahout

Summary

5.LearningtheHiddenMarkovModelUsingMahout

Deterministicandnondeterministicpatterns

TheMarkovprocess

IntroducingtheHiddenMarkovModel

UsingMahoutfortheHiddenMarkovModel

Summary

6.LearningRandomForestUsingMahout

Decisiontree

Randomforest

UsingMahoutforRandomforest

StepstousetheRandomforestalgorithminMahout

Summary

7.LearningMultilayerPerceptronUsingMahout

Neuralnetworkandneurons

MultilayerPerceptron

MLPimplementationinMahout

UsingMahoutforMLP

StepstousetheMLPalgorithminMahout

Summary

8.MahoutChangesintheUpcomingRelease

Mahoutnewchanges

MahoutScalaandSparkbindings

ApacheSpark

UsingMahout’sSparkshell

H2Oplatformintegration

Summary

9.BuildinganE-mailClassificationSystemUsingApacheMahout

Spame-maildataset

CreatingthemodelusingtheAssassindataset

Programtouseaclassifiermodel

Testingtheprogram

Secondusecaseasanexercise

TheASFe-maildataset

Classifierstuning

Summary

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1210215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78355-495-9

www.packtpub.com

CreditsAuthor

AshishGupta

Reviewers

SivaPrakash

TharinduRusira

VishnuViswanath

CommissioningEditor

AkramHussain

AcquisitionEditor

ReshmaRaman

ContentDevelopmentEditor

MerwynD’souza

TechnicalEditors

MonicaJohn

NovinaKewalramani

ShrutiRawool

CopyEditors

SarangChari

GladsonMonteiro

AartiSaldanha

RashmiSawant

ProjectCoordinator

NehaBhatnagar

Proofreaders

SimranBhogal

SteveMaguire

Indexer

MonicaAjmeraMehta

Graphics

SheetalAute

AbhinashSahu

ProductionCoordinator

ConidonMiranda

CoverWork

ConidonMiranda

AbouttheAuthorAshishGuptahasbeenworkinginthefieldofsoftwaredevelopmentforthelast8years.Hehasworkedindifferentcompanies,suchasSAPLabsandCaterpillar,asasoftwaredeveloper.Whileworkingforastart-upwherehewasresponsibleforpredictingpotentialcustomersfornewfashionapparelsusingsocialmedia,hedevelopedaninterestinthefieldofmachinelearning.Sincethen,hehasworkedonusingbigdatatechnologiesandmachinelearningfordifferentindustries,includingretail,finance,insurance,andsoon.Hehasapassionforlearningnewtechnologiesandsharingtheknowledgethusgainedwithothers.HehasorganizedmanybootcampsfortheApacheMahoutandHadoopecosystem.

Firstofall,Iwouldliketothankopensourcecommunitiesfortheircontinuouseffortsindevelopinggreatsoftwareforall.IwouldliketothankMerwynD’SouzaandReshmaRaman,myeditorsforthisproject.Specialthankstothereviewersofthisbook.

Nothingcanbeaccomplishedwithoutthesupportoffamily,friends,andlovedones.Iwouldliketothankmyfriends,family,andespeciallymywifeandmysonfortheircontinuoussupportthroughoutthewritingofthisbook.

AbouttheReviewersSivaPrakashisworkingasatechleadinBangalore.Hehasextensivedevelopmentexperienceintheanalysis,design,development,implementation,andmaintenanceofvariousdesktop,mobile,andweb-basedapplications.Helovestrekking,traveling,music,readingbooks,andblogging.

YoucanfindhimonLinkedInathttps://www.linkedin.com/in/techsivam.

TharinduRusiraiscurrentlyacomputerscienceandengineeringundergraduateattheUniversityofMoratuwa,SriLanka.Asastudentresearcher,hehasstronginterestsinmachinelearning,compilers,andhigh-performancecomputing.

TharinduhasalsoworkedasaresearchanddevelopmentsoftwareengineeringinternatZaiziAsia(Pvt)Ltd.,wherehefirststartedusingApacheMahoutduringtheimplementationofanenterprise-levelcontentmanagementandinformationretrievalsystem.

HeseesthepotentialofApacheMahoutasascalablemachinelearninglibraryforindustry-levelimplementationsandhasevencontributedtotheMahout0.9release,thelateststablereleaseofMahout.

HeisavailableonLinkedInathttps://www.linkedin.com/in/trusira.

VishnuViswanathisaseniorbigdatadeveloperwhohasmanyyearsofindustrialexpertiseinthearenaofmachinelearning.Heisatechenthusiastandispassionateaboutbigdataandhasexpertiseonmostbig-data-relatedtechnologies.

YoucanfindhimonLinkedInathttp://in.linkedin.com/in/vishnuviswanath25.

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<service@packtpub.com>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

PrefaceThankstotheprogressmadeinthehardwareindustries,ourstoragecapacityhasincreased,andbecauseofthis,therearemanyorganizationswhowanttostorealltypesofeventsforanalyticspurposes.Thishasgivenbirthtoaneweraofmachinelearning.Thefieldofmachinelearningisverycomplexandwritingthesealgorithmsisnotapieceofcake.ApacheMahoutprovidesuswithreadymadealgorithmsintheareaofmachinelearningandsavesusfromthecomplextaskofalgorithmimplementation.

TheintentionofthisbookistocoverclassificationalgorithmsavailableinApacheMahout.Whetheryouhavealreadyworkedonclassificationalgorithmsusingsomeothertoolorarecompletelynewtothefield,thisbookwillhelpyou.So,startreadingthisbooktoexploretheclassificationalgorithmsinoneofthemostpopularopensourceprojectswhichenjoysstrongcommunitysupport:ApacheMahout.

WhatthisbookcoversChapter1,ClassificationinDataAnalysis,providesanintroductiontotheclassificationconceptindataanalysis.Thischapterwillcoverthebasicsofclassification,similaritymatrix,andalgorithmsavailableinthisarea.

Chapter2,ApacheMahout,providesanintroductiontoApacheMahoutanditsinstallationprocess.Further,thischapterwilltalkaboutwhyitisagoodchoiceforclassification.

Chapter3,LearningLogisticRegression/SGDUsingMahout,discusseslogisticregressionandStochasticGradientDescent,andhowdeveloperscanuseMahouttouseSGD.

Chapter4,LearningtheNaïveBayesClassificationUsingMahout,discussestheBayesTheorem,NaïveBayesclassification,andhowwecanuseMahouttobuildNaïveBayesclassifier.

Chapter5,LearningtheHiddenMarkovModelUsingMahout,coverstheHMMandhowtouseMahout’sHMMalgorithms.

Chapter6,LearningRandomForestUsingMahout,discussestheRandomforestalgorithmindetail,andhowtouseMahout’sRandomforestimplementation.

Chapter7,LearningMultilayerPerceptronUsingMahout,discussesMahoutasanearlylevelimplementationofaneuralnetwork.WewilldiscussMultilayerPerceptroninthischapter.Further,wewilluseMahout’simplementationofMLP.

Chapter8,MahoutChangesintheUpcomingRelease,discussesMahoutasaworkinprogress.WewilldiscussthenewmajorchangesintheupcomingreleaseofMahout.

Chapter9,BuildinganE-mailClassificationSystemUsingApacheMahout,providestwousecasesofe-mailclassification—spammailclassificationande-mailclassificationbasedontheprojectthemailbelongsto.Wewillcreatethemodel,andusethismodelinaprogramthatwillsimulatetherealworkingenvironment.

WhatyouneedforthisbookTousetheexamplesinthisbook,youshouldhavethefollowingsoftwareinstalledonyoursystem:

Java1.6orhigherEclipseHadoopMahout;wewilldiscusstheinstallationinChapter2,ApacheMahout,ofthisbookMaven,dependingonhowyouinstallMahout

WhothisbookisforIfyouareadatascientistwhohassomeexperiencewiththeHadoopecosystemandmachinelearningmethodsandwanttotryoutclassificationonlargedatasetsusingMahout,thisbookisidealforyou.KnowledgeofJavaisessential.

ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Extractthesourcecodeandensurethatthefoldercontainsthepom.xmlfile.”

Ablockofcodeissetasfollows:

publicstaticMap<String,Integer>readDictionary(Configurationconf,

PathdictionaryPath){

Map<String,Integer>dictionary=newHashMap<String,Integer>();

for(Pair<Text,IntWritable>pair:newSequenceFileIterable<Text,

IntWritable>(dictionaryPath,true,conf)){

dictionary.put(pair.getFirst().toString(),

pair.getSecond().get());

returndictionary;

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

dictionary.put(pair.getFirst().toString(),

returndictionary;

Anycommand-lineinputoroutputiswrittenasfollows:

hadoopfs-mkdir/user/hue/KDDTrain

hadoopfs-mkdir/user/hue/KDDTest

hadoopfs–put/tmp/KDDTrain+_20Percent.arff/user/hue/KDDTrain

hadoopfs–put/tmp/KDDTest+.arff/user/hue/KDDTest

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“Now,navigatetothelocationformahout-distribution-0.9andclickonFinish.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<feedback@packtpub.com>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

DownloadingthecolorimagesofthisbookWealsoprovideyouwithaPDFfilethathascolorimagesofthescreenshots/diagramsusedinthisbook.Thecolorimageswillhelpyoubetterunderstandthechangesintheoutput.Youcandownloadthisfilefromhttp://www.packtpub.com/sites/default/files/downloads/4959OS_ColoredImages.pdf.

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<copyright@packtpub.com>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.

QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<questions@packtpub.com>,andwewilldoourbesttoaddresstheproblem.

Chapter1.ClassificationinDataAnalysisInthelastdecade,wesawahugegrowthinsocialnetworkingande-commercesites.IamsurethatyoumusthavegotinformationaboutthisbookonFacebook,Twitter,orsomeothersite.Chancesarealsohighthatyouarereadingane-copyofthisbookafterorderingitonyourphoneortablet.

ThismustgiveyouanideaofhowmuchdatawearegeneratingovertheInterneteverysingleday.Now,inordertoobtainallnecessaryinformationfromthedata,wenotonlycreatedatabutalsostorethisdata.Thisdataisextremelyusefultogetsomeimportantinsightsintothebusiness.Theanalysisofthisdatacanincreasethecustomerbaseandcreateprofitsfortheorganization.Taketheexampleofane-commercesite.Youvisitthesitetobuysomebook.Yougetinformationaboutbooksonrelatedtopicsorthesametopic,publisher,orwriter,andthishelpsyoutotakebetterdecisions,whichalsohelpsthesitetoknowmoreaboutitscustomers.Thiswilleventuallyleadtoanincreaseinsales.

Findingrelateditemsorsuggestinganewitemtotheuserisallpartofthedatascienceinwhichweanalyzethedataandtrytogetusefulpatterns.

Dataanalysisistheprocessofinspectinghistoricaldataandcreatingmodelstogetusefulinformationthatisrequiredtohelpindecisionmaking.Itishelpfulinmanyindustries,suchase-commerce,banking,finance,healthcare,telecommunications,retail,oceanography,andmanymore.

Let’staketheexampleofaweatherforecastingsystem.Itisasystemthatcanpredictthestateoftheatmosphereataparticularlocation.Inthisprocess,scientistscollecthistoricaldataoftheatmosphereofthatlocationandtrytocreateamodelbasedonittopredicthowtheatmospherewillevolveoveraperiodoftime.

Inmachinelearning,classificationistheautomationofthedecision-makingprocessthatlearnsfromexamplesofthepastandemulatesthosedecisionsautomatically.Emulatingthedecisionsautomaticallyisacoreconceptinpredictiveanalytics.Inthischapter,wewilllookatthefollowingpoints:

UnderstandingclassificationWorkingofclassificationsystemsClassificationalgorithmsModelevaluationmethods

IntroducingtheclassificationThewordclassificationalwaysremindsusofourbiologyclass,wherewelearnedabouttheclassificationofanimals.Welearnedaboutdifferentcategoriesofanimals,suchasmammals,reptiles,birds,amphibians,andsoon.

Ifyourememberhowthesecategoriesaredefined,youwillrealizethattherewerecertainpropertiesthatscientistsfoundinexistinganimals,andbasedontheseproperties,theycategorizedanewanimal.

Otherreal-lifeexamplesofclassificationcouldbe,forinstance,whenyouvisitthedoctor.He/sheasksyoucertainquestions,andbasedonyouranswers,he/sheisabletoidentifywhetheryouhaveacertaindiseaseornot.

Classificationisthecategorizationofpotentialanswers,andinmachinelearning,wewanttoautomatethisprocess.Biologicalclassificationisanexampleofmulticlassclassificationandfindingthediseaseisanexampleofbinaryclassification.

Indataanalysis,wewanttousemachinelearningconcepts.Toanalyzethedata,wewanttobuildasystemthatcanhelpustofindoutwhichclassanindividualitembelongsto.Usually,theseclassesaremutuallyexclusive.Arelatedprobleminthisareaisfindingouttheprobabilitythatanindividualbelongstoacertainclass.

Classificationisasupervisedlearningtechnique.Inthistechnique,machines—basedonhistoricaldata—learnandgainthecapabilitiestopredicttheunknown.Inmachinelearning,anotherpopulartechniqueisunsupervisedlearning.Insupervisedlearning,wealreadyknowtheoutputcategories,butinunsupervisedlearning,weknownothingabouttheoutput.Let’sunderstandthiswithaquickexample:supposewehaveafruitbasket,andwewanttoclassifyfruits.Whenwesayclassify,itmeansthatinthetrainingdata,wealreadyhaveoutputvariables,suchassizeandcolor,andweknowwhetherthecolorisredandthesizeisfrom2.3”to3.7”.Wewillclassifythatfruitasanapple.Oppositetothis,inunsupervisedlearning,wewanttoseparatedifferentfruits,andwedonothaveanyoutputinformationinthetrainingdataset,sothelearningalgorithmwillseparatedifferentfruitsbasedondifferentfeaturespresentinthedataset,butitwillnotbeabletolabelthem.Inotherwords,itwillnotbeabletotellwhichoneisanappleandwhichoneisabanana,althoughitwillbeabletoseparatethem.

ApplicationoftheclassificationsystemClassificationisusedforprediction.Inthecaseofe-mailcategorization,itisusedtoclassifye-mailasspamornotspam.Nowadays,Gmailisclassifyinge-mailsasprimary,social,andpromotionalaswell.Classificationisusefulinpredictingcreditcardfrauds,tocategorizecustomersforeligibilityofloans,andsoon.Itisalsousedtopredictcustomerchurnintheinsuranceandtelecomindustries.Itisusefulinthehealthcareindustryaswell.Basedonhistoricaldata,itisusefulinclassifyingparticularsymptomsofadiseasetopredictthediseaseinadvance.Classificationcanbeusedtoclassifytropicalcyclones.So,itisusefulacrossallindustries.

WorkingoftheclassificationsystemLet’sunderstandtheclassificationprocessinmoredetail.Intheprocessofclassification,withthedatasetgiventous,wetrytofindoutinformativevariablesusingwhichwecanreducetheuncertaintyandcategorizesomething.Theseinformativevariablesarecalledexplanatoryvariablesorfeatures.

Thefinalcategoriesthatweareinterestedarecalledtargetvariablesorlabels.Explanatoryvariablescanbeanyofthefollowingforms:

Continuous(numerictypes)CategoricalWord-likeText-like

NoteIfnumerictypesarenotusefulforanymathematicalfunctions,thosewillbecountedascategorical(zipcodes,streetnumbers,andsoon).

So,forexample,wehaveadatasetofcustomer’s’loanapplications,andwewanttobuildaclassifiertofindoutwhetheranewcustomeriseligibleforaloanornot.Inthisdataset,wecanhavethefollowingfields:

CustomerAgeCustomerIncome(PA)CustomerAccountBalanceLoanGranted

Fromthesefields,CustomerAge,CustomerIncome(PA)andCustomerAccountBalancewillworkasexplanatoryvariablesandLoanGrantedwillbethetargetvariable,asshowninthefollowingscreenshot:

Tounderstandthecreationoftheclassifier,weneedtounderstandafewterms,asshowninthefollowingdiagram:

Trainingdataset:Fromthegivendataset,aportionofthedataisusedtocreatethetrainingdataset(itcouldbe70percentofthegivendata).Thisdatasetisusedtobuildtheclassifier.Allthefeaturesetsareusedinthisdataset.Testdataset:Thedatasetthatisleftafterthetrainingdatasetisusedtotestthecreatedmodel.Withthisdata,onlythefeaturesetisusedandthemodelisusedtopredictthetargetvariablesorlabels.Model:Thisisusedtounderstandthealgorithmusedtogeneratethetargetvariables.

Whilebuildingaclassifier,wefollowthesesteps:

CollectinghistoricaldataCleaningdata(alotofactivitiesareinvolvedhere,suchasspaceremoval,andsoon)DefiningtargetvariablesDefiningexplanatoryvariablesSelectinganalgorithmTrainingthemodel(usingthetrainingdataset)RunningtestdataEvaluatingthemodelAdjustingexplanatoryvariablesRerunningthetest

Whilepreparingthemodel,oneshouldtakecareofoutlierdetection.Outlierdetectionisamethodtofindoutitemsthatdonotconformtoanexpectedpatterninadataset.Outliersinaninputdatasetcanmisleadthetrainingprocessofanalgorithm.Thiscanaffectthemodelaccuracy.Therearealgorithmstofindouttheseoutliersinthedatasets.Distance-

basedtechniquesandfuzzy-logic-basedmethodsaremostlyusedtofindoutoutliersinthedataset.Let’stalkaboutoneexampletounderstandtheoutliers.

Wehaveasetofnumbers,andwewanttofindoutthemeanofthesenumbers:

10,75,10,15,20,85,25,30,25

Justplotthesenumbersandtheresultwillbeasshowninthefollowingscreenshot:

Clearly,thenumbers75and85areoutliers(farawayintheplotfromtheothernumbers).

Mean=sumofvalues/numberofvalues=32.78

Meanwithouttheoutliers:=19.29

So,nowyoucanunderstandhowoutlierscanaffecttheresults.

Whilecreatingthemodel,wecanencountertwomajorlyoccurringproblems—OverfittingandUnderfitting.

Overfittingoccurswhenthealgorithmcapturesthenoiseofthedata,andthealgorithmfitsthedatatoowell.Generally,itoccursifweuseallthegivendatatobuildthemodelusingpurememorization.Insteadoffindingoutthegeneralizingpattern,themodeljustmemorizesthepattern.Usually,inthecaseofoverfitting,themodelgetsmorecomplex,anditisallowedtopickupspuriouscorrelations.Thesecorrelationsarespecifictotrainingdatasetsanddonotrepresentcharacteristicsofthewholedatasetingeneral.

Thefollowingdiagramisanexampleofoverfitting.Anoutlierispresent,andthealgorithmconsidersthatandcreatesamodelthatperfectlyclassifiesthetrainingset,butbecauseofthis,thetestdataiswronglyclassified(boththerectanglesareclassifiedasstarsinthetestdata):

Thereisnosinglemethodtoavoidoverfitting;however,wehavesomeapproaches,suchasareductioninthenumberoffeaturesandtheregularizationofafewofthefeatures.Anotherwayistotrainthemodelwithsomedatasetandtestwiththeremainingdataset.Acommonmethodcalledcross-validationisusedtogeneratemultipleperformancemeasures.Inthisway,asingledatasetissplitandusedforthecreationofperformancemeasures.

Underfittingoccurswhenthealgorithmcannotcapturethepatternsinthedata,andthedatadoesnotfitwell.Underfittingisalsoknownashighbias.Itmeansyouralgorithmhassuchastrongbiastowardsitshypothesisthatitdoesnotfitthedatawell.Foranunderfittingerror,moredatawillnothelp.Itcanincreasethetrainingerror.Moreexplanatoryvariablescanhelptodealwiththeunderfittingproblem.Moreexplanatoryfieldswillexpandthehypothesisspaceandwillbeusefultoovercomethisproblem.

Bothoverfittingandunderfittingprovidepoorresultswithnewdatasets.

ClassificationalgorithmsWewillnowdiscussthefollowingalgorithmsthataresupportedbyApacheMahoutinthisbook:

Logisticregression/StochasticGradientDescent(SGD):Weusuallyreadregressionalongwithclassification,butactually,thereisadifferencebetweenthetwo.Classificationinvolvesacategoricaltargetvariable,whileregressioninvolvesanumerictargetvariable.Classificationpredictswhethersomethingwillhappen,andregressionpredictshowmuchofsomethingwillhappen.WewillcoverthisalgorithminChapter3,LearningLogisticRegression/SGDUsingMahout.MahoutsupportslogisticregressiontrainedviaStochasticGradientDescent.NaïveBayesclassification:Thisisaverypopularalgorithmfortextclassification.NaïveBayesusestheconceptofprobabilitytoclassifynewitems.ItisbasedontheBayestheorem.WewilldiscussthisalgorithminChapter4,LearningtheNaïveBayesClassificationUsingMahout.Inthischapter,wewillseehowMahoutisusefulinclassifyingtext,whichisrequiredinthedataanalysisfield.Wewilldiscussvectorization,bagofwords,n-grams,andothertermsusedintextclassification.HiddenMarkovModel(HMM):Thisisusedinvariousfields,suchasspeechrecognition,parts-of-speechtagging,geneprediction,time-seriesanalysis,andsoon.InHMM,weobserveasequenceofemissionsbutdonothaveasequenceofstateswhichamodelusestogeneratetheemission.InChapter5,LearningtheHiddenMarkovModelUsingMahout,wewilltakeonemorealgorithmsupportedbyMahoutHiddenMarkovModel.WewilldiscussHMMindetailandseehowMahoutsupportsthisalgorithm.RandomForest:Thisisthemostwidelyusedalgorithminclassification.RandomForestconsistsofacollectionofsimpletreepredictors,eachcapableofproducingaresponsewhenpresentedwithasetofexplanatoryvariables.InChapter6,LearningRandomForestUsingMahout,wewilldiscussthisalgorithmindetailandalsotalkabouthowtouseMahouttoimplementthisalgorithm.Multi-layerPerceptron(MLP):InChapter7,LearningMultilayerPerceptronUsingMahout,wewilldiscussthisnewlyimplementedalgorithminMahout.AnMLPconsistsofmultiplelayersofnodesinadirectedgraph,witheachlayerfullyconnectedtothenextone.Itisabasefortheimplementationofneuralnetworks.WewilldiscussneuralnetworksalittlebutonlyafteradetaileddiscussiononMLPinMahout.

WewilldiscussalltheclassificationalgorithmssupportedbyApacheMahoutinthisbook,andwewillalsocheckthemodelevaluationtechniquesprovidedbyApacheMahout.

ModelevaluationtechniquesWecannothaveasingleevaluationmetricthatcanfitalltheclassifiermodels,butwecanfindoutsomecommonissuesinevaluation,andwehavetechniquestodealwiththem.WewilldiscussthefollowingtechniquesthatareusedinMahout:

ConfusionmatrixROCgraphAUCEntropymatrix

TheconfusionmatrixTheconfusionmatrixprovidesuswiththenumberofcorrectandincorrectpredictionsmadebythemodelcomparedwiththeactualoutcomes(targetvalues)inthedata.AconfusionmatrixisaN*Nmatrix,whereNisthenumberoflabels(classes).Eachcolumnisaninstanceinthepredictedclass,andeachrowisaninstanceintheactualclass.Usingthismatrix,wecanfindouthowoneclassisconfusedwithanother.Let’sassumethatwehaveaclassifierthatclassifiesthreefruits:strawberries,cherries,andgrapes.Assumingthatwehaveasampleof24fruits:7strawberries,8cherries,and9grapes,theresultingconfusionmatrixwillbeasshowninthefollowingtable:

Predictedclassesbymodel

Actualclass

Strawberries Cherries Grapes

Strawberries 4 3 0

Cherries 2 5 1

Grapes 0 1 8

So,inthismodel,fromthe8strawberries,3wereclassifiedascherries.Fromthe8cherries,2wereclassifiedasstrawberries,and1isclassifiedasagrape.Fromthe9grapes,1isclassifiedasacherry.Fromthismatrix,wewillcreatethetableofconfusion.Thetableofconfusionhastworowsandtwocolumnsthatreportabouttruepositive,truenegative,falsepositive,andfalsenegative.

So,ifwebuildthistableforaparticularclass,let’ssayforstrawberries,itwouldbeasfollows:

TruePositive

4(actualstrawberriesclassifiedcorrectly)(a)

FalsePositive

2(cherriesthatwereclassifiedasstrawberries)(b)

FalseNegative

3(strawberrieswronglyclassifiedascherries)(c)

TrueNegative

15(allotherfruitscorrectlynotclassifiedasstrawberries)(d)

Usingthistableofconfusion,wecanfindoutthefollowingterms:

Accuracy:Thisistheproportionofthetotalnumberofpredictionsthatwerecorrectlyclassified.Itiscalculatedas(TruePositive+TrueNegative)/Positive+Negative.Therefore,accuracy=(a+d)/(a+b+c+d).Precisionorpositivepredictivevalue:Thisistheproportionofpositivecasesthatwerecorrectlyclassified.Itiscalculatedas(TruePositive)/(TruePositive+FalsePositive).Therefore,precision=a/(a+b).Negativepredictivevalue:Thisistheproportionofnegativecasesthatwereclassifiedcorrectly.ItiscalculatedasTrueNegative/(TrueNegative+FalseNegative).Therefore,negativepredictivevalue=d/(c+d).Sensitivity/truepositiverate/recall:Thisistheproportionoftheactualpositive

casesthatwerecorrectlyidentified.ItiscalculatedasTruePositive/(TruePositive+FalseNegative).Therefore,sensitivity=a/(a+c).Specificity:Thisistheproportionoftheactualnegativecases.ItiscalculatedasTrueNegative/(FalsePositive+TrueNegative).Therefore,specificity=d/(b+d).F1score:Thisisthemeasureofatest’saccuracy,anditiscalculatedasfollows:F1=2.((Positivepredictivevalue(precision)*sensitivity(recall))/(Positivepredictivevalue(precision)+sensitivity(recall))).

TheReceiverOperatingCharacteristics(ROC)graphROCisatwo-dimensionalplotofaclassifierwithfalsepositiverateonthexaxisandtruepositiverateontheyaxis.Thelowerpoint(0,0)inthefigurerepresentsneverissuingapositiveclassification.Point(0,1)representsperfectclassification.Thediagonalfrom(0,0)to(1,1)dividestheROCspace.Pointsabovethediagonalrepresentgoodclassificationresults,andpointsbelowthelinerepresentpoorresults,asshowninthefollowingdiagram:

AreaundertheROCcurveThisistheareaundertheROCcurveandisalsoknownasAUC.Itisusedtomeasurethequalityoftheclassificationmodel.Inpractice,mostoftheclassificationmodelshaveanAUCbetween0.5and1.Thecloserthevalueisto1,thegreaterisyourclassifier.

TheentropymatrixBeforegoingintothedetailsoftheentropymatrix,firstweneedtounderstandentropy.TheconceptofentropyininformationtheorywasdevelopedbyShannon.

Entropyisameasureofdisorderthatcanbeappliedtoaset.Itisdefinedas:

Entropy=-p1log(p1)–p2log(p2)-…….

Eachpistheprobabilityofaparticularpropertywithintheset.Let’srevisitourcustomerloanapplicationdataset.Forexample,assumingwehaveasetof10customersfromwhich6areeligibleforaloanand4arenot.Here,wehavetwoproperties(classes):eligibleornoteligible.

P(eligible)=6/10=0.6

P(noteligible)=4/10=0.4

So,entropyofthedatasetwillbe:

Entropy=-[0.6*log2(0.6)+0.4*log2(0.4)]

=-[0.6*-0.74+0.4*-1.32]

=0.972

Entropyisusefulinacquiringknowledgeofinformationgain.Informationgainmeasuresthechangeinentropyduetoanynewinformationbeingaddedinmodelcreation.So,ifentropydecreasesfromnewinformation,itindicatesthatthemodelisperformingwellnow.Informationgainiscalculatedas:

IG(classes,subclasses)=entropy(class)–(p(subclass1)*entropy(subclass1)+p(subclass2)*entropy(subclass2)+…)

Entropymatrixisbasicallythesameastheconfusionmatrixdefinedearlier;theonlydifferenceisthattheelementsinthematrixaretheaveragesofthelogoftheprobabilityscoreforeachtrueorestimatedcategorycombination.Agoodmodelwillhavesmallnegativenumbersalongthediagonalandwillhavelargenegativenumbersintheoff-diagonalposition.

SummaryWehavediscussedclassificationanditsapplicationsandalsowhatalgorithmandclassifierevaluationtechniquesaresupportedbyMahout.Wediscussedtechniqueslikeconfusionmatrix,ROCgraph,AUC,andentropymatrix.

Now,wewillmovetothenextchapterandsetupApacheMahoutandthedeveloperenvironment.WewillalsodiscussthearchitectureofApacheMahoutandfindoutwhyMahoutisagoodchoiceforclassification.

Chapter2.ApacheMahoutInthepreviouschapter,wediscussedclassificationandlookedintothealgorithmsprovidedbyMahoutinthisarea.Beforegoingtothosealgorithms,weneedtounderstandMahoutanditsinstallation.Inthischapter,wewillexplorethefollowingtopics:

WhatisApacheMahout?AlgorithmssupportedinMahoutWhyisitagoodchoiceforclassificationproblems?SettingupthesystemforMahoutdevelopment

IntroducingApacheMahoutAmahoutisapersonwhoridesandcontrolsanelephant.MostofthealgorithmsinApacheMahoutareimplementedontopofHadoop,whichisanotherApache-licensedprojectandhasthesymbolofanelephant(http://hadoop.apache.org/).AsApacheMahoutridesoverHadoop,thisnameisjustified.

ApacheMahoutisaprojectofApacheSoftwareFoundationthathasimplementationsofmachinelearningalgorithms.MahoutwasstartedasasubprojectoftheApacheLuceneprojectin2008.Aftersometime,anopensourceprojectnamedTaste,whichwasdevelopedforcollaborativefiltering,anditwasabsorbedintoMahout.MahoutiswritteninJavaandprovidesscalablemachinelearningalgorithms.Mahoutisthedefaultchoiceformachinelearningproblemsinwhichthedataistoolargetofitintoasinglemachine.MahoutprovidesJavalibrariesanddoesnotprovideanyuserinterfaceorserver.Itisaframeworkoftoolstobeusedandadaptedbydevelopers.

Tosumitup,Mahoutprovidesyouwithimplementationsofthemostfrequentlyusedmachinelearningalgorithmsintheareaofclassification,clustering,andrecommendation.Insteadofspendingtimewritingalgorithms,itprovidesuswithready-to-consumesolutions.

MahoutusesHadoopforitsalgorithms,butsomeofthealgorithmscanalsorunwithoutHadoop.Currently,Mahoutsupportsthefollowingusecases:

Recommendation:Thistakestheuserdataandtriestopredictitemsthattheusermightlike.Withthisusecase,youcanseeallthesitesthataresellinggoodstotheuser.Basedonyourpreviousaction,theywilltrytofindoutunknownitemsthatcouldbeofuse.Oneexamplecanbethis:assoonasyouselectsomebookfromAmazon,thewebsitewillshowyoualistofotherbooksunderthetitle,CustomersWhoBoughtThisItemAlsoBought.Italsoshowsthetitle,WhatOtherItemsDoCustomersBuyAfterViewingThisItem?AnotherexampleofrecommendationisthatwhileplayingvideosonYouTube,itrecommendsthatyoulistentosomeothervideosbasedonyourselection.MahoutprovidesfullAPIsupporttodevelopyour

ownuser-basedoritem-basedrecommendationengine.Classification:Asdefinedintheearlierchapter,classificationdecideshowmuchanitembelongstooneparticularcategory.E-mailclassificationforfilteringoutspamisaclassicexampleofclassification.MahoutprovidesarichsetofAPIstobuildyourownclassificationmodel.Forexample,Mahoutcanbeusedtobuildadocumentclassifierorane-mailclassifier.Clustering:Thisisatechniquethattriestogroupitemstogetherbasedonsomesortofsimilarity.Here,wefindthedifferentclustersofitemsbasedoncertainproperties,andwedonotknowthenameoftheclusterinadvance.Themaindifferencebetweenclusteringandclassificationisthatinclassification,weknowtheendclassname.Clusteringisusefulinfindingoutdifferentcustomersegments.GoogleNewsusestheclusteringtechniqueinordertogroupnews.Forclustering,Mahouthasalreadyimplementedsomeofthemostpopularalgorithmsinthisarea,suchask-means,fuzzyk-means,canopy,andsoon.Dimensionalreduction:Aswediscussedinthepreviouschapter,featuresarecalleddimensions.Dimensionalreductionistheprocessofreducingthenumberofrandomvariablesunderconsideration.Thismakesdataeasytouse.Mahoutprovidesalgorithmsfordimensionalreduction.SingularvaluedecompositionandLanczosareexamplesofthealgorithmsthatMahoutprovides.Topicmodeling:Topicmodelingisusedtocapturetheabstractideaofadocument.Atopicmodelisamodelthatassociatesprobabilitydistributionwitheachdocumentovertopics.Giventhatadocumentisaboutaparticulartopic,onewouldexpectparticularwordstoappearinthedocumentmoreorlessfrequently.“Football”and“goal”willappearmoreinadocumentaboutsports.LatentDirichletAllocation(LDA)isapowerfullearningalgorithmfortopicmodeling.InMahout,collapsedvariationalBayesisimplementedforLDA.

AlgorithmssupportedinMahoutTheimplementationofalgorithmsinMahoutcanbecategorizedintotwogroups:

Sequentialalgorithms:ThesealgorithmsareexecutedsequentiallyanddonotuseHadoopscalableprocessing.TheyareusuallytheonesderivedfromTaste.Forexample:user-basedcollaborativefiltering,logisticregression,HiddenMarkovModel,multi-layerperceptron,singularvaluedecomposition.Parallelalgorithms:ThesealgorithmscansupportpetabytesofdatausingHadoop’smapandhencereduceparallelprocessing.Forexample,RandomForest,NaïveBayes,canopyclustering,k-meansclustering,spectralclustering,andsoon.

ReasonsforMahoutbeingagoodchoiceforclassificationInmachinelearningsystems,themoredatayouuse,themoreaccuratethesystembuiltwillbe.Mahout,whichusesHadoopforscalability,iswayaheadofothersintermsofhandlinghugedatasets.Asthenumberoftrainingsetsincreases,Mahout’sperformancealsoincreases.Iftheinputsizefortrainingexamplesisfrom1millionto10million,thenMahoutisanexcellentchoice.

Forclassificationproblems,increaseddatafortrainingisdesirableasitcanimprovetheaccuracyofthemodel.Generally,asthenumberofdatasetsincreases,memoryrequirementalsoincreases,andalgorithmsbecomeslow,butMahout’sscalableandparallelalgorithmsworkbetterwithregardstothetimetaken.Eachnewmachineaddeddecreasesthetrainingtimeandprovideshigherperformance.

InstallingMahoutNowlet’strytheslightlychallengingpartofthisbook:Mahoutinstallation.Basedoncommonexperiences,Ihavecomeupwiththefollowingquestionsorconcernsthatusersfacebeforeinstallation:

IdonotknowanythingaboutMaven.HowwillIcompileMahoutbuild?HowcanIsetupEclipsetowritemyownprogramsinMahout?HowcanIinstallMahoutonaWindowssystem?

So,wewillinstallMahoutwiththehelpofthefollowingsteps.Eachstepisindependentfromtheother.Youcanchooseanyoneofthese:

BuildingMahoutcodeusingMavenSettingupadevelopmentenvironmentusingEclipseSettingupMahoutforaWindowsuser

Beforeanyofthesteps,someoftheprerequisitesare:

YoushouldhaveJavainstalledonyoursystem.Wikihowisagoodsourceforthisathttp://www.wikihow.com/Install-Java-on-LinuxYoushouldhaveHadoopinstalledonyoursystemfromthehttp://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleNodeSetup.htmlURL

BuildingMahoutfromsourceusingMavenMahout’sbuildandreleasesystemisbasedonMaven.

InstallingMaven1. Createthefolder/usr/local/maven,asfollows:

mkdir/usr/local/maven

2. Downloadthedistributionapache-maven-x.y.z-bin.tar.gzfromtheMavensite(http://maven.apache.org/download.cgi)andmovethisto/usr/local/maven,asfollows:

mvapache-maven-x.y.z-bin.tar.gz/usr/local/maven

3. Unpacktothelocation/usr/local/maven,asfollows:

tar–xvfapache-maven-x.y.z-bin.tar.gz

4. Editthe.bashrcfile,asfollows:

exportM2_HOME=/usr/local/apache-maven-x.y.z

exportM2=$M2_HOME/bin

exportPATH=$M2:$PATH

NoteFortheEclipseIDE,gotoHelpandselectInstallnewSoftware.ClickontheAddbutton,andinthepopup,typethenameM2Eclipse,providethelinkhttp://download.eclipse.org/technology/m2e/releases,andclickonOK.

BuildingMahoutcodeBydefault,MahoutassumesthatHadoopisalreadyinstalledonthesystem.MahoutusestheHADOOP_HOMEandHADOOP_CONF_DIRenvironmentvariablestoaccessHadoopclusterconfigurations.ForsettingupMahout,executethefollowingsteps:

1. DownloadtheMahoutdistributionfilemahout-distribution-0.9-src.tar.gzfromthelocationhttp://archive.apache.org/dist/mahout/0.9/.

2. ChooseaninstallationdirectoryforMahout(/usr/local/Mahout),andplacethedownloadedsourceinthefolder.Extractthesourcecodeandensurethatthefoldercontainsthepom.xmlfile.Thefollowingistheexactlocationofthesource:

tar-xvfmahout-distribution-0.9-src.tar.gz

3. InstalltheMahoutMavenproject,andskipthetestcaseswhileinstalling,asfollows:

mvninstall-Dmaven.test.skip=true

4. SettheMAHOUT_HOMEenvironmentvariableinthe~/.bashrcfile,andupdatethePATHvariablewiththeMahoutbindirectory:

exportMAHOUT_HOME=/user/local/mahout/mahout-distribution-0.9

exportPATH=$PATH:$MAHOUT_HOME/bin

5. TotesttheMahoutinstallation,executethecommand:mahout.Thiswilllisttheavailableprogramswithinthedistributionbundle,asshowninthefollowingscreenshot:

SettingupadevelopmentenvironmentusingEclipseForthissetup,youshouldhaveMaveninstalledonthesystemandtheMavenpluginforEclipse.RefertotheInstallingMavenstepexplainedintheprevioussection.Thissetupcanbedoneinthefollowingsteps:

1. DownloadtheMahoutdistributionfilemahout-distribution-0.9-src.tar.gzfromthelocationhttp://archive.apache.org/dist/mahout/0.9/andunzipthis:

tarxzfmahout-distribution-0.9-src.tar.gz

2. Let’screateafoldernamedworkspaceunder/usr/local/workspace,asfollows:

mkdir/usr/local/workspace

3. Movethedownloadeddistributiontothisfolder(fromthedownloadsfolder),asfollows:

mvmahout-distribution-0.9/usr/local/workspace/

4. Movetothefolder/usr/local/workspace/mahout-distribution-0.9andmakeanEclipseproject(thiscommandcantakeuptoanhour):

mvneclipse:eclipse

5. SettheMahouthomeinthe.bashrcfile,asexplainedearlierintheBuildingMahoutcodesection.

6. NowopenEclipse.Selectthefile,importMaven,andExistingMavenProjects.Now,navigatetothelocationformahout-distribution-0.9andclickonFinish.

SettingupMahoutforaWindowsuserAWindowsusercanuseCygwin(alargecollectionofGNUandopensourcetoolsthatprovidesfunctionalitysimilartoaLinuxdistributiononWindows)tosetuptheirenvironment.Thereisalsoanotherwaythatiseasytouse,asshowninthefollowingsteps:

1. DownloadHortonworksSandboxforvirtualboxonyoursystemfromthelocationhttp://hortonworks.com/products/hortonworks-sandbox/#install.HortonworksSandboxonyoursystemwillbeapseudo-distributedmodeofHadoop.

2. Logintotheconsole.UseAlt+F5oralternativelydownloadPuttyandprovide127.0.0.1asthehostnameand2222intheport,asshowninthefollowingfigure.Loginwiththeusernamerootandpassword-hadoop.

3. Enterthefollowingcommand:

yuminstallmahout

Now,youwillseeascreenlikethis:

4. Entery,andyourMahoutwillstartinstalling.Oncethisisdone,youcantestbytypingthecommandmahoutandthiswillshowyouthesamescreenasshownintheSettingupadevelopmentenvironmentusingEclipserecipeseenearlier.

SummaryWediscussedApacheMahoutindetailinthischapter.WecoveredtheprocessofinstallingMahoutonoursystem,alongwithsettingupadevelopmentenvironmentthatisreadytoexecuteMahoutalgorithms.WehavealsotakenalookatthereasonsbehindMahoutbeingconsideredagoodchoiceforclassification.Now,wemovetothenextwherewewillunderstandaboutlogisticregressionandlearnabouttheprocessthatneedstobefollowedtoexecuteourfirstalgorithminMahout.

Chapter3.LearningLogisticRegression/SGDUsingMahoutInsteadofjumpingdirectlyintologisticregression,let’strytounderstandafewofitsconcepts.Inthischapter,wewillexplorethefollowingtopics:

IntroducingregressionUnderstandinglinearregressionCostfunctionGradientdescentLogisticregressionUnderstandingSGDUsingMahoutforlogisticregression

IntroducingregressionRegressionanalysisisusedforpredictionandforecasting.Itisusedtofindouttherelationshipbetweenexplanatoryvariablesandtargetvariables.Essentially,itisastatisticalmodelthatisusedtofindouttherelationshipamongvariablespresentinthedatasets.Anexamplethatyoucanrefertoforabetterunderstandingofthistermisthis:determinetheearningsofworkersinaparticularindustry.Here,wewilltrytofindoutthefactorsthataffectaworker’ssalary.Thesefactorscanbeage,education,yearsofexperience,particularskillset,location,andsoon.Wewilltrytomakeamodelthatwilltakeallthesevariablesintoconsiderationandtrytopredictthesalary.Inregressionanalysis,wecharacterizethevariationofthetargetvariablearoundtheregressionfunction,whichcanbedescribedbyaprobabilitydistributionthatisalsoofinterest.Thereareanumberofregressionanalysistechniquesthatareavailable.Forexample,linearregression,ordinaryleastsquaresregression,logisticregression,andsoon.

UnderstandinglinearregressionInlinearregression,wecreateamodeltopredictthevalueofatargetvariablewiththehelpofanexplanatoryvariable.Tounderstandthisbetter,let’slookatanexample.

AcompanyXthatdealsinsellingcoffeehasnoticedthatinthemonthofmonsoon,theirsalesincreasedtoquiteanextent.Sotheyhavecomeupwithaformulatofindtherelationbetweenrainandtheirpercupcoffeesale,whichisshownasfollows:

C=1.5R+800

So,for2mmofrain,thereisademandof803cupsofcoffee.Nowifyougointominutedetails,youwillrealizethatwehavethedataforrainfallandpercupcoffeesale,andwearetryingtobuildamodelthatcanpredictthedemandforcoffeebasedontherainfall.Wehavedataintheformof(R1,C1),(R2,C2)….(Ri,Ci).Here,wewillbuildthemodelinamannerthatkeepstheerrorintheactualandpredictedvaluesataminimum.

CostfunctionIntheequationC=1.5R+800,thetwovalues1.5and800areparametersandthesevaluesaffecttheendresult.WecanwritethisequationasC=p0+p1R.Aswediscussedearlier,ourgoalistoreducethedifferencebetweentheactualvalueandthepredictedvalue,andthisisdependentonthevaluesofp0andp1.Let’sassumethatthepredictedvalueisCpandtheactualvalueisCsothatthedifferencewillbe(Cp-C).Thiscanbewrittenas(p0+p1R-C).Tominimizethiserror,wedefinetheerrorfunction,whichisalsocalledthecostfunction.

Thecostfunctioncanbedefinedwiththefollowingformula:

Here,iistheithsampleandNisthenumberoftrainingexamples.Wecalculatecostsfordifferentsetsofp0andp1andfinallyselectthep0andp1thatgivestheleastcost(C).Thisisthemodelthatwillbeusedtomakepredictionsfornewinput.

GradientdescentGradientdescentstartswithaninitialsetofparametervalues,p0andp1,anditerativelymovestowardsasetofparametervaluesthatminimizesthecostfunction.Wecanvisualizethiserrorfunctiongraphically,wherewidthandlengthcanbeconsideredastheparametersp0andp1andheightasthecostfunction.Ourgoalistofindthevaluesforp0andp1inawaythatourcostfunctionwillbeminimal.Westartthealgorithmwithsomevaluesofp0andp1anditerativelyworktowardstheminimumvalue.Agoodwaytoensurethatthegradientdescentisworkingcorrectlyistomakesurethatthecostfunctiondecreasesforeachiteration.Inthiscase,thecostfunctionsurfaceisconvexandwewilltrytofindouttheminimumvalue.Thiscanbeseeninthefollowingfigure:

LogisticregressionLogisticregressionisusedtoascertaintheprobabilityofanevent.Generally,logisticregressionreferstoproblemswheretheoutcomeisbinary,forexample,inbuildingamodelthatisbasedonacustomer’sincome,traveluses,gender,andotherfeaturestopredictwhetherheorshewillbuyaparticularcarornot.So,theanswerwillbeasimpleyesorno.Whentheoutcomeiscomposedofmorethanonecategory,thisiscalledmultinomiallogisticregression.

Logisticregressionisbasedonthesigmoidfunction.Predictorvariablesarecombinedwithlinearweightandthenpassedtothisfunction,whichgeneratestheoutputintherangeof0–1.Anoutputcloseto1indicatesthatanitembelongstoacertainclass.Let’sfirstunderstandthesigmoidorlogisticfunction.Itcanbedefinedbythefollowingformula:

F(z)=1/1+e(-z)

Withasingleexplanatoryvariable,zwillbedefinedasz=β0+β1*x.Thisequationisexplainedasfollows:

z:Thisiscalledthedependentvariable.Thisisthevariablethatwewouldliketopredict.Duringthecreationofthemodel,wehavethisvariablewithusinthetrainingset,andwebuildthemodeltopredictthisvariable.Theknownvaluesofzarecalledobservedvalues.x:Thisistheexplanatoryorindependentvariable.Thesevariablesareusedtopredictthedependentvariablez.Forexample,topredictthesalesofanewlylaunchedproductataparticularlocation,wemightincludeexplanatoryvariablessuchasthepriceoftheproduct,theaverageincomeofthepeopleofthatlocation,andsoon.β0:Thisiscalledtheregressionintercept.Ifallexplanatoryvariablesarezero,thenthisparameterisequaltothedependentvariablez.β1:Thesearevaluesforeachexplanatoryvariable.

Thegraphofthelogisticfunctionisasfollows:

Withalittlebitofmathematics,wecanchangethisequationasfollows:

ln(F(x)/(1-F(x))=β0+β1*x

Inthecaseoflinearregression,thecostfunctiongraphwasconvex,buthere,itisnotgoingtobeconvex.Findingtheminimumvaluesforparametersinawaythatourpredictedoutputisclosetotheactualonewillbedifficult.Inacostfunction,whilecalculatingforlogisticregression,wewillreplaceourCpvalueoflinearregressionwiththefunctionF(z).Tomakeconvexlogisticregressioncostfunctions,wewillreplace(p0+p1Ri-Ci)2withoneofthefollowing:

log(1/1+e(-(β0+β1*x)))iftheactualoccurrenceofaneventis1,thisfunctionwillrepresentthecost.log(1-(1/1+e(-(β0+β1*x))))iftheactualoccurrenceofaneventis0,thisfunctionwillrepresentthecost.

Wewillhavetorememberthatinlogisticregression,wecalculatetheclassprobability.So,iftheprobabilityofaneventoccurring(customerbuyingacar,beingdefrauded,andsoon)isp,theprobabilityofnon-occurrenceis1-p.

StochasticGradientDescentGradientdescentminimizesthecostfunction.Forverylargedatasets,gradientdescentisaveryexpensiveprocedure.StochasticGradientDescent(SGD)isamodificationofthegradientdescentalgorithmtohandlelargedatasets.Gradientdescentcomputesthegradientusingthewholedataset,whileSGDcomputesthegradientusingasinglesample.So,gradientdescentloadsthefulldatasetandtriestofindoutthelocalminimumonthegraphandthenrepeatthefullprocessagain,whileSGDadjuststhecostfunctionforeverysample,onebyone.AmajoradvantagethatSGDhasovergradientdescentisthatitsspeedofcomputationisawholelotfaster.LargedatasetsinRAMgenerallycannotbeheldasthestorageislimited.InSGD,theburdenontheRAMisreduced,whereineachsampleorbatchofsamplesareloadedandworkedwith,theresultsforwhicharestored,andsoon.

UsingMahoutforlogisticregressionMahouthasimplementationsforlogisticregressionusingSGD.Itisveryeasytounderstandanduse.Solet’sgetstarted.

Dataset

WewillusetheWisconsinDiagnosticBreastCancer(WDBC)dataset.Thisisadatasetforbreastcancertumorsanddataisavailablefrom1995onwards.Ithas569instancesofbreasttumorcasesandhas30featurestopredictthediagnosis,whichiscategorizedaseitherbenignormalignant.

NoteMoredetailsontheprecedingdatasetisavailableathttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names.

Preparingthetrainingandtestdata

Youcandownloadthewdbc.datadatasetfromhttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data.

Now,saveitasaCSVfileandincludethefollowingheaderline:ID_Number,Diagnosis,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,ConcavePoints,Symmetry,Fractal_Dimension,RadiusStdError,TextureStdError,PerimeterStdError,AreaStdError,SmoothnessStdError,CompactnessStdError,ConcavityStdError,ConcavePointStdError,Symmetrystderror,FractalDimensionStderror,WorstRadius,worsttexture,worstperimeter,worstarea,worstsmoothness,worstcompactness,worstconcavity,worstconcavepoints,worstsymmentry,worstfractaldimensions

Now,wewillhavetoperformthefollowingstepstopreparethisdatatobeusedbytheMahoutlogisticregressionalgorithm:

1. Wewillmakethetargetclassnumeric.Inthiscase,thesecondfielddiagnosisisthetargetvariable.Wewillchangemalignantto0andbenignto1.Usethefollowingcodesnippettointroducethechanges.Wecanusethisstrategyforsmalldatasets,butforhugedatasets,wehavedifferentstrategies,whichwewillcoverinChapter4,LearningtheNaïveBayesClassificationUsingMahout:

publicvoidconvertTargetToInteger()throwsIOException{

//Readthedata

BufferedReaderbr=newBufferedReader(newFileReader("wdbc.csv"));

Stringline=null;

//Createthefiletosavetheresulteddata

FilewdbcData=newFile("<YourDestinationlocationforfile.>");

FileWriterfw=newFileWriter(wdbcData);

//Weareaddingheadertothenewfile

fw.write("ID_Number"+","+"Diagnosis"+","+"Radius"+","+"Texture"+","+"Pe

rimeter"+","+"Area"+","+"Smoothness"+","+"Compactness"+","+"Concavity"+

","+"ConcavePoints"+","+"Symmetry"+","+"Fractal_Dimension"+","+"RadiusS

tdError"+","+"TextureStdError"+","+"PerimeterStdError"+","+"AreaStdErro

r"+","+"SmoothnessStdError"+","+"CompactnessStdError"+","+"ConcavityStd

Error"+","+"ConcavePointStdError"+","+"Symmetrystderror"+","+"FractalDi

mensionStderror"+","+"WorstRadius"+","+"worsttexture"+","+"worstperimet

er"+","+"worstarea"+","+"worstsmoothness"+","+"worstcompactness"+","+"w

orstconcavity"+","+"worstconcavepoints"+","+"worstsymmentry"+","+"worst

fractaldimensions"+"\n");

/*Inthewhileloopwearereadinglinebylineandcheckingthelast

field-parts[1]andchangingittonumericvalueaccordingly*/

while((line=br.readLine())!=null){

String[]parts=line.split(",");

if(parts[1].equals("M")){

fw.write(parts[0]+","+"0"+","+parts[2]+","+parts[3]+","+parts[4]+","+pa

rts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10

]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]

+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+

","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+"

,"+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+",

"+parts[31]+"\n");

if(parts[1].equals("B")){

fw.write(parts[0]+","+"1"+","+parts[2]+","+parts[3]+","+parts[4]+","+pa

rts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10

]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]

+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+

","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+"

,"+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+",

"+parts[31]+"\n");

fw.close();

br.close();

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

2. Wewillhavetosplitthedatasetintotrainingandtestdatasetsandthenshufflethedatasetssothatwecanmixthemup,whichcanbedoneusingthefollowingcodesnippet:

publicvoiddataPrepration()throwsException{

//Readingthedatasetcreatedbyearliermethod

convertTargetToIntegerandhereweareusinggoogleguavaapi's.

List<String>result=

Resources.readLines(Resources.getResource("wdbc.csv"),Charsets.UTF_8);

//Thisistoremoveheaderbeforetherandomizationprocess.

Otherwiseitcanappearinthemiddleofdataset.

List<String>raw=result.subList(1,570);

Randomrandom=newRandom();

//Shufflingthedataset.

Collections.shuffle(raw,random);

//Splittingdatasetintotrainingandtestexamples.

List<String>train=raw.subList(0,470);

List<String>test=raw.subList(470,569);

FiletrainingData=newFile("<yourLocation>/wdbcTrain.csv");

FiletestData=newFile("<yourLocation>/wdbcTest.csv");

writeCSV(train,trainingData);

writeCSV(test,testData);

//Thismethodiswritingthelisttodesiredfilelocation.

publicvoidwriteCSV(List<String>list,Filefile)throwsIOException{

FileWriterfw=newFileWriter(file);