DATA - iedu.us€¦ · Introduction to Data Mining 1.1 Data Science 1.2 Knowledge Discovery in...
Transcript of DATA - iedu.us€¦ · Introduction to Data Mining 1.1 Data Science 1.2 Knowledge Discovery in...
DATASCIENCEWITHR:BYANDREWOLEKSY
Copyright©2018byAndrewOleksy.AllRightsReserved.
Nopartofthispublicationmaybereproduced,distributedortransmittedinanyformorbyanymeans,includingphotocopying,recordingorotherelectronicormechanicalmethodsorbyany
informationstorageorretrievalsystemwithoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofverybriefquotationsembodiedincriticalreviewsandcertainothernon-commercial
usespermittedbycopyrightlaw
TABLEOFCONTENTS
TableofContentsChapter1:IntroductiontoDataMiningSummaryPrerequisiteKnowledge
IntroductiontoDataMining1.1DataScience1.2KnowledgeDiscoveryinDatabases(KDD)1.2.1DataCollection1.2.2Preprocessing1.2.3Transformation1.2.4DataMining1.2.5InterpretationandEvaluation
1.3ModelTypes1.4ExamplesandCounterexamples1.5ClassificationofDataMiningmethods1.5.1Classification1.5.2Regression1.5.3Clustering1.5.4ExtractionandAssociationAnalysis1.5.5Visualization1.5.6AnomalyDetection
1.6Applications1.6.1Medicine1.6.2Finance1.6.3Telecommunications
1.7Challenges1.8TheRProgrammingLanguage
1.9BasicConcepts,DefinitionsandNotations1.10ToolInstallation
Chapter2:IntroductiontoRSummaryPrerequisiteKnowledge
IntroductiontoR2.1DataTypes2.1.1DefinitionandObjectClasses2.1.2VectorsandLists2.1.3Matrix2.1.4.FactorsandNominalData2.1.5MissingValues2.1.6DataFrames
2.2BasicTasks2.2.1ReadingDatafromFile2.2.2Sequencecreation2.2.3ReferencetoSubsets2.2.4Vectorization
2.3ControlStructures2.3.1ConditionalStatement:if-else2.3.1Loops:for,repeatandwhile2.3.3Nextandbreakstatements
2.4Functions2.5ScopingRules2.6IteratedFunctions2.6.1lapply2.6.2sapply2.6.3Split2.6.4tapply
2.7HelpfromtheconsoleandPackageInstallationChapter3:Types,QualityandDataPreprocessing
SummaryPrerequisiteKnowledge
Types,QualityandDataPreprocessing3.1CategoriesandTypesofVariables3.2Preprocessingprocesses3.2.1Datacleansing3.2.1.1MissingValues3.2.1.2DatawithNoiseExample–Datasmoothingusingbinningmethods
3.2.1.3Inconsistentdata3.2.2DataUnification3.2.3DataTransformationandDiscretization3.2.3.1DataTransformationExample–DataRegularization3.2.3.2DataDiscretizationExample–Entropy-baseddiscretization
3.2.4DataReduction3.2.4.1DimensionReduction3.2.4.2DataCompression
3.3dplyrandtidyrpackages3.3.1dplyr3.3.2tidyr
Chapter4:SummaryStatisticsandVisualizationSummaryPrerequisiteKnowledgeSummaryStatisticsandVisualization
4.1MeasuresofPosition
4.1.1MeanValue4.1.2Median
4.2Measuresofdispersion4.2.1Minimumvalue,Maximumvalue,Range4.2.2Percentilevalues4.2.3InterquartileRange4.2.4Variance4.2.5StandardDeviation4.2.6CoefficientofVariation
4.3VisualizationofQualitativeData4.3.1FrequencyTable4.3.2BarCharts4.3.3PieChart4.3.4ContingencyMatrix4.3.4StackedBarChartsandGroupedBarCharts
4.4VisualizationofQuantitativeData4.4.1FrequencyTable4.4.2.Histograms4.4.3FrequencyPolygon4.4.4Boxplot
Chapter5:ClassificationandPredictionSummaryPrerequisiteKnowledge
5.1Classification5.1.2DecisionTrees5.1.2.1Description5.1.2.2DecisionTreecreation–ID3Algorithm5.1.2.3DecisionTreecreation–GiniIndex
5.2Prediction
5.2.1DifferencebetweenClassificationandPrediction5.2.2LinearRegression5.2.2.1Description,DefinitionsandNotations5.2.2.2CostFunction5.2.2.3GradientDescentAlgorithm5.2.2.4GradientDescentinLinearRegression
5.2.2.5LearningParameter5.3Overfittingandregularization5.3.1Overfitting5.3.2ModelRegularization5.3.3LinearRegressionwithNormalization
Chapter6:ClusteringSummaryPrerequisiteKnowledge
CLUSTERING6.1UnsupervisedLearning6.2ConceptofCluster6.3k-meansalgorithm6.3.1AlgorithmDescription6.3.2RandomCentroidsInitialization6.3.3ChoosingthenumberofClusters6.3.4Applyingk-meansinR
6.4HierarchicalClusteringAlgorithms6.4.1DistanceMeasurementsBetweenClusters6.4.2AgglomerativeAlgorithms6.4.3DivisiveAlgorithms6.4.4ApplyingHierarchicalClusteringinR
6.5DBSCANAlgorithm6.5.1BasicConcepts
6.5.2AlgorithmDescription6.5.3AlgorithmComplexity6.5.4Advantages6.5.5Disadvantages
Chapter7:MiningofFrequentItemsetsandAssociationRulesSummaryPrerequisiteKnowledgeMiningofFrequentItemsetsandAssociationRules
7.1Introduction7.2TheoreticalBackground7.3AprioriAlgorithm7.4FrequentItemsetsTypes7.5PositiveandNegativeBorderofFrequentItemsets7.6AssociationRulesMining7.7AlternativeMethodsforLargeItemsetsgeneration7.7.1SamplingAlgorithm7.7.2PartitioningAlgorithm
7.8FP-GrowthAlgorithm7.9Arulespackage
Chapter 8: Computational Methods for Big Data Analysis (Hadoop andMapReduce)
SummaryPrerequisiteKnowledge
8.1Introduction8.2AdvantagesofHadoop’sDistributedFileSystem8.3HadoopUsers8.4HadoopArchitecture8.4.1HadoopDistributedFileSystem(HDFS)8.4.2HDFSArchitecture
8.4.3HDFS–LowPerformanceAreas8.4.3.1LowDataAccessTime8.4.3.2MultipleSmallFiles8.5.3.3MultipleDataRecordingNodes,ArbitraryFileModifications
8.4.4BasicHDFSConcepts8.4.4.1Blocks8.4.4.2NamenodesandDatanodes8.4.4.3HDFSFederation8.4.4.4HDFSHighAvailability
8.4.5DataFlow–DataReading8.4.6NetworkTopologyinHadoop8.4.7FileWriting8.4.8CopiesPlacement8.4.9ConsistencyModel
8.5TheHadoopClusterArchitecture8.6HadoopJavaAPI8.7ListsLoops&GenericClassesandMethods8.7.1GenericClassesandMethods8.7.2TheClassObject
OneLastThing...
CHAPTER1:INTRODUCTIONTODATAMINING
SUMMARY
TheaimofthischapteristointroduceDataMiningandKnowledgeDiscoveryinDatabases.First,somebasicconceptsaredefinedalongwiththereasonswhythis scientific fieldwas created and expanded rapidly.Second,wewill reviewsomepracticalexamplesandcounterexamplesofDataMining.Additionally,themost important fields onwhichDataMining is based are presented. Last,wewill findoutwhatRprogramming language is and its general philosophy.Wewillalsoshowhowtoinstallallnecessarytools,wewillpresenthowtoaddressto the R language manual for help, and define its basic types and functions,alongwiththeirfunctionality.
PREREQUISITEKNOWLEDGE
Nopreviousknowledgeisneededforthischapter.
INTRODUCTIONTODATAMINING
The evolution of technology helped internet expand lightning fast.Over time,internet access became accessible to more and more people. This led to thedevelopmentofmillionwebsitesandtheuseofdatabasesforstoringthesedata.The creationof the first commercial and socialwebpages created theneed forstoringandmanaginglargeamountofdata.
Today,theamountofavailabledataishugeandisgrowingexponentiallyeveryday.Theneedforminimizingthecostsforcollectingandstoringthesedatawasoneofthebiggestreasonsforthegrowthofthisscientificfield.
Thehugeamountofdatastoredindatabasesanddatawarehousescouldnotbeutilized as is. In order to get useful conclusions, some necessary actions arerequired inorder tostructure thedata.On thischapterwewillviewwhicharethefundamentalstagesinordertoextractvaluableandusableinformationfromdata.
.1DATASCIENCEDataScienceisanewterm,whichcametoreplaceformertermslikeKnowledgeDiscoveryinDatabaseorDataMining.
Boththreetermscanbeusedtodescribeasemiautomatedprocesswhosemainpurpose is toanalyzeahugevolumeofdataaboutaspecificproblemwith thepurposeofcreatingpatternsinscientificfieldslikeStatistics,MachineLearningandPatternRecognition.
Those patters, found in multiple forms like associations, anomalies, clusters,classes etc. constitute structures or instances, which appear in data and arestatisticallysignificant.
One of themost important aspects ofData Science has to dowith finding orrecognizing(recognizingmeansthatthepatterswherenotexpectedinadvance)andevaluatingthesepatterns.Apatternshouldshowsignsoforganizationacrossthis structure.Thesepatterns, alsoknownasmodels,mostof the timescanbetrackedwiththeuseofmeasurablefeaturesorattributeswhichareextractedbydata.
DataScience is anewscience,whichappearedat the endof1980and started
growinggradually.Duringthisera,RelationalDatabaseswereattheirzenithandserveddatastorageneedsforcompaniesandorganizationswith thepurposeofbetterorganizingandmanagingthem,sothatmassqueriesneededfortheirdaytodayoperationscouldbeaccomplishedfaster.
These Database Managing Systems (DBMS) followed the so called OLTP(OnLine Transaction Processing) model, with the purpose of processingtransactions.Thesetoolsallowedtheusertofindanswersinquestionshealreadykneworcreatesomereferences.
The need for better utilization of data created by these systems – the systemswhichhelped thedailyneedsof a company- led to thedevelopmentofOLAP(OnLineAnalyticalProcessing)typetools.WiththeseOLAPtoolsitwaseasierto answer more advanced queries, allowing bigger and MultidimensionalDatabases(MDB)toworkfasterandprovidedatavisualization.
OLAP tools could also be named as data exploration tools due to thevisualizationofthesedata.Thesetoolsallowedusers(salesmanagers,marketingmanagersetc.)torecognizenewpattersbutthisdiscoveryshouldbemadebytheuser.
Forexample,ausercouldperformqueriesaboutthetotalrevenuegeneratedbymultiple stores of a particular companywithin a country, in order to find thestoreswiththelowestrevenue.
AutomationofpatternrecognitionwascreatedthroughmethodologiesandtoolscreatedbythefieldofDataScience.Throughthesesolutions,patternrecognitionwas aided by the final goal. For example, if a userwanted a report about thestoreswiththelessrevenuegeneratedlastmonth,hecouldaskfromthesystemtofindvarioususefulinsightsaboutstoresrevenue.
Data Science growth came gradually and was directly associated to thecapability of collecting and listing huge amount of data, of different types,through the rapid expansion of fast web infrastructures on which commercialapplicationscouldrelyon.
OneofthefirstcompanieswhoembracedthisadvancementwasAmazon,whichstarted by selling books and other products and then created a user-friendlyrelated products recommendation system. This system was built and adjustedaccordingly based on user interactions, using a method called Collaborating
Filtering.ThissystembecamethefoundationofRecommenderSystems.
Theunconditionaldatagenerationina24-hourbasissupportsahugeamountofhuman activities like shopping cart data, medical records, social mediaannouncements,bankingandstockmarketoperationsandsoon.Thesedatahaveawidevarietyoftypes(images,videos,realtimedata,DNAsequencesetc.)anddifferentacquisitiontimes.Ifsomeofthesedataarenotanalyzedimmediately,itmight be difficult later to be stored and processed, creating this way a newscientificfieldknownasBigData.DataScience’sgoal is toaddress theneedscreated fromthisnewenvironmentandprovidesolutions for theescalatedandsufficientprocessofout-ofcoredata.
Methods and tools used for this purpose have already being developed likeHadoop,Map-Reduce,Hive,MongoDB,GraphPD.
ThetwomaingoalsofpracticalDataSciencearetocreatemodels,whichcanbeused both in predicting and describing data. Prediction is about using somevariablesorpartsof adatabase fromwhichwecouldestimateanunknownorfuturevalueofanotherattribute.Descriptionfocusesonfindingcomprehensivepatterswhich can describe data like finding clusters or groups of objectswithsimilarattributes.
1.2KNOWLEDGEDISCOVERYINDATABASES(KDD)KnowledgeDiscovery inDatabases consists of 5 steps. It’s about revealingorcreatingusefulknowledgethroughdataanalysis.Itreferstothewholeprocess,fromdatacollectiontoutilizingtheoutcomeinamorepracticallevel.ThebasicstagesofKnowledgeDiscoveryinDatabasesare:
1DataCollection2Preprocessing3Transformation4DataMining5Interpretation/Evaluation
The“KnowledgeDiscoveryinDatabases”termisoftenassociatedwiththeterm“DataMining”althoughitisjustoneofitssteps.ThebasicgoalofDataMining(DM)istheextractionofunknownandpossiblyvaluableinformationorpatternsfromdata. Someone could say that the term is used excessively since no dataextraction isperformed.On thecontrary,preprocessedand(possibly)modifieddataareusedforextractingusefulinformation,whichisthenusedforsolvingaproblem.BelowyouwillfindabriefdescriptionabouteachstageofKDD.
1.2.1DATACOLLECTION
The first step ofKDD is the collection and storage of data.Data collection isusuallyperformedautomaticallye.g.byusingsensorsornotautomaticallye.g.via a questionnaire. Malfunction in sensors or non-submitted answers on a
questionnaire could lead to incomplete data. These particular problemswhichmightoccurduringdatacollectionarefacedbythenextstep.
1.2.2PREPROCESSINGThesecondandmostimportantstepofKDDisthepreprocessing,withagoalofcleansing data: handling defective, false ormissing data. Preprocessingmightrequire up to 60%of the total effort since there is no reason to discuss aboutresultsifdataarenotcleanandintherightform.OnChapter3wewillexaminethroughout theprocesses fromwhichpreprocessingconsistsof andwheneachprocessshouldbeused.
1.2.3TRANSFORMATION
Data Transformation is the third step of KDD. Basically, this step is aboutconverting data under a common frame allowing us to edit them later. It ismostly used for smoothing data and removing noise, for data aggregation, fornormalization or for creating new features based on the existing ones. Specialformsoftransformationarediscretizationandcompression.
1.2.4DATAMINING
On this step of KDD an algorithm is used for model generation. Clean andtransformeddataarenowreadytobeusedbyanalgorithminordertocreateamodel, usually for categorization or prediction. We want to use this model,created by some already known data, to get an answer about the value of anattribute-variabletargetgoalfornew,unknowndata.
1.2.5INTERPRETATIONANDEVALUATION
ThelaststepofKDDisusedtointerpretandevaluatetheresults(notthemodel)producedbythewholeprocess.
1.3MODELTYPESThemodelsproducedbytheDataMiningstepfallunder twotypes:predictivemodelsanddescriptivemodels.
The goal of a predictive model is to predict values for a specific interestingfeaturewhichcouldprobablybebasedon thebehaviorofotherattributes.Forexample,predictioncouldbebasedondataacrossdifferentdaysofeachweek.
Adescriptivemodel findspattersor relationshidden insidedataandexaminestheirattributesinordertoprovideanexplanationabouttheirbehavior.
1.4EXAMPLESANDCOUNTEREXAMPLES
It’shardforsomepeopletounderstandanddistinguishwhatKDDandDMare.That’swhywewillhavealookatsomepracticalexamplesandcounterexamplesin order to make clear what DM is or isn’t. Some examples of DM are thefollowing:
After 9/11, Bill Clinton announced that after examining lots ofdatabases, FBI agents discovered that 5 of the perpetrators wereregisteredtothesedatabases.Oneofthemowned30creditcardswitha negative balance of $250.000 and lived in US for less than twoyears.Telecommunication companies not only reward clients who spendlotsofmoneybutalsoclientsnamedas“guides”.Theseguidesoftenconvince friends, relatives, coworkers and others to follow themwhentheychangeprovider.So,telecomcompaniesneedtofindtheseclients andmake them stick to them, proving higher discounts andmoreservicestothem.Byusingdata fromolder recorded temperaturesduring the summerseasonoftheprevious15years,wetrytopredictthetemperaturesforthesummerseasonofthenext15years.
Datamining isnot just the simpleprocessingofqueriesneither small scalestatisticalprograms.Somecounterpartsarethefollowing:
FindingaphonenumberfromaphonebookFindinginformationaboutParisontheinternetFindingtheaverageofexamsgradesSearching for themedical recordsof a patiencewith a particulardisease,inordertofurtheranalyzehismedicalrecord.
1.5CLASSIFICATIONOFDATAMININGMETHODS
Thereisawiderangeofdataminingmethods.Dependingonthedatatypesandthetypeofknowledgeextracted,theyareclassifiedindifferentcategories.Somebasic methods of Data Mining are presented below. At this point we shouldmentionthatinDataScienceeducationisaccomplishedbyusingdata,whereasinother formsof education a teacher or a specialist transfers knowledge fromonepersontoanother.
1.5.1CLASSIFICATION
This is apredictivemethod. Itsgoal is tocreateamodel–classifierbasedoncurrent data. Basically, it’s the knowledge of a function which represents anobject(usuallyrepresentedasavaluesvectorforitscharacteristicfeatures)inavalueof a categoricalvariable alsoknownas class.Learning, is abehaviorofintelligentsystemswhichisstudiedbyscientificfieldslikeMachineLearningorArtificial Intelligence. Due to this, all these fields study the same problems,withoutthismeaningthattherearenootherscopesstudiedindividuallybyeachscientificfield.
Classificationisoftenassociatedwithprediction.Inclassification, theoutcomewewanttopredictistheclassofthesamples.Aclasscanhavediscretevaluesfromafiniteset.Onthecontrary,duringpredictionwithmethodslikeregression,thevariable-goalcouldbeanyrealnumber.
1.5.2REGRESSION
Regression isasimilar toclassificationprocess,whosegoal is learningorelsetraining a function which represents an object in a real variable. It is also apredictivemethod.Byusingsomeindependentvariablesitsgoalistopredictthevaluesofadependentvariable.
Ontheaboveimagewecanseeanexampleoflinerregression.Thevariablesinthisexamplearethesquaremetersofahouseandthesellingpriceinthousandsofdollars.Linearregressionadaptsalineinthesamplesofthedataset,showninredX’s.Itiscreatedbasedonadistancefunctionorpricefunction,whosepricewewanttominimize.Byhavingtheoptimalline(thelinewhichminimizesthevalueofthepricefunction)wecanthenestimateprettyaccuratelyquestionslike:“Whichisthesellingpricefor150squaremetershouses?”.
1.5.3CLUSTERING
Clustering isadescriptivemethod.Givenadataset, thegoalofclustering is tocreateclusters(groupswiththesameorsimilarfeatures).Inclusteringthegoalistofindafinitenumberofclustersinordertodescribedata.
Onthebelowexamplewecanseetheresultfromclusteringmedicaldata.Threeclustersarecreatedbasedon“dosage”and“influenceduration”.
1.5.4EXTRACTIONANDASSOCIATIONANALYSIS
AssociationRulesMiningisconsideredoneofthemostimportantdataminingprocesses.Ithasattractedalotofattentionsinceassociationrulesprovideabriefwaytoexpressthepotentiallyuseful informationinaneasytounderstandwayforthefinalusers.Theseassociationrulesdiscoverhiddenrelationshipsbetweenfeaturesofadataset.TheseassociationsarepresentedinanABform,whereAandB are sets referring to the features of thewhole datawe analyze.AnABassociationrulepredictstheappearanceoffeaturesofsetBgiventhatfeaturesofsetAarepresent.
Aclassicexampleofassociationrulesinpracticehastodowiththeanalysisofashoppingcartinasupermarket,wheredatahavetodowithclientstransactions.Inthisscenario,sometransactionscouldbe{bread,milk},{bread,diapers,beer,eggs},{milk,diapers,beer,soda},{bread,milk,diapers,beer}and{bread,milk,diapers,soda}.Someassociationrulesonthesetransactionscouldbe{Diapers}{Beer},{beer,bread}{milk},{milk,bread}{eggs,soda}.Forexample,thelastrulerevealsthatit’squitepossiblethatwhoeverbuysmilkandbreadmightalsobuyeggsandsoda.
Extracting valuable conclusions through association rules, the marketingdepartment of the super market could place its products in shelves moreprofitably, create better marketing campaigns and efficiently manage itsresources.
1.5.5VISUALIZATION
Datavisualizationhelpsinbetterunderstandingnotonlythedatathemselvesbutalso correlations that might occur between them. In a next chapter we willdescribethewaysofvisualizationinR.Visualizationthoughcanonlyapplyinaspecificnumberofdimensions.Thismeansthatfordatasetswithlotsoffeaturesvisualizationis impossible.Alternatively,wecouldsettlewiththevisualizationof a smaller part of our dataset. In any case, visualizations should beaccompaniedby the respective statistical inspections in order to be sure abouttheaccuracyofthedisplayedcorrelations.
1.5.6ANOMALYDETECTION
Anomaly detection focuses in finding deviations in data according to similardata collected in thepast or by typical values of thesedata.Thebelow imageshowsanexamplewhereinredwecanseeanormalsamplelocatednearothersampleswithnormalvaluesandalsoananomaly,whosevaluediffersalotfromtheothervalues.
Someotherexamplesofanomalydetectionarethefollowing:
FrauddetectionbasedonauserprofileFindingdysfunctionalobjectsinindustrialproductionComputermonitoringinadatacenter
1.6APPLICATIONS
Datamining is used in a broad range of fields like e.g.Medicine, Finance orevenTelecommunications.Below,wewillexaminesomeapplicationsofDMinsomeofthesefields.
1.6.1MEDICINE
Over recent years, alongwith the growth ofMedicine fields like genetics andbiomedical,theuseofDMwashighlighted.Ingenetics,thegoalistounderstandandmap the relationshipbetween thealterationofhumanDNAsequencesandthe predisposition of a disease. DM is a tool which can help in diagnosisimprovement,inpreventionandconsequently,curediseases.
One of the main goals associated with DNA analysis is the comparison ofmultiple sequences and the search of similarities between DNA data. Thiscomparison is performed between gene sequences of healthy and harmfultissues,inordertofinddifferencesbetweenthem.
Visualization tools arequite important inbiomedicine aswell.These tools areabletopresentcomplexgeneformationsingraphsortreediagrams.
AnexampleofDMuseinMedicineisbyusingclassificationtodetectdiabeticretinopathy (DR). On this particular problem, data are high quality images ofpatientsretina.Theclassrangesfrom0to4.
OtherclassicexamplesofDMareepilepticseizurespredictionthroughmagnetic
MRIdataanalysisandtheclassificationofcancerinbenignormalignant.
Last,DMmethodsapplied in socialmedia likeTwitter, isknown that allowedthedetectionofrapidlyspreadingvirusesanddiseaseslikefluandHIV,raisingawareness about the incident much faster than the Public Health Agenciesallowingamorevalidpreventionofthedisease.
1.6.2FINANCE
AnotherfieldwheredataminingmethodsareusedisFinance.Financialdataaremostlycollectedbybanksorotherinstitutes.Theyusuallyarereliableandhigh-quality data. The contribution of data mining in Finance can be found in thecollection and understanding of data, in clearing and improving data and inmodel creation as well. Financial data analysis aims to facilitate decisionmaking,bytakingactionsaccordingtotheanalysismade.
Initiallydataarecollectedbyvariousfinancialinstitutionsandarestoredindatawarehouses. Multidimensional analysis methods are used for their analysis.Additionally,anotherimplementationofDataMininginfinancehastodowithprediction. For example, predicting if a client would be able to repay a loan,based on his previous transactions with the bank and his current financialsituation. By processing these data, a bank is able to create its originationpoliciesandminimizetheriskofprofitloss.
AnotherimplementationofDMinfinanceistheattempttopredictstockprices.Stockpricesareusuallymodeledastimeseriesandthegoalistopredictifthestockpricewillmoveupordown.Forexample, in theabove image,basedonthe stock prices of the previousmonths, we can predict its price for the nextfifteen days.With green colorwe can see the price of the predictionwhile inbluewecanseetheactualpriceofthisparticularstock.Thepredictionaccuracywasnotthathighbutthegeneraltrend(upward)waspredictedsuccessfully.
Anothermethodwhichcanbeusedisclustering.Clientsaregroupedinclustersbasedonsimilarfeatures.Aneffectiveclusteringhelpsbankstoidentifyagroupof clients or associate a new clientwith an existing group and offer solutionswhichsatisfiedtheclientsofthisparticulargroupinthepast.
Last, with DM techniques possible fraud or false data can be identified. Byanalyzingfinancialtransactionsandextractingsomepatterns,thedetectionofanunusualincidentcouldtriggeranidentitycheckoftherealclient.
1.6.3TELECOMMUNICATIONS
DMisusefulintelecommunicationsaswell.Telecomdatalikecalltype,calleranddialedlocation,timeandcalldurationcanbeusedinbetterservingnotjust
eachclientindividuallybutallclientsofthenetwork.DMmethodsareusedtobalancesystemloadanddatatraffic.
Datacanbeusedtocreateaclientprofile.Basedontheseprofilesclientscanbegroupedinclustersanddependingonthegroupofferspecial telecompackagesaccordingly.Additionally, interestingpatterscanbeidentifiedandthenanalyzethereasonsleadingtothesepatters.Forexample,identifythefactorsleadingtoahighercallfrequencyfromusersinspecifictimeintervals.
1.7CHALLENGES
Knowledgeextraction isaverypromisingscientific field.Like ineverysinglescientificfield,weneedtofacelotsofchallenges.Thesechallengesaremainlytechnicalandsocial-moral.
Thebiggertechnicalchallengeistheincreasingamountofdataalongwiththeirmultidimensional and complex character. Solutions should be scalable andflexiblemeaningthatDMmethodsandalgorithmsshouldbeabletohandlenotonlyhugevolumesofdatabutsmallerdatasetsaswell.DistributedsystemsareasolutionforhugevolumesofdatalikeHadoopandMapReducewhichwewillstudyinChapter8.
Social-moral challenges have to dowith finding the right areas to applyDM,sinceprivacyissuesmayrisefromprocessingdataduringDM.
1.8THERPROGRAMMINGLANGUAGEThisbookisamanualonhowtousevariousmethodsandalgorithmsinordertosolve real problems, so both theoretical and practical problems are provided.Many of these problems come with an indicative solution. The practicalproblems require the knowledge of a proper programming language throughwhich the reader should be able to respond accordingly. The programminglanguagewewilluseisR.
Ihopeyoufindthesolutionsprovidedinthisbooksuitablefortheproblemsyouencounter and apply them accordingly. Next, we will discuss shortly what iscoveredineachchapterofthisbook.
Chapter2isanintroductiontoR.Thisisthechapteryoudon’twanttoskipinthis book. We will show the different data types which R supports, controlstructures, how we create and call functions, how to use basic and usefulfunctionsandalsohowtofindhelpaboutfunctionsofanypackage.Thischapterisprerequisiteforallotherchapters.
OnChapter3wepresentdatatypes,alongwiththenecessaryactionsneededtopreprocessdatainordertoensuretheirdataqualityandthus,thequalityofourresults.Having familiaritywithRwewill thenpresenthow function from thedplyr and tidyr packages are used, for preprocessing data faster and moreefficiently.
OnChapter 4 we present summary statistics and ways of visualization. Wedescribetermslikemeasuresofposition(average,midline),variance(variance,standard deviation) and association and the functions which create thesemeasuresinR.Next,wewillpresentsomewaysofvisualizingqualitativedatalikehistograms,barchartsandpiechartswithR.
OnChapter5westudytheconceptsofclassificationandprediction.Inrespectto classification we will present decision trees throughout. In respect topredictionwewillexaminethemethodoflinearregression.
OnChapter6wewillpresentmethodsofdataclustering.Oncewedefinewhatsupervisedlearningandclusteris,wewillexaminethreecategoriesofclusteringmethods: partitional clustering, hierarchical clustering, and density-basedclustering.Wewill next view some specific clustering algorithms, like the k-means algorithm, the agglomerative hierarchical algorithm and the DBSCANalgorithm.
On Chapter 7 we describe mining association rules from transactionaldatabases. After defining some basic terms, we will then present the Apriorialgorithm for finding frequent itemsets. We will then give an example ofassociationrulesminingfromfrequentitemsets.Last,wewillexaminethearulespackage.With the help of this package everything described in this chapter isalreadycreatedinR.
On Chapter 8 we will examine computational methods for analyzing hugevolumes of data. More specifically, we will focus on the Hadoop andMapReducetools,andhowtheycanworkalongtosolveproblems.
1.9BASICCONCEPTS,DEFINITIONSANDNOTATIONS
Wewillnowdefinethemostbasicconcepts,definitionsandsymbolswewillusein the next chapters of this book. We initially have our data, also known asdatasets. These datasets are structured in rows and columns. Each columncorresponds to a specific feature or variable of thedataset.Wewill use themsymboltorepresentthenumberoffeaturesinaset.Same,eachrowcorrespondstoadatasetsample.Wewillusethensymboltorepresentthenumberofrowsinaset.Eachsamplecontainsmvalues,oneforeachfeature.
Wewillusethewholedatasettocreateamodel.Thismodelwillbeusedlateronfor the correspondence of new samples in a predefined groupof categories orclasses (classification) or for predicting the values of an important feature,knownas targetvariable (prediction).Anyhow, themodel shouldbevalidated.For thedatasetswecurrentlyhave, the classorvalues for the target value areknown.
Inordertomakeanaccuratevalidationofthemodelwesplitourdatasetintwosubsets:thetrainingsetandthetestset.Usuallythetrainingsetisabout2/3oftheinitialdatasetwhereasthetestsetistheremaining1/3.Wewillseethatthereareotherwaysofsplittingdatasets.
During the trainingphase, themodel is createdbasedonanalgorithmand thetrainingset.Modelvalidation isnecessarybeforeusing it and isaccomplishedbyusingthetestset.Basically,weusethemodelondata,forwhichwealreadyknowtheclassorthevalueofthetargetgoal,sothatwecancompareitwiththeonegivenbythemodel.
1.10TOOLINSTALLATIONThebasictoolwewilluseistheRprogramminglanguage.YoucanfindtheRconsolealongwiththebasicreadymadepackagesforfreeintheofficialwebsite(https://www.r-project.org).
After visiting the official webpage clink on “downloadR” and then choose alocationclosetoyou.Thenyouwillneedtoselecttheoperationsystemyouuse.
BychoosingWindowswewillseethebelowscreen:
Herewewillclickon“installRforthefirsttime”.
Weclickthedownloadlink.Wethenjustruntheinstallationfileandfollowtheinstructions.BelowwecanseetheRconsole.
RStudio is a free development environment of R. The installation file can bedownloadedfromthislink:https://www.rstudio.com/products/rstudio/download.Onceagainchoosetherightversionaccordingtoyouroperatingsystem.
RStudiohomescreenisshowninthebelowpicture.Onthefirstframewecanseethecodeoftheopenfiles.Everytabisadifferentsourcecode.Onthesecondframevariables and functions are shown.On the third frame, in thePlots tab,graphsareprintandwecanviewthepackageswehavedownloadedorneedtobe updated through the Packages tab. Additionally, through the “Help” tab(frame 3) we can find information and help about a function or a package.Finally,onframe4wecanfindtheclassicRconsole.