DATA - iedu.us€¦ · Introduction to Data Mining 1.1 Data Science 1.2 Knowledge Discovery in...

DATASCIENCEWITHR:BYANDREWOLEKSY

Copyright©2018byAndrewOleksy.AllRightsReserved.

Nopartofthispublicationmaybereproduced,distributedortransmittedinanyformorbyanymeans,includingphotocopying,recordingorotherelectronicormechanicalmethodsorbyany

informationstorageorretrievalsystemwithoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofverybriefquotationsembodiedincriticalreviewsandcertainothernon-commercial

usespermittedbycopyrightlaw

TABLEOFCONTENTS

TableofContentsChapter1:IntroductiontoDataMiningSummaryPrerequisiteKnowledge

IntroductiontoDataMining1.1DataScience1.2KnowledgeDiscoveryinDatabases(KDD)1.2.1DataCollection1.2.2Preprocessing1.2.3Transformation1.2.4DataMining1.2.5InterpretationandEvaluation

1.3ModelTypes1.4ExamplesandCounterexamples1.5ClassificationofDataMiningmethods1.5.1Classification1.5.2Regression1.5.3Clustering1.5.4ExtractionandAssociationAnalysis1.5.5Visualization1.5.6AnomalyDetection

1.6Applications1.6.1Medicine1.6.2Finance1.6.3Telecommunications

1.7Challenges1.8TheRProgrammingLanguage

1.9BasicConcepts,DefinitionsandNotations1.10ToolInstallation

Chapter2:IntroductiontoRSummaryPrerequisiteKnowledge

IntroductiontoR2.1DataTypes2.1.1DefinitionandObjectClasses2.1.2VectorsandLists2.1.3Matrix2.1.4.FactorsandNominalData2.1.5MissingValues2.1.6DataFrames

2.2BasicTasks2.2.1ReadingDatafromFile2.2.2Sequencecreation2.2.3ReferencetoSubsets2.2.4Vectorization

2.3ControlStructures2.3.1ConditionalStatement:if-else2.3.1Loops:for,repeatandwhile2.3.3Nextandbreakstatements

2.4Functions2.5ScopingRules2.6IteratedFunctions2.6.1lapply2.6.2sapply2.6.3Split2.6.4tapply

2.7HelpfromtheconsoleandPackageInstallationChapter3:Types,QualityandDataPreprocessing

SummaryPrerequisiteKnowledge

Types,QualityandDataPreprocessing3.1CategoriesandTypesofVariables3.2Preprocessingprocesses3.2.1Datacleansing3.2.1.1MissingValues3.2.1.2DatawithNoiseExample–Datasmoothingusingbinningmethods

3.2.1.3Inconsistentdata3.2.2DataUnification3.2.3DataTransformationandDiscretization3.2.3.1DataTransformationExample–DataRegularization3.2.3.2DataDiscretizationExample–Entropy-baseddiscretization

3.2.4DataReduction3.2.4.1DimensionReduction3.2.4.2DataCompression

3.3dplyrandtidyrpackages3.3.1dplyr3.3.2tidyr

Chapter4:SummaryStatisticsandVisualizationSummaryPrerequisiteKnowledgeSummaryStatisticsandVisualization

4.1MeasuresofPosition

4.1.1MeanValue4.1.2Median

4.2Measuresofdispersion4.2.1Minimumvalue,Maximumvalue,Range4.2.2Percentilevalues4.2.3InterquartileRange4.2.4Variance4.2.5StandardDeviation4.2.6CoefficientofVariation

4.3VisualizationofQualitativeData4.3.1FrequencyTable4.3.2BarCharts4.3.3PieChart4.3.4ContingencyMatrix4.3.4StackedBarChartsandGroupedBarCharts

4.4VisualizationofQuantitativeData4.4.1FrequencyTable4.4.2.Histograms4.4.3FrequencyPolygon4.4.4Boxplot

Chapter5:ClassificationandPredictionSummaryPrerequisiteKnowledge

5.1Classification5.1.2DecisionTrees5.1.2.1Description5.1.2.2DecisionTreecreation–ID3Algorithm5.1.2.3DecisionTreecreation–GiniIndex

5.2Prediction

5.2.1DifferencebetweenClassificationandPrediction5.2.2LinearRegression5.2.2.1Description,DefinitionsandNotations5.2.2.2CostFunction5.2.2.3GradientDescentAlgorithm5.2.2.4GradientDescentinLinearRegression

5.2.2.5LearningParameter5.3Overfittingandregularization5.3.1Overfitting5.3.2ModelRegularization5.3.3LinearRegressionwithNormalization

Chapter6:ClusteringSummaryPrerequisiteKnowledge

CLUSTERING6.1UnsupervisedLearning6.2ConceptofCluster6.3k-meansalgorithm6.3.1AlgorithmDescription6.3.2RandomCentroidsInitialization6.3.3ChoosingthenumberofClusters6.3.4Applyingk-meansinR

6.4HierarchicalClusteringAlgorithms6.4.1DistanceMeasurementsBetweenClusters6.4.2AgglomerativeAlgorithms6.4.3DivisiveAlgorithms6.4.4ApplyingHierarchicalClusteringinR

6.5DBSCANAlgorithm6.5.1BasicConcepts

6.5.2AlgorithmDescription6.5.3AlgorithmComplexity6.5.4Advantages6.5.5Disadvantages

Chapter7:MiningofFrequentItemsetsandAssociationRulesSummaryPrerequisiteKnowledgeMiningofFrequentItemsetsandAssociationRules

7.1Introduction7.2TheoreticalBackground7.3AprioriAlgorithm7.4FrequentItemsetsTypes7.5PositiveandNegativeBorderofFrequentItemsets7.6AssociationRulesMining7.7AlternativeMethodsforLargeItemsetsgeneration7.7.1SamplingAlgorithm7.7.2PartitioningAlgorithm

7.8FP-GrowthAlgorithm7.9Arulespackage

Chapter 8: Computational Methods for Big Data Analysis (Hadoop andMapReduce)

SummaryPrerequisiteKnowledge

8.1Introduction8.2AdvantagesofHadoop’sDistributedFileSystem8.3HadoopUsers8.4HadoopArchitecture8.4.1HadoopDistributedFileSystem(HDFS)8.4.2HDFSArchitecture

8.4.3HDFS–LowPerformanceAreas8.4.3.1LowDataAccessTime8.4.3.2MultipleSmallFiles8.5.3.3MultipleDataRecordingNodes,ArbitraryFileModifications

8.4.4BasicHDFSConcepts8.4.4.1Blocks8.4.4.2NamenodesandDatanodes8.4.4.3HDFSFederation8.4.4.4HDFSHighAvailability

8.4.5DataFlow–DataReading8.4.6NetworkTopologyinHadoop8.4.7FileWriting8.4.8CopiesPlacement8.4.9ConsistencyModel

8.5TheHadoopClusterArchitecture8.6HadoopJavaAPI8.7ListsLoops&GenericClassesandMethods8.7.1GenericClassesandMethods8.7.2TheClassObject

OneLastThing...

CHAPTER1:INTRODUCTIONTODATAMINING

SUMMARY

TheaimofthischapteristointroduceDataMiningandKnowledgeDiscoveryinDatabases.First,somebasicconceptsaredefinedalongwiththereasonswhythis scientific fieldwas created and expanded rapidly.Second,wewill reviewsomepracticalexamplesandcounterexamplesofDataMining.Additionally,themost important fields onwhichDataMining is based are presented. Last,wewill findoutwhatRprogramming language is and its general philosophy.Wewillalsoshowhowtoinstallallnecessarytools,wewillpresenthowtoaddressto the R language manual for help, and define its basic types and functions,alongwiththeirfunctionality.

PREREQUISITEKNOWLEDGE

Nopreviousknowledgeisneededforthischapter.

INTRODUCTIONTODATAMINING

The evolution of technology helped internet expand lightning fast.Over time,internet access became accessible to more and more people. This led to thedevelopmentofmillionwebsitesandtheuseofdatabasesforstoringthesedata.The creationof the first commercial and socialwebpages created theneed forstoringandmanaginglargeamountofdata.

Today,theamountofavailabledataishugeandisgrowingexponentiallyeveryday.Theneedforminimizingthecostsforcollectingandstoringthesedatawasoneofthebiggestreasonsforthegrowthofthisscientificfield.

Thehugeamountofdatastoredindatabasesanddatawarehousescouldnotbeutilized as is. In order to get useful conclusions, some necessary actions arerequired inorder tostructure thedata.On thischapterwewillviewwhicharethefundamentalstagesinordertoextractvaluableandusableinformationfromdata.

.1DATASCIENCEDataScienceisanewterm,whichcametoreplaceformertermslikeKnowledgeDiscoveryinDatabaseorDataMining.

Boththreetermscanbeusedtodescribeasemiautomatedprocesswhosemainpurpose is toanalyzeahugevolumeofdataaboutaspecificproblemwith thepurposeofcreatingpatternsinscientificfieldslikeStatistics,MachineLearningandPatternRecognition.

Those patters, found in multiple forms like associations, anomalies, clusters,classes etc. constitute structures or instances, which appear in data and arestatisticallysignificant.

One of themost important aspects ofData Science has to dowith finding orrecognizing(recognizingmeansthatthepatterswherenotexpectedinadvance)andevaluatingthesepatterns.Apatternshouldshowsignsoforganizationacrossthis structure.Thesepatterns, alsoknownasmodels,mostof the timescanbetrackedwiththeuseofmeasurablefeaturesorattributeswhichareextractedbydata.

DataScience is anewscience,whichappearedat the endof1980and started

growinggradually.Duringthisera,RelationalDatabaseswereattheirzenithandserveddatastorageneedsforcompaniesandorganizationswith thepurposeofbetterorganizingandmanagingthem,sothatmassqueriesneededfortheirdaytodayoperationscouldbeaccomplishedfaster.

These Database Managing Systems (DBMS) followed the so called OLTP(OnLine Transaction Processing) model, with the purpose of processingtransactions.Thesetoolsallowedtheusertofindanswersinquestionshealreadykneworcreatesomereferences.

The need for better utilization of data created by these systems – the systemswhichhelped thedailyneedsof a company- led to thedevelopmentofOLAP(OnLineAnalyticalProcessing)typetools.WiththeseOLAPtoolsitwaseasierto answer more advanced queries, allowing bigger and MultidimensionalDatabases(MDB)toworkfasterandprovidedatavisualization.

OLAP tools could also be named as data exploration tools due to thevisualizationofthesedata.Thesetoolsallowedusers(salesmanagers,marketingmanagersetc.)torecognizenewpattersbutthisdiscoveryshouldbemadebytheuser.

Forexample,ausercouldperformqueriesaboutthetotalrevenuegeneratedbymultiple stores of a particular companywithin a country, in order to find thestoreswiththelowestrevenue.

AutomationofpatternrecognitionwascreatedthroughmethodologiesandtoolscreatedbythefieldofDataScience.Throughthesesolutions,patternrecognitionwas aided by the final goal. For example, if a userwanted a report about thestoreswiththelessrevenuegeneratedlastmonth,hecouldaskfromthesystemtofindvarioususefulinsightsaboutstoresrevenue.

Data Science growth came gradually and was directly associated to thecapability of collecting and listing huge amount of data, of different types,through the rapid expansion of fast web infrastructures on which commercialapplicationscouldrelyon.

OneofthefirstcompanieswhoembracedthisadvancementwasAmazon,whichstarted by selling books and other products and then created a user-friendlyrelated products recommendation system. This system was built and adjustedaccordingly based on user interactions, using a method called Collaborating

Filtering.ThissystembecamethefoundationofRecommenderSystems.

Theunconditionaldatagenerationina24-hourbasissupportsahugeamountofhuman activities like shopping cart data, medical records, social mediaannouncements,bankingandstockmarketoperationsandsoon.Thesedatahaveawidevarietyoftypes(images,videos,realtimedata,DNAsequencesetc.)anddifferentacquisitiontimes.Ifsomeofthesedataarenotanalyzedimmediately,itmight be difficult later to be stored and processed, creating this way a newscientificfieldknownasBigData.DataScience’sgoal is toaddress theneedscreated fromthisnewenvironmentandprovidesolutions for theescalatedandsufficientprocessofout-ofcoredata.

Methods and tools used for this purpose have already being developed likeHadoop,Map-Reduce,Hive,MongoDB,GraphPD.

ThetwomaingoalsofpracticalDataSciencearetocreatemodels,whichcanbeused both in predicting and describing data. Prediction is about using somevariablesorpartsof adatabase fromwhichwecouldestimateanunknownorfuturevalueofanotherattribute.Descriptionfocusesonfindingcomprehensivepatterswhich can describe data like finding clusters or groups of objectswithsimilarattributes.

1.2KNOWLEDGEDISCOVERYINDATABASES(KDD)KnowledgeDiscovery inDatabases consists of 5 steps. It’s about revealingorcreatingusefulknowledgethroughdataanalysis.Itreferstothewholeprocess,fromdatacollectiontoutilizingtheoutcomeinamorepracticallevel.ThebasicstagesofKnowledgeDiscoveryinDatabasesare:

1DataCollection2Preprocessing3Transformation4DataMining5Interpretation/Evaluation

The“KnowledgeDiscoveryinDatabases”termisoftenassociatedwiththeterm“DataMining”althoughitisjustoneofitssteps.ThebasicgoalofDataMining(DM)istheextractionofunknownandpossiblyvaluableinformationorpatternsfromdata. Someone could say that the term is used excessively since no dataextraction isperformed.On thecontrary,preprocessedand(possibly)modifieddataareusedforextractingusefulinformation,whichisthenusedforsolvingaproblem.BelowyouwillfindabriefdescriptionabouteachstageofKDD.

1.2.1DATACOLLECTION

The first step ofKDD is the collection and storage of data.Data collection isusuallyperformedautomaticallye.g.byusingsensorsornotautomaticallye.g.via a questionnaire. Malfunction in sensors or non-submitted answers on a

questionnaire could lead to incomplete data. These particular problemswhichmightoccurduringdatacollectionarefacedbythenextstep.

1.2.2PREPROCESSINGThesecondandmostimportantstepofKDDisthepreprocessing,withagoalofcleansing data: handling defective, false ormissing data. Preprocessingmightrequire up to 60%of the total effort since there is no reason to discuss aboutresultsifdataarenotcleanandintherightform.OnChapter3wewillexaminethroughout theprocesses fromwhichpreprocessingconsistsof andwheneachprocessshouldbeused.

1.2.3TRANSFORMATION

Data Transformation is the third step of KDD. Basically, this step is aboutconverting data under a common frame allowing us to edit them later. It ismostly used for smoothing data and removing noise, for data aggregation, fornormalization or for creating new features based on the existing ones. Specialformsoftransformationarediscretizationandcompression.

1.2.4DATAMINING

On this step of KDD an algorithm is used for model generation. Clean andtransformeddataarenowreadytobeusedbyanalgorithminordertocreateamodel, usually for categorization or prediction. We want to use this model,created by some already known data, to get an answer about the value of anattribute-variabletargetgoalfornew,unknowndata.

1.2.5INTERPRETATIONANDEVALUATION

ThelaststepofKDDisusedtointerpretandevaluatetheresults(notthemodel)producedbythewholeprocess.

1.3MODELTYPESThemodelsproducedbytheDataMiningstepfallunder twotypes:predictivemodelsanddescriptivemodels.

The goal of a predictive model is to predict values for a specific interestingfeaturewhichcouldprobablybebasedon thebehaviorofotherattributes.Forexample,predictioncouldbebasedondataacrossdifferentdaysofeachweek.

Adescriptivemodel findspattersor relationshidden insidedataandexaminestheirattributesinordertoprovideanexplanationabouttheirbehavior.

1.4EXAMPLESANDCOUNTEREXAMPLES

It’shardforsomepeopletounderstandanddistinguishwhatKDDandDMare.That’swhywewillhavealookatsomepracticalexamplesandcounterexamplesin order to make clear what DM is or isn’t. Some examples of DM are thefollowing:

After 9/11, Bill Clinton announced that after examining lots ofdatabases, FBI agents discovered that 5 of the perpetrators wereregisteredtothesedatabases.Oneofthemowned30creditcardswitha negative balance of $250.000 and lived in US for less than twoyears.Telecommunication companies not only reward clients who spendlotsofmoneybutalsoclientsnamedas“guides”.Theseguidesoftenconvince friends, relatives, coworkers and others to follow themwhentheychangeprovider.So,telecomcompaniesneedtofindtheseclients andmake them stick to them, proving higher discounts andmoreservicestothem.Byusingdata fromolder recorded temperaturesduring the summerseasonoftheprevious15years,wetrytopredictthetemperaturesforthesummerseasonofthenext15years.

Datamining isnot just the simpleprocessingofqueriesneither small scalestatisticalprograms.Somecounterpartsarethefollowing:

FindingaphonenumberfromaphonebookFindinginformationaboutParisontheinternetFindingtheaverageofexamsgradesSearching for themedical recordsof a patiencewith a particulardisease,inordertofurtheranalyzehismedicalrecord.

1.5CLASSIFICATIONOFDATAMININGMETHODS

Thereisawiderangeofdataminingmethods.Dependingonthedatatypesandthetypeofknowledgeextracted,theyareclassifiedindifferentcategories.Somebasic methods of Data Mining are presented below. At this point we shouldmentionthatinDataScienceeducationisaccomplishedbyusingdata,whereasinother formsof education a teacher or a specialist transfers knowledge fromonepersontoanother.

1.5.1CLASSIFICATION

This is apredictivemethod. Itsgoal is tocreateamodel–classifierbasedoncurrent data. Basically, it’s the knowledge of a function which represents anobject(usuallyrepresentedasavaluesvectorforitscharacteristicfeatures)inavalueof a categoricalvariable alsoknownas class.Learning, is abehaviorofintelligentsystemswhichisstudiedbyscientificfieldslikeMachineLearningorArtificial Intelligence. Due to this, all these fields study the same problems,withoutthismeaningthattherearenootherscopesstudiedindividuallybyeachscientificfield.

Classificationisoftenassociatedwithprediction.Inclassification, theoutcomewewanttopredictistheclassofthesamples.Aclasscanhavediscretevaluesfromafiniteset.Onthecontrary,duringpredictionwithmethodslikeregression,thevariable-goalcouldbeanyrealnumber.

1.5.2REGRESSION

Regression isasimilar toclassificationprocess,whosegoal is learningorelsetraining a function which represents an object in a real variable. It is also apredictivemethod.Byusingsomeindependentvariablesitsgoalistopredictthevaluesofadependentvariable.

Ontheaboveimagewecanseeanexampleoflinerregression.Thevariablesinthisexamplearethesquaremetersofahouseandthesellingpriceinthousandsofdollars.Linearregressionadaptsalineinthesamplesofthedataset,showninredX’s.Itiscreatedbasedonadistancefunctionorpricefunction,whosepricewewanttominimize.Byhavingtheoptimalline(thelinewhichminimizesthevalueofthepricefunction)wecanthenestimateprettyaccuratelyquestionslike:“Whichisthesellingpricefor150squaremetershouses?”.

1.5.3CLUSTERING

Clustering isadescriptivemethod.Givenadataset, thegoalofclustering is tocreateclusters(groupswiththesameorsimilarfeatures).Inclusteringthegoalistofindafinitenumberofclustersinordertodescribedata.

Onthebelowexamplewecanseetheresultfromclusteringmedicaldata.Threeclustersarecreatedbasedon“dosage”and“influenceduration”.

1.5.4EXTRACTIONANDASSOCIATIONANALYSIS

AssociationRulesMiningisconsideredoneofthemostimportantdataminingprocesses.Ithasattractedalotofattentionsinceassociationrulesprovideabriefwaytoexpressthepotentiallyuseful informationinaneasytounderstandwayforthefinalusers.Theseassociationrulesdiscoverhiddenrelationshipsbetweenfeaturesofadataset.TheseassociationsarepresentedinanABform,whereAandB are sets referring to the features of thewhole datawe analyze.AnABassociationrulepredictstheappearanceoffeaturesofsetBgiventhatfeaturesofsetAarepresent.

Aclassicexampleofassociationrulesinpracticehastodowiththeanalysisofashoppingcartinasupermarket,wheredatahavetodowithclientstransactions.Inthisscenario,sometransactionscouldbe{bread,milk},{bread,diapers,beer,eggs},{milk,diapers,beer,soda},{bread,milk,diapers,beer}and{bread,milk,diapers,soda}.Someassociationrulesonthesetransactionscouldbe{Diapers}{Beer},{beer,bread}{milk},{milk,bread}{eggs,soda}.Forexample,thelastrulerevealsthatit’squitepossiblethatwhoeverbuysmilkandbreadmightalsobuyeggsandsoda.

Extracting valuable conclusions through association rules, the marketingdepartment of the super market could place its products in shelves moreprofitably, create better marketing campaigns and efficiently manage itsresources.

1.5.5VISUALIZATION

Datavisualizationhelpsinbetterunderstandingnotonlythedatathemselvesbutalso correlations that might occur between them. In a next chapter we willdescribethewaysofvisualizationinR.Visualizationthoughcanonlyapplyinaspecificnumberofdimensions.Thismeansthatfordatasetswithlotsoffeaturesvisualizationis impossible.Alternatively,wecouldsettlewiththevisualizationof a smaller part of our dataset. In any case, visualizations should beaccompaniedby the respective statistical inspections in order to be sure abouttheaccuracyofthedisplayedcorrelations.

1.5.6ANOMALYDETECTION

Anomaly detection focuses in finding deviations in data according to similardata collected in thepast or by typical values of thesedata.Thebelow imageshowsanexamplewhereinredwecanseeanormalsamplelocatednearothersampleswithnormalvaluesandalsoananomaly,whosevaluediffersalotfromtheothervalues.

Someotherexamplesofanomalydetectionarethefollowing:

FrauddetectionbasedonauserprofileFindingdysfunctionalobjectsinindustrialproductionComputermonitoringinadatacenter

1.6APPLICATIONS

Datamining is used in a broad range of fields like e.g.Medicine, Finance orevenTelecommunications.Below,wewillexaminesomeapplicationsofDMinsomeofthesefields.

1.6.1MEDICINE

Over recent years, alongwith the growth ofMedicine fields like genetics andbiomedical,theuseofDMwashighlighted.Ingenetics,thegoalistounderstandandmap the relationshipbetween thealterationofhumanDNAsequencesandthe predisposition of a disease. DM is a tool which can help in diagnosisimprovement,inpreventionandconsequently,curediseases.

One of the main goals associated with DNA analysis is the comparison ofmultiple sequences and the search of similarities between DNA data. Thiscomparison is performed between gene sequences of healthy and harmfultissues,inordertofinddifferencesbetweenthem.

Visualization tools arequite important inbiomedicine aswell.These tools areabletopresentcomplexgeneformationsingraphsortreediagrams.

AnexampleofDMuseinMedicineisbyusingclassificationtodetectdiabeticretinopathy (DR). On this particular problem, data are high quality images ofpatientsretina.Theclassrangesfrom0to4.

OtherclassicexamplesofDMareepilepticseizurespredictionthroughmagnetic

MRIdataanalysisandtheclassificationofcancerinbenignormalignant.

Last,DMmethodsapplied in socialmedia likeTwitter, isknown that allowedthedetectionofrapidlyspreadingvirusesanddiseaseslikefluandHIV,raisingawareness about the incident much faster than the Public Health Agenciesallowingamorevalidpreventionofthedisease.

1.6.2FINANCE

AnotherfieldwheredataminingmethodsareusedisFinance.Financialdataaremostlycollectedbybanksorotherinstitutes.Theyusuallyarereliableandhigh-quality data. The contribution of data mining in Finance can be found in thecollection and understanding of data, in clearing and improving data and inmodel creation as well. Financial data analysis aims to facilitate decisionmaking,bytakingactionsaccordingtotheanalysismade.

Initiallydataarecollectedbyvariousfinancialinstitutionsandarestoredindatawarehouses. Multidimensional analysis methods are used for their analysis.Additionally,anotherimplementationofDataMininginfinancehastodowithprediction. For example, predicting if a client would be able to repay a loan,based on his previous transactions with the bank and his current financialsituation. By processing these data, a bank is able to create its originationpoliciesandminimizetheriskofprofitloss.

AnotherimplementationofDMinfinanceistheattempttopredictstockprices.Stockpricesareusuallymodeledastimeseriesandthegoalistopredictifthestockpricewillmoveupordown.Forexample, in theabove image,basedonthe stock prices of the previousmonths, we can predict its price for the nextfifteen days.With green colorwe can see the price of the predictionwhile inbluewecanseetheactualpriceofthisparticularstock.Thepredictionaccuracywasnotthathighbutthegeneraltrend(upward)waspredictedsuccessfully.

Anothermethodwhichcanbeusedisclustering.Clientsaregroupedinclustersbasedonsimilarfeatures.Aneffectiveclusteringhelpsbankstoidentifyagroupof clients or associate a new clientwith an existing group and offer solutionswhichsatisfiedtheclientsofthisparticulargroupinthepast.

Last, with DM techniques possible fraud or false data can be identified. Byanalyzingfinancialtransactionsandextractingsomepatterns,thedetectionofanunusualincidentcouldtriggeranidentitycheckoftherealclient.

1.6.3TELECOMMUNICATIONS

DMisusefulintelecommunicationsaswell.Telecomdatalikecalltype,calleranddialedlocation,timeandcalldurationcanbeusedinbetterservingnotjust

eachclientindividuallybutallclientsofthenetwork.DMmethodsareusedtobalancesystemloadanddatatraffic.

Datacanbeusedtocreateaclientprofile.Basedontheseprofilesclientscanbegroupedinclustersanddependingonthegroupofferspecial telecompackagesaccordingly.Additionally, interestingpatterscanbeidentifiedandthenanalyzethereasonsleadingtothesepatters.Forexample,identifythefactorsleadingtoahighercallfrequencyfromusersinspecifictimeintervals.

1.7CHALLENGES

Knowledgeextraction isaverypromisingscientific field.Like ineverysinglescientificfield,weneedtofacelotsofchallenges.Thesechallengesaremainlytechnicalandsocial-moral.

Thebiggertechnicalchallengeistheincreasingamountofdataalongwiththeirmultidimensional and complex character. Solutions should be scalable andflexiblemeaningthatDMmethodsandalgorithmsshouldbeabletohandlenotonlyhugevolumesofdatabutsmallerdatasetsaswell.DistributedsystemsareasolutionforhugevolumesofdatalikeHadoopandMapReducewhichwewillstudyinChapter8.

Social-moral challenges have to dowith finding the right areas to applyDM,sinceprivacyissuesmayrisefromprocessingdataduringDM.

1.8THERPROGRAMMINGLANGUAGEThisbookisamanualonhowtousevariousmethodsandalgorithmsinordertosolve real problems, so both theoretical and practical problems are provided.Many of these problems come with an indicative solution. The practicalproblems require the knowledge of a proper programming language throughwhich the reader should be able to respond accordingly. The programminglanguagewewilluseisR.

Ihopeyoufindthesolutionsprovidedinthisbooksuitablefortheproblemsyouencounter and apply them accordingly. Next, we will discuss shortly what iscoveredineachchapterofthisbook.

Chapter2isanintroductiontoR.Thisisthechapteryoudon’twanttoskipinthis book. We will show the different data types which R supports, controlstructures, how we create and call functions, how to use basic and usefulfunctionsandalsohowtofindhelpaboutfunctionsofanypackage.Thischapterisprerequisiteforallotherchapters.

OnChapter3wepresentdatatypes,alongwiththenecessaryactionsneededtopreprocessdatainordertoensuretheirdataqualityandthus,thequalityofourresults.Having familiaritywithRwewill thenpresenthow function from thedplyr and tidyr packages are used, for preprocessing data faster and moreefficiently.

OnChapter 4 we present summary statistics and ways of visualization. Wedescribetermslikemeasuresofposition(average,midline),variance(variance,standard deviation) and association and the functions which create thesemeasuresinR.Next,wewillpresentsomewaysofvisualizingqualitativedatalikehistograms,barchartsandpiechartswithR.

OnChapter5westudytheconceptsofclassificationandprediction.Inrespectto classification we will present decision trees throughout. In respect topredictionwewillexaminethemethodoflinearregression.

OnChapter6wewillpresentmethodsofdataclustering.Oncewedefinewhatsupervisedlearningandclusteris,wewillexaminethreecategoriesofclusteringmethods: partitional clustering, hierarchical clustering, and density-basedclustering.Wewill next view some specific clustering algorithms, like the k-means algorithm, the agglomerative hierarchical algorithm and the DBSCANalgorithm.

On Chapter 7 we describe mining association rules from transactionaldatabases. After defining some basic terms, we will then present the Apriorialgorithm for finding frequent itemsets. We will then give an example ofassociationrulesminingfromfrequentitemsets.Last,wewillexaminethearulespackage.With the help of this package everything described in this chapter isalreadycreatedinR.

On Chapter 8 we will examine computational methods for analyzing hugevolumes of data. More specifically, we will focus on the Hadoop andMapReducetools,andhowtheycanworkalongtosolveproblems.

1.9BASICCONCEPTS,DEFINITIONSANDNOTATIONS

Wewillnowdefinethemostbasicconcepts,definitionsandsymbolswewillusein the next chapters of this book. We initially have our data, also known asdatasets. These datasets are structured in rows and columns. Each columncorresponds to a specific feature or variable of thedataset.Wewill use themsymboltorepresentthenumberoffeaturesinaset.Same,eachrowcorrespondstoadatasetsample.Wewillusethensymboltorepresentthenumberofrowsinaset.Eachsamplecontainsmvalues,oneforeachfeature.

Wewillusethewholedatasettocreateamodel.Thismodelwillbeusedlateronfor the correspondence of new samples in a predefined groupof categories orclasses (classification) or for predicting the values of an important feature,knownas targetvariable (prediction).Anyhow, themodel shouldbevalidated.For thedatasetswecurrentlyhave, the classorvalues for the target value areknown.

Inordertomakeanaccuratevalidationofthemodelwesplitourdatasetintwosubsets:thetrainingsetandthetestset.Usuallythetrainingsetisabout2/3oftheinitialdatasetwhereasthetestsetistheremaining1/3.Wewillseethatthereareotherwaysofsplittingdatasets.

During the trainingphase, themodel is createdbasedonanalgorithmand thetrainingset.Modelvalidation isnecessarybeforeusing it and isaccomplishedbyusingthetestset.Basically,weusethemodelondata,forwhichwealreadyknowtheclassorthevalueofthetargetgoal,sothatwecancompareitwiththeonegivenbythemodel.

1.10TOOLINSTALLATIONThebasictoolwewilluseistheRprogramminglanguage.YoucanfindtheRconsolealongwiththebasicreadymadepackagesforfreeintheofficialwebsite(https://www.r-project.org).

After visiting the official webpage clink on “downloadR” and then choose alocationclosetoyou.Thenyouwillneedtoselecttheoperationsystemyouuse.

BychoosingWindowswewillseethebelowscreen:

Herewewillclickon“installRforthefirsttime”.

Weclickthedownloadlink.Wethenjustruntheinstallationfileandfollowtheinstructions.BelowwecanseetheRconsole.

https://www.r-project.org

RStudio is a free development environment of R. The installation file can bedownloadedfromthislink:https://www.rstudio.com/products/rstudio/download.Onceagainchoosetherightversionaccordingtoyouroperatingsystem.

RStudiohomescreenisshowninthebelowpicture.Onthefirstframewecanseethecodeoftheopenfiles.Everytabisadifferentsourcecode.Onthesecondframevariables and functions are shown.On the third frame, in thePlots tab,graphsareprintandwecanviewthepackageswehavedownloadedorneedtobe updated through the Packages tab. Additionally, through the “Help” tab(frame 3) we can find information and help about a function or a package.Finally,onframe4wecanfindtheclassicRconsole.

https://www.rstudio.com/products/rstudio/download

DATA - iedu.us€¦ · Introduction to Data Mining 1.1 Data Science 1.2 Knowledge Discovery in...

Documents

Transcript of DATA - iedu.us€¦ · Introduction to Data Mining 1.1 Data Science 1.2 Knowledge Discovery in...