Pre processing big data
-
Upload
maloy-manna-pmp -
Category
Data & Analytics
-
view
205 -
download
0
Transcript of Pre processing big data
Abstract
• Datainthereal-worldisalmostalwaysdirty,incomplete,scatteredorinconsistent.Fordatascientists,‘janitorwork’iskeyhurdletodatainsights.• Whetheryouusebigdataforanalyticsordatascience,withincreasingvarietyandvelocityofbigdata,thedatapre-processingstepcanbethemosttime-consumingstepinyourdatapipeline.• WithfeatureengineeringconceptsandpracticalexamplesinPythonandR,thiswebinarwillfocusontechnicalconsiderationsanddataengineeringtechniquestooptimizedatapreparationtogetthemostvaluefromyourbigdatapipeline.
Speakerprofile
MaloyMannaEngineering,DataInnovationLab
• Buildingdatadrivenproductsandservicesforover15years
• Workedat:insuranceleaderAXA,informationleaderThomsonReuters,datasciencestartupSaama,consultingfirmsInfosys&TCS
linkedin.com/in/maloy @itsmaloy biguru.wordpress.com
Agenda
• Dataengineeringpatterns• Datapreparation|pre-processing• Exploratorydataanalysis• Datacleaningtechniques• Datareductiontechniques• Datatransformation• Dataintegration
Datastrategy• Startwiththebusinessquestion(s)• Definegoals…andmetrics• Initialhypothesis...dataneeds• Experiment…gaininsight• Takeactions...refine• Prioritize…buildroadmap
Datapreparation
Whypre-processdata?
• Errorsindatacollection• Measurementerror• Humanerrors• Namingconventions• Duplicaterecords• Incompletedata• Inconsistentdata• “Noise”indata
Datapreparation
• Dataacquisition• Datapreparation• Dataintegration• Datatransformation• Datacleaning• Datareduction
• Keyfactorinmodelquality• Insightsbasedon“trusted”data
Datapre-processing
Datapreparation• ExploratoryDataAnalysis• Datacleaning• Datareduction• Datatransformation• Dataintegration
• Keyfactorinmodelquality• Insightsbasedon“trusted”data
Exploratorydataanalysis
Goodnessoffit• R-squared[explainedvariation/totalvariation]
• Notsufficienttillresidualplotsareexaminedforbias• AdjustedR-squared
• Adjustsfornumberofexplanatoryvariablesinamodelrelativetonumberofdatapoints
Datacleaning
• Reformatdatavaluesorlayout• Standardizedata[commonunits]• Correcterroneousvalues• Fillin/Excludemissingvalues
• Validatingdata(e.g.dates/addresses)
Datacleaning
Handlingmissingdata[tactics]• Ignorerecordswithmissingdata• Fillinvalues(ifknown/available)• Useglobalconstante.g.NULL,unknown• Useattributevaluemean• Infermostprobablevalue
DatacleaningSmoothingnoise
• Regression• UsingClassintervalsor“Binning”• Clusteringandremovingoutliers• K-meansclustering[kobservations,nclusters]
DatareductionDimensionalityreduction|Whyreduce?
• Toomanyvariables• Multi-collinearity[highlycorrelatedmultiplepredictorvariables]• Lesscomputation• Reducesnoise,improvesmodelperformance• Compressdata,reducestorage
Datareduction• Dimensionalityreduction• Numerization – [non-numericattributestonumeric]
• UsefulforSVM[supportvectormachine]andneuralnetworks• Categorization– [non-categoricalattributestocategorical]
• e.g.dummyvariable(binarystates)• UsefulforNaiveBayesandBayesiannetworks
• Featureextractione.g.PCA[PrincipalComponentAnalysis]• Featurereduction
• Usesremovaloflow/almost-zerovarianceandhighlycorrelatedvariables• Reducescomputationcosts• Improvesmodelinterpretability
Datareduction• PCA– PrincipalComponentAnalysis
• Goalistoreduced-dimensionaldatasetintok-dimensionalsubspace(wherek<=d)toincreasecomputationalefficiency• Inessence,originalvariablesreducedtoanewsetofvariablesinlinearcombination,calledprincipalcomponents.• Dataneedstobestandardizedbeforehand• scikit-learnprovidesimplementation
Datatransformation• Modifydatatoformsuitableforanalysisandmodeling• Standardtransformationfunctions/needs:• Reshapedata(sort,append|feature generation,pivot)• Joindata(union,intersection,join,match)• Subsetdata(filter,drop,distinct)• Aggregate(group,windowing)• Mathematicaloperations
Dataintegration• Datapipelinelevel|Individualdatasetlevel• Standardizingschema• Metadatamanagementcrucial• Automationiskey• Toolsautomateseveralmachinelearningtasks• Deduplication,onlineentityresolution,dataenrichment/geocoding• Referencemetadatacatalog,taggingandsearch• RESTAPImicroservices forintegrationwithanalytics
References&furtherreading• PCA– PrincipalComponentAnalysis:https://en.wikipedia.org/wiki/Principal_component_analysis
• Datasciencelifecycle:http://www.datasciencecentral.com/profiles/blogs/the-data-science-project-lifecycle
• R-squaredconcepts:http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit
• Dimensionalityreduction:https://en.wikipedia.org/wiki/Dimensionality_reduction• Rreshape2packagereference:https://cran.r-project.org/web/packages/reshape2/reshape2.pdf
• Sparktransformations:http://spark.apache.org/docs/latest/programming-guide.html#transformations
• Thetotallymanagedanalyticspipeline:https://segment.com/blog/the-totally-managed-analytics-pipeline/