Data Mining: An Overview

download Data Mining: An Overview

If you can't read please download the document

description

Data Mining: An Overview. Ayhan Demiriz Adapted from Chris Clifton’s course page. What do data mean?. Some examples? Who collect data? Need? Required? For how long? Privacy? Storage?. Then What Is Data Mining?. Data mining (knowledge discovery from data) - PowerPoint PPT Presentation

Transcript of Data Mining: An Overview

  • Data Mining: An OverviewAyhan DemirizAdapted from Chris Cliftons course page

  • What do data mean?Some examples?Who collect data?Need?Required?For how long?Privacy?Storage?

  • Then What Is Data Mining?Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of dataData mining: a misnomer?Alternative namesKnowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.Watch out: Is everything data mining?(Deductive) query processing. Expert systems or small ML/statistical programs

  • Why Data Mining?Potential ApplicationsData analysis and decision supportMarket analysis and managementTarget marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentationRisk analysis and managementForecasting, customer retention, improved underwriting, quality control, competitive analysisFraud detection and detection of unusual patterns (outliers)Other ApplicationsText mining (news group, email, documents) and Web miningStream data miningDNA and bio-data analysis

  • Data MiningWhats in a Name?Data MiningKnowledge MiningKnowledge Discoveryin DatabasesData ArchaeologyData DredgingDatabase MiningKnowledge ExtractionData Pattern ProcessingInformation HarvestingSiftwareThe process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of stored data, using pattern recognition technologies and statistical and mathematical techniques

  • Integration of Multiple TechnologiesMachineLearningDatabaseManagementArtificialIntelligenceStatisticsDataMiningVisualizationAlgorithms

  • Data Mining: Confluence of Multiple DisciplinesData MiningDatabase SystemsStatisticsOtherDisciplinesAlgorithmMachineLearningVisualization

  • Data Mining: Classification SchemesGeneral functionalityDescriptive data mining Predictive data miningDifferent views, different classificationsKinds of data to be minedKinds of knowledge to be discoveredKinds of techniques utilizedKinds of applications adapted

  • Knowledge Discovery in Databases: Processadapted from:U. Fayyad, et al. (1995), From Knowledge Discovery to Data Mining: An Overview, Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

    Knowledge

  • Multi-Dimensional View of Data MiningData to be minedRelational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWWKnowledge to be minedCharacterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.Multiple/integrated functions and mining at multiple levelsTechniques utilizedDatabase-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.Applications adaptedRetail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, Web mining, etc.

  • Ingredients of an Effective KDD ProcessBackground KnowledgeGoals for LearningKnowledge BaseDatabase(s)Plan forLearningDiscoverKnowledgeDetermineKnowledgeRelevancyEvolveKnowledge/DataGenerateand TestHypothesesVisualization andHuman ComputerInteractionDiscovery Algorithms

  • Data Mining:History of the FieldKnowledge Discovery in Databases workshops started 89Now a conference under the auspices of ACM SIGKDDIEEE conference series started 2001 Key founders / technology contributors:Usama Fayyad, JPL (then Microsoft, now has his own company, Digimine)Gregory Piatetsky-Shapiro (then GTE, now his own data mining consulting company, Knowledge Stream Partners)Rakesh Agrawal (IBM Research)

    The term data mining has been around since at least 1983 as a pejorative term in the statistics community

  • Market Analysis and ManagementWhere does the data come from?Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studiesTarget marketingFind clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.Determine customer purchasing patterns over timeCross-market analysisAssociations/co-relations between product sales, & prediction based on such association Customer profilingWhat types of customers buy what products (clustering or classification)Customer requirement analysisidentifying the best products for different customerspredict what factors will attract new customersProvision of summary informationmultidimensional summary reportsstatistical summary information (data central tendency and variation)

  • Corporate Analysis & Risk ManagementFinance planning and asset evaluationcash flow analysis and predictioncontingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)Resource planningsummarize and compare the resources and spendingCompetitionmonitor competitors and market directions group customers into classes and a class-based pricing procedureset pricing strategy in a highly competitive market

  • Fraud Detection & Mining Unusual PatternsApproaches: Clustering & model construction for frauds, outlier analysisApplications: Health care, retail, credit card service, telecomm.Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insuranceProfessional patients, ring of doctors, and ring of referencesUnnecessary or correlated screening testsTelecommunications: phone-call fraudPhone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected normRetail industryAnalysts estimate that 38% of retail shrink is due to dishonest employeesAnti-terrorism

  • Other ApplicationsSportsIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami HeatAstronomyJPL and the Palomar Observatory discovered 22 quasars with the help of data miningInternet Web Surf-AidIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

  • Example: Correlating communication needs and eventsGoal: Avoid overload of communication facilitiesInformation source: Historical event data and communication traffic reportsSample question what do we expect our peak communication demands to be in Bosnia?

    Date

    Peak comm. (bps)

    1/24/89

    1.2M

    1/25/89

    1.8M

    Date

    Temp

    Visibility

    1/24/89

    105

    25mi

    1/25/89

    97

    0.5mi

  • Data Mining Ideas: LogisticsDelivery delaysDebatable what data mining will do here; best match would be related to quality analysis: given lots of data about deliveries, try to find common threads in problem deliveriesPredicting item needsSeasonalLooking for cycles, related to similarity search in time series dataLook for similar cycles between products, even if not repeatedEvent-relatedSequential association between event and product order (probably weak)

  • One Vision for Data MiningKDDProcess

    . . .GeospatialImageryIntelAnalystFBISdatabasesOITdatabasesOIAdatabasesIntelinkdata sourcesMiddlewareInternetdata sourcesMediator/BrokerData MiningsourceenvironmentsDiscoveredKnowledgeAre there any other interesting relationships I should know about?Visualization

  • What Can Data Mining Do?ClusterClassifyCategorical, RegressionSummarizeSummary statistics, Summary rulesLink Analysis / Model DependenciesAssociation rulesSequence analysisTime-series analysis, Sequential associationsDetect Deviations

  • Data Mining FunctionalitiesConcept description: Characterization and discriminationGeneralize, summarize, and contrast data characteristics, e.g., dry vs. wet regionsAssociation (correlation and causality)Diaper Beer [0.5%, 75%]Classification and Prediction Construct models (functions) that describe and distinguish classes or concepts for future predictionE.g., classify countries based on climate, or classify cars based on gas mileagePresentation: decision-tree, classification rule, neural networkPredict some unknown or missing numerical values

  • Data Mining Functionalities (2)Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patternsMaximizing intra-class similarity & minimizing interclass similarityOutlier analysisOutlier: a data object that does not comply with the general behavior of the dataNoise or exception? No! useful in fraud detection, rare events analysisTrend and evolution analysisTrend and deviation: regression analysisSequential pattern mining, periodicity analysisSimilarity-based analysisOther pattern-directed or statistical analyses

  • Types of Data Mining OutputData dependency analysis - identifying potentially interesting dependencies or relationships among data itemsClassification - grouping records into meaningful subclasses or clustersDeviation detection - discovery of significant differences between an observation and some reference potentially correct the dataAnomalous instances, OutliersClasses with average values significantly different than parent or sibling classChanges in value from one time period to anotherDiscrepancies between observed and expected valuesConcept description - developing an abstract description of members of a populationCharacteristic descriptions - patterns in the data that best describe or summarize a classDiscriminating descriptions - describe how classes differ

  • ClusteringFind groups of similar data itemsStatistical techniques require some definition of distance (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptionsUses:Demographic analysisTechnologies:Self-Organizing MapsProbability DensitiesConceptual ClusteringGroup people with similar travel profilesGeorge, PatriciaJeff, Evelyn, ChrisRob

    Clusters

  • ClassificationFind ways to separate data items into pre-defined groupsWe know X and Y belong together, find other things in same groupRequires training data: Data items where group is knownUses:ProfilingTechnologies:Generate decision trees (results are human understandable)Neural NetsRoute documents to most likely interested partiesEnglish or non-english?Domestic or Foreign?

    classifier

    tool produces

    Groups

    Training Data

  • Association RulesIdentify dependencies in the data:X makes Y likelyIndicate significance of each dependencyBayesian methods

    Uses:Targeted marketing

    Technologies:AIS, SETM, Hugin, TETRAD IIFind groups of items commonly purchased togetherPeople who purchase fish are extraordinarily likely to purchase winePeople who purchase Turkey are extraordinarily likely to purchase cranberries

    Date/Time/Register

    Fish

    Turkey

    Cranberries

    Coke

    12/6 13:15 2

    N

    Y

    Y

    Y

    12/6 13:16 3

    Y

    N

    N

    Y

  • Sequential AssociationsFind event sequences that are unusually likelyRequires training event list, known interesting eventsMust be robust in the face of additional noise eventsUses:Failure analysis and predictionTechnologies:Dynamic programming (Dynamic time warping)Custom algorithmsFind common sequences of warnings/faults within 10 minute periodsWarn 2 on Switch C preceded by Fault 21 on Switch BFault 17 on any switch preceded by Warn 2 on any switch

    Time

    Switch

    Event

    21:10

    B

    Fault 21

    21:11

    A

    Warn 2

    21:13

    C

    Warn 2

    21:20

    A

    Fault 17

  • Deviation DetectionFind unexpected values, outliers

    Uses:Failure analysisAnomaly discovery for analysis

    Technologies:clustering/classification methodsStatistical techniquesvisualizationFind unusual occurrences in IBM stock prices

    Date

    Close

    Volume

    Spread

    58/07/02

    369.50

    314.08

    .022561

    58/07/03

    369.25

    313.87

    .022561

    58/07/04

    Market Closed

    58/07/07

    370.00

    314.50

    .022561

    Sample date

    Event

    Occurrences

    58/07/04

    Market closed

    317 times

    59/01/06

    2.5% dividend

    2 times

    59/04/04

    50% stock split

    7 times

    73/10/09

    not traded

    1 time

  • Necessity for Data MiningLarge amounts of current and historical data being storedOnly small portion (~5-10%) of collected data is analyzedData that may never be analyzed is collected in the fear that something that may prove important will be missedAs databases grow larger, decision-making from the data is not possible; need knowledge derived from the stored dataData sourcesHealth-related services, e.g., benefits, medical analysesCommercial, e.g., marketing and salesFinancialScientific, e.g., NASA, GenomeDOD and IntelligenceDesired analysesSupport for planning (historical supply and demand trends)Yield management (scanning airline seat reservation data to maximize yield per seat)System performance (detect abnormal behavior in a system)Mature database analysis (clean up the data sources)

  • Necessity Is the Mother of InventionData explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data miningData warehousing and on-line analytical processingMiing interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

  • Are All the Discovered Patterns Interesting?Data mining may generate thousands of patterns: Not all of them are interestingSuggested approach: Human-centered, query-based, focused miningInterestingness measuresA pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measuresObjective: based on statistics and structures of patterns, e.g., support, confidence, etc.Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.

  • Can We Find All and Only Interesting Patterns?Find all the interesting patterns: CompletenessCan a data mining system find all the interesting patterns?Heuristic vs. exhaustive searchAssociation vs. classification vs. clusteringSearch for only interesting patterns: An optimization problemCan a data mining system find only the interesting patterns?ApproachesFirst general all the patterns and then filter out the uninteresting ones.Generate only the interesting patternsmining query optimization

  • Knowledge Discovery in Databases: Processadapted from:U. Fayyad, et al. (1995), From Knowledge Discovery to Data Mining: An Overview, Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

    Knowledge

  • Steps of a KDD Process Learning the application domainrelevant prior knowledge and goals of applicationCreating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformationFind useful features, dimensionality/variable reduction, invariant representation.Choosing functions of data mining summarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentationvisualization, transformation, removing redundant patterns, etc.Use of discovered knowledge

  • Data Mining and Business Intelligence End UserBusiness Analyst DataAnalystDBA MakingDecisionsData PresentationVisualization TechniquesData MiningInformation DiscoveryData ExplorationOLAP, MDAStatistical Analysis, Querying and ReportingData Warehouses / Data MartsData SourcesPaper, Files, Information Providers, Database Systems, OLTP

  • Architecture: Typical Data Mining SystemData WarehouseData cleaning & data integrationFilteringDatabasesDatabase or data warehouse serverData mining enginePattern evaluationGraphical user interface

  • Related Techniques: OLAPOn-Line Analytical ProcessingOn-Line Analytical Processing tools provide the ability to pose statistical and summary queries interactively (traditional On-Line Transaction Processing (OLTP) databases may take minutes or even hours to answer these queries)Advantages relative to data miningCan obtain a wider variety of resultsGenerally faster to obtain resultsDisadvantages relative to data miningUser must ask the right questionGenerally used to determine high-level statistical summaries, rather than specific relationships among instances

  • Integration of Data Mining and Data WarehousingData mining systems, DBMS, Data warehouse systems couplingNo coupling, loose-coupling, semi-tight-coupling, tight-couplingOn-line analytical mining dataintegration of mining and OLAP technologiesInteractive mining multi-level knowledgeNecessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.Integration of multiple mining functions Characterized classification, first clustering and then association

  • Data Mining and VisualizationApproachesVisualization to display results of data miningHelp analyst to better understand the results of the data mining toolVisualization to aid the data mining processInteractive control over the data exploration processInteractive steering of analytic approaches (grand tour)Interactive data mining issuesRelationships between the analyst, the data mining tool and the visualization tool

  • Customer Centric Data Mining and CRM Life-Cycle

  • Data Mining Solves Four ProblemsDiscovering Relationships MBA, Link Analysis Making Choices Resource Allocation, Service AgreementsMaking Predictions Good-Bad Customer, Stock Prices Improving the Process Utility Forecast

  • The Data Mining ProcessProblem DefinitionData EvaluationFeature Extraction and EnhancementPrototyping PlanPrototyping/Model DevelopmentModel EvaluationImplementationReturn-on-Investment Evaluation