02 - Introduction to Data Mining

download 02 - Introduction to Data Mining

of 15

Transcript of 02 - Introduction to Data Mining

  • 8/8/2019 02 - Introduction to Data Mining

    1/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 1

    Davide AnguitaSmartLab

    DIBE Facolt di Ingegneria

    n vers t eg tu enova

    1Slides from: J.Han & M.Kamber Data Mining: Conceptsand Techniques

    IntroductionMotivation: Wh data minin ?

    What is data mining?Data Mining: On what kind of data?Data mining functionalityAre all the patterns interesting?

    Major issues in data mining

    2J.Han & M.Kamber Data Mining: Concepts and Techniques

    Motivation: Necessity is the Mother of Invention

    Data ex losion roblem Automated data collection tools and mature database

    technology lead

    to

    tremendous

    amounts

    of

    data

    stored

    in

    databases, data warehouses and other information repositories

    We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

    Data warehousing and on line analytical processingExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

    3J.Han & M.Kamber Data Mining: Concepts and Techniques

    Evolution of Database Technology1960s:

    Data collection, database creation, IMS and network DBMS

    1970s:

    Relational data model, relational DBMS implementation1980s:

    RDBMS, advanced data models (extended relational, OO, , . ,

    scientific, engineering, etc.)1990s2000s:

    Data mining and data warehousing, multimedia databases, and Web databases

    4J.Han & M.Kamber Data Mining: Concepts and Techniques

  • 8/8/2019 02 - Introduction to Data Mining

    2/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 2

    What Is

    Data

    Mining?

    Data mining (knowledge discovery in databases):

    Extraction of interesting (non trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

    Alternative names and their inside stories: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge

    , , , , information harvesting, business intelligence, etc.

    What is not data mining?(Deductive) query processing. Expert systems or small ML/statistical programs

    5J.Han & M.Kamber Data Mining: Concepts and Techniques

    Why Data Mining? Potential ApplicationsDatabase anal sis and decision su ort

    Market analysis and managementtarget marketing, customer relation management, market basket analysis, cross selling, market segmentation

    Risk analysis and managementForecasting, customer retention, improved underwriting, quality control, competitive analysis

    Fraud detection and managementOther Applications

    Text mining (news group, email, documents) and Web analysis.Intelligent query answering

    6J.Han & M.Kamber Data Mining: Concepts and Techniques

    Market Analysis and Management Where are the data sources for analysis?

    Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

    Target marketingFind clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.

    Determine customer purchasing patterns over timeConversion of single to a joint bank account: marriage, etc.

    Cross market analysisAssociations/co relations between product salesPrediction based on the association information

    7J.Han & M.Kamber Data Mining: Concepts and Techniques

    Market Analysis and Management

    Customer profiling

    data mining can tell you what types of customers buy what products (clustering or classification)

    Identifying customer requirementsidentifying the best products for different customersuse prediction to find what factors will attract new customers

    Provides summary informationvarious multidimensional summary reportsstatistical summary information (data central tendency and variation)

    8J.Han & M.Kamber Data Mining: Concepts and Techniques

  • 8/8/2019 02 - Introduction to Data Mining

    3/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 3

    Corporate Analysis and Risk

    ManagementFinance lannin

    cash flow analysis and predictioncross sectional and time series analysis (financial ratio, trend analysis, etc.)

    Resource planning:summarize and compare the resources and spending

    monitor competitors and market directions

    group customers into classes and a class based pricing procedureset pricing strategy in a highly competitive market

    9J.Han & M.Kamber Data Mining: Concepts and Techniques

    Fraud Detection

    and

    Management

    Applications

    widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.

    Approachuse historical data to build models of fraudulent behavior and use data mining to help identify similar instances

    Examplesauto insurance: detect a group of people who stage accidents to

    co ect on insurancemoney laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors

    10J.Han & M.Kamber Data Mining: Concepts and Techniques

    Fraud Detection and Management Detecting inappropriate medical treatment

    Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).

    Detecting telephone fraudTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.

    intra group calls, especially mobile phones, and broke a multimillion dollar fraud.

    RetailAnalysts estimate that 38% of retail shrink is due to dishonest employees.

    11J.Han & M.Kamber Data Mining: Concepts and Techniques

    Other ApplicationsSports

    IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for

    New York Knicks and Miami HeatAstronomy

    JPL and the Palomar Observatory discovered 22 quasars with the help of data mining

    IBM Surf Aid applies data mining algorithms to Web access logs for market related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

    12J.Han & M.Kamber Data Mining: Concepts and Techniques

  • 8/8/2019 02 - Introduction to Data Mining

    4/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 4

    Data Mining:

    A

    KDD

    Process

    Data minin : the core of knowled e discover rocess.Pattern Evaluation

    Data Warehouse

    Task-relevant Data

    Selection

    Data Mining

    13

    Data Cleaning

    Data Integration

    DatabasesJ.Han & M.Kamber Data Mining: Concepts and Techniques

    Steps of

    a KDD

    Process

    Learning the application domain:

    relevant prior knowledge and goals of applicationCreating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:

    Find useful features, dimensionality/variable reduction, invariant representation.

    Choosing functions of data mining summarization, classification, regression, association, clustering.

    Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation

    visualization, transformation, removing redundant patterns, etc.Use of discovered knowledge

    14J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining and Business Intelligence Increasing potentialto supportbusiness decisions End UserMakin

    BusinessAnalyst

    DataAnalyst

    Decisions

    Data PresentationVisualization Techniques

    Data Mining Information Discovery

    15

    DBA

    Data Exploration

    OLAP, MDA

    Statistical Analysis, Querying and Reporting

    Data Warehouses / Data Marts

    Data Sources Paper, Files, Information Providers, Database Systems, OLTP

    J.Han & M.Kamber Data Mining: Concepts and Techniques

    Architecture of a Typical Data Mining System

    Graphical user interface

    Data mining engine

    Pattern evaluation

    Knowledge base

    16

    DataWarehouse

    Data cleaning & data integration Filtering

    Databases

    warehouse server

    J.Han & M.Kamber Data Mining: Concepts and Techniques

  • 8/8/2019 02 - Introduction to Data Mining

    5/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 5

    Data Mining:

    On

    What

    Kind

    of

    Data?

    Relational databasesData warehousesTransactional databasesAdvanced DB and information repositories

    Object oriented and object relational databasesSpatial databases

    Timeseries data and temporal dataText databases and multimedia databasesHeterogeneous and legacy databasesWWW

    17J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining

    Functionalities

    (1)

    Conce t descri tion: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

    Association (correlation and causality)Multi dimensional vs. single dimensional association

    age(X, 20..29) income(X, 20..29K) > buys(X, PC)

    [support = 2%, confidence = 60%]contains(T, computer) > contains(T, software) [1%, 75%]

    18J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining Functionalities (2)Classification and Prediction

    Finding models (functions) that describe and distinguish classes or concepts for future prediction

    E.g., classify countries based on climate, or classify cars based on gas mileagePresentation: decision tree, classification rule, neural networkPrediction: Predict some unknown or missing numerical values

    us er ana ys sClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patternsClustering based on the principle: maximizing the intra class similarity and minimizing the interclass similarity

    19J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining Functionalities (3)Outlier anal sis

    Outlier: a data object that does not comply with the general behavior of the dataIt can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

    Trend and evolution analysisTrend and deviation: regression analysisSequential pattern mining, periodicity analysisSimilaritybased analysis

    Other pattern directed or statistical analyses20J.Han & M.Kamber Data Mining: Concepts and Techniques

  • 8/8/2019 02 - Introduction to Data Mining

    6/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 6

    Are All

    the

    Discovered

    Patterns

    Interesting?

    A data mining system/query may generate thousands of patterns, not all of them are interesting.

    Suggested approach: Human centered, query based, focused mining

    Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm

    .

    Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.

    21J.Han & M.Kamber Data Mining: Concepts and Techniques

    Can We

    Find

    All

    and

    Only

    Interesting

    Patterns?

    Find all the interestin atterns: Com letenessCan a data mining system find all the interesting patterns?Association vs. classification vs. clustering

    Search for only interesting patterns: OptimizationCan a data mining system find only the interesting patterns?

    First general all the patterns and then filter out the uninteresting ones.Generate only the interesting patternsmining query optimization

    22J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining: Confluence of Multiple Disciplines

    Database

    Data Mining

    Technology

    MachineLearning Visualization

    23

    OtherDisciplines

    InformationScience

    J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining: Classification SchemesGeneral functionalit

    Descriptive data mining

    Predictive data

    mining

    Different views, different classificationsKinds of databases to be minedKinds of knowledge to be discoveredKinds of techniques utilizedKinds of applications adapted

    24J.Han & M.Kamber Data Mining: Concepts and Techniques

  • 8/8/2019 02 - Introduction to Data Mining

    7/15

  • 8/8/2019 02 - Introduction to Data Mining

    8/15

  • 8/8/2019 02 - Introduction to Data Mining

    9/15

  • 8/8/2019 02 - Introduction to Data Mining

    10/15

    Introduction to Data Mining

  • 8/8/2019 02 - Introduction to Data Mining

    11/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 11

    Visual Data

    Mining

    &

    Data

    Visualization

    Inte ration of visualization and data minin data visualizationdata mining result visualizationdata mining process visualizationinteractive visual data mining

    Data visualizationData in a database or data warehouse can be viewed

    at different levels of granularity or abstractionas different combinations of attributes or dimensions

    Data can be presented in various visual forms

    41J.Han & M.Kamber Data Mining: Concepts and Techniques

    Boxplots from Statsoft: multiple

    variable combinations

    42J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining Result VisualizationPresentation of the results or knowled e obtained from data mining in visual formsExamples

    Scatter plots and boxplots (obtained from descriptive data mining)Decision treesAssociation rulesClustersOutliersGeneralized rules

    43J.Han & M.Kamber Data Mining: Concepts and Techniques

    Visualization of data mining results in SAS Enterprise Miner: scatter plots

    44J.Han & M.Kamber Data Mining: Concepts and Techniques

    Introduction to Data Mining

  • 8/8/2019 02 - Introduction to Data Mining

    12/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 12

    Visualization of association rules in

    MineSet 3.0

    45J.Han & M.Kamber Data Mining: Concepts and Techniques

    Visualization of a decision tree in

    MineSet 3.0

    46J.Han & M.Kamber Data Mining: Concepts and Techniques

    Visualization of cluster groupings in IBM Intelligent Miner

    47J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining Process VisualizationPresentation of the various rocesses of data minin in visual forms so that users can see

    How the data are extractedFrom which database or data warehouse they are extractedHow the selected data are cleaned, integrated, preprocessed, and minedWhich method is selected at data miningWhere the results are storedHow they may be viewed

    48J.Han & M.Kamber Data Mining: Concepts and Techniques

    Introduction to Data Mining

  • 8/8/2019 02 - Introduction to Data Mining

    13/15

    Introduction to Data Mining

    Business Intelligence Ingegneria Gestionale UNIGE 13

    Visualization of Data Mining

    Processes by Clementine

    49J.Han & M.Kamber Data Mining: Concepts and Techniques

    Interactive Visual

    Data

    Mining

    Usin visualization tools in the data minin rocess to help users make smart data mining decisions Example

    Display the data distribution in a set of attributes using colored sectors or columns (depending on whether the whole space is represented by either a circle or a set of

    Use the display to which sector should first be selected for classification and where a good split point for this sector may be

    50J.Han & M.Kamber Data Mining: Concepts and Techniques

    Interactive Visual Mining by Perception Based Classification (PBC)

    51J.Han & M.Kamber Data Mining: Concepts and Techniques

    Audio Data MiningUses audio signals to indicate the patterns of data or the features of data mining resultsAn interesting alternative to visual miningAn inverse task of mining audio (such as music) databases which is to find patterns from audio dataVisual data mining may disclose interesting patterns using

    , watching patterns Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual

    52J.Han & M.Kamber Data Mining: Concepts and Techniques

    Introduction to Data Mining

  • 8/8/2019 02 - Introduction to Data Mining

    14/15

    g

    Business Intelligence Ingegneria Gestionale UNIGE 14

    Is Data

    Mining

    a Hype

    or

    Will

    It

    Be

    Persistent?

    Data minin is a technolo Technological life cycle

    InnovatorsEarly adoptersChasmEarly majority

    Late majorityLaggards

    53J.Han & M.Kamber Data Mining: Concepts and Techniques

    Life Cycle

    of

    Technology

    Adoption

    Data minin is at Chasm!? Existing data mining systems are too genericNeed business specific data mining solutions and smooth integration of business logic with data mining functions

    54J.Han & M.Kamber Data Mining: Concepts and Techniques

    Data Mining: Merely Managers' Business or Everyone's?

    Data mining will surely be an important tool for managers decision making

    Bill Gates: Business @ the speed of thoughtThe amount of the available data is increasing, and data mining systems will be more affordableMultiple personal uses

    Mine your family's medical history to identify genetically related medical conditions Mine the records of the companies you deal with Mine data on stocks and company performance, etc.

    Invisible data miningBuild data mining functions into many intelligent tools

    55J.Han & M.Kamber Data Mining: Concepts and Techniques

    Social Impacts: Threat to Privacy and Data Security?

    Is data mining a threat to privacy and data security?Big Brother, Big Banker, and Big Business are carefully watching youProfiling information is collected every time

    You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for any of the aboveYou surf the Web, reply to an Internet newsgroup, subscribe to a

    , , , ,You pay for prescription drugs, or present you medical care number when visiting the doctor

    Collection of personal data may be beneficial for companies and consumers, there is also potential for misuse

    56J.Han & M.Kamber Data Mining: Concepts and Techniques

    Introduction to Data Mining

  • 8/8/2019 02 - Introduction to Data Mining

    15/15

    g

    Business Intelligence Ingegneria Gestionale UNIGE 15

    Protect Privacy and Data SecurityFair information practices

    International guidelines for data privacy protectionCover aspects relating to data collection, purpose, use, quality, openness, individual participation, and accountabilityPurpose specification and use limitationOpenness: Individuals have the right to know what information is collected about them, who has access to the data, and how the

    Develop and use data security enhancing techniquesBlind signaturesBiometric encryptionAnonymous databases

    57J.Han & M.Kamber Data Mining: Concepts and Techniques

    Trends in Data Mining (1)A lication ex loration

    development of application specific data mining systemInvisible data mining (mining as built in function)

    Scalable data mining methodsConstraint based mining: use of constraints to guide data mining systems in their search for interesting patterns

    Integration of data mining with database systems, data warehouse systems, and Web database systems

    58J.Han & M.Kamber Data Mining: Concepts and Techniques

    Trends in Data Mining (2)Standardization of data minin lan ua e

    A standard will facilitate systematic development, improve interoperability, and promote the education and use of data

    mining systems in industry and societyVisual data miningNew methods for mining complex types of data

    methods with existing data analysis techniques for the complex types of data

    Web miningPrivacy protection and information security in data mining

    59J.Han & M.Kamber Data Mining: Concepts and Techniques