Data Mining And KDD

download Data Mining And KDD

of 32

Transcript of Data Mining And KDD

  • 7/29/2019 Data Mining And KDD

    1/32

    From Data Miningto

    Knowledge Discovery:An IntroductionGregory Piatetsky-Shapiro

    KDnuggets

  • 7/29/2019 Data Mining And KDD

    2/32

    2

    Outline

    Introduction

    Data Mining Tasks

    Application Examples

  • 7/29/2019 Data Mining And KDD

    3/32

    3

    Trends leading to Data Flood

    More data is generated:

    Bank, telecom, otherbusiness transactions ...

    Scientific Data: astronomy,biology, etc

    Web, text, and e-commerce

    More data is captured:

    Storage technology fasterand cheaper

    DBMS capable of handlingbigger DB

    http://www.cultindustries.com/new/html/frame.html
  • 7/29/2019 Data Mining And KDD

    4/32

    4

    Examples

    Europe's Very Long Baseline Interferometry(VLBI) has 16 telescopes, each of which produces1 Gigabit/second of astronomical data over a

    25-day observation session storage and analysis a big problem

    Walmart reported to have 24 Tera-byte DB

    AT&T handles billions of calls per day

    data cannot be stored -- analysis is done on the fly

  • 7/29/2019 Data Mining And KDD

    5/32

    5

    Growth Trends

    Moores law

    Computer Speed doubles every 18months

    Storage law total storage doubles every 9

    months

    Consequence

    very little data will ever be looked atby a human

    Knowledge Discovery is NEEDEDto make sense and use of data.

  • 7/29/2019 Data Mining And KDD

    6/32

    6

    Knowledge Discovery Definition

    Knowledge Discovery in Data is thenon-trivial process of identifying

    valid

    novel

    potentially useful

    and ultimately understandablepatternsin data.

    fromAdvances in Knowledge Discovery and DataMining, Fayyad, Piatetsky-Shapiro, Smyth, andUthurusamy, (Chapter 1), AAAI/MIT Press 1996

  • 7/29/2019 Data Mining And KDD

    7/327

    Related Fields

    Statistics

    MachineLearning

    Databases

    Visualization

    Data Mining andKnowledge Discovery

  • 7/29/2019 Data Mining And KDD

    8/328

    __

    __

    __

    __

    __

    __

    __

    __

    __

    Transformed

    Data

    Patternsand

    Rules

    Target

    Data

    Raw

    Data

    KnowledgeInterpretation

    & Evaluation

    Integration

    Understanding

    Knowledge Discovery Process

    DATA

    Warehouse

    Knowledge

  • 7/29/2019 Data Mining And KDD

    9/329

    Outline

    Introduction

    Data Mining Tasks

    Application Examples

  • 7/29/2019 Data Mining And KDD

    10/3210

    Data Mining Tasks: Classification

    Learn a method for predicting the instance class frompre-labeled (classified) instances

    Many approaches:

    Statistics,Decision Trees,

    Neural Networks,

    ...

  • 7/29/2019 Data Mining And KDD

    11/3211

    Classification: Linear Regression

    Linear Regression

    w0 + w1 x + w2 y >= 0

    Regression computeswi from data tominimize squarederror to fit the data

    Not flexible enough

  • 7/29/2019 Data Mining And KDD

    12/3212

    Classification: Decision Trees

    X

    Y

    if X > 5 then blue

    else if Y > 3 then blue

    else if X > 2 then green

    else blue

    52

    3

  • 7/29/2019 Data Mining And KDD

    13/3213

    Classification: Neural Nets

    Can select morecomplex regions

    Can be more accurate

    Also can overfit thedata find patterns in

    random noise

  • 7/29/2019 Data Mining And KDD

    14/3214

    Data Mining Central Quest

    Find true patterns

    and avoid overfitting(false patterns due

    to randomness)

  • 7/29/2019 Data Mining And KDD

    15/32

    15

    Data Mining Tasks: Clustering

    Find natural grouping ofinstances given un-labeled data

  • 7/29/2019 Data Mining And KDD

    16/32

    16

    Major Data Mining Tasks

    Classification: predicting an item class

    Clustering: finding clusters in data

    Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery

    Estimation: predicting a continuous value

    Deviation Detection: finding changes

    Link Analysis: finding relationships

  • 7/29/2019 Data Mining And KDD

    17/32

    17

    www.KDnuggets.comData Mining Software Guide

  • 7/29/2019 Data Mining And KDD

    18/32

    18

    Outline

    Introduction

    Data Mining Tasks

    Application Examples

  • 7/29/2019 Data Mining And KDD

    19/32

    19

    Major Application Areas forData Mining Solutions

    Advertising Bioinformatics

    Customer Relationship Management (CRM)

    Database Marketing

    Fraud Detection eCommerce

    Health Care

    Investment/Securities

    Manufacturing, Process Control

    Sports and Entertainment

    Telecommunications

    Web

  • 7/29/2019 Data Mining And KDD

    20/32

  • 7/29/2019 Data Mining And KDD

    21/32

    21

    Case Study:Direct Marketing and CRM

    Most major direct marketing companies are usingmodeling and data mining

    Most financial companies are using customer

    modeling

    Modeling is easier than changing customerbehaviour

    Some successes

    Verizon Wireless reduced churn rate from 2% to 1.5%

  • 7/29/2019 Data Mining And KDD

    22/32

    22

    Biology: Molecular Diagnostics

    Leukemia: Acute Lymphoblastic (ALL) vs AcuteMyeloid (AML)

    72 samples, about 7,000 genes

    ALL AML

    Results: 33 correct (97% accuracy),1 error (sample suspected mislabelled)

    Outcome predictions?

  • 7/29/2019 Data Mining And KDD

    23/32

    23

    AF1q: New Marker forMedulloblastoma? AF1Q ALL1-fused gene from chromosome 1q transmembrane protein

    Related to leukemia (3 PUBMED entries) but not to Medulloblastoma

  • 7/29/2019 Data Mining And KDD

    24/32

    24

    Case Study:Security and Fraud Detection

    Credit Card Fraud Detection

    Money laundering

    FAIS (US Treasury)

    Securities Fraud

    NASDAQ Sonar system

    Phone fraudAT&T, Bell Atlantic, British Telecom/MCI

    Bio-terrorism detection at Salt Lake

    Olympics 2002

  • 7/29/2019 Data Mining And KDD

    25/32

    25

    Data Mining and Terrorism:Controversy in the News

    TIA: Terrorism (formerly Total) InformationAwareness Program

    DARPA program closed by Congress

    some functions transferred to intelligence agencies

    CAPPS II screen all airline passengers

    controversial

    Invasion of Privacy or Defensive Shield?

    f l h

  • 7/29/2019 Data Mining And KDD

    26/32

    26

    Criticism of analytic approach toThreat Detection:

    Data Mining will

    invade privacy

    generate millions of false positives

    But can it be effective?

    C d S b

  • 7/29/2019 Data Mining And KDD

    27/32

    27

    Can Data Mining and Statistics beEffective for Threat Detection?

    Criticism: Databases have 5% errors, soanalyzing 100 million suspects will generate 5million false positives

    Reality: Analytical models correlate many items of

    information to reduce false positives.

    Example: Identify one biased coin from 1,000.

    After one throw of each coin, we cannot

    After 30 throws, one biased coin will stand out withhigh probability.

    Can identify 19 biased coins out of 100 million withsufficient number of throws

  • 7/29/2019 Data Mining And KDD

    28/32

    28

    Another Approach: Link Analysis

    Can Find Unusual Patterns in the Network Structure

  • 7/29/2019 Data Mining And KDD

    29/32

    29

    Analytic technology can be effective

    Combining multiple models and link analysis canreduce false positives

    Today there are millions of false positives with

    manual analysis

    Data Mining is just one additional tool to helpanalysts

    Analytic Technology has the potential to reducethe current high rate of false positives

  • 7/29/2019 Data Mining And KDD

    30/32

    30

    Data Mining with Privacy

    Data Mining looks for patterns, not people!

    Technical solutions can limit privacy invasion

    Replacing sensitive personal data with anon. ID

    Give randomized outputs

    Multi-party computation distributed data

    Bayardo & Srikant, Technological Solutions forProtecting Privacy, IEEE Computer, Sep 2003

    Th H C f

  • 7/29/2019 Data Mining And KDD

    31/32

    31

    19901998 2000 2002

    Expectations

    Performance

    The Hype Curve forData Mining and Knowledge

    DiscoveryOver-inflatedexpectations

    Disappointment

    Growing acceptanceand mainstreaming

    risingexpectations

  • 7/29/2019 Data Mining And KDD

    32/32

    Summary

    www.KDnuggets.com

    the website forData Mining and Knowledge Discovery

    Contact: Gregory Piatetsky-Shapiro

    [email protected]