Data Mining And KDD
-
Upload
chiropriyac -
Category
Documents
-
view
226 -
download
0
Transcript of Data Mining And KDD
-
7/29/2019 Data Mining And KDD
1/32
From Data Miningto
Knowledge Discovery:An IntroductionGregory Piatetsky-Shapiro
KDnuggets
-
7/29/2019 Data Mining And KDD
2/32
2
Outline
Introduction
Data Mining Tasks
Application Examples
-
7/29/2019 Data Mining And KDD
3/32
3
Trends leading to Data Flood
More data is generated:
Bank, telecom, otherbusiness transactions ...
Scientific Data: astronomy,biology, etc
Web, text, and e-commerce
More data is captured:
Storage technology fasterand cheaper
DBMS capable of handlingbigger DB
http://www.cultindustries.com/new/html/frame.html -
7/29/2019 Data Mining And KDD
4/32
4
Examples
Europe's Very Long Baseline Interferometry(VLBI) has 16 telescopes, each of which produces1 Gigabit/second of astronomical data over a
25-day observation session storage and analysis a big problem
Walmart reported to have 24 Tera-byte DB
AT&T handles billions of calls per day
data cannot be stored -- analysis is done on the fly
-
7/29/2019 Data Mining And KDD
5/32
5
Growth Trends
Moores law
Computer Speed doubles every 18months
Storage law total storage doubles every 9
months
Consequence
very little data will ever be looked atby a human
Knowledge Discovery is NEEDEDto make sense and use of data.
-
7/29/2019 Data Mining And KDD
6/32
6
Knowledge Discovery Definition
Knowledge Discovery in Data is thenon-trivial process of identifying
valid
novel
potentially useful
and ultimately understandablepatternsin data.
fromAdvances in Knowledge Discovery and DataMining, Fayyad, Piatetsky-Shapiro, Smyth, andUthurusamy, (Chapter 1), AAAI/MIT Press 1996
-
7/29/2019 Data Mining And KDD
7/327
Related Fields
Statistics
MachineLearning
Databases
Visualization
Data Mining andKnowledge Discovery
-
7/29/2019 Data Mining And KDD
8/328
__
__
__
__
__
__
__
__
__
Transformed
Data
Patternsand
Rules
Target
Data
Raw
Data
KnowledgeInterpretation
& Evaluation
Integration
Understanding
Knowledge Discovery Process
DATA
Warehouse
Knowledge
-
7/29/2019 Data Mining And KDD
9/329
Outline
Introduction
Data Mining Tasks
Application Examples
-
7/29/2019 Data Mining And KDD
10/3210
Data Mining Tasks: Classification
Learn a method for predicting the instance class frompre-labeled (classified) instances
Many approaches:
Statistics,Decision Trees,
Neural Networks,
...
-
7/29/2019 Data Mining And KDD
11/3211
Classification: Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computeswi from data tominimize squarederror to fit the data
Not flexible enough
-
7/29/2019 Data Mining And KDD
12/3212
Classification: Decision Trees
X
Y
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
52
3
-
7/29/2019 Data Mining And KDD
13/3213
Classification: Neural Nets
Can select morecomplex regions
Can be more accurate
Also can overfit thedata find patterns in
random noise
-
7/29/2019 Data Mining And KDD
14/3214
Data Mining Central Quest
Find true patterns
and avoid overfitting(false patterns due
to randomness)
-
7/29/2019 Data Mining And KDD
15/32
15
Data Mining Tasks: Clustering
Find natural grouping ofinstances given un-labeled data
-
7/29/2019 Data Mining And KDD
16/32
16
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery
Estimation: predicting a continuous value
Deviation Detection: finding changes
Link Analysis: finding relationships
-
7/29/2019 Data Mining And KDD
17/32
17
www.KDnuggets.comData Mining Software Guide
-
7/29/2019 Data Mining And KDD
18/32
18
Outline
Introduction
Data Mining Tasks
Application Examples
-
7/29/2019 Data Mining And KDD
19/32
19
Major Application Areas forData Mining Solutions
Advertising Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
-
7/29/2019 Data Mining And KDD
20/32
-
7/29/2019 Data Mining And KDD
21/32
21
Case Study:Direct Marketing and CRM
Most major direct marketing companies are usingmodeling and data mining
Most financial companies are using customer
modeling
Modeling is easier than changing customerbehaviour
Some successes
Verizon Wireless reduced churn rate from 2% to 1.5%
-
7/29/2019 Data Mining And KDD
22/32
22
Biology: Molecular Diagnostics
Leukemia: Acute Lymphoblastic (ALL) vs AcuteMyeloid (AML)
72 samples, about 7,000 genes
ALL AML
Results: 33 correct (97% accuracy),1 error (sample suspected mislabelled)
Outcome predictions?
-
7/29/2019 Data Mining And KDD
23/32
23
AF1q: New Marker forMedulloblastoma? AF1Q ALL1-fused gene from chromosome 1q transmembrane protein
Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
-
7/29/2019 Data Mining And KDD
24/32
24
Case Study:Security and Fraud Detection
Credit Card Fraud Detection
Money laundering
FAIS (US Treasury)
Securities Fraud
NASDAQ Sonar system
Phone fraudAT&T, Bell Atlantic, British Telecom/MCI
Bio-terrorism detection at Salt Lake
Olympics 2002
-
7/29/2019 Data Mining And KDD
25/32
25
Data Mining and Terrorism:Controversy in the News
TIA: Terrorism (formerly Total) InformationAwareness Program
DARPA program closed by Congress
some functions transferred to intelligence agencies
CAPPS II screen all airline passengers
controversial
Invasion of Privacy or Defensive Shield?
f l h
-
7/29/2019 Data Mining And KDD
26/32
26
Criticism of analytic approach toThreat Detection:
Data Mining will
invade privacy
generate millions of false positives
But can it be effective?
C d S b
-
7/29/2019 Data Mining And KDD
27/32
27
Can Data Mining and Statistics beEffective for Threat Detection?
Criticism: Databases have 5% errors, soanalyzing 100 million suspects will generate 5million false positives
Reality: Analytical models correlate many items of
information to reduce false positives.
Example: Identify one biased coin from 1,000.
After one throw of each coin, we cannot
After 30 throws, one biased coin will stand out withhigh probability.
Can identify 19 biased coins out of 100 million withsufficient number of throws
-
7/29/2019 Data Mining And KDD
28/32
28
Another Approach: Link Analysis
Can Find Unusual Patterns in the Network Structure
-
7/29/2019 Data Mining And KDD
29/32
29
Analytic technology can be effective
Combining multiple models and link analysis canreduce false positives
Today there are millions of false positives with
manual analysis
Data Mining is just one additional tool to helpanalysts
Analytic Technology has the potential to reducethe current high rate of false positives
-
7/29/2019 Data Mining And KDD
30/32
30
Data Mining with Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation distributed data
Bayardo & Srikant, Technological Solutions forProtecting Privacy, IEEE Computer, Sep 2003
Th H C f
-
7/29/2019 Data Mining And KDD
31/32
31
19901998 2000 2002
Expectations
Performance
The Hype Curve forData Mining and Knowledge
DiscoveryOver-inflatedexpectations
Disappointment
Growing acceptanceand mainstreaming
risingexpectations
-
7/29/2019 Data Mining And KDD
32/32
Summary
www.KDnuggets.com
the website forData Mining and Knowledge Discovery
Contact: Gregory Piatetsky-Shapiro