Post on 03-Jul-2015
DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
1
Semester 2/2011
Lecture 10
Data Mining: Case Studies and Applications
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Topics
2
Data Mining Goals
Data Floods
Machine Learning / Data Mining Application areas
Case Studies
Data Warehousing and Data Mining by Kritsada Sriphaew
THE FOUR GOALS OF DATA MINING
3
• Prediction: Using current data to make prediction on future activities
• Identification: "Data patterns can be used to identify the existence of an item, an event, or an activity“
• Classification: Breaking the data down into categories based on certain attributes.
• Optimization: Using the mined data to make optimizations on resources, such as time, money, etc.
Data Warehousing and Data Mining by Kritsada Sriphaew
Trends leading to Data Flood
More data is generated:
Bank, telecom, other
business transactions ...
Scientific data: astronomy,
biology, etc
Web, text, and e-commerce
4 Data Warehousing and Data Mining by Kritsada Sriphaew
Big Data Examples
Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session
storage and analysis a big problem
AT&T handles billions of calls per day
so much data, it cannot be all stored -- analysis has to be done “on the fly”, on streaming data
5 Data Warehousing and Data Mining by Kritsada Sriphaew
Largest databases in 2003
Commercial databases:
Winter Corp. 2003 Survey: France Telecom has largest decision-support DB, ~30TB; AT&T ~ 26 TB
Web
Alexa internet archive: 7 years of data, 500 TB
Google searches 4+ Billion pages, several hundreds TB
IBM WebFountain, 160 TB (2003)
Internet Archive (www.archive.org),~ 300 TB
6 Data Warehousing and Data Mining by Kritsada Sriphaew
5 million terabytes created in 2002
UC Berkeley 2003 estimate: 5 exabytes (5 million terabytes) of new data was created in 2002.
www.sims.berkeley.edu/research/projects/how-much-info-2003/
US produces ~40% of new stored data worldwide
7
1018
Data Warehousing and Data Mining by Kritsada Sriphaew
Data Growth Rate
Twice as much information was created in 2002 as in 1999 (~30% growth rate)
Other growth rate estimates even higher
Very little data will ever be looked at by a human
Knowledge Discovery is NEEDED to make sense and use of data.
8 Data Warehousing and Data Mining by Kritsada Sriphaew
Machine Learning / Data Mining Application areas
Science astronomy, bioinformatics, drug discovery, …
Business advertising, CRM (Customer Relationship management),
investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, …
Web: search engines, advertising, web and text mining, …
Government surveillance & anti-terror (?|), crime detection, profiling tax
cheaters, …
9 Data Warehousing and Data Mining by Kritsada Sriphaew
DM applications in 2004 (in %) 13%: Banking
9%: Direct Marketing, Fraud Detection, Scientific data analysis
8%: Bioinformatics
7%: Insurance, Medical/Pharmaceutic Applications
6%: eCommerce/Web, Telecommunications
4%: Investments/Stocks, Manufacturing, Retail, Security
Bellow: Travel, Entertainment/News, …
10 Data Warehousing and Data Mining by Kritsada Sriphaew
Data Mining for Customer Modeling
Customer Tasks:
attrition prediction (odchod zákazníků)
targeted marketing:
cross-sell, customer acquisition
credit-risk
fraud detection
Industries
banking, telecom, retail sales (maloobchodní prodej), …
11 Data Warehousing and Data Mining by Kritsada Sriphaew
Customer Attrition: Case Study
Situation: Attrition rate for mobile phone customers is around 25-30% a year!
Task:
Given customer information for the past N months,
predict who is likely to attrite next month.
Also, estimate customer value and what is the cost-
effective offer to be made to this customer.
12 Data Warehousing and Data Mining by Kritsada Sriphaew
Customer Attrition Results
Verizon Wireless built a customer data warehouse
Identified potential attriters
Developed multiple, regional models
Targeted customers with high propensity to accept the offer
Reduced attrition rate from over 2%/month to under 1.5%/month (huge impact, with >30 M subscribers)
(Reported in 2003)
13 Data Warehousing and Data Mining by Kritsada Sriphaew
Assessing Credit Risk: Case Study
Situation: Person applies for a loan
Task: Should a bank approve the loan?
Note: People who have the best credit don’t need the loans, and people with worst credit are not likely to repay. Bank’s best customers are in the middle
14 Data Warehousing and Data Mining by Kritsada Sriphaew
Credit Risk - Results
Banks develop credit models using variety of machine learning methods.
Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan
Widely deployed in many countries
15 Data Warehousing and Data Mining by Kritsada Sriphaew
Successful e-commerce – Case Study
A person buys a book (product) at Amazon.com.
Task: Recommend other books (products) this person is likely to buy
Amazon does clustering based on books bought:
customers who bought “Advances in Knowledge Discovery and Data Mining”, also bought “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations”
Recommendation program is quite successful
16 Data Warehousing and Data Mining by Kritsada Sriphaew
Unsuccessful e-commerce case study (KDD-Cup 2000)
Data: clickstream and purchase data from Gazelle.com, legwear and legcare e-tailer
Q: Characterize visitors who spend more than $12 on an average order at the site
Dataset of 3,465 purchases, 1,831 customers
Very interesting analysis by Cup participants
thousands of hours - $X,000,000 (Millions) of consulting
Total sales -- $Y,000
Obituary: Gazelle.com out of business, Aug 2000
17 Data Warehousing and Data Mining by Kritsada Sriphaew
Genomic Microarrays – Case Study
Given microarray data for a number of samples (patients), can we
Accurately diagnose the disease?
Predict outcome for given treatment?
Recommend best treatment?
18 Data Warehousing and Data Mining by Kritsada Sriphaew
Example: ALL/AML data 38 training cases, 34 test, ~ 7,000 genes
2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML)
Use train data to build diagnostic model
ALL AML
Results on test data: 33/34 correct, 1 error may be mislabeled
19 Data Warehousing and Data Mining by Kritsada Sriphaew
Security and Fraud Detection - Case Study
Credit Card Fraud Detection Detection of Money laundering FAIS (US Treasury)
Securities Fraud NASDAQ KDD system
Phone fraud AT&T, Bell Atlantic, British Telecom/MCI
Bio-terrorism detection at Salt Lake Olympics 2002
20 Data Warehousing and Data Mining by Kritsada Sriphaew
Data Mining and Privacy
21
in 2006, NSA (National Security Agency) was reported to be mining years of call info, to identify terrorism networks
Social network analysis has a potential to find networks
Invasion of privacy – do you mind if your call information is in a gov database?
What if NSA program finds one real suspect for 1,000 false leads ? 1,000,000 false leads?
Data Warehousing and Data Mining by Kritsada Sriphaew
Case Study: Intrusion Detection Systems
23
Detection Approach Misuse Detection
▪ Based on known malicious patterns (signatures)
Anomaly Detection ▪ Based on deviations from established
normal patterns (profiles)
Data Source Network-based (NIDS)
▪ Network traffic Host-based (HIDS)
▪ Audit trails
Data Warehousing and Data Mining by Kritsada Sriphaew
Case Study: Intrusion Detection Systems
24
Data Mining Usage
Signature extraction
Rule matching
Alarm data analysis Reduce false alarms
Eliminate redundant alarms
Feature selection
Training Data cleaning
Data Warehousing and Data Mining by Kritsada Sriphaew
Case Study: Intrusion Detection Systems
25
Behavioral Feature for Network Anomaly Detection
Training set = normal network traffic
Feature provides semantics of the values of data
Feature selection is important
Proposed method:
▪ Feature extraction based on protocol behavior
▪ Many Attacks uses protocol improperly ▪ Ping of Death
▪ SYN Flood
▪ Teardrop
Data Warehousing and Data Mining by Kritsada Sriphaew
Case Study: Intrusion Detection Systems
26
tcpPerFIN
tcpPerPSH
WWW
tcpPerPSH
SMTP
FTP
meanIAT
tcpPerSYN
telnet
meanipTLen
tcpPerPSH
SMTP
… … …
>0.01
<=0.01
<=0.4
>0.4
<=0.79
>0.79
>546773
>546773
<=0.03
>0.03
>73
<=73
>0.79
Decision Tree (only small part)
Data Warehousing and Data Mining by Kritsada Sriphaew
CASE STUDY: METLIFE
27
Company Profile MetLife, Inc. is a leading provider of insurance and other financial services to millions of individual and institutional customers throughout the United States. Established in 1863, Metlife now has offices all over the US and the world, and offers ten different types
of insurances and financial services.
Data Warehousing and Data Mining by Kritsada Sriphaew
CASE STUDY: METLIFE
28
Industry: Insurance and Financial Services How they use Data Mining: Fraud Detection
Data Warehousing and Data Mining by Kritsada Sriphaew
CASE STUDY: METLIFE
29
• Project first started in 2001
• MetLife set out to build $50 Million relational database
• This project would consolidate data from 30 business world wide.
Data Warehousing and Data Mining by Kritsada Sriphaew
CASE STUDY: METLIFE
30
• Around same time, it was reported that $30 Million of insurance money went to fraudulent claims.
• MetLife teamed up with Computer Sciences Corporation (CSC) to o License their data mining tool (called Fraud
Investigator),
o Develop @First, "an early fraud detection system"
Data Warehousing and Data Mining by Kritsada Sriphaew
CASE STUDY: METLIFE
31
• By 2003, MetLife's data mining operation was in full swing.
• They were able to detect fraud in a fraction of the time it would take in man hours
• One example is detecting rate evasion
Data Warehousing and Data Mining by Kritsada Sriphaew
CASE STUDY: METLIFE
32
• Rate evasion is lying about where you live to pay lower premiums.
• Metlife used data mining to detect rate evasion by matching ZIP codes with phone numbers to see if the cities matched.
• In 2.5 hours, Metlife found 107 fraudulent claims, all linked to a rate-evasion ring in NY and Massachusetts.
Data Warehousing and Data Mining by Kritsada Sriphaew