Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
-
Upload
alan-walker -
Category
Technology
-
view
335 -
download
3
description
Transcript of Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
1
Data Mining Some Real-World Experiences
Alan Walker VP Sabre Labs
April 12th, 2004
2
Overview
• What are the challenges? – Missing and/or noisy data
– Joining data from multiple data sources
– Very large data sets
– Designing and testing new models
– Explaining the results of your data mining exercise to decision makers
• Case studies – Employee fraud detection
– Web page analysis
– Customer choice models
• Conclusions
• Questions to think about
3
Employee Fraud Detection
• Liquor sales
– Many airlines give away drinks in first
class, but charge for them in economy
– Dishonest staff could sell in economy
and report drinks given away in first
class, then pocket the revenue
• Requirements
– Formal and objective method to flag an
individual as a candidate for further
investigation
4
Employee Fraud Detection
• Choosing a measure – Total Revenue Per Passenger (TRPP)
– Total revenue is not a good measure, as it depends on the number of
passengers on the aircraft
• Data quality – Revenue amounts come from hand written reports that are later entered
into a computer system
– Noisy data
– Missing values
5
Employee Fraud Detection
• Additional variables
– Data varies by time of day (see below)
– May also vary by day of week or on holidays
– Need to ensure that we’ve gathered other variables that may be correlated with
variance in sales
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
Morning Mid Day Evening Late Night All
Nu
mbe
r of
Flig
hts 0.0-0.2
0.2-0.4
0.4-0.6
0.6-0.8
0.8 +
6
Employee Fraud Detection
Rank the TRPP values for each Day/ Time
Period into deciles.
$
0 1 2 3 4 9 8 7 6 5
10% 10% 10%
7
Employee Fraud Detection
• Binomial Approach – Probability for a single day’s sales
– P(TRPP in decile 10 for one day) = 0.1
• What about two days in row? – Like tossing two heads in a row
– P(TRPP in decile 10 for two consecutive days)
– (0.10)2 = 0.01
• Why use ranks? – Not affected by outliers
8
Employee Fraud Detection
• Variables
– n = number of observations for an employee
– x = number of 10th decile rankings
• Use binomial theorem to compute probabilities
9.01.0inin
xi i
n
)!(!
!
xnx
n
x
n
P(x or more lowest decile rankings) =
Where:
9
Employee Fraud Detection
• Example – An employee reports 100 TRPP values
– There are 30 observations in lowest decile
– P(30 or more in lowest) = 2.45 x 10-8
• How probable is this? – Texas Lotto probability is 3.87 x 10-8
– Lotto’s advantages
• You get more money
• You don’t go to jail
• Results – This work was successful in identifying people for investigation
– But, as we stressed earlier, the results don’t prove or disprove guilt
10
Web page analysis
• How do users interact with a large website? – What paths lead to sales?
– What paths lead to abandonment?
– What users are actually robots pounding your system?
• What we did – Gathered page hit information from data warehouse
– Built a version of the Apriori algorithm to find sequential patterns
– Iterative process to discover useful, actionable results
11
Web page analysis
• Data collection
– We were fortunate
• Travelocity’s web site went live in March 1996
• The data warehouse started at the same time
• Initially on Oracle, migrated to Teradata 1Q00
• All the page hit data we needed was stored in
Teradata, along with a lot of other data about
user sessions
– Teradata is a shared-nothing database system,
optimized for warehouse and VLDB
applications
• Tables are partitioned by hash values
• Extensive parallel join facilities
12
Web page analysis
• Consider a set of three sample sessions – S1: A, B, C, D, E
– S2: A, B, X
– S3: A, B, C, Q
• Some sequential patterns – A B confidence=100%
– A,B C confidence=67%
– A,B,C D confidence=33%
13
Web page analysis
• Confidence – A,B C, confidence=67%
– If A,B occurs, then C follows, with 67% chance
– More formally, confidence = P(C | A,B)
• Support – Number of cases in which this sequence occurs
– Used to eliminate high probability sequences that only occurred once or
twice
14
Web page analysis
• SPuD (Sequential Pattern Discoverer) – About 1,000 lines of C++, using STL
– Ports to any platform
– Command line, reads stdin, writes stdout
– Variant of the Apriori Algorithm
• Command line options – Minimum confidence & support (-c, -s)
– Min / Max pattern length (-l, -m)
– Include / Exclude pages (-i, -x)
– Help with options (-h, -?)
15
Web page analysis
• Performance goals
– ONE MILLION RECORDS!!!
• Test results
– 62 seconds elapsed
– 500 MHz Pentium
– 256 MB RAM
• Observation
– The textbook examples are all small
datasets
– One million records is not a large
dataset in practice
16
Web page analysis
2827,2827,2827 2827; conf=0.68; supp=0.10
3157,3158,3163 3163; conf=0.71; supp=0.11
3157,3157,3157 3157; conf=0.73; supp=0.23
2841,2841,2841 2841; conf=0.99; supp=0.29
6016 3162; conf=0.90; supp=0.12
3162 3157; conf=0.62; supp=0.35
2432 2827; conf=0.61; supp=0.34
3157,3158 3163; conf=0.55; supp=0.16
These rules show repetition. For example, if a
user looks at page 2841 three times in a row,
we’re 99% sure they’ll hit it again
Some more example rules
There is still the challenge
of deciding what this
information means. Does
spinning on the same page
mean the user can’t find
what they want? Is it a
web crawler gathering
data? Or something else?
17
Web page analysis
• Challenges – The Apriori algorithm generates a lot of patterns
• Most are obvious, such as the path people follow as they fill in personal
information and pay for a reservation
• We added some filters to only generate patterns that use a certain page, or
exclude a certain page, also min/max pattern length
• Additional variables – Thing we know about the session
• Look vs. book
• What did they book (air / car / hotel / other)?
– Things about the user
• Registered user
• Frequent buyer
18
Web page analysis
• Concept hierarchy
– Too many distinct values of page ID for any categorical data analysis
– Need to build a hierarchy
– This is harder than it looks, every business person will come up with a different
classification
2123
Air_book Air_shop
Air
Travelocity
Cruise
5770 5771
2124 3123 2234 2235
19
Customer choice modeling
• Predicting probabilities – Linear regression finds y(-,)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This won’t work for probability, since P(event) [0,1]
– A non-linear transform maps y p
p = ey / (1 + ey)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This transform is called a logistic function
– Alternatively….
loge[p/(1-p)] = c0 + c1x1 + c2x2 + … + cnxn + ε
• Based on logit-choice [Ben-Akiva & Lehrman, 1985]
20
Customer choice modeling
• Derived from the logistic
regression
– Equivalent to logistic regression
when there are only two choices
– Forecast the probability a customer
will choice an item from the choice
set
– The utility of each choice i, is
denoted ui
– Each ui is a linear combination of
indicator variables and/or continuous
variables, such as price
P Buyk
uk
i 1n uk
uk k,1 xk,1 ... k,m xk,m
xk,11 non stop flt
0 otherwise
xk,21 connecting flt
0 otherwise
xk,m Price
21
Customer choice modeling
• Choice model is used to determine – What will someone pay for a non-stop vs connecting flight?
– Does this vary by airline?
– Does this vary by time-of-day or day-of-week?
• What is it good for? – Price determination
– Dynamic discounts and packages
• Other methods for categorical data – Decision-tree induction (ie. C4.5)
– Neural networks can forecast y[0,1], but don’t extend easily to create a
market share model
22
Customer choice modeling
One use is to model the
probability that a user will
choose one of the many
itineraries displayed on
the web site.
We can look at the price,
the type of itinerary
(Nonstop, 1 Stop, etc), the
time of day to estimate the
probability of selling each
option
23
Customer choice modeling
• Implementation – We use SAS for data preprocessing and model calibration.
• PROC MDC (multinomial discrete choice) in the Econometrics and Time
Series (ETS) package
• SAS is also very good with large datasets
– Although not a problem here, data collection is often a challenge for
customer choice modeling
• Results – We’ve been using logistic regression and similar models for many years
– Can sometimes be hard to explain as few people understand the statistics
– The upside is that the model predicts probabilities and share
– Also combines continuous variables (price) with discrete (service type)
24
Conclusions
• Data mining is a process, not a product – Data collection and preparation is an involved process
– Customized techniques are still needed
– Large datasets are typical
• How to be a data miner? – Learn tools for large scale data manipulation, such as SQL, SAS, etc.
– The math is important, even if the tool has a GUI and is simple to use,
you have to understand the results and limitations
– Be prepared to spend significant time presenting and explaining what
you’ve discovered. Data mining is an iterative process
25
Questions to think about…
• Employee fraud detection – How could an employee be consistently in the bottom 10% and not be
committing fraud?
– Suppose you were a crooked employee, how could you beat the system?
• Web page analysis – What other data mining techniques could you use to analyze this data?
– How could I detect a web-crawler? How are they different than a real
person?
• Customer choice modeling – What other data mining techniques could you use to analyze this data?
– What other variables might you add to the model to explain choice?
– What other factors might explain abandonment at a web site? Which of
these can you measure?