Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004

1

Data Mining Some Real-World Experiences

Alan Walker VP Sabre Labs

April 12th, 2004

2

Overview

• What are the challenges? – Missing and/or noisy data

– Joining data from multiple data sources

– Very large data sets

– Designing and testing new models

– Explaining the results of your data mining exercise to decision makers

• Case studies – Employee fraud detection

– Web page analysis

– Customer choice models

• Conclusions

• Questions to think about

3

Employee Fraud Detection

• Liquor sales

– Many airlines give away drinks in first

class, but charge for them in economy

– Dishonest staff could sell in economy

and report drinks given away in first

class, then pocket the revenue

• Requirements

– Formal and objective method to flag an

individual as a candidate for further

investigation

http://homepages.luc.edu/~jstopek/coldone.jpg

http://www.antioch-college.edu/community/livermore/Images/Dollar.GIF




4


• Choosing a measure – Total Revenue Per Passenger (TRPP)

– Total revenue is not a good measure, as it depends on the number of

passengers on the aircraft

• Data quality – Revenue amounts come from hand written reports that are later entered

into a computer system

– Noisy data

– Missing values

5


• Additional variables

– Data varies by time of day (see below)

– May also vary by day of week or on holidays

– Need to ensure that we’ve gathered other variables that may be correlated with

variance in sales

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

Morning Mid Day Evening Late Night All

Nu

mbe

r of

Flig

hts 0.0-0.2

0.2-0.4

0.4-0.6

0.6-0.8

0.8 +

6


Rank the TRPP values for each Day/ Time

Period into deciles.

$

0 1 2 3 4 9 8 7 6 5

10% 10% 10%

7


• Binomial Approach – Probability for a single day’s sales

– P(TRPP in decile 10 for one day) = 0.1

• What about two days in row? – Like tossing two heads in a row

– P(TRPP in decile 10 for two consecutive days)

– (0.10)2 = 0.01

• Why use ranks? – Not affected by outliers

8


• Variables

– n = number of observations for an employee

– x = number of 10th decile rankings

• Use binomial theorem to compute probabilities

9.01.0inin

xi i

n

)!(!

!

xnx

n

x

n

P(x or more lowest decile rankings) =

Where:

9


• Example – An employee reports 100 TRPP values

– There are 30 observations in lowest decile

– P(30 or more in lowest) = 2.45 x 10-8

• How probable is this? – Texas Lotto probability is 3.87 x 10-8

– Lotto’s advantages

• You get more money

• You don’t go to jail

• Results – This work was successful in identifying people for investigation

– But, as we stressed earlier, the results don’t prove or disprove guilt

10

Web page analysis

• How do users interact with a large website? – What paths lead to sales?

– What paths lead to abandonment?

– What users are actually robots pounding your system?

• What we did – Gathered page hit information from data warehouse

– Built a version of the Apriori algorithm to find sequential patterns

– Iterative process to discover useful, actionable results

11

Web page analysis

• Data collection

– We were fortunate

• Travelocity’s web site went live in March 1996

• The data warehouse started at the same time

• Initially on Oracle, migrated to Teradata 1Q00

• All the page hit data we needed was stored in

Teradata, along with a lot of other data about

user sessions

– Teradata is a shared-nothing database system,

optimized for warehouse and VLDB

applications

• Tables are partitioned by hash values

• Extensive parallel join facilities

http://www.teradata.com/

12

Web page analysis

• Consider a set of three sample sessions – S1: A, B, C, D, E

– S2: A, B, X

– S3: A, B, C, Q

• Some sequential patterns – A B confidence=100%

– A,B C confidence=67%

– A,B,C D confidence=33%

13

Web page analysis

• Confidence – A,B C, confidence=67%

– If A,B occurs, then C follows, with 67% chance

– More formally, confidence = P(C | A,B)

• Support – Number of cases in which this sequence occurs

– Used to eliminate high probability sequences that only occurred once or

twice

14

Web page analysis

• SPuD (Sequential Pattern Discoverer) – About 1,000 lines of C++, using STL

– Ports to any platform

– Command line, reads stdin, writes stdout

– Variant of the Apriori Algorithm

• Command line options – Minimum confidence & support (-c, -s)

– Min / Max pattern length (-l, -m)

– Include / Exclude pages (-i, -x)

– Help with options (-h, -?)

15

Web page analysis

• Performance goals

– ONE MILLION RECORDS!!!

• Test results

– 62 seconds elapsed

– 500 MHz Pentium

– 256 MB RAM

• Observation

– The textbook examples are all small

datasets

– One million records is not a large

dataset in practice

16

Web page analysis

2827,2827,2827 2827; conf=0.68; supp=0.10

3157,3158,3163 3163; conf=0.71; supp=0.11

3157,3157,3157 3157; conf=0.73; supp=0.23

2841,2841,2841 2841; conf=0.99; supp=0.29

6016 3162; conf=0.90; supp=0.12

3162 3157; conf=0.62; supp=0.35

2432 2827; conf=0.61; supp=0.34

3157,3158 3163; conf=0.55; supp=0.16

These rules show repetition. For example, if a

user looks at page 2841 three times in a row,

we’re 99% sure they’ll hit it again

Some more example rules

There is still the challenge

of deciding what this

information means. Does

spinning on the same page

mean the user can’t find

what they want? Is it a

web crawler gathering

data? Or something else?

17

Web page analysis

• Challenges – The Apriori algorithm generates a lot of patterns

• Most are obvious, such as the path people follow as they fill in personal

information and pay for a reservation

• We added some filters to only generate patterns that use a certain page, or

exclude a certain page, also min/max pattern length

• Additional variables – Thing we know about the session

• Look vs. book

• What did they book (air / car / hotel / other)?

– Things about the user

• Registered user

• Frequent buyer

18

Web page analysis

• Concept hierarchy

– Too many distinct values of page ID for any categorical data analysis

– Need to build a hierarchy

– This is harder than it looks, every business person will come up with a different

classification

2123

Air_book Air_shop

Air

Travelocity

Cruise

5770 5771

2124 3123 2234 2235

19

Customer choice modeling

• Predicting probabilities – Linear regression finds y(-,)

y = c0 + c1x1 + c2x2 + … + cnxn + ε

– This won’t work for probability, since P(event) [0,1]

– A non-linear transform maps y p

p = ey / (1 + ey)

y = c0 + c1x1 + c2x2 + … + cnxn + ε

– This transform is called a logistic function

– Alternatively….

loge[p/(1-p)] = c0 + c1x1 + c2x2 + … + cnxn + ε

• Based on logit-choice [Ben-Akiva & Lehrman, 1985]

20


• Derived from the logistic

regression

– Equivalent to logistic regression

when there are only two choices

– Forecast the probability a customer

will choice an item from the choice

set

– The utility of each choice i, is

denoted ui

– Each ui is a linear combination of

indicator variables and/or continuous

variables, such as price

P Buyk

uk

i 1n uk

uk k,1 xk,1 ... k,m xk,m

xk,11 non stop flt

0 otherwise

xk,21 connecting flt

0 otherwise

xk,m Price

21


• Choice model is used to determine – What will someone pay for a non-stop vs connecting flight?

– Does this vary by airline?

– Does this vary by time-of-day or day-of-week?

• What is it good for? – Price determination

– Dynamic discounts and packages

• Other methods for categorical data – Decision-tree induction (ie. C4.5)

– Neural networks can forecast y[0,1], but don’t extend easily to create a

market share model

22


One use is to model the

probability that a user will

choose one of the many

itineraries displayed on

the web site.

We can look at the price,

the type of itinerary

(Nonstop, 1 Stop, etc), the

time of day to estimate the

probability of selling each

option

23


• Implementation – We use SAS for data preprocessing and model calibration.

• PROC MDC (multinomial discrete choice) in the Econometrics and Time

Series (ETS) package

• SAS is also very good with large datasets

– Although not a problem here, data collection is often a challenge for

customer choice modeling

• Results – We’ve been using logistic regression and similar models for many years

– Can sometimes be hard to explain as few people understand the statistics

– The upside is that the model predicts probabilities and share

– Also combines continuous variables (price) with discrete (service type)

24

Conclusions

• Data mining is a process, not a product – Data collection and preparation is an involved process

– Customized techniques are still needed

– Large datasets are typical

• How to be a data miner? – Learn tools for large scale data manipulation, such as SQL, SAS, etc.

– The math is important, even if the tool has a GUI and is simple to use,

you have to understand the results and limitations

– Be prepared to spend significant time presenting and explaining what

you’ve discovered. Data mining is an iterative process

25

Questions to think about…

• Employee fraud detection – How could an employee be consistently in the bottom 10% and not be

committing fraud?

– Suppose you were a crooked employee, how could you beat the system?

• Web page analysis – What other data mining techniques could you use to analyze this data?

– How could I detect a web-crawler? How are they different than a real

person?

• Customer choice modeling – What other data mining techniques could you use to analyze this data?

– What other variables might you add to the model to explain choice?

– What other factors might explain abandonment at a web site? Which of

these can you measure?

Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004

Technology

Transcript of Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004