Data Mining - Massey University Introduction to Data Mining Instructor: Dr. Chris Volinsky.
-
Upload
augusta-kelly -
Category
Documents
-
view
219 -
download
0
Transcript of Data Mining - Massey University Introduction to Data Mining Instructor: Dr. Chris Volinsky.
Data Mining - Massey University
Introduction to Data Mining
Instructor: Dr. Chris Volinsky
Data Mining - Massey University
Class Structure
• Class structure– 9AM - 11AM Lecture– 11AM - 12PM Computer Lab– 12PM Lunch– 1PM - 3PM Lecture– 3PM - 4PM Computer Lab– 4PM - 5PM Recap and discussion
• Exams / Grades based on– 40% data analysis project– 30% technical paper presentation– 30% pop quizzes and exams
• given at the beginning of each class• questions about broad concepts
Data Mining - Massey University
Class Schedule
• 24-26 July: Block 1• 4-6 September: Block 2• 6 September: Presentation of technical
paper and data analysis proposal to class
• 19 September: Data Analysis project due (to via email)
• I will be available intermittently from 1-24 August
Data Mining - Massey University
Course Objectives
• Direct Objectives:– To learn data mining techniques– To see their use in real-world/research applications– To understand limitations of standard statistical
techniques in data mining applications– To get an understanding of the methodological
principles behind data mining– To be able to read and understand methodological
work in scholarly journal papers– To implement & use data mining models using
statistical software (specifically R)
Data Mining - Massey University
Project #1:Data Analysis Project
• The goal of data mining is to find interesting patterns in data. You will be required to:– Define a scientific question of interest– Collect a data set (probably online)– Prepare the data set properly– Analyze the data using appropriate models– Write a 5-10 page report on your analysis (graphics
included)
• Project proposals (1/2 -1 page) will be due at the beginning of the second block.
• Present proposal (5 minutes) to class on 6 September.
• Finished reports will be due 19 September.
Data Mining - Massey University
Project #2:Scientific Paper Presentation
• Select a technical paper about a data mining technique (list provided on web site).– Read and understand the paper– Write a one-page summary of the paper– Present the basic ideas of the paper to the class (10-15
minutes)• Emphasis to be placed on the motivation for a particular
statistical methodology within the application context:– What is the general objective of the paper?– What data are they using?– What statistical approach/method is proposed? Why?– What has been done in the past?– How does the paper accomplish new domain insight using
that method?
• Paper presented to class on last day of lecture: 9 September
Data Mining - Massey University
Class Web Site
• http://www.research.att.com/~volinsky/DataMining
• Lists of papers for presentation• Announcements• Links to other data mining course notes, R
tutorials, resources
• Email:– [email protected]
Data Mining - Massey University
Resources• Data mining is a new field and as such, does not
have authoritative texts (yet).• This class draws from many sources, best are
– “Handbook of Data Mining” Hand, Mannila and Smyth – “Elements of Statistical Learning” Hastie, Tibshirani,
and Friedman– “Interactive and Dynamic Graphics for Data Analysis”
Cook and Swayne– Also good class notes available from other classes:
• David Madigan, Rutgers• Di Cook, Iowa State• Padhraic Smyth, UC Irvine• Jiawei Han, Simon Fraser
– see class web site for pointers to these notes, or just Google them!)
• Also many good tutorials and books on R (or S/Splus), both online and in the library.
Data Mining - Massey University
Course Outline• 6 days = 12 “units” each unit is a lecture and a lab• Units:
– Intro to Data Mining– Data exploration and visualization– Data Mining Concepts– Regression Topics– Classification and Supervised Learning– Clustering and Unsupervised Learning– Text Mining and Information Retrieval– Web Mining and Social Networks 1– Web Mining and Social Networks 2– Assorted Topics
• Advanced Classification - Neural networks, ensemble methods• Association Rules• Telecommunications Fraud• Proximity models for social networks• Support Vector machines
Data Mining - Massey University
What is Data Mining?
• Not well defined….• Hand, Mannila, Smyth:
– “data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data
owner”
• Isn’t that the same as statistics?
Data Mining - Massey University
Data Mining Enablers
• Explosion of data• Fast and cheap computation and storage
– Moore’s Law: processing doubles every 19 months– Disk storage doubles every 9 months– Database technology
• Competitive pressure in business• New, successful models
– SVM, boosting
• Commercial products• SAS, SPSS, Insightful, IBM, Oracle
– Open Source products• Weka • R
1E+3
1E+4
1E+5
1E+6
1E+7
1988 1991 1994 1997 2000
disk TB growth: 112%/y
Moore's Law: 58.7%/y
ExaByte
Disk TB Shipped per Year1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
Both from NZ!
Data Mining - Massey University
Data Mining vs. Statistics
• Statistics is known for:– well defined hypotheses used to learn about a– specifically chosen population studied using– carefully collected data providing inferences with– well known properties.
• Data mining isn’t that careful. It is:– data driven discovery of– models and patterns from– massive and– observational data sets
Data Mining - Massey University
Two Types of Data
• Experimental Data– Hypothesis H– design an experiment to test H– collect data, infer how likely it is that H is true– e.g., clinical trials in medicine
• Observational or Retrospective or Secondary Data– massive non-experimental data sets
• e.g., Web logs, human genome, atmospheric simulations, etc
– assumptions of experimental design no longer valid– how can we use such data to do science?
• use the data to support model exploration, hypothesis testing
Data Mining - Massey University
Data-Driven Discovery
• Observational data– cheap relative to experimental data
• Examples: – Transaction data archives for retail stores, airlines,
etc– Web logs for Amazon, Google, etc– The human/mouse/rat genome– Etc., etc
makes sense to leverage available data useful (?) information may be hidden in vast
archives of data
Data Mining - Massey University
Data Mining v. Statistics
• Traditional statistics– first hypothesize, then collect data, then analyze– often model-oriented (strong parametric models)
• Data mining: – few if any a priori hypotheses– data is usually already collected a priori– analysis is typically data-driven not hypothesis-driven– Often algorithm-oriented rather than model-oriented
• Different?– Yes, in terms of culture, motivation: however…..– statistical ideas are very useful in data mining, e.g., in validating
whether discovered knowledge is useful – Increasing overlap at the boundary of statistics and DM
e.g., exploratory data analysis (based on pioneering work of John Tukey in the 1960’s)
Data Mining - Massey University
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
OtherDisciplines
InformationScience
MachineLearning Visualization
Different fields have different views of what data mining is
Data Mining - Massey University
Data Data Data
• It’s all about the data - where does it come from?– www– NASA– Business processes/transactions– Telecommunications and networking– Medical imagery– Government, census, demographics– Sensor networks, RFID tags– sports
Data Mining - Massey University
Flat File or Vector Data
• Rows = objects• Columns = measurements on objects
– Represent each row as a p-dimensional vector, where p is the dimensionality
• In efffect, embed our objects in a p-dimensional vector space• Often useful, but not always appropriate
• Both n and p can be very large in data mining• Matrix can be quite sparse
n
p
Data Mining - Massey University
Sparse Matrix (Text) Data
20 40 60 80 100 120 140 160 180 200
50
100
150
200
250
300
350
400
450
500
Word IDs
TextDocuments
Data Mining - Massey University
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
…
5115
11111151511151
77777777
111333
3333131113332232
…
User 5
User 4
User 3
User 2
User 1
Sequence (Web) Data
Data Mining - Massey University
Time Series Data
0 5 10 15 20 25 3040
60
80
100
120
140
160
TIME
X-POSITION
TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS
Data Mining - Massey University
Image Data
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
Spatio Temporal Data
• http://senseable.mit.edu/nyte/nyte-globe-encounters.mov
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
Relational Data
Algorithms for estimating relative importance in networks S. White and P. Smyth, ACM SIGKDD, 2003.
Data Mining - Massey University
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
Examples of Data Mining Successes
• Market Basket (WalMart)• Recommender Systems (Amazon.com)• Fraud Detection in Telecommunications
(AT&T)• Target Marketing / CRM• Financial Markets• DNA Microarray analysis• Biometrics (fingerprinting, handwriting)• Web Traffic / Blog analysis
Data Mining - Massey University
Examples of Data Mining Successes
• Google is a company built on data mining• PageRank mined the web to build better
search• Google as spell checker• Google as ad placer• Google as news aggregator• Google as face recognizer
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
The Data Mining Process
• Often called KDD - Knowledge Discovery in Databases
• Analysis is just one part of the process– Data collection and storage– Data cleaning– Data sampling– Analysis– Decision making
Data Mining - Massey University
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining - Massey University
Exploratory Data Analysis
• Getting an overall sense of the data set– Computing summary statistics:
• Number of distinct values, max, min, mean, median, variance, skewness,..
• Visualization is widely used– 1d histograms– 2d scatter plots– Higher-dimensional methods
• Useful for data checking– E.g., finding that a variable is always integer valued or
positive– Finding the some variables are highly skewed
• Simple exploratory analysis can be extremely valuable– You should always “look” at your data before applying any
data mining algorithms
Data Mining - Massey University
Example of Exploratory Data Analysis(Pima Indians data, scatter plot matrix)
Data Mining - Massey University
Descriptive Modeling
• Goal is to build a “descriptive” model – e.g., a model that could simulate the data if
needed– models the underlying process
• Examples:– Density estimation:
• estimate the joint distribution P(x1,……xp)
– Cluster analysis:• Find natural groups in the data
– Dependency models among the p variables• Learning a Bayesian network for the data
Data Mining - Massey University
Example of Descriptive Modeling
3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7
3.8
3.9
4
4.1
4.2
4.3
4.4ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Volume
Red Blood Cell Hemoglobin Concentration
Anemia Group
Control Group
Data Mining - Massey University
Example of Descriptive Modeling
3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7
3.8
3.9
4
4.1
4.2
4.3
4.4ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Volume
Red Blood Cell Hemoglobin Concentration
Anemia Group
Control Group
3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7
3.8
3.9
4
4.1
4.2
4.3
4.4
Red Blood Cell Volume
Red Blood Cell Hemoglobin Concentration
EM ITERATION 25
Data Mining - Massey University
WebCanvas algorithm and software - currently in new SQLServer
Data Mining - Massey University
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining - Massey University
Predictive Modeling
• Predict one variable Y given a set of other variables X– Here X could be a p-dimensional vector
– Classification: Y is categorical– Regression: Y is real-valued
• In effect this is function approximation, learning the relationship between Y and X
• Many, many algorithms for predictive modeling in statistics and machine learning
• Often the emphasis is on predictive accuracy, less emphasis on understanding the model
Data Mining - Massey University
Predictive Modeling: Fraud Detection
• Telecommunications fraud detection– Fraud costs telecommunication companies US$ Billions per
year– very few transactions are fraudulent, but they are costly
• Approach– For each transaction estimate “fraudiness”.– Based on known fraud AND known user behavior– High probability cases investigated by fraud police
• Example models:– anomaly detection– guilt by association
• Issues– Significant feature engineering/preprocessing – false alarm rate vs missed detection – what is the tradeoff?
Data Mining - Massey University
Predictive Modeling: Other Examples
• Risk Management:– Example: Credit card company wants to do risk
management. – How would you do this?
• Netflix Prize– US$1M prize to make better movie
recommendations. – How would you do this?
Data Mining - Massey University
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining - Massey University
Pattern Discovery
• Goal is to discover interesting “local” patterns in the data rather than to characterize the data globally
• given market basket data we might discover that
• If customers buy wine and bread then they buy cheese with probability 0.9
• These are known as “association rules”
• Given multivariate data on astronomical objects
• We might find a small group of previously undiscovered objects that are very self-similar in our feature space, but are very far away in feature space from all other objects
Data Mining - Massey University
Example of Pattern Discovery
• IBM “Advanced Scout” System– Bhandari et al. (1997)– Every NBA basketball game is annotated,
• e.g., time = 6 mins, 32 seconds event = 3 point basket player = Michael Jordan
• This creates a huge untapped database of information
– IBM algorithms search for rules of the form “If player A is in the game, player B’s scoring rate increases from 3.2 points per quarter to 8.7 points per quarter”
Data Mining - Massey University
Data Mining Pitfalls
• Is data mining always necessary– Just because you have a terabyte doesn’t
mean you need to use it.
• Privacy concerns– Differ by country, industry, application,
generation
• Meaningfulness of patterns unclear– Rhine paradox– Terrorism– DM has a lot to learn from statistics!
Data Mining - Massey University
Rhine Paradox
• David Rhine: parapsychologist who studied ESP (he was a believer!)
• He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue.
• He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!
• He told these people they had ESP and called them in for another test of the same type.
• Alas, he discovered that almost all of them had lost their ESP.
• What is the conclusion?
Data Mining - Massey University
Data Mining Pitfalls
• PR Problems: data mining as a four letter word?
– ...increasingly people’s data is at risk. The old ways ...are still at use like dumpster diving, stealing from mailboxes, physical theft, and credit card receipt copying. New tactics include disparate techniques of phishing, email fraud, data mining, spam, key-logging and an array of other technological processes. - Steven D. Domenikos, IdentityTruth, 2008
– One place oversight is sorely lacking is in the whole matter of data mining. ...What have they contributed? Not a single case comes to mind in which security services apprehended a terrorist following identification by data mining. ...that huge database will be out there, win or lose, for some government agency to divert to its purposes or some hacker to turn to private gain or crime. - John Prados, TomPaine.com
Data Mining - Massey University
Fighting Terrorism in the US
• US Government is widely known to be collecting lots of data on Americans and using data mining to look for patterns consistent with terrorist activity.
• Bruce Schneier, Wired Magazine, “Why Data Mining Won’t Stop Terror”:
• Assume:– 1 in 100 false positive (99% precision)– 1 in 1000 false negative– 1 trillion events (phone calls, credit card transactions,
emails) per day – 10 are really terrorist plots
• Then:– 1 billion false alarms for every true plot uncovered– 27 million leads daily– Even if 99.9999% precision = 2,750 false alarms
Data Mining - Massey University
Data Mining Software:Introduction to R
Data Mining - Massey University
Data Mining Software• What is R?
– Open source statistical software– Grew out of S, S+– www.r-project.org– http://cran.stat.auckland.ac.nz/
• R Tutorials available online (see website and CRAN)
• Great graphics
Data Mining - Massey University
R examplesx=5y=rnorm(1000,-1,3.5)hist(y)?histhist(y, nclass=20, col=‘orange’)> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -12.7400 -3.3080 -0.8247 -0.8101 1.5820 10.5500> t.test(y,mu=0)
data: y t = -7.1286, df = 999, p-value = 1.942e-12alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.0330474 -0.5870667 sample estimates:mean of x -0.810057
mydata = read.table(“iris.dat”,sep=“ “)> summary(mydata) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
R Examples
• Data stored in “data frames”– read.table or read.csv reads data into a data frame
• Check the contents of ‘mydata’– summary(mydata)– names(mydata) – lists all the variable names– mydata[1,] – shows only the first row of data– mydata[,1] – shows only the first column– mydata[,2:5] selects columns– mydata$Sepal.Length – shows all values of the variable
“Sepal.Length”– attach(mydata) - allows you to access Sepal.Length and
other columns directly– mydata[Sepal.Length > 3,] selects conditional rows
Data Mining - Massey University
R Examples• Modelling requires formula notation lm.out = lm(y~x)
Data Mining - Massey University
Lab #1
• R Tutorial– Courtesy of Di Cook– Work your way through the R tutorial (intro-
R.pdf). – Code is available in the file introductory-
code.txt
– Input spam data (collected at Iowa State University)
– Find summaries and simple manipulations
– Write R functions using function() and apply()