DataMining - u-szeged.hu › ~berendg › docs › dm › DM_intro_en_pf.pdf · Introduction...
Transcript of DataMining - u-szeged.hu › ~berendg › docs › dm › DM_intro_en_pf.pdf · Introduction...
Introduction
Data Mining
University of Szeged
Data Mining
Introduction
Requirements and grading
2 midterms for 25 pts each (min. 8 pts each) (9 Oct, 27 Nov)Retake: one of the MTOs can be retaken at the end of thesemester (on the last week)Extra credit (do not count towards the minimum requirements)
Grade–22 pts fail (1)
23–30 pts pass (2)31–36 pts fair (3)37–42 pts good (4)43–50 pts excellent (5)
Data Mining
Introduction
Requirements and grading
Prerequisite: passing grade from practice10 mini-essays for 5 pts eachExtra credits (do not count towards the minimumrequirements)
Grade–25 pts insufficient (1)
26–31 pts pass (2)32–37 pts fair (3)38–43 pts good (4)44–50 pts excellent (5)
Data Mining
Introduction
Summary of the semester
Introduction, basic conceptsData description and preprocessing (data clearing anddimensionality reduction)Unsupervised learningSupervised learning (classification)Outlier detectionFrequent pattern mining and association rulesGraph based data mining techniquesWeb-scale data mining and text mining
Data Mining
Introduction
Recommended literature
H. Witten, E. Frank, M. A. Hall: Data Mining: PracticalMachine Learning Tools and TechniquesP. Tan, M. Steinbach, V. Kumar: Introduction to Data MiningB. Liu: Web Data MiningJ. Leskovec, A. Rajaraman, J. Ullman: Mining MassiveDatasetsC. Manning, P. Raghavan, H. Schütze: Introduction toInformation RetrievalR. Duda, P. Hart, D. Stork: Pattern RecognitionC. Bishop: Pattern Recognition and Machine Learning...
Data Mining
Introduction
Motivation
1,000,000$ NetFlix Prize20,000$ Kaggle StackOverflow ChallangeWarren Buffet’s 109$ offer related to March MadnessBig Data to fight against terrorism
Data Mining
Introduction
Main fields of data mining
Commercial applicationsClassification of debt inquiriesSegmentation of customer groupsChurn analysis
Scientific applicationsAstronomyMedicine researchMedical diagnostics
Data Mining
Introduction
What is data mining then?
The recognition of useful, (sometimes) unexpected patternsfrom a vast amount of data (e.g. from the Web)Technological development made it prevalent
Both hardware (HDD/RAM/CPU) & software(e.g. MapReduce)
The knowledge obtained should be easily understandable,valid, useful and novel≈ Knowledge Discovery
Data Mining
Introduction
What is not data mining?
Database queries”Simple” statistics (but can be used as a tool)Bonferroni Principle: having a massive dataset, uninterestingpatterns can emerge just by chance
Data Mining
Introduction
Total Information Avareness programme
introducing The Big Brother in USAThought experiment: we are willing to find evil-doers based onhotel reservations log109 people each go to any of the 100-bedded hotel withchance 0.01 → there are 100,000 (109*0.01/100)How many suspicious pairs of people (sleeping in the samehotel two times) will be detected over 1,000 days if there areno evil-doers (i.e. all the people are just behaving randomly)?P(x and y stay at the very same hotel at some day) =0.01*0.01*0.00001=10−9
P(x and y stay at the very same hotel at two days) =(0.01*0.01*0.00001)2 = 10−18
possible people-night pairings =(109
2
)∗(103
2
)≈ 2.5 ∗ 1023
suspicious pairs all together ≈ 2.5 ∗ 1023 ∗ 10−18 = 250, 000
Data Mining
Introduction
Rhine-paradox
David Rhine’s research on parapsychologystudents had to predict the color of 10 cards (blue/red)Rhine’s results: almost 0.1 % of the subjects wereextra-sensory geniuses (2−10)when ’paraphenomen’ were called back they produced averageresults
Rhine’s conclusion?Paraphenomens loose their special skills once they are told aboutthem.
Data Mining
Introduction
Simpson’s paradox
Think twice before coming to a conclusion!
Applied/AdmittedFemale male
Major A 7/100 3/50Major B 91/100 172/200Total 98/200 175/250
Should we conclude that females are positively discriminatedupon admittance?Not really, based on the aggregated data (cf. 49% vs. 70%)
Data Mining
Introduction
Correlation vs. causality
Number of marriages in Kentucky positively correlate (r>0.95)to the number of people drowned 1
KentuckymarriagesFi
shingboatdeaths
Peoplewhodrownedafterfallingoutofafishingboatcorrelateswith
MarriagerateinKentucky
Kentuckymarriages Fishingboatdeaths
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
7per1,000
8per1,000
9per1,000
10per1,000
11per1,000
0deaths
10deaths
20deaths
tylervigen.com
1SourceData Mining
Introduction
After all, are they data mining?
Determining the mean of the ages of the customers of a shopfrom its databaseDetermining the average shoe size from previous data of thearmy and ruling out outlier data at the same timeSearching for mutations of organs in tomographic recordsDetermining the distribution of electives by sexPredicting who would take part in an election
Data Mining
Introduction
Related areas
Mathematics: probability theory, statistics, graph theory,algebra, analysisAlgorithm and computational theoryDatabasesMachine learning, pattern recognition, artificial intelligence
Data Mining
Introduction
Tools, software
Commercial products (e.g. SAS)Machine learning APIs, numeric mathematic libs
Weka, MALLETClementine (SPSS Inc.), Intelligent Miner (IBM), DBMiner(Simon Fraser Univ.)Octave, Matlab, Maple, RPython (numpy, scipy, scikit-learn, pandas). . .
Data Mining
Introduction
Object of data mining
(Massive) data sets made up of data objects that aredescribed by (usually) high-dimensional feature sets
Data object Data attributesrecord field
data point dimensionsample/measurement variable
instance/sample attribute, featureCurse of dimensionality: as the number of dimensions grow weneed exponentially large number of data points (in order theperformance not to drop dramatically)Distances often lose their importances in high dimensionalspaces → dimensionality reduction procedures (to be coveredlater)
Data Mining
Introduction
Forms of data sets
RecordsLists of transactions (shopping carts)Data matrixOccurrence (e.g. document-term) matrix
Data Mining
Introduction
Types of variables based on their scale of measurement
Type of attribute Description Examples Statistics
Category Nominal Variables can be
checked for equalityonly
names of cities,hair color
mode, entropy,correlation,χ2-test
Ordinal > relation can beinterpreted amongvariables
grades, {fail,pass, excel-lent}
median, per-centiles
Num
erical Interval The difference of
two variables can beformed and inter-preted
shoe sizes,dates, ◦C
mean, devia-tion, significance(e.g. F,t-) tests
Ratio Ratios can beformed from valuesof the variables ofthis kind
age, length,temperature inKelvin
percent, har-monic mean
Data Mining
Introduction
Discrete and continuous variables
Discrete variable: finite or countably infinite number ofpossible valuesContinuous variable: can take up any real valueMeasurement scales vs. range of variables
Variables measured at nominal or ordinal scale are discretemost of the timesVariables measured at interval or ratio scale are continuousmost of the timesMight there be such as continuous binary attribute?What about the measurement scale of discrete count variables?
Data Mining
Introduction
Further characterization of variables - Symmetry vs.anti-symmetry
The absence of an attribute does not necessarily indicate thesame amount of similarity of two points compared to thepresence of ite.g. sparse document vectors
Data Mining
Introduction
Manipulating features
Feature discretizationCategorical features instead of numerical ones can be morebeneficial for certain algorithms both in terms of speed andaccuracy
Feature selectionIntuitively, more features should help to obtain betterperformance, however, it is not always the caseWe try to select the best subset of features (beware that thereare exponentially many subsets! → use heuristics)
Data Mining
Introduction
Unsupervised discretization
Disregard information about the class label of the data pointsForm bins of fix-sized intervals
We might end up with (near) empty binsDense regions of feature values might be split
Form bins which hold the same amount of observations (weget a ”flat” histogram across the bins)
Data Mining
Introduction
Supervised discretization
If observations fall into different classes, we can take intoconsideration that information as well upon discretizationBased on mutual information, information gain, χ2, . . . criteria
Mutual informationThe mutual information between two random variables X and Y isMI (X ;Y ) = H(X )− H(X |Y ) =
∑x∈X
∑y∈Y
p(x , y) log p(x ,y)p(x)p(y)
It tells us how much our uncertainty decreases about the possiblevalue of X once we become aware of the true value of variable Y .
Data Mining
Introduction
Example for discretization based on mutual information
Suppose we have the below 10 observations belonging to either ofthe positive (P) and negative (N) classes.Does performing discretization for X = 1 or X = 3 seems to be abetter choice?
31
P NX ≤ 1 2 3X > 1 1 4
P NX ≤ 3 2 4X > 3 1 3
210 log2
43 + 3
10 log267 + 1
10 log223 + 4
10 log287 ≈ 0.035 > 2
10 log2109 + . . . ≈ 0.006
Performing discretization at X = 1 should be preferred overdiscretization at X = 3 based on the mutual information
Data Mining
Introduction
Preprocessing continuous variables – mean-centering
Subtract the mean observation from every single observationTransformed variables show the extent to which the originalobservations differ from average behavior (its mean will be 0)
0 2 4 6 8 10 12
-6
-4
-2
0
(a) Original data-6 -4 -2 0 2 4 6
-4
-2
0
2
4
(b) Centralized data
Data Mining
Introduction
Preprocessing continuous variables – standardizing
Express the variables in terms of z-scores (from statistics)To what extent does an observation differs from its expectedvalue expressed in terms of its standard deviation
0 2 4 6 8 10 12
-6
-4
-2
0
(a) Original data-2 0 2
-3
-2
-1
0
1
2
3
(c) Standardized data
Data Mining
Introduction
Preprocessing continuous variables – whitening
Remove correlation between the variablesConvert the (mean-centered) data X with covariance matrix Σby applying the linear transformation L on it, i.e. XL
L such be such that Σ−1 = LLᵀ holds
0 2 4 6 8 10 12
-6
-4
-2
0
(a) Original data-2 0 2
-3
-2
-1
0
1
2
3
(d) Whitened data
Data Mining
Introduction
Preprocessing continuous observations – unit normalizing
Entire observations (rows) can be normalized not just therandom variables (columns)
For any x 6= 0, the transformed vector xu = 1√xᵀxx will point
into the direction of x with ‖xu‖2 = 1
0 2 4 6 8 10 12
-6
-4
-2
0
(a) Original data-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
(e) Unit normalized (and mean centered)data
Data Mining
Introduction
The process of knowledge discovery
Data Mining
Introduction
Data quality
There are errors and inconsistencies in basically every databasedata collectiondata digitalizationmeasurement error
Data Mining
Introduction
Cleaning data
Missing values, e.g. xi = (0, 0, 4, 2, ?, ?, 1, 6,′ True ′)approximation (e.g. assign the mode/mean/median attributevalue of the k-most-similar data entry)throwing away the data pointthrowing away the attribute
Noisy data (pl. age=280)Similarly to missing values
Filtering out duplicate data entries
Data Mining
Introduction
Random variables
Characterize outcomes of experimentsResults of n experiment/measurement: x1, x2, . . . , xn
Expected value µX = E[X ] =∑x∈X
P(X = x) ∗ x
Variance: expected value of the squared difference from the µXVar(X ) =
∑x∈X
P(X = x) ∗ (x − µX )2 = E[X 2]− E2[X ]
Example
X=[1,4,7]E[X ] = (1 + 4 + 7)/3 = 4 E[X 2] = (1 + 16 + 49)/3 = 22Var(X ) = 22− 42 = 6
Data Mining
Introduction
Some refreshment on algebra (which might come handylater on)
Euclidean distance: ‖a‖2 =
√d∑
i=1a2i
Inner/scalar product: aᵀa =d∑
i=1a2i
Eigenvalues, eigenvectorsRight-side eigenvalues: Ax = λx⇔ det(A− λI ) = 0
e.g. A =
[3
√20√
20 4
]–> λ2 − 7λ− 8 = 0
Left-side eigenvalues: yA = λy
Data Mining