DataMining - u-szeged.hu › ~berendg › docs › dm › DM_intro_en_pf.pdf · Introduction...

Introduction

Data Mining

University of Szeged

Data Mining

Introduction

Requirements and grading

2 midterms for 25 pts each (min. 8 pts each) (9 Oct, 27 Nov)Retake: one of the MTOs can be retaken at the end of thesemester (on the last week)Extra credit (do not count towards the minimum requirements)

Grade–22 pts fail (1)

23–30 pts pass (2)31–36 pts fair (3)37–42 pts good (4)43–50 pts excellent (5)

Data Mining

Introduction

Requirements and grading

Prerequisite: passing grade from practice10 mini-essays for 5 pts eachExtra credits (do not count towards the minimumrequirements)

Grade–25 pts insufficient (1)

26–31 pts pass (2)32–37 pts fair (3)38–43 pts good (4)44–50 pts excellent (5)

Data Mining

Introduction

Summary of the semester

Introduction, basic conceptsData description and preprocessing (data clearing anddimensionality reduction)Unsupervised learningSupervised learning (classification)Outlier detectionFrequent pattern mining and association rulesGraph based data mining techniquesWeb-scale data mining and text mining

Data Mining

Introduction

Recommended literature

H. Witten, E. Frank, M. A. Hall: Data Mining: PracticalMachine Learning Tools and TechniquesP. Tan, M. Steinbach, V. Kumar: Introduction to Data MiningB. Liu: Web Data MiningJ. Leskovec, A. Rajaraman, J. Ullman: Mining MassiveDatasetsC. Manning, P. Raghavan, H. Schütze: Introduction toInformation RetrievalR. Duda, P. Hart, D. Stork: Pattern RecognitionC. Bishop: Pattern Recognition and Machine Learning...

Data Mining

http://www.mmds.org/

http://www.mmds.org/

http://nlp.stanford.edu/IR-book/

http://nlp.stanford.edu/IR-book/

Introduction

Motivation

1,000,000$ NetFlix Prize20,000$ Kaggle StackOverflow ChallangeWarren Buffet’s 109$ offer related to March MadnessBig Data to fight against terrorism

Data Mining

www.netflixprize.com

https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/

http://www.takebuffettsbillion.com/

http://arstechnica.co.uk/security/2016/02/the-nsas-skynet-program-may-be-killing-thousands-of-innocent-people/

Introduction

Main fields of data mining

Commercial applicationsClassification of debt inquiriesSegmentation of customer groupsChurn analysis

Scientific applicationsAstronomyMedicine researchMedical diagnostics

Data Mining

Introduction

What is data mining then?

The recognition of useful, (sometimes) unexpected patternsfrom a vast amount of data (e.g. from the Web)Technological development made it prevalent

Both hardware (HDD/RAM/CPU) & software(e.g. MapReduce)

The knowledge obtained should be easily understandable,valid, useful and novel≈ Knowledge Discovery

Data Mining

Introduction

What is not data mining?

Database queries”Simple” statistics (but can be used as a tool)Bonferroni Principle: having a massive dataset, uninterestingpatterns can emerge just by chance

Data Mining

Introduction

Total Information Avareness programme

introducing The Big Brother in USAThought experiment: we are willing to find evil-doers based onhotel reservations log109 people each go to any of the 100-bedded hotel withchance 0.01 → there are 100,000 (109*0.01/100)How many suspicious pairs of people (sleeping in the samehotel two times) will be detected over 1,000 days if there areno evil-doers (i.e. all the people are just behaving randomly)?P(x and y stay at the very same hotel at some day) =0.01*0.01*0.00001=10−9

P(x and y stay at the very same hotel at two days) =(0.01*0.01*0.00001)2 = 10−18

possible people-night pairings =(109

2

)∗(103

2

)≈ 2.5 ∗ 1023

suspicious pairs all together ≈ 2.5 ∗ 1023 ∗ 10−18 = 250, 000

Data Mining

Introduction

Rhine-paradox

David Rhine’s research on parapsychologystudents had to predict the color of 10 cards (blue/red)Rhine’s results: almost 0.1 % of the subjects wereextra-sensory geniuses (2−10)when ’paraphenomen’ were called back they produced averageresults

Rhine’s conclusion?Paraphenomens loose their special skills once they are told aboutthem.

Data Mining

Introduction

Simpson’s paradox

Think twice before coming to a conclusion!

Applied/AdmittedFemale male

Major A 7/100 3/50Major B 91/100 172/200Total 98/200 175/250

Should we conclude that females are positively discriminatedupon admittance?Not really, based on the aggregated data (cf. 49% vs. 70%)

Data Mining

Introduction

Correlation vs. causality

Number of marriages in Kentucky positively correlate (r>0.95)to the number of people drowned 1

KentuckymarriagesFi

shingboatdeaths

Peoplewhodrownedafterfallingoutofafishingboatcorrelateswith

MarriagerateinKentucky

Kentuckymarriages Fishingboatdeaths

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

7per1,000

8per1,000

9per1,000

10per1,000

11per1,000

0deaths

10deaths

20deaths

tylervigen.com

1SourceData Mining

http://www.tylervigen.com/

Introduction

After all, are they data mining?

Determining the mean of the ages of the customers of a shopfrom its databaseDetermining the average shoe size from previous data of thearmy and ruling out outlier data at the same timeSearching for mutations of organs in tomographic recordsDetermining the distribution of electives by sexPredicting who would take part in an election

Data Mining

Introduction

Related areas

Mathematics: probability theory, statistics, graph theory,algebra, analysisAlgorithm and computational theoryDatabasesMachine learning, pattern recognition, artificial intelligence

Data Mining

Introduction

Tools, software

Commercial products (e.g. SAS)Machine learning APIs, numeric mathematic libs

Weka, MALLETClementine (SPSS Inc.), Intelligent Miner (IBM), DBMiner(Simon Fraser Univ.)Octave, Matlab, Maple, RPython (numpy, scipy, scikit-learn, pandas). . .

Data Mining

Introduction

Object of data mining

(Massive) data sets made up of data objects that aredescribed by (usually) high-dimensional feature sets

Data object Data attributesrecord field

data point dimensionsample/measurement variable

instance/sample attribute, featureCurse of dimensionality: as the number of dimensions grow weneed exponentially large number of data points (in order theperformance not to drop dramatically)Distances often lose their importances in high dimensionalspaces → dimensionality reduction procedures (to be coveredlater)

Data Mining

Introduction

Forms of data sets

RecordsLists of transactions (shopping carts)Data matrixOccurrence (e.g. document-term) matrix

Data Mining

Introduction

Types of variables based on their scale of measurement

Type of attribute Description Examples Statistics

Category Nominal Variables can be

checked for equalityonly

names of cities,hair color

mode, entropy,correlation,χ2-test

Ordinal > relation can beinterpreted amongvariables

grades, {fail,pass, excel-lent}

median, per-centiles

Num

erical Interval The difference of

two variables can beformed and inter-preted

shoe sizes,dates, ◦C

mean, devia-tion, significance(e.g. F,t-) tests

Ratio Ratios can beformed from valuesof the variables ofthis kind

age, length,temperature inKelvin

percent, har-monic mean

Data Mining

Introduction

Discrete and continuous variables

Discrete variable: finite or countably infinite number ofpossible valuesContinuous variable: can take up any real valueMeasurement scales vs. range of variables

Variables measured at nominal or ordinal scale are discretemost of the timesVariables measured at interval or ratio scale are continuousmost of the timesMight there be such as continuous binary attribute?What about the measurement scale of discrete count variables?

Data Mining

Introduction

Further characterization of variables - Symmetry vs.anti-symmetry

The absence of an attribute does not necessarily indicate thesame amount of similarity of two points compared to thepresence of ite.g. sparse document vectors

Data Mining

Introduction

Manipulating features

Feature discretizationCategorical features instead of numerical ones can be morebeneficial for certain algorithms both in terms of speed andaccuracy

Feature selectionIntuitively, more features should help to obtain betterperformance, however, it is not always the caseWe try to select the best subset of features (beware that thereare exponentially many subsets! → use heuristics)

Data Mining

Introduction

Unsupervised discretization

Disregard information about the class label of the data pointsForm bins of fix-sized intervals

We might end up with (near) empty binsDense regions of feature values might be split

Form bins which hold the same amount of observations (weget a ”flat” histogram across the bins)

Data Mining

Introduction

Supervised discretization

If observations fall into different classes, we can take intoconsideration that information as well upon discretizationBased on mutual information, information gain, χ2, . . . criteria

Mutual informationThe mutual information between two random variables X and Y isMI (X ;Y ) = H(X )− H(X |Y ) =

∑x∈X

∑y∈Y

p(x , y) log p(x ,y)p(x)p(y)

It tells us how much our uncertainty decreases about the possiblevalue of X once we become aware of the true value of variable Y .

Data Mining

Introduction

Example for discretization based on mutual information

Suppose we have the below 10 observations belonging to either ofthe positive (P) and negative (N) classes.Does performing discretization for X = 1 or X = 3 seems to be abetter choice?

31

P NX ≤ 1 2 3X > 1 1 4

P NX ≤ 3 2 4X > 3 1 3

210 log2

43 + 3

10 log267 + 1

10 log223 + 4

10 log287 ≈ 0.035 > 2

10 log2109 + . . . ≈ 0.006

Performing discretization at X = 1 should be preferred overdiscretization at X = 3 based on the mutual information

Data Mining

Introduction

Preprocessing continuous variables – mean-centering

Subtract the mean observation from every single observationTransformed variables show the extent to which the originalobservations differ from average behavior (its mean will be 0)

0 2 4 6 8 10 12

-6

-4

-2

0

(a) Original data-6 -4 -2 0 2 4 6

-4

-2

0

2

4

(b) Centralized data

Data Mining

Introduction

Preprocessing continuous variables – standardizing

Express the variables in terms of z-scores (from statistics)To what extent does an observation differs from its expectedvalue expressed in terms of its standard deviation

0 2 4 6 8 10 12

-6

-4

-2

0

(a) Original data-2 0 2

-3

-2

-1

0

1

2

3

(c) Standardized data

Data Mining

Introduction

Preprocessing continuous variables – whitening

Remove correlation between the variablesConvert the (mean-centered) data X with covariance matrix Σby applying the linear transformation L on it, i.e. XL

L such be such that Σ−1 = LLᵀ holds

0 2 4 6 8 10 12

-6

-4

-2

0

(a) Original data-2 0 2

-3

-2

-1

0

1

2

3

(d) Whitened data

Data Mining

Introduction

Preprocessing continuous observations – unit normalizing

Entire observations (rows) can be normalized not just therandom variables (columns)

For any x 6= 0, the transformed vector xu = 1√xᵀxx will point

into the direction of x with ‖xu‖2 = 1

0 2 4 6 8 10 12

-6

-4

-2

0

(a) Original data-2 -1 0 1 2

-1.5

-1

-0.5

0

0.5

1

1.5

(e) Unit normalized (and mean centered)data

Data Mining

Introduction

The process of knowledge discovery

Data Mining

Introduction

Data quality

There are errors and inconsistencies in basically every databasedata collectiondata digitalizationmeasurement error

Data Mining

Introduction

Cleaning data

Missing values, e.g. xi = (0, 0, 4, 2, ?, ?, 1, 6,′ True ′)approximation (e.g. assign the mode/mean/median attributevalue of the k-most-similar data entry)throwing away the data pointthrowing away the attribute

Noisy data (pl. age=280)Similarly to missing values

Filtering out duplicate data entries

Data Mining

Introduction

Random variables

Characterize outcomes of experimentsResults of n experiment/measurement: x1, x2, . . . , xn

Expected value µX = E[X ] =∑x∈X

P(X = x) ∗ x

Variance: expected value of the squared difference from the µXVar(X ) =

∑x∈X

P(X = x) ∗ (x − µX )2 = E[X 2]− E2[X ]

Example

X=[1,4,7]E[X ] = (1 + 4 + 7)/3 = 4 E[X 2] = (1 + 16 + 49)/3 = 22Var(X ) = 22− 42 = 6

Data Mining

Introduction

Some refreshment on algebra (which might come handylater on)

Euclidean distance: ‖a‖2 =

√d∑

i=1a2i

Inner/scalar product: aᵀa =d∑

i=1a2i

Eigenvalues, eigenvectorsRight-side eigenvalues: Ax = λx⇔ det(A− λI ) = 0

e.g. A =

[3

√20√

20 4

]–> λ2 − 7λ− 8 = 0

Left-side eigenvalues: yA = λy

Data Mining

DataMining - u-szeged.hu › ~berendg › docs › dm › DM_intro_en_pf.pdf · Introduction...

Documents

Transcript of DataMining - u-szeged.hu › ~berendg › docs › dm › DM_intro_en_pf.pdf · Introduction...