Data Mining in Bioinformatics Day 1: Classiﬁcation

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 1: Classification

Karsten Borgwardt

February 18 to March 1, 2013

Machine Learning & Computational Biology Research GroupMax Planck Institute Tübingen and

Eberhard Karls Universität Tübingen

Our course

ScheduleFebruary 18 to March 1Lecture from 10:00 to 12:30, tutorials are projected-basedOral exam on March 5 to get the certificate

Structure1 week of algorithmics, 1 week of bioinformatics appli-cationsKey topics: classification, clustering, feature selection,text & graph miningLecture will provide introduction to the topic + discussionof important papers

What is data mining?

Data MiningExtracting knowledge from large amounts of data (Hanand Kamber, 2006)Often used as synonym for Knowledge Discovery — yetother definitions disagree

Knowledge DiscoveryData cleaningData integrationData selectionData transformationData miningPattern evaluationKnowledge presentation

What is classification?

ProblemGiven an object, which class of objects does it belongto?Given object x, predict its class label y.

ExamplesComputer vision: Is this object a chair?Credit cards: Is this customer to be trusted?Marketing: Will this customer buy/like our product?Function prediction: Is this protein an enzyme?Gene finding: Does this sequence contain a splice site?

What is classification?

SettingClassification is usually performed in a supervised set-ting: We are given a training dataset.A training dataset is a dataset of pairs (xi, yi), that isobjects and their known class labelsThe test set is a dataset of test points x′ with unknownclass labelThe task is to predict the class label y′ of x′

Role of yif y ∈ {0, 1}: then we are dealing with a binary classi-fication problemif y ∈ {1, . . . , n}, (n ∈ N): a multiclass classificationproblemif y ∈ R: a regression problem

Classifiers in a nutshell

Nearest NeighbourKey idea: if x1 is most similar to x2, then y1 = y2Classification by looking at the ‘Nearest Neighbour’

Naive BayesA simple probabilistic classifier based on applyingBayes’ theorem with strong (naive) independence as-sumptions

Decision treesA series of decisions has to be taken to classify an ob-ject, based on its attributesThe hierarchy of these decisions is ordered as a tree, a‘decision tree’.

Classifiers in a nutshell

Support Vector MachineKey idea: Draw a line (plane, hyperplane) that separatestwo classes of dataMaximise the distance between the hyperplane and thepoints closest to it (margin)Test point is predicted to belong to the class whose half-space it is located in

Criteria for a good classifierAccuracyRuntime and scalabilityInterpretabilityFlexibility

Nearest Neighbour

The actual classificationGiven xi, we predict its label yi by

xj = argminx∈D||x− xi||2 ⇒ yi = yj (1)

xi’s predicted label is that of the point closest to it, thatis its ‘nearest neighbour’

RuntimeNaively, one has to compute the distance to all N neigh-bours in the dataset for each point:O(N) for one pointO(N 2) for the entire dataset

Nearest Neighbour

How to speed NN upExploit the triangle inequality:

d(x1, x2) + d(x2, x3) ≥ d(x1, x3) (2)

This holds for any metric d.

MetricA distance function d is a metric iff1. d(x1, x2) ≥ 02. d(x1, x2) = 0 if and only if x1 = x23. d(x1, x2) = d(x2, x1)4. d(x1, x3) ≤ d(x1, x2) + d(x2, x3)

Nearest Neighbour

How to speed NN upRewrite triangle inequality:

d(x1, x2) ≥ d(x1, x3)− d(x2, x3) (3)

That means if you know d(x1, x3) and d(x2, x3), you canprovide a lower bound on d(x1, x2)

If you know a point that is closer to x1 than d(x1, x3) −d(x2, x3), then you can avoid to compute d(x1, x2)

Naive Bayes

Bayes’ Rule

P (C|x) =P (x|C)P (C)

P (x)(4)

Naive Bayes ClassificationClassify x into one of m classes C1, . . . , Cm

argmaxCiP (Ci|x) =

P (x|Ci)P (Ci)

P (x)(5)

Naive Bayes

Three simplificationsP (x) is the same for all classes, ignore this termWe further assume that P (Ci) is constant for all classes1 ≤ i ≤ m, ignore this term as well.That means

P (Ci|x) ∝ P (x|Ci) (6)

If x is multidimensional, that is if x contains n featuresx = (x1, . . . , xn), we further assume that

P (x|Ci) =n∏j=1

P (xj|Ci) (7)

Naive Bayes

The actual classificationThe actual classification is performed by computing

P (Ci|x) ∝n∏j=1

P (xj|Ci) (8)

The three simplifications are thatall classes have the same marginal probabilityall data points have the same marginal probabilityall features of an object are independent of each other

Alternative name:‘Simple Bayes Classifier’

RuntimeO(Nmn), where N is the number of data points, m thenumber of classes, n the number of features

Decision Tree

Key ideaRecursively split the data space into regions that containa single class only

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

Decision Tree

ConceptA decision tree is a flowchart like tree structure with

a root: this is the uppermost nodeinternal nodes: these represents tests on an attributebranches: these represent outcomes of a testleaf nodes: these hold a class label

Decision Tree

Classificationgiven a test point xperform test on the attributes of x at the rootfollow the branch that corresponds to the outcome of thistestrepeat this procedure, until you reach a leaf nodepredict the label of x to be the label of that leaf node

Decision Tree

Popularityrequires no domain knowledgeeasy to interpretconstruction and prediction is fast

But how to construct a decision tree?

Decision Tree

Constructionrequires to determine a splitting criterion at each internalnodethis splitting criterion tells us which attribute to test atnode vwe would like to use the attribute that best separates theclasses on the training dataset

Decision Tree

Information gainID3 uses information gain as attribute selection measureThe information content is defined as:

Info(D) = −m∑i=1

p(Ci) log2(p(Ci)), (9)

where p(Ci) is the probability that an arbitrary tuple in Dbelongs to class Ci and is estimated by |Ci,D|

This is also known as the Shannon entropy of D.

Decision Tree

Information gainAssume that attribute A was used to split D into v par-titions or subsets, {D1, D2, . . . , Dv}, where Dj containsthose tuples in D that have outcome aj of A.Ideally, the Dj would provide a perfect classification, butthey seldomly do.How much more information do we need to arrive at anexact classification?This is quantified by

InfoA(D) =v∑j=1

|Dj||D|

Info(Dj). (10)

Decision Tree

Information gainThe information gain is the loss of entropy (increase ininformation) that is caused by splitting with respect toattribute A

Gain(A) = Info(D)− InfoA(D) (11)

We pick A such that this gain is maximised.

Decision Tree

Gain ratioThe information gain is biased towards attributes with alarge number of valuesFor example, an ID attribute maximises the informationgain!Hence C4.5 uses an extension of information gain: thegain ratioThe gain ratio is based on the split information

SplitInfoA(D) = −v∑j=1

|Dj||D|

log2(|Dj||D|

). (12)

and is defined as

GainRatio(A) =Gain(A)

SplitInfo(A)(13)

Decision Tree

Gain ratioThe attribute with maximum gain ratio is selected as thesplitting attributeThe ratio becomes unstable, as the split information ap-proaches zeroA constraint is added to ensure that the information gainof the test selected is at least as great as the averagegain over all tests examined.

Decision Tree

Gini indexAttribute selection measure in CART systemGini index measures class impurity as

Gini(D) = 1−m∑i=1

p2i (14)

If we split via attribute A into partitions {D1, D2, . . . , Dv},the Gini index of this partitioning is defined as

GiniA(D) =v∑j=1

|Dj||D|

Gini(Dj) (15)

and the reduction in impurity by a split on A is

∆Gini(D) = Gini(D)−GiniA(D) (16)

Support Vector Machines

Hyperplane classifiersVapnik et al. define a family of classifiers for binary clas-sification problems.This family is the class of hyperplanes in some dot prod-uct space H,

〈w, x〉 + b = 0, (17)

where w ∈ H, b ∈ R.These correspond to decision functions (‘classifiers’):

f (x) = sgn(〈w, x〉 + b) (18)

Vapnik et al. proposed a learning algorithm for deter-mining this f from the training dataset

The optimal hyperplanemaximises the margin of separation between any train-ing point and the hyperplane

maxw∈H,b∈R

min{‖x− xi‖|x ∈ H, 〈w, x〉 + b = 0, i = 1, . . . ,m}(19)

Optimisation problem

minimisew∈H,b∈R τ (w) =1

2‖w‖2 (20)

subject to yi(〈w, xi〉 + b) ≥ 1 for all i = 1, . . . ,m

Why minimise 12‖w‖

2?The size of the margin is 2

‖w‖. The smaller ‖w‖, thelarger the margin.Why do we have to obey the constraints yi(〈w, xi〉+b) ≥1? They ensure that all training data points of the sameclass are on the same side of the hyperplane and out-side the margin.

The LagrangianWe form the Lagrangian:

L(w, b, α) =1

2‖w‖2 −

m∑i=1

αi(yi(〈xi, w〉 + b)− 1) (21)

The Lagrangian is minimised with respect to the primalvariables w and b, and maximised with respect to thedual variables αi.

Support VectorsAt optimality,

∂bL(w, b, α) = 0 and

∂wL(w, b, α) = 0 (22)

such thatm∑i=1

αiyi = 0 and w =m∑i=1

αiyixi (23)

Hence the solution vector w, the crucial parameter of theSVM classifier, has an expansion in terms of the trainingpoints and their labels.Those training points with α > 0 are the Support Vec-tors.

The dual problemPlugging (23) into the Lagrangian (21), we obtain thedual optimization problem that is solved in practice:

maximiseα∈RmW (α) =m∑i=1

αi −1

m∑i,j=1

αiαjyiyj〈xi, xj〉

The kernel trickThe key insight is that (24) accesses the training dataonly in terms of inner products 〈xi, xj〉We can plug in an inner product of our choice here! Thisis referred to as a kernel k:

k(xi, xj) = 〈xi, xj〉 (25)

Some prominent kernelslinear kernel

k(xi, xj) =n∑l=1

xilxjl = x>i xj, (26)

polynomial kernel

k(xi, xj) = (x>i xj + c)d, (27)

where c, d ∈ R,Gaussian RBF kernel

k(xi, xj) = exp(− 1

2σ2‖xi − xj‖2), (28)

where σ ∈ R.

References and further reading

References

[1] B. Schölkopf and A. Smola. Learning with Kernels. MITpress, 2002.

[2] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Elsevier, Morgan-Kaufmann Publishers,2006.

The end

See you tomorrow! Next topic: Clustering

Data Mining in Bioinformatics Day 1: Classiﬁcation · 2014-10-29 · Our course Karsten...

Documents

Transcript of Data Mining in Bioinformatics Day 1: Classiﬁcation · 2014-10-29 · Our course Karsten...

Bioinformatics and data mining: application in dairy ... · Bioinformatics and data mining: application in dairy cattle nutrition and physiology ... Enrichment tools systematically

Bioinformatics: Practical Application of Simulation and Data Mining Protein Folding I

2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Data Integration & Data Mining Tool Donald Dunbar BHF CoRE Bioinformatics Team Edinburgh Bioinformatics Meeting April 2013.

Graph Mining and Graph Kernels - Homepage | ETH Zürich · 2014-10-29 · Graph Mining and Graph Kernels Karsten Borgwardt and Xifeng Yan | Biological Network Analysis: Graph Mining|

Sequence Mining - Rensselaer Polytechnic Institute · 2016-09-22 · CHAPTER 10 Sequence Mining Many real-world applications such as bioinformatics, Web mining, and text mining have

A REVIEW OF DATA MINING IN BIOINFORMATICS

Chapter 16: Text Mining for Translational Bioinformatics 1.

MSc in Bioinformatics: Statistical Data Mining

Application of Data mining and Soft Computing in Bioinformatics

MASTER IN BIOINFORMATICS/ OMICS DATA …BIOINFORMATICS Programming and database management for Bioinformatics: Linux, Python, BioPython, MySQL and iCloud 4,0 Statistical and data mining

Redescription Mining: Algorithms and Applications in Bioinformatics · 2020-01-18 · Redescription Mining: Algorithms and Applications in Bioinformatics Deept Kumar Abstract Scientic

Part II: Graph Kernels - UCSBxyan/tutorial/KDD08_graph_partII.pdfGraph Mining and Graph Kernels 5 Karsten Borgwardt and Xifeng Yan | Part II: Graph Kernels | Graph Isomorphism Graph

Bioinformatics: Practical Application of Simulation and Data Mining Markov Modeling III

TEXT MINING (2005) TEXT MINING Bioinformatics and Computational Biology Summer School – University Complutense of Madrid.

Data Mining in Bioinformatics - UQAM

Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 Clustering in two dimensions alternative names: co-clustering,

An Introduction to Graph Mining Borgwardt and Oliver Stegle: Computational Approaches for Analysing Complex Biological Systems, Page 1 An Introduction to Graph Mining Karsten Borgwardt

Oliver Stegle and Karsten Borgwardt - ETH Z