MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS Elena Marchiori IBIVU Vrije Universiteit Amsterdam.
MACHINE LEARNING TECHNIQUES IN BIO-
INFORMATICS
Elena Marchiori
IBIVU
Vrije Universiteit Amsterdam
Summary
• Machine Learning
• Supervised Learning: classification
• Unsupervised Learning: clustering
Machine Learning (ML)
• Construct a computational model from a dataset describing properties of an unknown (but existent) system.
Computational model
System (unknown)observations
?
properties
ML
Supervised Learning
• The dataset describes examples of input-output behaviour of a unknown (but existent) system.
• The algorithm tries to find a function ‘equivalent’ to the system.
• ML techniques for classification: K-nearest neighbour, decision trees, Naïve Bayes, Support Vector Machines.
Supervised Learning
Training data
ML algorithm
model predictionnew observation
System (unknown)observationsproperty of interest
?
supervisor
Unsupervised learning
Example: A Classification Problem
• Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon”
• Use features such as length, width, lightness, fin shape & number, mouth position, etc.
• Steps1. Preprocessing (e.g., background
subtraction)2. Feature extraction 3. Classification
example from Duda & Hart
Classification in Bioinformatics
• Computational diagnostic: early cancer detection
• Tumor biomarker discovery
• Protein folding prediction
• Protein-protein binding sites prediction
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Classification Techniques
• Naïve Bayes
• K Nearest Neighbour
• Support Vector Machines (next lesson)
• …
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Bayesian Approach
• Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis
• Prior knowledge can be combined with observed data to determine hypothesis
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
• New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities
Kathleen McKeown’s slides
Bayesian Approach
• Assign the most probable target value, given <a1,a2,…an>
• VMAP=argmax P(vj| a1,a2,…an)
• Using Bayes Theorem:• VMAP=argmax P(a1,a2,…an|vj)P(vi)
vjV P(a1,a2,…an) =argmax P(a1,a2,…an|vj)P(vi) vjV
• Bayesian learning is optimal• Easy to estimate P(vi) by counting in training data• Estimating the different P(a1,a2,…an|vj) not feasible(we would need a training set of size proportional to the number of
possible instances times the number of classes) Kathleen McKeown’s slides
Bayes’ Rules
• Product RuleP(a Λ b) = P(a|b)P(b)= P(b|a)P(a)
• Bayes’ ruleP(a|b)=P(b|a)P(a)
P(b)• In distribution form:
P(Y|X)=P(X|Y)P(Y) = αP(X|Y)P(Y) P(X)
Kathleen McKeown’s slides
Naïve Bayes
• Assume independence of attributes– P(a1,a2,…an|vj)=∏P(ai|vj)
i
• Substitute into VMAP formula– VNB=argmax P(vj)∏P(ai|vj)
vjV i
Kathleen McKeown’s slides
VNB=argmax P(vj)∏P(ai|vj) vjV
S-length S-width P-length Class
1 high high high Versicolour
2 low high low Setosa
3 low high low Verginica
4 low high med Verginica
5 high high high Versicolour
6 high high med Setosa
7 high high low Setosa
8 high high high Versicolour
9 high high high Versicolour
Kathleen McKeown’s slides
Estimating Probabilities
• What happens when the number of data elements is small?
• Suppose true P(S-length=low|verginica)=.05• There are only 2 instances with C=Verginica• We estimate probability by nc/n using the training set• Then #S-length =low |Verginica must = 0• Then, instead of .05 we use estimated probability of 0• Two problems
• Biased underestimate of probability• This probability term will dominate if future query contains S-
length=low
Kathleen McKeown’s slides
Instead: use m-estimate
• Use priors as well
• nc+mp n+m
– Where p = prior estimate of P(S-length=low|verginica)
– m is a constant called the equivalent sample size» Determines how heavily to weight p relative to the
observed data» Typical method: assume a uniform prior of an
attribute (e.g. if values low,med,high -> p =1/3)
Kathleen McKeown’s slides
K-Nearest Neighbour• Memorize the training data
• Given a new example, find its k nearest neighbours, and output the majority vote class.
• Choices: – How many neighbours?– What distance measure?
Application in Bioinformatics • A Regression-based K nearest neighbor algorithm for gene
function prediction from heterogeneous data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics 2006, 7
1. For each dataset k, for each pair of genes p compute similarity fk(p) of p wrt k-th data
2. Construct predictor of gene pair similarity, e.g. logistic regressionH: f(p,1),…,f(p,m) H(f(p,1),…,f(p,m)) such thatH high value if genes of p have similar functions.
Given a new gene g find kNN using H as distancePredict the functional classes C1, .., Cn of g with confidence equal toConfidence(Ci) = 1- Π (1- Pij) with gj neighbour of g and Ci in the set
of classes of gj (probability that at least one prediction is correct, that is 1 – probability that all predictions are wrong)
Classification: CV error
• Training error– Empirical error
• Error on independent test set – Test error
• Cross validation (CV) error– Leave-one-out (LOO)– N-fold CV
N samples
splitting
1/n samples for testing
Summarize CV error rate
N-1/n samples for training
Count errors
Supervised learning
Two schemes of cross validation
N samples
LOO
Train and test the gene-selector and the
classifier
Count errors
N samples
Gene selection
Train and test the classifier
Count errors
LOO
CV2CV1
Supervised learning
Difference between CV1 and CV2
• CV1 gene selection within LOOCV• CV2 gene selection before before LOOCV• CV2 can yield optimistic estimation of classification true
error
• CV2 used in paper by Golub et al. :– 0 training error– 2 CV error (5.26%)– 5 test error (14.7%)– CV error different from test error!
Supervised learning
Significance of classification results
• Permutation test:– Permute class label of samples– LOOCV error on data with permuted labels– Repeat process a high number of times– Compare with LOOCV error on original data:
• P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered
Supervised learning
Unsupervised Learning
ML for unsupervised learning attempts to discover interesting structure in the available data
Unsupervised learning
Unsupervised Learning
• The dataset describes the structure of an unknown (but existent) system.
• The computer program tries to identify structure of the system (clustering, data compression).
• ML techniques: hierarchical clustering, k-means, Self Organizing Maps (SOM), fuzzy clustering (described in a future lesson).
Clustering Clustering is one of the most important unsupervised
learning processes for organizing objects into groups whose members are similar in some way.
Clustering finds structures in a collection of unlabeled data.
A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.
Clustering Algorithms
• Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n.
• The goal is to associatethe n objects to k clusters so that objects “within” a clusters are more “similar” than objects between clusters. k is usually unknown.
• Popular methods: hierarchical, k-means, SOM, …
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Hierarchical Clustering
DendrogramVenn Diagram of Clustered Data
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
Hierarchical Clustering (Cont.)
• Multilevel clustering: level 1 has n clusters level n has one cluster.
• Agglomerative HC: starts with singleton and merge clusters.
• Divisive HC: starts with one sample and split clusters.
Nearest Neighbor Algorithm
• Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).
• Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
Nearest Neighbor, Level 2, k = 7 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
Nearest Neighbor, Level 3, k = 6 clusters.
Nearest Neighbor, Level 4, k = 5 clusters.
Nearest Neighbor, Level 5, k = 4 clusters.
Nearest Neighbor, Level 6, k = 3 clusters.
Nearest Neighbor, Level 7, k = 2 clusters.
Nearest Neighbor, Level 8, k = 1 cluster.
Calculate the similarity between all possible
combinations of two profiles
Two most similar clusters are grouped together to form
a new cluster
Calculate the similarity between the new cluster and
all remaining clusters.
Hierarchical Clustering
Keys• Similarity• Clustering
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering in Bioinformatics
• Microarray data quality checking– Does replicates cluster together?– Does similar conditions, time points, tissue types
cluster together?
• Cluster genes Prediction of functions of unknown genes by known ones
• Cluster samples Discover clinical characteristics (e.g. survival, marker status) shared by samples.
• Promoter analysis of commonly regulated genes
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Functional significant gene clusters
Two-way clustering
Gene clusters
Sample clusters
Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses.
Proc. Natl. Acad. Sci. USA, Vol. 98, 13790-13795.
Similarity Measurements• Pearson Correlation
Nx
x
x 1
Two profiles (vectors) and
])(][)([
))((),(
1
2
1
2
1
N
i yi
N
i xi
N
i yixipearson
mymx
mymxyxC
Ny
y
y 1
x
y
x
y+1 Pearson Correlation – 1
N
n nx xN
m1
1
N
n ny yN
m1
1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Similarity Measurements
• Pearson Correlation: Trend Similarity
ab
5.0
2.0ac
1),( caCpearson
1),( baCpearson
1),( cbCpearson
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Similarity Measurements
• Euclidean Distance
N
n nn yxyxd1
2)(),(
Nx
x
x 1
Ny
y
y 1
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Similarity Measurements
• Euclidean Distance: Absolute difference
ab
5.0
2.0ac
5875.1),( cad
8025.2),( bad
2211.3),( cbd
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Clustering
C1
C2
C3
Merge which pair of clusters?
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
+
+
Clustering
Single Linkage
C1
C2
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters
Tend to generate “long chains”
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
+
+
Clustering
Complete Linkage
C1
C2
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters
Tend to generate “clumps”
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
+
+
Clustering
Average Linkage
C1
C2
Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
+
+
Clustering
Average Group Linkage
C1
C2
Dissimilarity between two clusters = Distance between two cluster means.
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
Considerations
• What genes are used to cluster samples?– Expression variation– Inherent variation– Prior knowledge
(irrelevant genes)– Etc.
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong
K-means Clustering– Initialize the K cluster representatives w’s, e.g. to randomly
chosen examples. – Assign each input example x to the cluster c(x) with the
nearest corresponding weight vector:
– Update the weights:
– Increment n by 1 and go until no noticeable changes of the cluster representatives occur.
)n(wxmin argc(x) jj
jcluster toassigned examples ofnumber the with
/)1(j c(x) that suchx
j
jj
n
nxnw
Unsupervised learning
Example I
Initial Data and Seeds Final Clustering
Unsupervised learning
Example II
Initial Data and Seeds Final Clustering
Unsupervised learning
The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation.
That is, the brain processes the external signals in a topology-preserving way
Mimicking the way the brain learns, our clustering system should be able to do the same thing.
SOM: Brain’s self-organization
Unsupervised learning
Self-Organized Map: ideaSelf-Organized Map: idea
Data: vectors XT = (X1, ... Xd) from d-dimensional space.
Grid of nodes, with local processor (called neuron) in each node.
Local processor # j has d adaptive parameters W(j).
Goal: change W(j) parameters to recover data clusters in X space.Unsupervised learning
Training processTraining process
o
o
oox
x
xx=dane
siatka neuronów
N-wymiarowa
xo=pozycje wag neuronów
o
o o
o
o
o
o
o
przestrzeń danych
wagi wskazująna punkty w N-D
w 2-D
Java demos: http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Unsupervised learning
Concept of the SOM Input space Reduced feature space
s1
s2Mn
Sr
Ba
Clustering and ordering of the cluster centers in a two dimensional grid
Cluster centers (code vectors) Place of these code vectors in the reduced space
Unsupervised learning
Concept of the SOM
Ba
Mn
Sr
…
SA3
We can use it for visualization
We can
use it fo
r classification
Mg
We can use it for clustering
SA3
Unsupervised learning
SOM: learning algorithm
• Initialization. n=0. Choose random small values for weight vectors components.
• Sampling. Select an x from the input examples.• Similarity matching. Find the winning neuron i(x) at iteration n:
• Updating: adjust the weight vectors of all neurons using the following rule
• Continuation: n = n+1. Go to the Sampling step until no noticeable changes in the weights are observed.
||)()(||minarg)( nwnxxi jj
)(-x(n) )( )()()1( )()( nwdhnnwnw jjxixijj
Unsupervised learning
Neighborhood Function
– Gaussian neighborhood function:
– dji: lateral distance of neurons i and j
• in a 1-dimensional lattice | j - i |
• in a 2-dimensional lattice || rj - ri || where rj is the position of
neuron j in the lattice.
2
2
2exp)(
ijiji
ddh
Unsupervised learning
Initial h function (Example )
Unsupervised learning
Some examples of real-life applications
Helsinki University of Technology web site http://www.cis.hut.fi/research/refs/Contains > 5000 papers on SOM and its applications:• Brain research: modeling of formation of various
topographical maps in motor, auditory, visual and somatotopic areas.
• Clusterization of genes, protein properties, chemical compounds, speech phonemes, sounds of birds and insects, astronomical objects, economical data, business and financial data ....
• Data compression (images and audio), information filtering.• Medical and technical diagnostics.
Unsupervised learning
Issues in Clustering
• How many clusters?– User parameter– Use model selection criteria (Bayesian Information Criterion) with
penalization term which considers model complexity. See e.g. X-means: http://www2.cs.cmu.edu/~dpelleg/kmeans.html
• What similarity measure?– Euclidean distance– Correlation coefficient– Ad-hoc similarity measures
Unsupervised learning
Validation of clustering results
• External measures– According to some external knowledge
– Consideration of bias and subjectivity
• Internal measures– Quality of clusters according to the data
– Compactness and separation
– Stability
– …
See e.g. J.Handl, J.Knowles, D.B.Kell
Computational cluster validation in postgenomic data analysis, Bioinformatics, 21(15):3201-3212, 2005
Unsupervised learning
Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring
T.R. Golub et al., Science 286, 531 (1999)
Bioinformatics ApplicationBioinformatics Application
Unsupervised learning
Identification of cancer types
• Why is Identification of Cancer Class (tumor sub-type) important?– Cancers of Identical grade can have widely variable
clinical courses (i.e. acute lymphoblastic leukemia, or Acute myeloid leukemia).
• Tradition Method:– Morphological appearance.
– Enzyme-based histochemical analyses.
– Immunophenotyping.
– Cytogenetic analysis.
Golub et al 1999Unsupervised learning
Class Prediction
• How could one use an initial collection of samples belonging to know classes to create a class Predictor?– Identification of Informative Genes
– Weighted Vote
Golub et al slidesUnsupervised learning
Data
• Initial Sample: 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis.
• Independent Sample: 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).
Golub et al slidesUnsupervised learning
Validation of Gene Voting
• Initial Samples: 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis.
• Independent Samples: 29 of 34 samples are strongly predicted with 100% accuracy.
Golub et al slidesUnsupervised learning
Class Discovery
• Can cancer classes be discovered automatically based on gene expression?– Cluster tumors by gene expression– Determine whether the putative classes
produced are meaningful.
Golub et al slidesUnsupervised learning
Cluster tumors
Self-organization Map (SOM) Mathematical cluster analysis for recognizing and
clasifying feautres in complex, multidimensional data (similar to K-mean approach) Chooses a geometry of “nodes” Nodes are mapped into K-dimensional space, initially at
random. Iteratively adjust the nodes.
Golub et al slidesUnsupervised learning
Validation of SOM
• Prediction based on cluster A1 and A2:– 24/25 of the ALL samples from initial dataset were
clustered in group A1
– 10/13 of the AML samples from initial dataset were clustered in group A2
Golub et al slidesUnsupervised learning
Validation of SOM
• How could one evaluate the putative cluster if the “right” answer were not known?– Assumption: class discovery could be tested by class
prediction.• Testing of Assumption:
– Construct Predictors based on clusters A1 and A2.– Construct Predictors based on random clusters
Golub et al slidesUnsupervised learning
Validation of SOM
• Predictions using predictors based on clusters A1 and A2 yields 34 accurate predictions, one error and three uncertains.
Golub et al slidesUnsupervised learning
Validation of SOM
Golub et al slidesUnsupervised learning
CONCLUSION
• In Machine Learning, every technique has its assumptions and constraints, advantages and limitations
• My view:– First perform simple data analysis before applying fancy high tech ML
methods
– Possibly use different ML techniques and then ensemble results
– Apply correct cross validation method!
– Check for significance of results (permutation test, stability of selected genes)
– Work in collaboration with data producer (biologist, pathologist) when possible!
ML in bioinformatics