One-Class Learning - Bioinformatics & Machine...

OutlineDefining Areas

Why Machine Learning Algorithms?Characteristics of data & Problems

How does One-Class Learning fit here?One-Class Learning details

Conclusion

One-Class LearningBioinformatics & Machine Learning

Kihoon Yoon

Department of Computer ScienceUniversity of Texas at San Antonio

November 22, 2005

Kihoon Yoon One-Class Learning




Conclusion

Defining AreasBioinformatics & Related Areas

Why Machine Learning Algorithms?Characteristics of Life ScienceWhy ML is so popular?

Characteristics of data & ProblemsData and Problems

How does One-Class Learning fit here?My Current ProjectIntroduction of One-Class Learning

One-Class Learning detailsKoby Crammer & Gal Chechik, 2004Gunjan Gupta & Joydeep Ghosh, 2005Moshe Koppel & Jonathan Schler, 2004

ConclusionKihoon Yoon One-Class Learning




Conclusion

Bioinformatics & Related Areas

Computational Biology & System Biology

I Computational BiologyI Not a new area.I Computational biology uses mathematical and computational

approaches to address theoretical and experimental questionsin biology.

I System BiologyI Not a new area.I Systems Biology aims at understanding a biological system as

a whole.I For instance, a cell or an organ as one unitI Artificial organ construction, Neural regeneration





Conclusion

Bioinformatics & Related Areas

Bioinformatics

I Bioinformatics = Biology + Informatics

I Bioinformatics makes life sciences data more understandableand useful.

I Can it be a part of Science Domains?I With the above broad definition → YES or NO. It defends on

what you do.I A typical dictionary definition of Science: “The observation,

identification, description, experimental investigation (scientificmethod), and theoretical explanation of phenomena. Suchactivities restricted to a class of naturalphenomena....”(Excepted from The American HeritageDictionary of the English Language, Third Edition 1996.)





Conclusion

Characteristics of Life ScienceWhy ML is so popular?

Biology vs. Bioinformatics

I BiologyI Hypothesis driven and finding evidencesI Inferring Science: Learn from a model system, try to explain a

target system.I Biologist may believe ‘data’ from wet lab if the data make a

sense biologically.





Conclusion


Biology vs. Bioinformatics

I BioinformaticsI With the Broad Definition:

I Any technical supporting activity toward to Biology.I Current Trend: Only accuracy is important. Do not care even

if the results tell a wrong story. → Claiming doingBioinformatics!

I With the consideration of Bioinformatics as a part of science:I Much bigger setting than the traditional Biology. But, it is

still hypothesis driven from existing data.





Conclusion


Biology & Machine Learning

I Machine LearningI Learning from a known data set, try to predict unknown

instances.I “Inferring” - the most attractive word to biologists

I So, Machine Learning is a perfect match for Biology ...I Nothing can be so sure on data.I Some assumptions in Machine Learning.





Conclusion

Data and Problems

You can see what you want to see!





Conclusion

Data and Problems

Characteristics of Data

I Hypothesis driven data: how about the DNA Sequences?Is it absolutely ‘TRUE’ Data?

I NO!I Possible sequencing errors: systematic errorsI Variations on each individual sample

I Need to clean up data as best as I can and pray.

I Small number of positive examples vs huge number ofundefined negative (?) examples

I Fortunately, there are some definitely positive examples.





Conclusion

Data and Problems

Problems of ML Algorithms

I What’s wrong with ML Algorithms?I Underlying assumptions in Machine Learning are not suitable

for Biology DataI Training data is not much different from the data to test. ←

This is not always true in Biology (heterogeneous data).I Number of positive and negative examples are roughly same.← This is not always true in Biology (imbalance data).

I Possible general approaches to overcome thses problemsI Transductive learning for heterogeneous data (?)I Undersampling or Oversampling for imbalanced data





Conclusion

Data and Problems

Just examples of common mistakes

I A general mistake from Biology sideI J. Doe et al., 2005, Genome Research: They claimed that

Brain and testis are similar organs based on their principalcomponent analysis (PCA). The explanation was just “PCAspit out the results”.

I A common mistake from non-Biology sideI ‘A’ et al., 2004, Bioinformatics and ‘B’ et al., 2005,

Bioinformatics: Both group of people are working on ProteinSubcellular Localization prediction and claimed that theirprediction accuracies are more than 95% over 18 classes(locations). And also group B claimed that they bit the groupA. It truns out that their data has lots of smilar proteinsequences.





Conclusion

My Current ProjectIntroduction of One-Class Learning

Construct a model of Gene Networking

I Subtask 1: Find all possible Transcription Factor(TF) Bindingsites on Human chromosomes

I Only positive examples are available

I Subtask 2: Promoter and its complexity predictionI Negative examples are available, but imbalanced ratio is too

high

I One-Class Learning may handle imbalanced data and nocounter example data problem by considering only positiveexamples.





Conclusion


Inspirations

I Close to human’s learning processI Counter examples are not required to learnI One-Class classification tries to describe one class of objects,

and distinguish it from all other possible objects.

I Various Applications of One-Class LearningI machine diagnosticsI signature verification problemI fMRI image analysis and so on ...





Conclusion


One-Class Learning

I Proposed by Scholkopf et al., 1995; Tax and Duin, 1999;Ben-Hur et al., 2001

I A.k.a. Data domain description problem (Tax and Duin, 1999)

I Main ideaI Boundary between the two classes has to be estimated from

data of only genuine class.I Task: to define a boundary around the target class (to accept

as much of the target objects as possible, to minimizes thechance of accepting outlier objects).





Conclusion


One-Class Learning

I General Setting of One-Class LearningI Decision Process

I A measure for the distance d(x) or resemblance r(x) of anobject x to target class

I A threshold Θ on this distance or resemblanceI New objects are accepted:

f (x) = I (d(x) < Θd)

or

f (x) = I (r(x) > Θr )





Conclusion


One-Class Learning

I General Setting of One-Class LearningI Error Definition

I A method which obtains the lowest outlier rejection rate, fO−,is to be preferred.

I For a target acceptance rate fT+ with respect to the thresholdΘf T+ is defined as:

fT+ =1

N

Xi

I (r(xi ) ≥ Θf T+),

where xi ∈ χT ,∀i .





Conclusion


One-Class Learning

I Characteristics of One-Class ApproachesI Robustness to outliersI Possibly make the class description tight with known outlier

informationI Magic parameters and ease of configuration (advantage or

disadvantage?)I How to decide an initial center?I How to decide the best threshold?





Conclusion


ICML Papers

I “A Needle in a Haystack: Local One-Class Optimization”,Koby Crammer & Gal Chechik, 2004, ICML

I “Robust One-Class Clustering using Hybrid Global and LocalSearch”, Gunjan Gupta & Joydeep Ghosh, 2005, ICML

I “Authorship Verification as a One-Class ClassificationProblem”, Moshe Koppel & Jonathan Schler, 2004, ICML





Conclusion

Koby Crammer & Gal Chechik, 2004Gunjan Gupta & Joydeep Ghosh, 2005Moshe Koppel & Jonathan Schler, 2004

A Needle in a Haystack: Local One-Class Optimization

I The problem of finding a small and coherent subset of pointsin a given data

I The goal is to identify a subset of meaningful samples fromthe large set of data points. (Assumption: meaningful samplesare clustered together!)

I They formalize the learning task as an optimization problemusing the Information-Bottleneck (IB) principle.





Conclusion



I Information-Bottleneck MethodI The original formulation of the IB was proposed by Tishby et

al., 1999I Define the relevant information in a signal x ∈ X as being the

information that this signal provides about another signaly ∈ Y .

I Finding a short code for X that preserves the maximuminformation about Y .

I That is, we squeeze the information that X provides about Ythrough a ‘bottleneck’ formed by a limited set of codewords∼ X .





Conclusion



I Bregman Divergences as a distance measureI Defined via a strictly convex function F : Λ → R defined on a

closed, convex set Λ ⊆ Rn.I Assume that F is continuously differentiable at all points of

Λint .I The Bregman divergence associated with F is defined for

v ∈ Λ and w ∈ Λint to be

BF (v‖w) = F (v)− [F (w) +∇F (w) · (v − w)].

I BF measures the difference between F and its expansion aboutw , evaluated at v .





Conclusion



I Formulation of the Optimization ProblemI C ∈ {TRUE, FALSE}; event of being assigned to the ballI p(x) = p(X = x); the prior distribution over the samplesI The learning task is written as a minimization of the tradeoff

b/w two terms.

minq(C |x),w

βD(C ,w ;X ) + I (C ;X ),

where I (C ;X ) is the mutual information b/w X and C , and βis a free parameter.





Conclusion



I Formulation of the Optimization ProblemI The learning task

minq(C |x),w

βD(C ,w ;X ) + I (C ;X ),

I First term D is an average distortion.

D(C ,w ;X ) =∑

x

p(x)D(C |x ,w ; vx),

D(C |x ,w ; vx) = q(C |x)BF (v‖w) + (1− q(C |x))R.

D(C ,w ;X ) =∑

x

p(x)q(C |x)(BF (v‖w)− R) + R∑

x

p(x)





Conclusion




minq(C |x),w

βD(C ,w ;X ) + I (C ;X ),

I Second term I (C ;X ) is a measurement of how strongly themodel C compress the data.

I (C ;X ) = h(X )− h(X |C )

where h(·) is the differential entropy.





Conclusion



I Properties of the solutionI The marginal over C

q(C ) =∑

x

p(x)q(C |x) is soft (marginal) assignment

I The location of the centroid w

w =1

q(C )

∑x

p(x)q(C |x)vx .

I Probabilities q(C |x) can express the distance between thecentroid w and each of the points vx ,

q(C |x) = 1/

{1 +

1− q(C )

q(C )eβ[BF (vx‖w)−R]

}.





Conclusion



I Properties of the solution

q(C |x) = 1/

{1 +

1− q(C )

q(C )eβ[BF (vx‖w)−R]

}.

I When β = 0, then q(C |x) = q(C ) regardless of the specificsample value vx .

I When β →∞, q(C |x) attains one of three values as follows

limβ→∞

=

1 BF (vx‖w) < R0 BF (vx‖w) > Rq(C ) BF (vx‖w) = R

For a given w , the best assignment for x is to minimize theloss function L = min{BF (vx‖w),R}.





Conclusion







Conclusion



I Advantage and DisadvantageI Achieve more compact class description

I Not sure how much it can help for real world problemsI Trade-off between Recall and Precision

I Chance to end up with a bad local minimumI Initial R and w (center) problems are not addressed





Conclusion


Robust One-Class Clustering ...

I Improvement over Koby Crammer & Gal Chechik, 2004I Adding Global search method using an approximation

algorithm and use its output to initialize a local search.I Their approximation

I Proposition 3.1 If c is restricted to c ∈ Z, then the number ofdistinct clusters of size 2 through n is n(n − 1), and can beenumerated.

I Using the observation that the cluster representative and thefarthest point determine members of G , and such a tuple canonly be picked from Z .





Conclusion



I Cost FunctionI Definition 1: Find the cluster G = {p1,p2,pi ,. . .,ps} ⊂ Z of

size s that has the smallest cost.I Definition 2: Find the largest cluster G of cost less than or

equal to qmax .I Cost as a function of Distance: Given a distance measure

D(x , y) 7→ [0,∞), and a cluster representative c ∈ Rd ,I Average Distance cost

QAD(G , c) =1

s

sXi=1

D(pi , c),

I Maximum Distance cost: serve as a cost threshold for localsearch

QMD(G , c) =s

maxi=1

D(pi , c).





Conclusion



I Algorithm 1





Conclusion







Conclusion



I Results & ConclusionI Results against labeled data: the 173 experiments in Gasch

data using the 6,151 genes as features → Precision of 1 andRecall of 0.41.

I A fast global approximationI Again, problems with initial size of s and qmax .





Conclusion


Authorship Verification ...

I In the authorship verification problem, we need to determine ifa given texts were or were not written by a single author.

I The problem is how to make a classifier learn in-class variation.I Verification is thus essentially a one-class problem.I Two important points

I They did not actually lack for negative examples.I Consider only long texts as examples.

I Test the rate of degradation of the accuracy of learned modelsas the best features are iteratively dropped from the learningprocess.





Conclusion



I A New Approach: UnmaskingI The intuitive idea of unmasking is to iteratively remove those

features that are most useful for distinguishing between A andX.

I Measure the speed with which cross-validation accuracydegrades as more features are removed.

I Main hypothesis is that if A and X are by the same author,then whatever differences there are between them caused byonly a relatively small number of features.





Conclusion



I Baseline: One-Class SVMI Used One-class SVM on the 250 most frequent words in Ax to

build a model.I Using an SVM with linear kernel

I Applying UnmaskingI Iteratively remove selected features on each fold of 10-fold

cross validation step.I Record performance degradation curves for each X.

I Obtain overall accuracy of 95.7%.





Conclusion



1. Determine the accuracy results of aten-fold cross-validation experimentfor AX against X.

2. For the model obtained in each fold,eliminate the 3 most strongly-weightedpositive features and the 3 moststrongly-weighted negative features.

3. Go to step 1.





Conclusion

Conclusion

I High number of false positive predictions in imbalanced datamight be reduced by One-Class Learning approach.

I Come up with better ways to decide initial parameters toavoid “Magic Numbers”.

I Koppel et al., 2004 showed that a completely different schemecould work well on learning problems.


One-Class Learning - Bioinformatics & Machine...

Documents

Transcript of One-Class Learning - Bioinformatics & Machine...