Boosting Shai Raffaeli Seminar in mathematical biology freund

Boosting

Shai Raffaeli

Seminar in mathematical biology

http://www1.cs.columbia.edu/~freund/

Toy Example

• Computer receives telephone call

• Measures Pitch of voice

• Decides gender of caller

HumanVoice

Female

Generative modeling

Voice Pitch

Discriminative approach

Voice Pitch

Ill-behaved data

Voice Pitch

mean1 mean2

Traditional Statistics vs. Machine Learning

DataEstimated world state

PredictionsActionsStatistics

Decision Theory

Machine Learning

A weighted training set

Feature vectors

Binary labels {-1,+1}

Positive weights

x1,y1,w1 , x2 ,y2 ,w2 ,, xm ,ym ,wm

A weak learner

weak learner

A weak rule

hWeighted

training set

(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)

instances

x1,x2,x3,…,xnh

labels

y1,y2,y3,…,yn

The weak requirement:

Feature vectorB

inary labelN

on-negative weights

sum to 1

The boosting process

weak learner h1

(x1,y1,1/n), … (xn,yn,1/n)

weak learner h2(x1,y1,w1), … (xn,yn,wn)

h3(x1,y1,w1), … (xn,yn,wn) h4

(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6

(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8

(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT

(x1,y1,w1), … (xn,yn,wn)

Final rule: Sign[ ]h1 h2 hT

Adaboost

• Binary labels y = -1,+1

• margin(x,y) = y [ttht(x)]

• P(x,y) = (1/Z) exp (-margin(x,y))

• Given ht, we choose t to minimize

(x,y) exp (-margin(x,y))

t ln wit

i:ht xi 1,yi1 wi

i:ht xi 1,yi 1

wit exp yiFt 1(xi )

Adaboost

F0 x 0

Ft1 Ft tht

Get ht from weak learner

for t 1..T

Freund, Schapire 1997

Main property of adaboost

• If advantages of weak rules over random guessing are: T then in-sample error of final rule is at most

Adaboost as gradient descent

• Discriminator class: a linear discriminator in the space of “weak hypotheses”

• Original goal: find hyper plane with smallest number of mistakes – Known to be an NP-hard problem (no algorithm that

runs in time polynomial in d, where d is the dimension of the space)

• Computational method: Use exponential loss as a surrogate, perform gradient descent.

Margins view

}1,1{;, yRwx n )( xwsign Prediction =

- + +++

Correc

Project

Margin

Cumulative # examples

Mistakes Correct

Margin = xw

Adaboost et al.

Correct

MarginMistakes

Brownboost

LogitboostAdaboost = )( xwye

0-1 loss

One coordinate at a time

• Adaboost performs gradient descent on exponential loss• Adds one coordinate (“weak learner”) at each iteration.• Weak learning in binary classification = slightly better

than random guessing. • Weak learning in regression – unclear.• Uses example-weights to communicate the gradient

direction to the weak learner• Solves a computational problem

What is a good weak learner?

• The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label.

• Small enough to allow exhaustive search for the minimal weighted training error.

• Small enough to avoid over-fitting.• Should be able to calculate predicted label very

efficiently.• Rules can be “specialists” – predict only on a small

subset of the input space and abstain from predicting on the rest (output 0).

Decision Trees

Decision tree as a sum

+0.2-0.3

+0.1-0.1

-1sign

An alternating decision tree

+0.1-0.1

+0.2-0.3yesno

Example: Medical Diagnostics

•Cleve dataset from UC Irvine database.

•Heart disease diagnostics (+1=healthy,-1=sick)

•13 features from tests (real valued and discrete).

•303 instances.

Adtree for Cleveland heart-disease diagnostics problem

Cross-validated accuracy

Learning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boostin

g446 20.2% 0.5%

Boost Stumps

16 16.5% 0.8%

Curious phenomenon

Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Explanation using margins

Margin

0-1 loss

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

Experimental Evidence

Theorem

For any convex combination and any threshold

Probability of mistake

Fraction of training example with small margin

Size of training sample

VC dimension of weak rules

No dependence on numberNo dependence on number of weak rules of weak rules that are combined!!!that are combined!!!

Schapire, Freund, Bartlett & LeeAnnals of stat. 98

Category

Word occursWord does not occur

Weak rules generated by “boostexter”

Results

• 7844 training examples– hand transcribed

• 1000 test examples– hand / machine transcribed

• Accuracy with 20% rejected– Machine transcribed: 75%– Hand transcribed: 90%

Commercial deployment

• Distinguish business/residence customers

• Using statistics from call-detail records

• Alternating decision trees– Similar to boosting decision trees, more

flexible– Combines very simple rules– Can over-fit, cross validation used to stop

Freund, Mason, Rogers, Pregibon, Cortes 2000

Summary

• Boosting is a computational method for learning accurate classifiers

• Resistance to over-fit explained by margins• Underlying explanation –

large “neighborhoods” of good classifiers• Boosting has been applied successfully to a

variety of classification problems

Measurable quantity

Gene Regulation• Regulatory proteins bind to non-coding regulatory

sequence of a gene to control rate of transcription

regulators

mRNAtranscript

bindingsites

From mRNA to Protein

mRNAtranscript

Nucleus wall

ribosomeProteinfolding

Protein sequence

Protein Transcription Factors

regulator

Genome-wide Expression Data

• Microarrays measure mRNA transcript expression levels for all of the ~6000 yeast genes at once.

• Very noisy data• Rough time slice over all

compartments of many cells.• Protein expression not observed

Partial “Parts List” for YeastMany known and putative – Transcription factors– Signaling molecules

that activate transcription factors– Known and putative binding site “motifs” – In yeast, regulatory sequence = 500 bp

upstream region

Predict target gene regulatory response from regulator activity and binding site data

MicroarrayImage

R1 R2 RpR4R3 …..“Parent” gene expression G1

Target gene expression

GeneClass: Problem Formulation

G1G2G3G4

Binding sites (motifs)in upstream region

M. Middendorf, A. Kundaje, C. Wiggins, Y. Freund, C. Leslie.Predicting Genetic Regulatory Response Using Classification. ISMB 2004.

Role of quantization

-1 +10

By Quantizing expression into three classesWe reduce noise but maintain most of signal

Weighting +1/-1 examples linearly with Expression level performs slightly better.

Problem setup

• Data point = Target gene X Microarray

• Input features:– Parent state {-1,0,+1}– Motif Presence {0,1}

• Predict output:– Target Gene {-1,+1}

Boosting with Alternating Decision Trees (ADTs)

• Use boosting to build a single ADT, margin-based generalization of decision tree

Splitter NodeIs MotifMIG1 presentAND ParentXBP1 up?Prediction Node

F(x) given by sum of prediction nodes alongall paths consistent with x

Statistical Validation• 10-fold cross-validation experiments, ~50,000

(gene/microarray) training examples • Significant correlation between prediction score and true

log expression ratio on held-out data.• Prediction accuracy on +1/-1 labels: 88.5%

Biological InterpretationFrom correlation to causation

• Good prediction only implies Correlation.• To infer causation we need to integrate additional knowledge.• Comparative case studies: train on similar conditions

(stresses), test on related experiments• Extract significant features from learned model

– Iteration score (IS): Boosting iteration at which feature first appearsIdentifies significant motifs, motif-parent pairs

– Abundance score (AS): Number of nodes in ADT containing featureIdentifies important regulators

• In silico knock-outs: remove significant regulator and retrain.

Case Study: Heat Shock and Osmolarity

Training set: Heat shock, osmolarity, amino acid starvation

Test set: Stationary phase, simultaneous heat shock+osmolarity

Results: Test error = 9.3% Supports Gasch hypothesis: heat shock and osmolarity

pathways independent, additive– High scoring parents (AS): USV1 (stationary phase and

heat shock), PPT1 (osmolarity response), GAC1 (response to heat)

Case Study: Heat Shock and Osmolarity

Results: High scoring binding sites (IS):

MSN2/MSN4 STRE element Heat shock related: HSF1 and RAP1 binding sitesOsmolarity/glycerol pathways: CAT8, MIG1, GCN4Amino acid starvation: GCN4, CHA4, MET31

– High scoring motif-parent pair (IS):TPK1~STRE pair (kinase that regulates MSN2 via

cellular localization) – indirect effect

Direct binding Indirect effect Co-occurrence

Case Study: In silico knockout

• Training and test sets: Same as heat shock and osmolarity case study

• Knockout: Remove USV1 from regulator list and retrain

• Results:– Test error: 12% (increase from 9%)– Identify putative downstream targets of USV1: target

genes that change from correct to incorrect label– GO annotation analysis reveals putative functions:

Nucleoside transport, cell-wall organization and biogenesis, heat-shock protein activity

– Putative functions match those identified in wet lab USV1 knockout (Segal et al., 2003)

Conclusions: Gene Regulation

• New predictive model for study of gene regulation– First gene regulation model to make quantitative

predictions. – Using actual expression levels - no clustering.– Strong prediction accuracy on held-out

experiments– Interpretable hypotheses: significant regulators,

binding motifs, regulator-motif pairs• New methodology for biological analysis:

comparative training/test studies, in silico knockouts

Summary

• Boosting is an efficient and flexible method for constructing complex and accurate classifiers.

• Correlation -> Causation : still a hard problem, requires domain specific expertise and integration of data sources.

Improvement suggestions...

• Use of binary labels simplify the algorithm, but doesn’t reflect reality.

• “Confusion table”.

The End.

Large margins

marginFT(x,y) Ý y

tht x t1

marginFT(x,y) 0 fT (x)y

Thesis:large margins => reliable predictions

Very similar to SVM.

Experimental Evidence

TheoremSchapire, Freund, Bartlett & Lee / Annals of statistics 1998

H: set of binary functions with VC-dimension d

ihi | hi H ,i 0, i 1

c C, 0, with probability1 w.r.t. T ~ Dm

P x ,y ~D sign c(x) y P x,y ~T marginc x,y

˜ O d / m

T x1,y1 , x2 ,y2 ,..., xm ,ym ; T ~ Dm

No dependence on no. of combined functions!!!

Idea of Proof

Boosting Shai Raffaeli Seminar in mathematical biology freund

Documents

Transcript of Boosting Shai Raffaeli Seminar in mathematical biology freund

FREUND, John. Estadística Elemental

Improved Boosting Algorithms Using Confidence …...A slightly generalized version of Freund and Schapire’s AdaBoost algorithm is shown in ﬁgure 1. The main effect of AdaBoost’s

Shai Even - content.equisolve.net

Package ‘gbm’ - R · basehaz.gbm 3 References Y. Freund and R.E. Schapire (1997) “A decision-theoretic generalization of on-line learning and an application to boosting,”

freund 2014 8th

PROGRAM - Dr. Robert Freund

Chapter-14 Freund Solutions

On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire.

Freund Product Brosch

A Tutorial on Boosting Yoav Freund Rob Schapiregroups.di.unipi.it/~cardillo/AA0304/fabio/boosting.pdf · [Kearns & Valiant ’88]: does weak learnability imply strong learnability?

A Tutorial on Boosting Yoav Freund Rob Schapire

Kam Hamidieh University of Pennsylvania, Wharton ...The boosted models were rst developed by Schapire (1990) and Freund (1995). The boosted models were generalized to gradient boosting

An Introduction to Boosting Yoav Freund Banter Inc.

Cloze Test Shai

1 What is learning? - huji.ac.ilshais/Handouts.pdf · 2010-12-07 · Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz Based on a book by Shai Ben-David and Shai Shalev-Shwartz

FREUND + KUPFERBLATT - SEP15

Statistical Applications in Genetics and Molecular Biology · In recent years, aggregation methods such as bagging (Breiman, 1996) and boosting (Freund, 1995) have been extensively

University of Washington 1 Boosting and predictive modeling Yoav Freund Columbia University.

Freund Dissertation Vedic Library

Freund-Dynamic Fracture Mechanics