Post on 29-Nov-2014
description
IntroductionVisualization
Machine Learning
Visualization and Machine Learningfor exploratory data analysis
Xiaochun Li1,2
1Division of BiostatisticsIndiana University School of Medicine
2Regenstrief Institute
May 2, 2008 / CCBB Journal Club
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Introduction
Mining large scale datasets, methods are needed tosearch for patterns, e.g., biologically important gene sets,or samplespresent data structure succinctlyboth are essential in the analysis.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
ObjectiveVisualization
An essential part of exploratory data analysis, and reporting theresults.
plot data as isplot data after simple summarizationplot data based on more advanced methods
clusteringPCA (Principal component analysis)MDS (Multidimensional scaling)Silhouette, randomForest, . . .
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Plot data as isQuality Inspection
An affymetrics chip
image. Some images
may have obvious local
contaminations.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Plot data as isQuality Inspection
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Ins+, white
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Ins−, white
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Ins+, black
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Ins−, black
An RNAi experiment with
white and black plates,
insulin stimulated +/-.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Plot data as isR tools
image or heatmap for any chip arraysfor cell-based assays, could also use plotPlate in Rpackage prada
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Simple SummarizationAlong Genomic Coordinates
020
000
4000
060
000
8000
0
Cumulative expression levels by genes in chromosome 21 scaling method: none
Representative Genes
Cum
ulat
ive
expr
essi
on le
vels
NR
IP1
BT
G3
JAM
2A
DA
MT
S1
CC
T8
GR
IK1
HU
NK
SY
NJ1
IFN
AR
2C
21or
f55
SO
NIT
SN
1A
TP
5OD
SC
R1
RU
NX
1R
UN
X1
CB
R3
CLD
N14
DS
CR
5T
TC
3D
YR
K1A
DY
RK
1AD
SC
R4
KC
NJ1
5E
TS
2H
MG
N1
BA
CE
2A
NK
RD
3T
FF
1P
DE
9AU
2AF
1P
DX
KT
ME
M1
B7H
2A
IRE
C21
orf2
UB
E2G
2A
DA
RB
1C
OL1
8A1
SLC
19A
1C
OL6
A1
CO
L6A
1LS
SM
CM
3AP
− + − + + + − − + − + − + + − − + + − + + −
Cumulative expression
profiles along
Chromosome 21 for
samples from 10 children
with trisomy 21 and a
transient myeloid
disorder, colored in red,
and children with different
subtypes of acute myeloid
leukemia (M7), colored in
blue.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Simple SummarizationAlong Genomic Coordinates
The previous wiggle plot was produced usingalongChrom of the R package geneplotter
Could plot just a segment of chromosome of interest
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
MASS Spec Example“Latin Square” Design for B-F
Group Cytochrome c Ubiquitin Lysozyme Myoglobin TrypsinogenA 0 0 0 0 0B 0 1 2 5 10C 1 2 5 10 0D 2 5 10 0 1E 5 10 0 1 2F 10 0 1 2 5G 10 10 10 10 10
Design and the protein concentration, proteins 1= Ubiquitin (1 fmol/uL),
Cytochrome/Lysozyme/Myoglobin (10 fmol/uL), Trypsinogen(100 fmol/uL)
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Mass SpecExample
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
010
2030
40
mz
x[1,
] One spectrum fromgroup A
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Mass SpecMDS
−400 −200 0 200 400 600 800 1000−20
0−
150
−10
0 −
50
0 5
0 1
00 1
50 2
00
−600−400
−200 0
200 400
first coordinate
seco
nd c
oord
inat
e
third
coo
rdin
ate
●
●●
●●
●
●●
●
●
●●
●
Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
MASS Specpairs plot
spec 1
0 10 20 30 40 0 10 20 30
020
4060
80
010
2030
40
0.66 spec 2
0.60 0.98 spec 3
010
2030
40
0 20 40 60 80
010
2030
0.59 0.970 10 20 30 40
0.99 spec 4
The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
MASS Specpairs plot
spec 1
0 10 20 30 40 0 10 20 30
05
1015
2025
30
010
2030
40
0.99 spec 2
0.96 0.98 spec 3
010
2030
40
0 5 10 15 20 25 30
010
2030
0.96 0.980 10 20 30 40
0.99 spec 4
The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Mass SpecMDS: 3-D
−400 −300 −200 −100 0 100 200 300 400−20
0−
100
0
100
200
−200
−150
−100
−50
0
50
100
150
200
first coordinate
seco
nd c
oord
inat
e
third
coo
rdin
ate
●●
●
●
●
●
●
●
●●
●●
●●
Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
A A A A A A A A A A A A AD
G D DG
G GG
D GG
D D D D D DG
G G D D G G D G
010
2030
4050
Cluster Dendrogram
hclust (*, "complete")d.s.nocut
Hei
ght
Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
A AA A
AA A
AA A
AA A
DD D
D DD
D DD
D D D DG
G GG
G GG
GG G G
G G
010
020
030
040
0
Cluster Dendrogram
hclust (*, "complete")d.s.cut
Hei
ght
Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
whole spec
Average silhouette width : 0.57
n = 39 3 clusters Cj
j : nj | avei∈∈Cj si
1 : 17 | 0.67
2 : 16 | 0.48
3 : 6 | 0.56
Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
mz<1000 cut
Average silhouette width : 0.65
n = 39 3 clusters Cj
j : nj | avei∈∈Cj si
1 : 13 | 0.82
2 : 13 | 0.60
3 : 13 | 0.53
Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotsilhouette width
For each observation i , the silhouette width si is defined asfollows:
ai = average dissimilarity between i and all other points ofthe cluster to which i belongsfor all other clusters C, put di,C = average dissimilarity of ito all observations of Cbi = minC di,C , and can be seen as the dissimilaritybetween i and its “neighbor” cluster, i.e., the nearest one towhich it does not belongsi = (bi − ai)/max(ai ,bi)
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
VisualizationR tools
classical MDS: cmdscale2-D, 3-D scatter plot: plot and R packagescatterplot3d
2-D scatter plot matrix: pairssilhouette plot: silhouette
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Machine Learning
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Machine Learning
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Machine Learning
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Machine Learning
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Supervised Learning
linear modelnearest neighbor (k -nn)LDA (Linear Discriminant Analysis): same covariance Σacross classesLDA variants: QDA (class-specific Σk ), DLDA (Σ isdiagonal), RDA (regularized use αΣ + (1− α)I, SVMrandomForest
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Unsupervised Learning
ClusteringPCA (Principal component analysis)MDS (Multidimensional scaling), classical MDS usingEuclidean distance=PCAK-meansSOM (Self-organizing maps)Unsupervised as Supervised Learning
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Unsupervised as Supervised Learningthrough data augmentation
Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.
x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1
xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0
x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2
µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by
supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)
g(x) = g0(x) µ(x)1−µ(x)
E.g., using this techinque with RandomForest.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Unsupervised as Supervised Learningthrough data augmentation
Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.
x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1
xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0
x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2
µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by
supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)
g(x) = g0(x) µ(x)1−µ(x)
E.g., using this techinque with RandomForest.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Unsupervised as Supervised Learningthrough data augmentation
Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.
x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1
xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0
x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2
µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by
supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)
g(x) = g0(x) µ(x)1−µ(x)
E.g., using this techinque with RandomForest.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Unsupervised as Supervised Learningthrough data augmentation
Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.
x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1
xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0
x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2
µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by
supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)
g(x) = g0(x) µ(x)1−µ(x)
E.g., using this techinque with RandomForest.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Unsupervised as Supervised Learningthrough data augmentation
Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.
x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1
xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0
x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2
µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by
supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)
g(x) = g0(x) µ(x)1−µ(x)
E.g., using this techinque with RandomForest.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Unsupervised as Supervised Learningthrough data augmentation
Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.
x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1
xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0
x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2
µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by
supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)
g(x) = g0(x) µ(x)1−µ(x)
E.g., using this techinque with RandomForest.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
What are Random Forests
Random forests are a combination of tree predictors whichdepends on iid values random vectors, {θθθk}.Example - Bagging (bootstrap aggregation):
bootstrap samples are drawn from the training set, whereθθθk is counts in n boxes resulting from sampling withreplacementa tree is grown from each bootstrap sampleassign class per majority votes.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Motivation:
Improve predictiona single tree has poor accuracy for problems with manyvariables, each of them having very little information e.g.,genomics data setscombining trees grown using random features can improveaccuracy
Assess Performancetraining error (error rate from the training set) does notindicate performance over new dataoverfit→ small training error but poor generalization errorneed data which were not used to grow a particular tree toassess the performance of the tree.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Strength and Correlation
for a given case (X, Y), and a given ensemble of classifiersmargin = proportion of votes for the right class −maxother classes(proportion of votes for any other class)
generalization error PE∗ = PX,Y(margin < 0)
s ≡strength = EX,Y(margin)ρ ≡correlation = some correlation btw any two trees.Thm 1.2. generalization error convergesThm 2.3. Gen. Error is bounded, PE∗ ≤ ρ(1− s2)/s2.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Random Forests Converge
Theorem 1.2. As the number of trees increases,generalization error a.s. for all {θθθk} converges.
this is why random forests do not overfit as more trees areadded, but tend to a limiting value of the generalizationerror.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
StrategyMinimize Correlation While Keeping Strength
Using randomly selected inputs or combinations of inputs ateach node to grow each tree:
Random Input Selection - Forest-RIat each node, select at random F variables to split on,grow the tree to maximum size and do not prune.Random Feature Selection - Forest-RCsame idea as above but with F Features- "linear combinations of randomly selected L variables"with random coefficients runif(L, -1, 1)⇒ further reducecorrelation
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Gauging Performance
Bagging makes it possible to estimate the generalizationerror without a test set.
Why: in any bootstrap sample, about 1/3 of cases from theoriginal training set are left out due to sampling withreplacement (1− 1
n )n ≈ e−1 ≈ 1/3.Out-Of-Bag Estimates of Error, Strength and Correlation
For each (x, y), aggregate the votes over trees grownwithout (x, y) - out-of-bag classifier.Out-of-bag estimate of generalization error = error rate ofout-of-bag classifier.Same idea for out-of-bag strength and correlation.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
ConclusionsRandomForest
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
ConclusionsRandomForest
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
ConclusionsRandomForest
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
ConclusionsRandomForest
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
ConclusionsRandomForest
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
RandomForest in Unsupervised Learning
RandomForest can be used in the unsupervised mode forvariable selectionproximity matrix (for clustering)
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
RandomForest in Unsupervised Learning
RandomForest can be used in the unsupervised mode forvariable selectionproximity matrix (for clustering)
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Outline
1 Introduction
2 VisualizationAs IsSimple SummarizationMore Advanced Methods
3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
What are SVMs
Support vector machines (SVMs) are a set of supervisedlearning methods used for classification and regressionAn extension of LDA
many hyperplanes could classify the datainterested in the one achieving maximum separation(margin) between the two classesmathematically, for (yi , xi ), yi = ±1, i = 1, . . . ,n, min 1
2 ||x ||2
s.t., yi (x ′i x − b) ≥ 1 (if separable)min 1
2 ||w ||2 + λ
∑ni=1 ξi s.t., ξi ≥ 0, yi (x ′i w − b) ≥ 1− ξi (if
not separable)
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
SVMseparable case
http://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png
http://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png 5/2/2008 9:40:18 AM
Separable case.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
SVMseparable case
http://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png
http://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png 5/2/2008 9:41:23 AM
Separable case.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Predictive Models
Are we only interested in a predictive black box, or are we alsointerested in which features predict?
p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Predictive Models
Are we only interested in a predictive black box, or are we alsointerested in which features predict?
p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
Supervised LearningUnsupervised LearningRandom ForestsSVM
Summary
Visualization is an important aspect of EDA. "A picture isworth a thousand words".Supervised Learning allows one to select features, andclassify (prediction).Unsupervised Learning allows study of associationsamong features, feature selection, and cluster.
Xiaochun Li Visualization and ML