Visualization and Machine Learning - for exploratory data ...

IntroductionVisualization

Machine Learning

Visualization and Machine Learningfor exploratory data analysis

Xiaochun Li1,2

1Division of BiostatisticsIndiana University School of Medicine

2Regenstrief Institute

May 2, 2008 / CCBB Journal Club

Xiaochun Li Visualization and ML

Machine Learning

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Machine Learning

Introduction

Mining large scale datasets, methods are needed tosearch for patterns, e.g., biologically important gene sets,or samplespresent data structure succinctlyboth are essential in the analysis.

Machine Learning

As IsSimple SummarizationMore Advanced Methods

ObjectiveVisualization

An essential part of exploratory data analysis, and reporting theresults.

plot data as isplot data after simple summarizationplot data based on more advanced methods

clusteringPCA (Principal component analysis)MDS (Multidimensional scaling)Silhouette, randomForest, . . .

Machine Learning

Outline

1 Introduction

Machine Learning

Plot data as isQuality Inspection

An affymetrics chip

image. Some images

may have obvious local

contaminations.

Machine Learning

Plot data as isQuality Inspection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Ins+, white

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Ins−, white

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Ins+, black

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Ins−, black

An RNAi experiment with

white and black plates,

insulin stimulated +/-.

Machine Learning

Plot data as isR tools

image or heatmap for any chip arraysfor cell-based assays, could also use plotPlate in Rpackage prada

Machine Learning

Outline

1 Introduction

Machine Learning

Simple SummarizationAlong Genomic Coordinates

Cumulative expression levels by genes in chromosome 21 scaling method: none

Representative Genes

− + − + + + − − + − + − + + − − + + − + + −

Cumulative expression

profiles along

Chromosome 21 for

samples from 10 children

with trisomy 21 and a

transient myeloid

disorder, colored in red,

and children with different

subtypes of acute myeloid

leukemia (M7), colored in

Machine Learning

Simple SummarizationAlong Genomic Coordinates

The previous wiggle plot was produced usingalongChrom of the R package geneplotter

Could plot just a segment of chromosome of interest

Machine Learning

Outline

1 Introduction

Machine Learning

MASS Spec Example“Latin Square” Design for B-F

Group Cytochrome c Ubiquitin Lysozyme Myoglobin TrypsinogenA 0 0 0 0 0B 0 1 2 5 10C 1 2 5 10 0D 2 5 10 0 1E 5 10 0 1 2F 10 0 1 2 5G 10 10 10 10 10

Design and the protein concentration, proteins 1= Ubiquitin (1 fmol/uL),

Cytochrome/Lysozyme/Myoglobin (10 fmol/uL), Trypsinogen(100 fmol/uL)

Machine Learning

Mass SpecExample

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

] One spectrum fromgroup A

Machine Learning

Mass SpecMDS

−400 −200 0 200 400 600 800 1000−20

−600−400

−200 0

200 400

first coordinate

●●

Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.

Machine Learning

MASS Specpairs plot

spec 1

0 10 20 30 40 0 10 20 30

0.66 spec 2

0.60 0.98 spec 3

0 20 40 60 80

0.59 0.970 10 20 30 40

0.99 spec 4

The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.

Machine Learning

MASS Specpairs plot

spec 1

0 10 20 30 40 0 10 20 30

0.99 spec 2

0.96 0.98 spec 3

0 5 10 15 20 25 30

0.96 0.980 10 20 30 40

0.99 spec 4

The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.

Machine Learning

Mass SpecMDS: 3-D

−400 −300 −200 −100 0 100 200 300 400−20

−200

−150

−100

first coordinate

●●

Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.

Machine Learning

Silhouette plotvisualize clustering results

A A A A A A A A A A A A AD

G D DG

D D D D D DG

G G D D G G D G

Cluster Dendrogram

hclust (*, "complete")d.s.nocut

Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Machine Learning

A AA A

D D D DG

GG G G

Cluster Dendrogram

hclust (*, "complete")d.s.cut

Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Machine Learning

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

whole spec

Average silhouette width : 0.57

n = 39 3 clusters Cj

j : nj | avei∈∈Cj si

1 : 17 | 0.67

2 : 16 | 0.48

3 : 6 | 0.56

Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Machine Learning

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

mz<1000 cut

Average silhouette width : 0.65

n = 39 3 clusters Cj

j : nj | avei∈∈Cj si

1 : 13 | 0.82

2 : 13 | 0.60

3 : 13 | 0.53

Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Machine Learning

Silhouette plotsilhouette width

For each observation i , the silhouette width si is defined asfollows:

ai = average dissimilarity between i and all other points ofthe cluster to which i belongsfor all other clusters C, put di,C = average dissimilarity of ito all observations of Cbi = minC di,C , and can be seen as the dissimilaritybetween i and its “neighbor” cluster, i.e., the nearest one towhich it does not belongsi = (bi − ai)/max(ai ,bi)

Machine Learning

VisualizationR tools

classical MDS: cmdscale2-D, 3-D scatter plot: plot and R packagescatterplot3d

2-D scatter plot matrix: pairssilhouette plot: silhouette

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Machine Learning

Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.

Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?

Machine Learning

Outline

1 Introduction

Machine Learning

Supervised Learning

linear modelnearest neighbor (k -nn)LDA (Linear Discriminant Analysis): same covariance Σacross classesLDA variants: QDA (class-specific Σk ), DLDA (Σ isdiagonal), RDA (regularized use αΣ + (1− α)I, SVMrandomForest

Machine Learning

Outline

1 Introduction

Machine Learning

Unsupervised Learning

ClusteringPCA (Principal component analysis)MDS (Multidimensional scaling), classical MDS usingEuclidean distance=PCAK-meansSOM (Self-organizing maps)Unsupervised as Supervised Learning

Machine Learning

Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0