Multivariate data analysis and visualization tools for biological data
-
Upload
dmitry-grapov -
Category
Education
-
view
2.196 -
download
3
Transcript of Multivariate data analysis and visualization tools for biological data
![Page 1: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/1.jpg)
Multivariate Data Analysis and Visualization
Tools for Understanding Biological Data Dmitry Grapov
![Page 2: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/2.jpg)
Introduction: Systems
Oltvai, et al. Science 25 October 2002: 763-764.
Emergent
Reductionist
Systems
Complex systems
Deterministic
Chemical analysis
Physiology Biochemistry
Graph theory
Modeling
Informatics
![Page 3: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/3.jpg)
Introduction: Inference
![Page 4: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/4.jpg)
Types: Univariate
1-D
Bivariate
2-D
Multivariate
n-D
Properties: vector matrix matrix
Representations: histograms
densities
scatter plots dendrograms
heatmaps
biplots networks
Central Idea: mean correlation many
http://www.thefullwiki.org/Hypercube
Overview
![Page 5: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/5.jpg)
Univariate: Properties
•vector of length m–mean–variance
![Page 6: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/6.jpg)
Univariate: Representations
![Page 7: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/7.jpg)
Univariate: Assumptions
•Normality
![Page 8: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/8.jpg)
Univariate: Utility
Hypothesis testing• α - type I error ( False Positive)•β - type II error ( False negative)•power - (1–β)•effect size - standardized difference in mean
![Page 9: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/9.jpg)
Univariate: Limitations
•Biological definition of the mean ?•Relationship between sample size and test power•Multiple hypothesis testing
• False discovery rate
![Page 10: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/10.jpg)
Old Faithful Data
272 observations
•time between eruptions– 70 ± 14 min
•duration of eruption– 3.5 ± 1 min
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365
![Page 11: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/11.jpg)
•Matrix of 2 vectors of length m
Bivariate: Properties
![Page 12: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/12.jpg)
(X,Y)
Bivariate: Representations
![Page 13: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/13.jpg)
(X,Y)
Bivariate: Utility
Variable 2 = m*Variable 1 + b
•bivariate distribution
•correlation
![Page 14: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/14.jpg)
http://en.wikipedia.org/wiki/Correlation
Bivariate: Limitations
correlation coefficient•Measure of linear or monotonic relationship
![Page 15: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/15.jpg)
http://en.wikipedia.org/wiki/Correlation
Bivariate: Limitations
•Sensitive to outliers
![Page 16: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/16.jpg)
Old Faithful
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365
![Page 17: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/17.jpg)
Old Unfaithful?
![Page 18: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/18.jpg)
Old Unfaithful?
Additional variables
•Nearby hydrofracking
•Improve inference based on more information
![Page 19: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/19.jpg)
Old Unfaithful?
Additional variables
•Nearby hydrofracking
•Improve inference based on more information
![Page 20: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/20.jpg)
Challenges
•data often wide structured
•integration
•noise
Rewards
•robust inference
•signal amplification
•holistic/systems approach
A matrix of n vectors of length m
Multivariate: Properties
Correlation matrix
![Page 21: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/21.jpg)
Principal Components Analysis (PCA)
Linear n-dimensional encoding of original data Where dimensions are:
1. orthogonal (uncorrelated)2. Top k dimensions are ordered by variance explained
PC 2
PC 1
Multivariate: Dimensional Reduction
![Page 22: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/22.jpg)
Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001.
Scores LoadingsExplained variance
m x PC
PC x PC n x PC
Original Data
Calculating PCs: singular value decomposition (SVD)
Eigenvalue
•explained variance
Scores
•sample representation based on all variables
Loadings
•variable contribution to scores
Multivariate: Dimensional Reduction
![Page 23: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/23.jpg)
Old Faithful 2.0
•272 measurements
•8 variables
•2 real, 6 random noise
A matrix of n vectors of length m
Multivariate: Representations
![Page 24: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/24.jpg)
Multivariate: Representation
Identify outliers using all measurements Use known to impute missingIdentify interesting groups Evaluate uni- and bivariate observations
•Number of PCs can be used true data complexity
![Page 25: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/25.jpg)
PCA: Considerations
•data pre-treatment
•outliers
•noise
•unsupervised projection
no pre-treatment
centered and scaled to unit variance
![Page 26: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/26.jpg)
PCA: Considerations
•data pre-treatment
•outliers
•linear reconstruction
•noise
•Independent components analysis (ICA)
•unsupervised projection
Use ICA to calculate statistically independent components
![Page 27: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/27.jpg)
PCA: Considerations
•data pre-treatment
•outliers
•linear reconstruction
•noise
•supervised projection
•Non-negative matrix factorization (NMF)
NMF uses additive parts based encoding
Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
![Page 28: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/28.jpg)
PCA: Considerations
•data pre-treatment
•outliers
•linear reconstruction
•noise
•supervised projection
•Identify projection correlated with class assignment (classification) or continuous variables (regression)
•Partial Least Squares Projection to Latent Structures (PLS/-DA)
![Page 29: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/29.jpg)
PLS/-DA: UtilityStrengths
•Predict multiple dependent variables
•avoids issues of multicollinearity
•Independent measure of variable importance
Weaknesses
•Need to derive an empirical reference for model performance
•Poor established model optimization methods
![Page 30: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/30.jpg)
PLS-DA: Example•Data: Old Faithful 2.0
•272 observations on 8 variables
•Latent Variables are analogous to PCs
•Important Statistics (CV)
•Q2 = fit
•RMSEP = error of prediction
•AU(RO)C = specificity vs. sensitivity
Select the appropriate number Latent Variables (LVs) to maximize Q2
![Page 31: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/31.jpg)
PLS-DA: Performance
•Use permutation tests to empirically determine model performance
![Page 32: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/32.jpg)
PLS-DA: Performance
•Use permutation tests to empirically determine model performance
![Page 33: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/33.jpg)
PLS: Predictive Performance
•Split data into training (2/3) and test sets (1/3)
•Generate model using training set and then predict class assignment for test set
•Use permutation tests to generate confidence bounds for future predictions
![Page 34: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/34.jpg)
PLS: Predictive Performance
![Page 35: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/35.jpg)
PLS: Feature SelectionUse the PLS-DA as an objective function to identify the
most informative variables
![Page 36: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/36.jpg)
Networks
Network: representation of relationships among objects
Utility
•Project statistical results into a biological context
•Explore informative data aspects in the context of all that was observed.
•Identify emergent patterns
![Page 37: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/37.jpg)
Networks•Interpret statistical results within a biological context
![Page 38: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/38.jpg)
Networks•Highlight changes in patterns of relationships.
non-diabetics type 2 diabetics
![Page 39: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/39.jpg)
Networks•Display complex interactions
non-diabetics type 2 diabetics
-0.2 -0.1 0.0 0.1 0.2 0.3 0.4
-0.2
0.0
0.2
0.4
0.6
T2D and UCP3 OPLS-DA Loadings
Increasing in T2D------------>
Incre
asin
g in
g/a
UC
P3
----
----
--->
1-LG
12(13)-EpODE
12,13-DiHOME
15(S)-HEPE
17,18-DiHETE
18:1n9
22:5n6
5-HETE
9-HETE
9(10)-EpOME
9,10-DiHOME
9,12,13-TriHOME
AEA
c16:0
c18:0
DHEA
LEA
NO-Gly
SEA
g/a
g/g
T2D non-T2D
![Page 40: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/40.jpg)
non-diabetics type 2 diabetics
imDEV: interactive modules for Data Exploration and Visualization
An integrated environment for systems level analysis of multivariate data.
http://sourceforge.net/apps/mediawiki/imdev
![Page 41: Multivariate data analysis and visualization tools for biological data](https://reader036.fdocuments.us/reader036/viewer/2022081512/554e84f7b4c90526358b45c3/html5/thumbnails/41.jpg)
Acknowledgements
Newman Lab
Designated Emphasis in Biotechnology (DEB)
NIH
This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.