New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in...

79
New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1 Statistics and Bioinformatics Research Group Statistics department, Universitat de Barelona 2 Statistics and Bioinformatics Unit Vall d’Hebron Institut de Recerca 1

Transcript of New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in...

Page 1: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

New Challenges in Bioinformatics:Integrative Analysis of Omics Data

Alex Sánchez

1Statistics and Bioinformatics Research GroupStatistics department, Universitat de Barelona

2Statistics and Bioinformatics UnitVall d’Hebron Institut de Recerca

1

Page 2: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Outline Introduction:

omics, data integration, integrative analysis Integrative analysis: challenges and methods Some (prototypical) examples

Multivariate statistical approach to integrative analysis Building better predictors from diverse data sources Gene sets and its application to integrative analysis Network methods for visualization and data integration

Where to now?

Page 3: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Who, where, what?

3

Page 4: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Omics data

Page 5: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

123456789 p p m

H NMR metabolites

Affy Transcriptome

LC-MS proteomicss

Adiponectin (change from baseline)

-15

-10

-5

0

5

10

15day 7day 14

db/+ db/db

Veh Met30

Gly1

Gly3

Met75 Veh Met

30Gly1

Gly3

Met75

*

Adipon

ectin

(ug/ml)

“Non-omic” markers

Veh A B C D Veh A B C DNormal Disease

A

A

Experimental Platforms generatediverse omics and non-omics data

“NGS-Sequences

5

Page 6: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

6

Genomics

• Uses sequencing technologies to study genomes and intragenomic phenomena.

• Data: DNA sequences

Page 7: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

7

Transcriptomics• The transcriptome is the

set of all RNA molecules, in one or a population of cells.

• Transcriptomics, examines expression levels of mRNAs in a given cell population,

• Technologies• Microarrays• Next Generation

Sequencing

Page 8: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

8

Proteomics• The large-scale study

of proteins (the proteome)• (3D) structures and • functions.

• Spectra of techniques• 2D gel based• Mass Spectrometry (MS)• Seldi-TOF (MS)• Protein arrays,• …

Page 9: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

9

Metabolomics• Comprehensive and

simultaneous systematic determination of• metabolite levels in the

metabolome and • their changes over time as

a consequence of stimuli.• Relies on

• Separation techniques• GC, CE, HPLC, UPLC

• Detection techniques• NMR, MS

Page 10: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

CEMCAT-Neuroimmunology10

Altogether: The central dogma and the omics cascade

Page 11: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Why would we want to integrate data?

Page 12: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Why should we integrate data?

What we learn from an experiment may depend on where we look, how we look, and the scope of our view!

The Blind Men and the Elephant

http://www.noogenesis.com/pineapple/blind_men_elephant.html

Page 13: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Focusing on one platform risks missing an obvious signal!!!

13

Page 14: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

From componentwise to global approaches

It is expected that the integrated collection and analysis of diverse types of data,

jointly modelled and analyzed in a systems biology approach

can shed light on the global functioning of biological systems.

Page 15: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Ultimate Goal: understanding of complex processes

15

Page 16: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Integrative Analysis & Data integration: methods, types, challenges

16

Page 17: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Data Integration is cool

• Everywhere nowadays in Biology Medicine, Bioinformatics, …• Meetings

• Barcelona (Feb. 2013), Leiden (Apr. 2013), Ascona (May 2013)

• Finnancing (FP7): projects with > 106 € each• Stategra• MimOmics

• Try googling with the terms 'omics data integration'

Page 18: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

But what is Data Integration?

◦ “Data integration” may mean different things...◦ Computational combination of data ◦ Combination of studies performed independently◦ Simultaneous analysis of multiple variables on multiple

datasets.◦ Not to mention any possible approach for

homogeneously querying heterogeneous data sources

Integrative analysis may be preferable

Page 19: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

There are many types of integrative analysis

Hamid et al. 200919

Page 20: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

There are many methods ….

• Decision trees, Bayesian networks, Support vector machines, Graph algorithms, Multivariate analysis,

Page 21: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

There are many issues to be addressed

Data-Preprocessing Data of same or different types

High (but "cursed") dimensionality N << p Datasets of different sizes (104 genes, 103 proteins) Multiple testing issues

Missing values Some values missing for some individuals Non rectangularity of the data

Biological interpretation

Page 22: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

So what?

• We willl restrict to arbitrarily chosen examples providing an overview of the field without pretending to cover it all.

• Case studies.– Combining biological knowledge with omics data using

multivariate statistics.– How to obtain improved cancer predictors by

aggregating datasets.– Using network biology methods for traslational cancer

research.

Page 23: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Some examples

23

Page 24: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Integrative Analysis of the Relationship Between Insulin Resistance and Gut Microbiota

24

Page 25: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Insulin Resistance

Insulin resistance means cells become less sensitive to insulin,

This provokes the pancreas to over-compensate by working harder and releasing even more insulin.

Insulin-resistance + Insulin over-production leads to two common outcomes: diabetes, or obesity

25

Page 26: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

IS/IR and Gut Microbiota

Human gut microbiome is related to health & weight◦ varies in healthy people◦ varies in lean and obese

It is reasonable to postulate insuline sensitivity to be associated with changes in bacterial microflora.

26

Page 27: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Data for relating IR/IS with Microbiome

Clinical variables (BMI, Homa, Ins, HDL, …) Microarrays

Expression matrix an related annotations (GO) Microbial flora diversity based on

Denaturing Gradient Gel Electrophoresis Metagenomic shotgun NGS sequencing

Clin1 ······ ClinK1 DGGE1 ······ DGGEK2 Expr1 ······ ExprK3 GeneSet1 ······ GeneSetK4 Spec1 ······ SpecK5IS_NoD_10IS_NoD_11IS_NoD_12IR_NoD_13IR_NoD_14IR_NoD_15Diab_16Diab_17

27

Page 28: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Principal Components Analysis

• Given a KxN data matrix containing K (correlated) measurements on N samples (objects/individuals…)

• Decomposes data matrix in new K components that – account for different sources of variability in the data,– are uncorrelated, that is each component accounts for a

different source of variability,– have decreasing explanatory ability: each component explains

more than the following– allow for a lower dimensional representation of the data in

terms of scores on principal components.

Page 29: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

How does PCA work

• PCA provides a new set of coordinates for the observations• Original coordinates

•Value of the variables• New coordinates

•Value of PCs: scores• Scores are the new

coordinates in the orthogonal system defined by PCs.

X1

X2

Page 30: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Representing data in the PCA space

• PCs have been derived so that– They are orthogonal– Each PC explains the maximum amount of remaining

variation in the data• This means that it is not necessary to use all

PCs to visualize the data in this new coordinate system– Taking the first PCs will often explain a high

percentage of variability.– Usually only first 2 or 3– This should always be checked!!!

Page 31: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

31

Multiple Factor Analysis (MFA)

MFA is a multivariate statistical technique useful to analyze several groups of variables

(numerical and/or categorical) defined on the same samples

31

Page 32: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

32

Multiple Factor Analysis (2) The core of MFA is a PCA

applied to the whole set of variables,

Each group of variables is weighted, rendering possible the analysis of different points of view by taking them equally into account.

MFA allows to look for common factors by providing a representation of each matrix of variables.

32

Page 33: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

33

MFA (3): Multiple displays

33

Page 34: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

34

MFA (4): Supplementary info

The assets of MFA appear when integrating both numerical and categorical groups of variables, and when supplementary groups of data need to be added in the

analysis.

Page 35: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Conclusions

The good◦ MFA allows the integrated analysis of multiple groups of

possibly heterogeneous data types. ◦ It can help to highlight associations previously

undetected (“adds value”).◦ It can deal with any number of groups and any type of

supplementary variables (Gene Sets, Species, …) Limitations: ◦ It assumes individual-based information No groups

(e.g. pools) as input◦ Missings are difficult to deal with

Page 36: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Complementary idea 1: Improve use of biological knowledge

• The ultimate goal is a better understanding of (changes) in biological processes.

• It seems reasonable to make an (increased) use of biological information.

• This can be done in different ways– Convert data into networks and align them– Project biological units in a common space and rely on

• commonalities• differencesfor variable selection

Page 37: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Previous results

37

Page 38: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

goProfiles

38

Page 39: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Variable selection based on Biological Knowledge

• Preliminary work on functional profiling can be used to project biological units such as genes or proteins into annotation databases such as the Gene Ontology

• An iterative algorithm can be used to select subsets that are either – most biologically diverse– nost biologically homogeneous

• This can be used as a basis for variable selection previous to MFA

Page 40: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Integrative Omics Data Mining and Knowledge Discovery in

Colorectal Cancer

based on a work by Jake Y. Chen, Ph.DIndiana Center for Systems Biology & Personalized Medicine

Page 41: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Polyp and Colorectal Cancer

Polyp vs. Colorectal CancerBenign tumors of the large intestine.Does not invade nearby tissue or spread to other parts of

the body.If not removed from the large intestine, may become

malignant (cancerous) over time.Most of the cancers of the large intestine are believed to

have developed from Polyp.Photo Courtesy of National Cancer Institute

Colon Cancer vs. Rectal Cancer• Share many commonalities, including molecular mechanisms.• Tend to be treated differently.

Page 42: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Omics/Clinical Data SourceProteomics/Metabolomics/Lipdomics/Clinical Data

Diet

H=70

PP=54

CR=29

N=153

Oxidative Stress

H=50

PP=32

CR=12

N=94

LC-MS Proteomics

H=80

PR=72

CR=40

N=192

Vitamin D

H=83

PP=81

CR=31

N=195

GC/GC MS Metabolomics

H=83

PP=84

CR=30

N=197

Lipdomics

H=47

PP=35

CR=15

N=97

NMR Metabolomics

H=53

PP=35

CR=15

N=103

Page 43: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Scientific Questions to Answer

Data AnalysisWhich Omics data has the best prediction power?Which features in Omics data are important?

Data MiningDoes integration of Omics data improve the prediction?Which combination of Omics data has the best prediction power?

Knowledge DiscoveryWhy those features in Omics data have the best prediction power?

Page 44: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics DataIntegrative Data Mining

Page 45: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Proteomics Data Description

Group: Bindley Biosciences Center at Purdue University

Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap

Data format at CCE webportal: mzXML

Number of Samples: Normal: 80; PolyP:72; Colorectal: 40

Page 46: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

LC-MS Proteomics Data Processing

LC/MS data “heat map”

Total Ion Chromatogram (TIC) summarized from enhanced heat map

Methods Adapted fromN. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066.S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83

Image Enhanced LC/MS data “heat map”

Page 47: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

LC-MS Major Protein Identification~25-28 characteristic proteins /sample identified

Identify Most Informative TIC R.T. “Grid”

Apply the R.T. Grid to Original SpectraUse Mascot to Search for Protein ID at R.T. Grid Regions

No Scan RT Uniprot_ID Score Expect Evidence1 119 139.48 ADAD2_HUMAN 38 3.3 02 229 265.87 NNMT_HUMAN 43 1.1 23 372 429.15 ZSA5D_HUMAN 42 1.2 04 656 749.8 BRAF_HUMAN 40 2.2 4795 1162 1276.6 RGS7_HUMAN 47 0.39 16 1310 1407.2 TTC9C_HUMAN 35 6.3 07 1669 1713.9 CP042_HUMAN 38 3.1 08 1866 1879.1 HXD11_HUMAN 34 8.4 09 1987 1980.3 ING4_HUMAN 38 3.1 2

10 2114 2086 ZN423_HUMAN 33 10 011 2353 2285.7 CL065_HUMAN 37 3.9 012 2539 2441.3 CA5BL_HUMAN 47 0.4 113 2722 2594.7 NPDC1_HUMAN 38 3.6 014 2874 2722.2 DJC27_HUMAN 37 3.8 015 3001 2828.5 BORG4_HUMAN 40 2.2 116 3165 2965.1 KC1G1_HUMAN 27 43 017 3440 3196.1 TPPC5_HUMAN 40 2 018 3656 3377.6 UB2D3_HUMAN 43 0.99 119 3997 3665.5 TM208_HUMAN 34 8.1 020 4257 3885.4 ZBED3_HUMAN 29 23 0

Page 48: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Proteomics Result Interpretation

Proteins Identified from Colon Cancer and Health Group

Uniprot_ID

Frequency in Colon

(10)

Frequency in Health

(10)Evidence in

PubMedBRAF_HUMAN 3 0 508DMP46_HUMAN 3 0 0NNMT_HUMAN 3 1 4MRP_HUMAN 1 3 0STK33_HUMAN 0 3 0

Uniprot_ID Gene Protein NameEvidence in

PubMed

BRAF1_HUMAN BRAFSerine/threonine-protein kinase B-raf 508

P53_HUMAN TP53 Cellular tumor antigen p53 443CD44_HUMAN CD44 CD44 antigen 411MDM2_HUMAN MDM2 E3 ubiquitin-protein ligase Mdm2 131BCR_HUMAN BCR Breakpoint cluster region protein 59LCK_HUMAN LCK Tyrosine-protein kinase Lck 29Q7RTZ3_HUMAN LCK Tyrosine-protein kinase Lck 29CAV1_HUMAN CAV1 Caveolin-1 21PNPH_HUMAN PNP Purine nucleoside phosphorylase 13CBL_HUMAN CBL E3 ubiquitin-protein ligase CBL 11

RAF1_HUMAN RAF1RAF proto-oncogene serine/threonine-protein kinase 10

CD38_HUMAN CD38 ADP-ribosyl cyclase 1 8NNMT_HUMAN NNMT Nicotinamide N-methyltransferase 4

IRAK1_HUMAN IRAK1Interleukin-1 receptor-associated kinase 1 3

DMPK_HUMAN DMPK Myotonin-protein kinase 2ITA5_HUMAN ITGA5 Integrin alpha-5 1ITB1_HUMAN ITGB1 Integrin beta-1 1ZAP70_HUMAN ZAP70 Tyrosine-protein kinase ZAP-70 1

Proteins Interacted with High-Frequency Proteins from Colon Cancer Group

Page 49: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Proteomics Result InterpretationA Network Biology Context

Protein Network Constructed from the Top 3 Differential Proteins

Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)

Page 50: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Proteomics Result InterpretationA Biological Pathway Context

BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)

Page 51: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)

Proteomics Result InterpretationA Biological Pathway Context for NNMT

Page 52: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics Data

NMR DataGCxGC MS Data

Integrative Data Mining

Page 53: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Metabolomics Data Description

Group: Daniel Raftery Laboratory at Purdue University

NMR DataInstruments: Bruker Avance 500MHz, NMRData format at CCE webportal: Excel spreadsheetNumber of Samples: Normal: 53; PolyP:35; Colorectal: 15

GCxGC MS Data Instruments: LECO Pegasus 4D GCxGC-TOF Data format at CCE webportal: Excel spreadsheetNumber of Samples: Normal: 83; Polyp: 84; Colorectal:30

Page 54: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

NMR Data Analysis Workflow

Extract peaks’ ppm

Search AgainstHuman Metabolome Database (2.5) to identify metabolites

Report only significant metabolitesSample_ID 1 2Top1 Delta-Hexanolactone Delta-HexanolactoneTop2 Hypotaurine Hypotaurine

Top3 2,3-Diphosphoglyceric acid DiethanolamineTop4 Diethanolamine 3,7-Dimethyluric acid

Top5 3-Phosphoglyceric acid Methyl isobutyl ketoneTop6 3,7-Dimethyluric acid 1,3,7-Trimethyluric acid

Top7 1,3,7-Trimethyluric acid Cysteine-S-sulfateTop8 L-Allothreonine L-AllothreonineTop9Top10

Signal Processing

Page 55: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

NMR Peak Metabolite Identificationusing Human Metabolomics Database

1) Input the peak lists

2) Get the metabolites; leave out those with fewer than 2 matches

Page 56: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Significant Metabolites Identified from NRM Metabolomics Data

Group MetabolitesPolyp vs Health D-Arabitol,D-Pantethine(2/35 vs 0/53)

Colorectal vs Polyp None

Colorectal vs Health D-Arabitol (2/15 vs 0/53)

Population Frequency =

Marker metabolites? Shared metabolites

Page 57: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

D-Arabitol Identified from NMR ResultsInvolved in Pentose and Glucuronate Interconversions Pathways

Page 58: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics Data

NMR DataGCxGC MS Data

Integrative Data Mining

Page 59: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Results from GCxGC MS Data IMetabolite identification is more straightforward

Polyp vs Healthy Colorectal vs Polyp Colorectal vs Healthy

Metabolites Metabolites Metabolites

Methanesulfinic acid, trimethylsilyl ester Acetic acid, (methoxyimino)-, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester

Propanoic acid, 2-(methoxyimino)-, trimethylsilyl ester

Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester

L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester

Hexanedioic acid, bis(2-ethylhexyl) ester Methanesulfinic acid, trimethylsilyl ester Cholesterol trimethylsilyl ether

Mefloquine Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester

Hexanoic acid, trimethylsilyl ester

Cyclohexane, 1,3,5-trimethyl-2-octadecyl- L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester

Tetradecanoic acid, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester

Hexanoic acid, 2-(methoxyimino)-, trimethylsilyl ester

psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'-tetrahydro-1,1'-dimethoxy-2,2'-dioxo-

Cyclohexane, 1,3,5-trimethyl-2-octadecyl- 3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7-pentamethyl-

Silanol, trimethyl-, pyrophosphate (4:1) Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester

Trimethylsilyl ether of glycerol L-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl ester

Ethylbis(trimethylsilyl)amine

Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenyl-

Benzene, (1-hexadecylheptadecyl)-

Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester

Page 60: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Results from GCxGC MS Data II

A. Polyp vs Healthy B. Polyp vs Colorectal C. Colorectal vs Healthy

Page 61: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Comparative Results (Intensity vs. Population)Marker Metabolite Panel Clustering of three groups

Intensity based Heat map

Population Frequency based Heat map

Page 62: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Metabolites identified from GCxGC MS ResultsInvolved in Fatty Acid Biosynthesis Pathways

Page 63: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics DataIntegrative Data Mining

Page 64: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Data Set DescriptionDiet, Lipidomics, Oxidative and VD

# of features and the total # of subjects varies

Three classes are balanced to the least common denominatorHealthy vs. PolypHealthy vs. ColorectalPolyp vs. Colorectal

Diet Lipid Oxidative VD

Total Subjects 150 97 94 195

Total Features 38 49 3 2

Page 65: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Predictive Modeling Methods

Data PreprocessingFiltering outliers (three standard deviations away from mean)Data Normalization (transforming to the 0-1 range) Binned categorical data using Quantile binning method

Missing Value TreatmentReplaced with the mean value of the attribute in group

Support vector machines (SVM) Classifier KernelRadial Basis Function (RBF) kernel are used

Feature Selection MethodsApproach #1: Two sample unpaired T-tests at 5% significance level.Approach #2: SVM Attribute Evaluator with Ranker Algorithm. Features from T-tests are filtered using p-values

K-fold Cross-validation

Classification Model

Clean Dataset

Raw Dataset

HypothesisHypothesis

Hypothesis

Page 66: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Dietary Attributes as Predictors

Polyp vs. Healthy Colorectal vs. Healthy

2.38E-02

4.21E-01

4.11E-02

1.21E-01

2.53E-02

9.57E-01

3.71E-02

5.60E-02

SVM Predictor Accuracy = 64% SVM Predictor Accuracy = 65%

P-value P-value

Ice cream

Rice

Tea

Shellfish

Salad

Tomato

Egg

Milk

Page 67: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Lipidomics T-Tests ResultsSignificant Features Selected from T Test with their corresponding p value

Features Polyp vs. Healthy Polyp vs. Colorectal Colorectal vs. Healthy

16:0/18:1 PE 1.76E-02

24:1 Cer 6.90E-03

LPE 18:1 <1.00E-04

LPE 20:0 1.50E-03 2.00E-04

An-16:0 LPA 3.23E-02

An-18:1 LPA 3.38E-02 1.33E-02

AA 1.13E-02

18:2 LPA 1.13E-02 4.50E-03

20:4 LPA 2.40E-02

22:6 FA 4.28E-02 3.24E-02

LPE 16:0 3.08E-02 3.40E-03

LPE 18:0 3.90E-03 1.00E-04

LPE 18:1 2.18E-02

Page 68: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Integrating lipidomics with clinical features Performance comparisons

Accuracy(without pre-selection)

Accuracy(with t-test pre-selection)

Accuracy(automatic selection)

Polyp vs. Healthy

0.54 0.71 0.78

Colorectal vs. Healthy*

0.57 0.63 0.73

Polyp vs. Colorectal *

0.70 0.90 0.87

* Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.

Accuracy

Polyp vs. Healthy

0.55

Colorectal vs. Healthy*

0.60

Polyp vs. Colorectal *

0.60

Without Clinical Features With Clinical Features

Page 69: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Messages

Individual Omics data set has variable predictive performance

Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noise

Integration of different Omics data with clinical data can improve predictive performance

69

Page 70: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Network methods and data integration

Page 71: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Network methods

• [Obvious comment]: Networks are everywhere from social networks such as facebook, terrorism menaces or biochemical processes

• Network science is a (re-)emerging approach that relies on different approaches to modeling systems of interacting elements to describe, model and predict the behavior of diverse systems.

Page 72: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Biological systems modelling

Page 73: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Building and using networks

– Networks can be created from collecting interactions published in papers, or can be reconstructed directly from data.

– Different types of biological intracellular molecular networks can be represented by different types of graphs.

– Protein interaction networks and cell signaling networks can be connected to drugs and diseases

– Network representation can be used to integrate different datasets using genes as anchors

Page 74: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Network biology methods integrating biological data for translational science

Bebek G et al. Brief Bioinform 2012;13:446-459

Page 75: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

An integrative -omics signaling network identification process workflow

Start with processing tissue-specific data (instrument outputs) Microarray data is normalized to make comparisons of expression levels and transformed to

select genes for further analysis. Genome-wide genotyping signals are analyzed to identify regions (and hence regional

genes) for both tumor and normal tissue (or non-cancerous cells). Next, genomic regions with significant aberrations are merged with their corresponding

microarray probes to create expression profiles. In this analysis step, expression profiles are used to calculate Pearson's

coexpression correlations among gene pairs. These results are fed into the Pathway Analysis Framework. Integrating gene–gene coexpression values, annotations from GO, known signaling

pathways, protein sequence information, PPI networks and protein subcellular co-localization data, pathways are predicted and filtered.

Significant pathway subnetworks are merged to form signaling networks connecting genes of interest.

The networks and genomic alterations identified are put together to create a descriptive functional network, creating a molecular basis for the cancer studied.

Page 76: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Network-based prioritization of candidate disease genes.

Bebek G et al. Brief Bioinform 2012;bib.bbr075

© The Author 2012. Published by Oxford University Press. For Permissions, please email: [email protected]

Page 77: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Conclusions Data integration or -better- integrative analysis of 'omics data' is

a challenging topic with many open-problems. Current state: go study by study and consider nature of data

and type of question. Current approaches are diverse:

Machine learning, Dimension reduction, Pathway visualization,

Diverse open research lines, lot of space for improvements Yet to come:

the "integrator": automatical combination that clearly improves biologival interpretation.

Mathematical framework common to all problems Last but not least: Integrative analysis requires integrative work,

well inside the philosophy of Biostatnet or other collaborative networks.

Page 78: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Acknowledgments Statistics and Bioinformatics Research

Group at the Statistics department of the University of Barcelona.

The Biostatnet group and particularly Carmen Cadarso and Lupe Gomez

My colleagues at the Statistics and Bioinformatics Unit at the Vall d'Hebrón Research Institute

Unitat de Serveis Científico Tècnics (UCTS) at the Vall d'Hebrón Research Institute

78

Page 79: New Challenges in Bioinformatics: Integrative Analysis of Omics … · New Challenges in Bioinformatics: Integrative Analysis of Omics Data Alex Sánchez 1Statistics and Bioinformatics

Thank you for your attention!

79