Post on 22-Jun-2020
Big Data Training for Translational Omics Research
Principles of Biomarker
Discovery and Development
In Translational Medicine
Liu6/7/2017
Class 1
10:15am
Unit 3; Session 1
Big Data Training for Translational Omics Research
Breakdown
Learning objectives
Biomarker and Precision Medicine
Biomarker in preclinical and clinical studies
Principles of Biomarker Discovery: Overview
Principles of Biomarker discovery: data collection
Principles of Biomarker discovery: data analysis
Principles of Biomarker Discovery: validation
Big Data Training for Translational Omics Research
Philosophy of Translational Research
• As a biomedical researcher, how
can I make something to benefit
patients?
• I am working on cell lines and
mice, how the omics approach can
help me understand the
mechanism? esp. causality?
• Can the key molecule(s) I
identified in cells and animals be
able to used in humans?
Lab researchers, grant writers, physicians…
Big Data Training for Translational Omics Research
Key Words
• Biomarker: A characteristic that is objectively measured
and evaluated as an indicator of normal biologic process,
pathogenic processes, or pharmacologic responses to a
therapeutic intervention.
NIH Biomarkers Definition Working Group
• Translational: Translational research aims to aid in the
transformation of biological knowledge into solutions that
can be applied in a clinical setting
Atkinson, et al., Clin Pharm Ther, 2001.
Azuaje F. Bioinformatics and Biomarker Discovery, 2010
Big Data Training for Translational Omics Research
Why Biomarker?
Big Data Training for Translational Omics Research
A Core Question in Modern Medicine
How to Address Patient Heterogeneity?
Big Data Training for Translational Omics Research
Patient Heterogeneity
Big Data Training for Translational Omics Research
BiomarkerPersonalized Medicine
CML Patients
All Breast Cancer
Patients
HER2+ Breast Cancer
Patients
All NSCLC Patients
EGFR MT+ NSCLC
Patients
Gleevec
Herceptin
Herceptin
Iressa
Iressa
90% RR
10–15% RR
35–45% RR
10–15% RR
60–70% RR
Slamon et al. NEJM 2001; Kantarjian et al. NEJM 2002; Vogel et al. JCO 2002. 20:3; Douillard et al. JCO 2010.
Biomarkers are especially important in diseases with low response rates in
the overall population
Big Data Training for Translational Omics Research
Cancer
Other common diseases
Discovery ImplementationDrug development
EGFR
KRAS
ALK
HER2
ALK
BRAF
Gefitinib
ARS-853?
Crizotinib
Herceptin
Vemurafenib
Gene A
Gene B
ALK
Gene D
Gene C
Gene E
Precision molecules
BiomarkerPersonalized Medicine
Big Data Training for Translational Omics Research
Precision Medicine
To deliver the right treatment to the right patient with the right dose
and at the right time
Big Data Training for Translational Omics Research
Clinical Application of Biomarker
• Deal with the patient heterogeneity– Early risk assessment
– Disease prevention
– Assist diagnosis
– Optimize treatment: high effectiveness, low risk
– Match the patient to therapeutic strategy
– Monitor therapy success/disease recurrence
– Long-term management
Risk
Diagnosis
Treatment
Monitoring
Big Data Training for Translational Omics Research
Biomarker in Preclinical Studies• To characterize the phenotype
• To monitor the response
• To identify potential translational biomarkers for humans
Big Data Training for Translational Omics Research
Omics Approach in Basic Research
• Explore molecular mechanism
• Hypothesis generating
• Identify therapeutic targets and strategies
• Establish intermediate phenotypes
Big Data Training for Translational Omics Research
Type of Biomarkers
• Prognostic marker (a): before treatment
• Predictive marker (b): before treatment
• Pharmacodynamic marker (c): after treatment
• Surrogate marker (d): during treatment
Gosho, et al. Sensors 2012, 12, 8966-8986
Big Data Training for Translational Omics Research
Prognostic Marker• Signature separates a population with respect to the outcome (risk)
• Regardless of the types of therapies or treatments– Markers associated with overall survival regardless of treatment
• Distinguish outcome (poor or good) following the test and standard treatments
• Cannot guide the choice of a particular treatment
• Can determine the aggressiveness of treatment
Ballman KL, JCO. 2015.63.3651
Big Data Training for Translational Omics Research
Predictive Biomarker
Ballman KL, JCO. 2015.63.3651
• Predicts the differential outcome of a particular therapy or treatment
• Prospectively identify patients who are likely to have a favorable clinical outcome from a specific treatment; therefore, a predictive biomarker
• Can guide the choice of treatment
Big Data Training for Translational Omics Research
Prognostic and Predictive Markers
Ballman KL, JCO. 2015.63.3651
• Biomarkers are both predictive of disease susceptibility or progression and certain treatment outcomes
• ER status and breast cancer-prognostic
• ER status and antiestrogen therapy-prediction
Big Data Training for Translational Omics Research
Pharmacodynamic Markers• PD biomarkers provide information about the pharmacologic
effects of a drug on its target
• Measured after treatment
• A clinical endpoint to be measured
• Application:– Proof of mechanism: i.e., Does the drug hit its intended target?
– Proof of concept: i.e., Does hitting the drug target alter the biology of the tumor?
– Selection of optimal biologic dosing
– Understanding response/resistance mechanisms
• Examples:– Protein phosphorylation markers. i.e. p-EGFR, p-ERK to evaluate
changes in target protein phosphorylation or the activation status of downstream signaling/adapter molecules.
– Apoptosis (TUNEL assay) to assess pharmacologic effect on proliferation
Big Data Training for Translational Omics Research
Surrogate Biomarker• Substitute for a clinical endpoint
• Expected to predict clinical benefit (lack of benefit or harm) based on epidemiologic, therapeutic, pathophysiologic, or other scientific evidence
• During or after treatment
• Examples:
• Glucose level monitoring the treatment for diabetes
• Imaging-based measurement for anti-cancer therapy
Big Data Training for Translational Omics Research
Questions
What kind of biomarker is
HOX13B:IL17BR in the first case paper?
What kind of biomarker is blood
concentration of R-/S-methadone in the
second case paper?
Big Data Training for Translational Omics Research
Examples of FDA Approved Biomarkers
Gosho, et al. Sensors 2012, 12, 8966-8986
Big Data Training for Translational Omics Research
Gosho, et al. Sensors 2012, 12, 8966-8986
Examples of FDA Approved Biomarkers
Big Data Training for Translational Omics Research
Biomarker Discovery and Development in the Omics Era
1970s 1980s 1990s
>2005
Big Data Training for Translational Omics Research
Biomarker Discovery and Development in the Omics Era
Genomics
Transcriptomics
miRNomics
lncRNomics
Epigenomics
Proteomics
Metabolomics
Lipidomics
Exposomics
Big Data Training for Translational Omics Research
Prognostic-diagnostic Markers
• Genes for ~50% of rare diseases identified
Nature Reviews Genetics 14, 681–691 (2013)
Big Data Training for Translational Omics Research
Prognostic-Diagnostic Markers• 11,907 SNPs strongly associated with common diseases
Big Data Training for Translational Omics Research
Pharmacogenomic Markers
• 166 FDA approved PGx markers for drug treatment
Big Data Training for Translational Omics Research
Transcriptomic Biomarkers
• MammaPrint test– Agendia
– 70-gene signature for breast cancer prognosis
• Oncotype Dx test– Genomic Health
– 21 gene-expression biomarkers for predicting the recurrence of breast cancer patients, and predicting response to both chemotherapy and radiation therapy
• H/I test– AviaraDx
– 2-gene signature that is used to estimate the risk of recurrence and response to therapy of breast cancer patients.
Big Data Training for Translational Omics Research
Technical
development
Biomarker Development Pipeline
Discovery ConfirmationAssay
development
Validation/
Refinement
Clinical Validation
Clinical Adoption
Genomics
Transcriptomics
Proteomics
Metabolomics
Lipidomics
Epigenomics
Exposomics
Imaging
Target
selection
Integrated technologies and platforms
Multi-analyst assays
Robust validated assays
Clinical grade assays
Accurate, specific,
reproducible, reliable
Clinical grade assays
Instruments
Number of analytes
Number of samples
https://is.muni.cz
Lead
identification
Preclinical
Retrospective
Clinical
trials
Marketing
clinical use
Big Data Training for Translational Omics Research
Institute of Medicine Roadmap for omics-
based tumor biomarker test development
Hayes BMC Medicine 2013, 11:221
Big Data Training for Translational Omics Research
Institute of Medicine Roadmap for omics-
based tumor biomarker test development Hayes BMC Medicine 2013, 11:221
Big Data Training for Translational Omics Research
Data Acquisition Strategies
• Retrospective:– Clinical samples collected before the design of the biomarker study,
and before comparison with control samples.
– Looks back at past, recorded data to find evidence of marker-disease relationships
– Inexpensive, rapid
– Potentially biased, noisy
– Weak evidence
• Prospective– The biomarker-based prediction or classification model is applied on
patients at the time of patient enrolment
– Clinical outcomes or disease occurrence are unknown at the time of enrolment
– Less biased
– Strong evidence
– Expensive, time-consuming,
• Pro-retrospective
FDA approval!!
Big Data Training for Translational Omics Research
Study Design Consideration
• Biomarker discovery studies require careful planning and design
• Study style: retrospective, prospective, pro-retrospective
• Sample collection
• Phenotype
• Sample size and power estimation
• Other covariates
• Data collection
• Platform
• Replication, validation and application
• Data analysis plan
Big Data Training for Translational Omics Research
Sample Collection, Assay Design, Data Analysis Plan
• Establish methods• Specimen collection • Processing • Storage
• Establish criteria • Quantity and quality• Minimum amount
• Feasibility • Obtaining specimens
• Assay design• Communication with core/service provider
• Data Analysis• Communication biostatistician and bioinformatician
Big Data Training for Translational Omics Research
Sample and Materials
• Biospecimen• Tissue
• Blood
• Oral swab
• Hair
• Tear
• Urine
• Feces
• Saliva
• …
• Test materials• DNA
• RNA
• Protein
• Small
molecules
• Lipids
• Principles:• Non-invasive
• Reproducible
• Reliable
• Specific
• Accurate
• Inexpensive
• Point-of-care
invasiv
eness
Big Data Training for Translational Omics Research
Ethical, Legal, and Regulatory Issues
• Establish communication with regulatory agencies, e.g. IRB, FDA
• Regulatory approvals
• Documents: – Informed consent
– Study protocol
• Intellectual property issues
• CLIA-lab based test for clinical trials involving patient selection
Big Data Training for Translational Omics Research
Sample Size and Power Estimation• Power setting: 0.8
• Statistical significance: – Discovery: multiple hypothesis (corrected p
according to # of tests)
– Validation: usually one hypothesis (p<0.05)
• Input parameters: previous publication or pilot study
• Online tools:– piface.jar by Lenth (2006).
• http://homepage.stat.uiowa.edu/~rlenth/Power/
– Microarray power/sample size estimation• http://sph.umd.edu/department/epib/sample-size-
and-power-calculations-microarray-studies
• RNA-seq data:
• Scotty: http://bioinformatics.bc.edu/marthlab/scotty/scotty.php
• RnaSeqSampleSize: https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/
Big Data Training for Translational Omics Research
Key Principles: Big Data in Biomarker
Phenotype Molecular Profiles
X“Digits” “Digits”Statistics
Bioinformatics
Network
…
Big Data Training for Translational Omics Research
Always Start Your Design and
Analysis From Data Evaluation!
• What kind of phenotypic and marker data do I
have/should I use/collect?
• Are my data normally distributed?
• What kind of models should I choose?
• What factors may possibly confound my analyses?
• How covariate data may be correlated with my
phenotype?
Big Data Training for Translational Omics Research
Phenotype to Digits
• Nominal data: no order– yes or no (Binary): disease vs normal, response vs no
response
– Cancer type: Breast, lung, colon…
• Ordinal data: some order– Pathologic: Tumor stage: I, II, III
– Disease progression: no, mild, severe, death
• Continuous data: – glucose level, LDL, drug concentration, gene expression
• Survival data: time to event– Death, occurrence of disease, onset of toxicity, in hr, day,
wk, month, yr, etc.
Big Data Training for Translational Omics Research
Platform
Raw data
“Digits”Ordinal data
0, 1, 2
Continuous Variables-1.2,
-1.1,
0.58,
1.09,
2.34
…
Genomics
Transcriptomics
miRNomics
lncRNomics
Epigenomics
Proteomics
Metabolomics
Lipidomics
Molecular Data Collection
Big Data Training for Translational Omics Research
Basic Statistical MethodsPhenotype Molecular Profiles
XNumerical data Numerical data
Nominal
Ordinal
Continuous
Nominal
Ordinal
Continuous
Survival
Chi-square test
t-test
ANOVA
Correlation
Log rank
Statistic
Models
Descriptive and exploratory association
Big Data Training for Translational Omics Research
Basic Statistical Methods
• Continuous data
– Normal distributed: parametric method
– Non-normal distribution/ordinal data: non-parametric
method
• Winsorization
• Log transformation: log2
Parametric Non-parametric
t-test Mann-Whitney rank-sum test
Paired t-test Wilcoxon signed-rank test
ANOVA Kruskal-Wallis test
Pearson correlation Spearman correlation
Big Data Training for Translational Omics Research
Statistic Models
• Univariate models– Logistic regression: binary/categorical phenotype
– Linear regression: continuous phenotype
– Kaplan-Meier (KM) method: survival phenotype
• Multivariate models– Multivariate regressions: linear or logistic
– Cox regression: survival phenotype
• Other sophisticated models
Big Data Training for Translational Omics Research
• Example• P value cutoff =0.05
• 1000 genes: 50 genes by chance (error) at this significance level
• If 60 genes with p<0.05, many might be due to noise (false positive)
• Common Correction Method• Bonferroni Correction
• True significance level: pXn, e.g. p=0.0005, n=1000 genes, true p=
0.0005X1000=0.5.
• Correct p value = 0.05/N
• Explanation: among all genes selected, the p value for at least one
false positive is <=0.05
• False discovery rate (FDR)• FDR=0.1, meaning among all genes selected, (e.g. 100), we would
expect 10 to be false positive
• FDR as high as 0.5 may be acceptable to biologists
• Several different approaches to estimate (Benjamini & Hochberg,
B&H, most popular)
• Data filtering in the process step can also reduce the number of genes
Multiple Testing Issue
Big Data Training for Translational Omics Research
Azuaje F. Bioinformatics and Biomarker Discovery, 2010
Basic Biomarker Discovery Pipeline
Big Data Training for Translational Omics Research
Data Processing
• Data pre-processing – Data filtering and QC
• Remove samples with failed experiment
• Exclude markers with very low variance
• Exclude markers with very low expression levels, e.g. RNA-seq
– Data Normalization• To transform the data into a format that is compatible
or comparable between different samples or assays
• To level potential differences caused by experimental factors, such as labelling and hybridization
Big Data Training for Translational Omics Research
Why Remove Genes with Low Variance?
Case
Co
ntr
ol
Case
Co
ntr
ol
0
1
2
3
4
Ge
ne
Ex
pre
ss
ion
p=0.004 p=0.008
Big Data Training for Translational Omics Research
Data Reduction
• Focus on smaller sets of potentially novel and interesting data patterns (e.g. groups of samples or gene sets).
• Confirm initial hypothesis about the relevance of the features available and to guide future experimental and computational analysis
• Exploratory univariate analyses– T-test
– Chi-square test
– Correlation
– Univariate regression
Big Data Training for Translational Omics Research
Data Matrix
• Data matrix
• Color-coded representations of
• Absolute or relative expression levels
Expre
ssio
n
Samples
Big Data Training for Translational Omics Research
Data Visualization
dendrogram
• Statistical plotting: Graphpad
• Dendrogram and heatmap: R, GENE-E, Gitools
Big Data Training for Translational Omics Research
Exploratory Analysis
• Univariate analysis
• Single marker vs phenotype
• Multiple-hypotheses testing corrections– DEG
– Fold change
– Statistical model: t-test, correlation, univariate regression
– P values and other cut-off
• Unsupervised classification (clustering) and visualization
• Filtering: to remove uninformative, highly noisy or redundant markers for subsequent analyses
• Supervised classification
Big Data Training for Translational Omics Research
Data Integration
• Further reduction
• Which marker to be chosen for the predictive model construction
• To estimate the potential relevance of the identified markers and relationships;
• To discover other significant genes and relationships (e.g. gene-gene or gene-disease) not found in previous data-driven analysis steps
• Tools:– human gene annotation databases (e.g. GO),
– metabolic pathways databases (e.g. KEGG),
– gene-disease association extractors from public databases (e.g. Endeavour),
– Other functional catalogues
• Resulting data- and knowledge-driven findings, patterns or predictions provide a selected catalogue of genes, pathways and (gene-gene and gene-disease) relationships relevant to the phenotype classes investigated
IPA
Big Data Training for Translational Omics Research
Don’t Forget Covariates!• Don’t forget these:
– Demographic• age, gender, race (often a PCA component), smoking, drinking, life style etc.
– Physiological• BMI, weight, height, etc.
– Clinical• blood tests, urine tests, other analytes.
• Integrate information– Molecular data
– Knowledge-driving data
– Covariates
• Multivariate regression– Model training
– Model validation
– Model assessment• ROC
Big Data Training for Translational Omics Research
Data Integration is Critical
• Provide more reliable information
• Increase the prediction value
• Insight into the mechanism
• Reliable hypothesis generating
• But can be biased as well
Transcription Translation Catalysis
DNA RNA Protein Metabolites
Genome Transcriptome Proteome Metabolome/Lipidome Clinical endpoint
dysregulation
Genetic effect
Environmental effect
Big Data Training for Translational Omics Research
Examples of Cardiovascular
Biomarkers with Integrated
Data
Vasan, 2006; Gerszten and Wang, 2008
Big Data Training for Translational Omics Research
Building Predictive Models
If …Then…
Build up a model based on selected markers
Discovery set
validation set
Pro-retrospective set
Prospective set
Y= β0 + β1X1 + β2 X2 + βiXi^ ^ ^ ^
Big Data Training for Translational Omics Research
Predictive Models
• Multivariable models
– Linear regression
• Continuous data
– logistic regression
• Presence/absence of disease
– Cox regression
• Survival data
• Algorithmic models—Machine learning
– Support vector machines (SVM)
– Artificial neural networks (ANN)
Big Data Training for Translational Omics Research
Validation Strategies
• Internal validation
– Cross-validation
– Random/non-random split samples into
training and test set
• External validation
– Independent sample and dataset
Big Data Training for Translational Omics Research
Assessment of Performance• Basic parameters
– Sensitivity: the proportion of the true positive outcomes (e.g. truly diseased subjects) that are predicted to be positive
– Specificity: the proportion of the true negative outcomes (e.g. truly disease-free subjects) that are predicted to be negative
Big Data Training for Translational Omics Research
Assessment of Performance
• Receiver Operating Characteristic (ROC) curve
• Area under the curve (AUC)
– AUC=0.5: no association
– AUC=1: perfect association
– AUC<0.6: No medical value
– AUC>0.75: reasonable
“AUROC”