Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning...

159
Bioinformatics I (KF) VU 041035 WS2019 http://icbi.at/mo Hubert Hackl Biocenter, Division of Bioinformatics, Medical University of Innsbruck, Innrain 80, 6020 Innsbruck, Austria Tel: +43-512-9003-71403, Email: [email protected], URL: http://icbi.at

Transcript of Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning...

Page 1: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Bioinformatics I (KF) VU 041035 WS2019 http://icbi.at/mo

Hubert Hackl

Biocenter, Division of Bioinformatics, Medical University of Innsbruck,Innrain 80, 6020 Innsbruck, AustriaTel: +43-512-9003-71403, Email: [email protected], URL: http://icbi.at

Page 2: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

I. Predictive methods on protein sequences (Protein domains, Signal P, NetMHCPan)

II. Sequence alignment and databases (BLAST, NCBI, TCGA, Firebrowse, TCIA, GEO, ArrayExpress, IntOgen, cBioPortal)

III. Differentially expressed genes (microarrays, RNAseq) (R/Bioconductorsoftware packages rma, limma, DEseq2)

IV. Expression profiling and clustering (Genesis)

V. Gene ontology, Pathway analysis (DAVID, KEGG, Reactome, ConsensusPathDB, ClueGO,Enrichr)

VI. Network analysis (Cytoscape)

VII. Gene set enrichment analysis (GSEA), Deconvolution

VIII. Predictive and prognostic marker (signatures) (logistic regression, survival analysis)

Contents

Page 3: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Symbol Meaning Description

R A or G puRineY C or T pYrimidineW A or T Weak hydrogen bondsS G or C Strong hydrogen bondsM A or C aMino groupsK G or T Keto groupsH A, C, or T (U) not G, (H follows G)B G, C, or T (U) not A, (B follows A)V G, A, or C not T (U), (V follows U)D G, A, or T (U) not C, (D follows C)N G, A, C or T (U) aNy nucleotide

Nomenclature of nuclein acids

Base Symbol Occurrence

Adenin A DNA, RNAGuanin G DNA, RNACytosin C DNA, RNAThymin T DNAUracil U RNA

Page 4: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

+ strand 5´-ACGGTCGCTGTCGGTAGC-3´- strand 3´-TGCCAGCGACAGCCATCG-5´

e.g. in fasta format : >gene sequence|gi12345|chr17|-

GCTACCGACAGCGACCGT

DNA sequences are always from 5‘ to 3‘

Positions in the genome (genome assembly) are chromosome wise

e.g. human GRCh37/hg19

chr11:1-100 chr11:49,686,777-49,689,777

Positions in the chromosome start for both!! strands from position 1

+ strand 5´-ACGGTCGCTG…………TCGGTAGC-3´- strand 3´-TGCCAGCGAC…………AGCCATCG-5´

chr11:1 2523 2529

chr11:1 2523 2529

Nomenclature

Page 5: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Translation, genetic code and reading frames

Page 6: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Peptid chain, amino acid sequence, proteins

Protein sequences are always form N-terminal end to C-terminal end

backbone

sidechains

E.g.. SCD sequence in fasta format

Page 7: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Protein Sequence Analysis

Shared ancestry?Similar function?Domain or complete sequence?Are functional sequences conserved?

Page 8: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Homology

Searches

Homology searches

Sequence alignment

BLAST, FASTA

Page 9: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Profile

Analysis

Profile Analysis

Uses collective characteristics of a family of proteins

Position specific score matrix (PSSM)

Profile HMM

ProfileScan, Pfam, CDD, Prosite, BLOCKS

PSI-Blast

Page 10: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Protein Sequence Analysis

Amino Acid Composition

Hydrophobicity

Charge

Physical

Properties

Page 11: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Protein Sequence Analysis

Secondary structure

Specialized structures

Tertiary structure

Structural

Properties

Page 12: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Substitutions matrices

• Unrelated or random model assumes that letter a occursindependently with some frequency qa.

P(x,y|R) = qxiqxj

• The alternative match model of aligned pairs of residuesoccurs with a joint probability pab.

P(x,y|M) = pxi yi

• Odds ratio

P(x,y|M) pxi yi pxi yi

P(x,y|R) qxi qyj qxi qyj= =

Page 13: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Summary of substitutions matrices

• Triple-PAM strategy (Altschul, 1991)

– PAM 40 short alignments, highly similar– PAM 120 – PAM 250 longer, weaker local alignments

• BLOSUM (Henikoff, 1993)

– BLOSUM 90 short alignments, highly similar– BLOSUM 62 most effective in detecting known

members of a protein family– BLOSUM 30 longer, weaker local alignments

• No single matrix is the complete answer for all sequencecomparisons

• Programs like BLAST usually have default matrices!

Page 14: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Database search

Page 15: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

FASTA

• Use hash table of short words of the database (DB) sequence and query sequence (2-6 chars)

• For words in query sequence, find similar words in DB using (fast) hash table lookup, and compute

R = position(query) – position (DB).

Areas of long match will show same R for many words.

• Score matching segments based on content of these matches. Extend the good matches empirically.

Page 16: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

High-scoring segment pairs

Page 17: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Significance of scores

The number of unrelated matches with score greater than S is approximately Poisson distributed with mean

E(S)=Kmne-λS

where λ is a scaling factor m and n are the length of the sequences

The probability that there is a match of score greater than S follows a extreme value distribution:

Karlin S, Altschul S. Proc Natl Acad Sci (1990)

P(x>S)=1-e-E(S)

Page 18: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

NCBI Blast

Program Query sequence Subject sequence

BLASTN Nucleotide Nucleotide

BLASTP Protein Protein

BLASTX Nucleotidesix-frame translation

Protein

TBLASTN Protein Nucleotidesix-frame translation

TBLASTX Nucleotidesix-frame translation

Nucleotidesix-frame translation

Page 19: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

NCBI Blast Example

Page 20: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Blast Results

alignment

Be

st h

it

description

graphicalvisualization

E-value

Score (S)

conserved domaindatabase (CDD)

Page 21: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Multiple sequence alignment (Clustal W)

Hbb_Human -

Hbb_Horse .17 -

Hba_Human .59 .60 -

Hba_Horse .59 .59 .13 -

Myg_Whale .77 .77 .75 .75 -

Hbb_Human

Hbb_Horse

Hba_Human

Hba_Horse

Myg_Whale

1 PEEKSAVTALWGKVN--VDEVGG

2 GEEKAAVLALWDKVN--EEEVGG

3 PADKTNVKAAWGKVGAHAGEYGA

4 AADKTNVKAAWSKVGGHAGEYGA

5 EHEWQLVLHVWAKVEADVAGHGQ

Rooted neighbor-joining tree(guide tree) and sequence weights

Progressive alignmentfollowing guide tree

Pairwise alignmentcalculate distance matrix

1

23

4

1

23

4

Page 22: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

BLOCKS

Steve Henikoff, FHCRC ,Seattle

Multiple alignments of conserved regions in protein families

– 1 block = 1 short, ungapped multiple alignment

– Families can be defined by one or more blocks

– Searches allow detection of one or more blocks representing a family

– http://blocks.fhcrc.org/

Page 23: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Profile Construction

20

1b

f(p,b) = frequency of amino acid b in position ps(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM)

PSSM(p,a) = f(p,b)*s(a,b)

Page 24: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

PSI-BLAST

• Position-Specific Iterated BLAST search

• Used to identify distantly related sequences that are possibly missed during a standard BLAST search

• Easy-to-use version of a profile-based search Perform BLAST search against protein database Use results to calculate a position-specific scoring

matrix PSSM replaces query for next round of searches May be iterated until no new significant alignments

are found

Altschul et al., Nucleic Acids Res. 25: 3389-3402, 1997

Page 25: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

DELTA-BLAST

• Method different from that used by PSI-BLAST

Step 1: Align the query against conserved domains derived from CDD

Step 2: Compute PSSMStep 3: Search sequence databases using PSSM as

the query

• Produces high-quality alignments, even at low levels of sequence similarity

• Dependent on homologous relationships captured within CDD

Page 26: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Markov chains

Markov chains: a sequence of events that occur one after another. The main restriction on a Markov chain is that the probability assigned to an event at any location in the chain can depend on only a fixed number of previous events.

Scoring sequences (e.g. start codon ATG)3 states (S1, S2, S3), p(A)=p(C)=p(G)=p(T)=0.25

A T G

p(A)=0.91p(C)=0.03p(G)=0.03p(T)=0.03

p(A)=0.03p(C)=0.03p(G)=0.03p(T)=0.91

p(A)=0.03p(C)=0.03p(G)=0.91p(T)=0.03

Markov chain 0th orderp(ATG)=0.913=0.752

Markov chain 1th orderp(ATG)=p(A)*p(T|A)*p(G|T)

S1 S2 S3

Page 27: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Hidden Markov Model (HMM)

- Example exon-intron border- 3 states: exon(E), 5‘SS (5), intron (I)

Eddy SR, Nat Biotech 2004

Emission probabilities

Hidden, want to infer(π)(hidden Markov chain)

Given (S)

log P(S,π|HMM,Θ)=log(1*0.2518*0.917*0.1*0.95*1.0*0.4*0.9*0.4*0.9*0.4*0.9*0.1*0.9*0.4*0.9*0.1*0.9*0.4*0.1)

Find best state path(highest score)

If man possible paths thanuse efficient Viterbi algorithm(based on dynamic programing)

G T A A G T C A

HMM parameters (Θ)

Page 28: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Profile Hidden Markov Model

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

P(A)=0.8

P(C)=0.0

P(G)=0.0

P(T)=0.2

P(A)=0.8

P(C)=0.2

P(G)=0.0

P(T)=0.0

P(A)=1.0

P(C)=0.0

P(G)=0.0

P(T)=0.0

P(A)=0.0

P(C)=0.0

P(G)=0.2

P(T)=0.8

P(A)=0.0

P(C)=0.8

P(G)=0.2

P(T)=0.0

P(A)=0.0

P(C)=0.8

P(G)=0.2

P(T)=0.0

P(A)=0.2

P(C)=0.4

P(G)=0.2

P(T)=0.2

1.0

0.4

1.0 0.4

0.6

0.6

1.0 1.01.0

[AT][CG][AC][ACGT]*A[TG][GC]

Regular Expressions

p(ACACATC)=0.8*1*0.8*1*0.8*0.6 *0.4*0.6*1*1*0.8*1*0.8=0.047log-odds=log(p(S)/0.25L)=log(0.047/0.257)

insertion state

- For multiple alignments (e.g. DNA sequences)

Page 29: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Profile Hidden Markov Model

Main states (gray)

Allows position dependent gap penalties Can be obtained from a multiple alignment (DNA or Protein) Can be used for searching a database for other members

of the family

Insert states to model highlyvariable regions in the alignemnt

Delete (silent, null) states

Avoid overfitting by using pseudocounts (e.g. add 1 to all counts)

Insert states

Page 30: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Neuronal network for secondary structure prediction

Page 31: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

PredictProtein

Multi-step predictive algorithm (Rost et al., 1994)

Protein sequence queried against SWISS-PROTMaxHom used to generate iterative, profile-basedMultiple sequence alignment (Sander and Schneider,1991)

Multiple alignment fed into neural network (PHDsec)

Accuracy: Average > 70%, Best-case > 90%

http://www.predictprotein.org/

Page 32: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Prediction of protein function

Frank Eisenhaber et al.

Annotator

Page 33: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

SignalP

• Neural network trained based on phylogeny– Gram-negative prokaryotic– Gram-positive prokaryotic– Eukaryotic

• Predicts secretory signal peptides• http://www.cbs.dtu.dk/services/SignalP/

Signal peptide score (S)

Cleavage site score (C)

Combined Score (Y)

Page 34: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Insulin

Page 35: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Proinsulin

Page 36: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

DNA sequence(AY242110)

1. extract cds of insulin

2. extract mRNA seq of insulin

fasta file

blastn

4. extract protein sequence of insulin

NCBI Geneusing Refseq ID

fasta file

5. extract mRNA seq in human

Notepad++

Notepad++

NCBI Nucleotide (Genbank)

blastn

NCBI Geneusing Refseq ID

6. extract protein sequence of insulin

fasta file Notepad++

7. genomic location and adjacent genes

NCBI Geneusing Refseq ID

8. compare protein sequences 9. find signal peptide

blastp (align two sequences)Signal P server

Sus scrofa (pig) Homo sapiens (human)

Find difference between insulin sequence in pig and human

Page 37: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Signal peptide B-chain (30 AA)

C-peptide A-chain (21 AA)

Difference between pig and human insulin = 1AA

Page 38: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

A-kinase anchoring proteins (AKAPs) binding to the regulatory subunit of protein kinase A (PKA)

Page 39: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Protter

GPR161

Page 40: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Protein secondary structure prediction (Jpred)

Page 41: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Amphipatic helix

Heliquest

Page 42: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

xx

Protein structure

Predictive biomarker

Clonal evolution

Tumor heterogenity

Antigen PredictionsFunctional impact

Combinations Pathways

variants

Recurrent mutations

Population

Analyses and interpretation of DNA variants

x

Page 43: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Somatic mutation detection in tumor samples

Raphael et al. Genome Med. 2014

Page 44: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

MuTect 2 (GATK)

Page 45: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

MutSig (MutSigCV)

Lawrence, M. et al. Nature 2013

MutSig builds a model of the background mutation processes (BMR) that were at work during formation of the tumors, and it analyzes the mutations of each gene to identify genes that were mutated more often than expected by chance, given the background model.MutSigCV (CV for 'covariate') improves the BMR estimation by pooling data from 'neighbor' genes with similar genomic properties such as DNA replication time, chromatin state (open/closed), and general level of transcription activity.

Page 46: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Ensembl Variant Effect Predictor (VEP)

Page 47: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Oncotator

http://www.broadinstitute.org/oncotator/

chr1 43815005 43815005 A AT

chr1 43815008 43815008 TG T

chr1 43815013 43815013 G GT

chr1 43815036 43815036 G GCC

chr1 43815041 43815041 C G

chr1 43815047 43815047 G GC

chr1 43815049 43815049 A AG

chr1 115252237 115252237 T TG

jurkat_mutations.vcf

Page 48: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Oncotator

Page 49: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

MutationAssessor

2,212288939,C,G

Chr Pos RefAll AltAll

Page 50: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

https://www.broadinstitute.org/cancer/cga/mutsig_run (see key.txt)

MutSigCV

exome_full192_coverage.txt

gene_covariates.txt

Mutation_type_dictionary_file.txt

hg19

dlcbl_prim.maf

Page 51: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Intratumor heterogenity and clonal evolution

Page 52: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Cancer-Immunity cycle

Hackl et al. Nat Rev Genet 17: 441-458 (2016)

Page 53: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Cancer-Immunity cycle

Hackl et al. Nat Rev Genet 17: 441-458 (2016)

Page 54: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

NetMHCpan

http://www.cbs.dtu.dk/services/NetMHCpan/

Page 55: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

IntOGen

Page 56: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 57: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Driver genes

Functional Impact

Recurrence

Cancers arise due to alterations in genes that confer growth advantage to the cell . More than 400 such ‘cancer genes’, identified to date are currently annotated in the Cancer Gene Census.

Cancer driver mutations (cancer drivers) versus passenger mutations can beidentified based on:

Page 58: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Driver signals

Clustered mutations (OncodriveCLUST)

Functional Mutations (OncodriveFM)

Recurrent Mutations (MutSigCV)

Mutation frequency per cancertype

No. of mutated samples in cancer type

No. of protein affecting mutated samples in cancer type (PAM)

Mutation distribution along protein sequence

Protein domains

No. and position of mutationof protein affecting mutatedsamples in cancer type (PAM)

Different transcripts

IntOgen (Integrative Onco Genomics)

Page 59: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

What is the most common BRAF mutation

In which cancer types IDH1 is a cancer driver and in whichcancer type mutation of IDH1 is most frequent

Most common drivers in breast carcinoma

Mutation frequency of VHL

Examples

Page 60: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA

Page 61: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

International Cancer Genome Consortium (ICGC)

Page 62: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

International Cancer Genome Consortium (ICGC)

The International Cancer Genome Consortium (ICGC)was launched to coordinate large-scale cancer genomestudies in tumours from 50 different cancer typesand/or subtypes that are of clinical and societalimportance across the globe.

Systematic studies of more than 25,000 cancer genomesat the genomic, epigenomic and transcriptomic levelswill reveal the repertoire of oncogenic mutations,uncover traces of the mutagenic influences, defineclinically relevant subtypes for prognosis and therapeuticmanagement, and enable the development of newcancer therapies.

Page 63: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA

Page 64: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA

TCGA is supervised by the National Cancer Institute's Center for Cancer Genomics and the National Human Genome Research Institute funded by the US government.

3 year pilot project on 3 cancer types (starting 2006)

Phase II 33 different cancer types by 2014 including 10 rare cancers by 2014

Characterization centers (GCCs) for sequencing.

Genome data analysis centers (GDACs) for bioinformatic analyses.

It is also part of the The International Cancer Genome Consortium (ICGC)

Currently 34 cancer types >11000 samples

Page 65: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA

Biospecimen Core Resource (BCR)Collect and process tissue samples

Genome Sequencing Centers (GSCs)Use high-throughput Genome Sequencing to identify the changes in DNA sequences in cancer

Genome Characterization Centers (GCCs)Analyze genomic and epigenomic changes involved in cancer

Data Coordinating Center (DCC)The TCGA data are centrally managed at the DCC

Genome Data Analysis Centers (GDACs)These centers provide informatics tools to facilitate broader use of TCGA data.

Page 66: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA member map

Page 67: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA

Page 68: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA

https://wiki.nci.nih.gov/display/TCGA/TCGA+Wiki+Home

All information, tutorials, FAQs etc. can be found here:

TCGA Data Portal has changed to Genomic Data Commons (GDC)

https://gdc-docs.nci.nih.gov/Data_Portal/Users_Guide/Getting_Started/

Page 69: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA barcodes

Tissue source siteBiospecimen core resource

Universally Unique Identifiers (UUIDs) have replaced TCGA barcodes

Sample

Page 70: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA data levels

Sequencing data (fastq, BAM) (data level 1) is control accessed at Cancer Genomics Hub (CGHub)

Page 71: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Genomic Data Commons

Page 72: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Genomic Data Commons

Page 73: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Genomic Data Commons

Page 74: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Genomic Data Commons

Page 75: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Genomic Data Commons

Page 76: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

GDC data transfer tool

https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

unzip

Page 77: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Download using manifest

>gdc-client download –m gdc_manifest_20161212_140030.txt

Page 78: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Firebrowse

Page 79: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Download RNAseqV2 with Firebrowse

Page 80: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

~_Level_3__RSEM_genes__data.data.txt

Page 81: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Download clinical data with Firebrowse

Selected parameter

All clinical andsample

information

Page 82: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Download clinical data with Firebrowse

Patients

Sele

cted

clin

ical

par

amet

er

Page 83: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Somatic mutations (open access)

https://www.biostars.org/p/69222/

Tumor-specific Analysis Working Groups (AWGs) take the auto-generated variant calls from the Genome Sequencing Centers (GSCs), and remove false-positive variants, or recover those missed by the GSCs.

manually by experts automated scripts filters based on the extensive domain knowledge of said experts.

So the quality of variant calls may differ wildly between TCGA GSCs and AWGs, and across samples as the pipelines get better over time

In some tumor-types, and for some subsets of samples, mutations called from the first-pass of sequencing (usually exome-seq), are targeted by custom capture arrays, and re-sequenced. This results in much higher read-depths for the purpose of validating first-pass calls, for more accurate variant allele fractions (VAFs), or for finding calls that were missed.

Page 84: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Mutation Annotation Format (MAF) Specification

1 Hugo_Symbol IDH12 Entrez_Gene_Id 03 Center genome.wustl.edu4 NCBI_Build 375 Chromosome 26 Start_Position 2091131127 End_Position 2091131128 Strand +9 Variant_Classification Missense_Mutation

10 Variant_Type SNP11 Reference_Allele C12 Tumor_Seq_Allele1 C13 Tumor_Seq_Allele2 T14 dbSNP_RS rs12191350015 dbSNP_Val_Status unknown16 Tumor_Sample_Barcode TCGA-AB-2802-03B-01W-0728-0817 Matched_Norm_Sample_Barcode TCGA-AB-2802-11B-01W-0728-0818 Match_Norm_Seq_Allele1 C19 Match_Norm_Seq_Allele2 C20 Tumor_Validation_Allele121 Tumor_Validation_Allele222 Match_Norm_Validation_Allele1 C23 Match_Norm_Validation_Allele2 C24 Verification_Status Verified25 Validation_Status Valid26 Mutation_Status Somatic27 Sequencing_Phase Phase_IV28 Sequence_Source Capture29 Validation_Method30 Score 131 BAM_File dbGAP32 Sequencer Illumina HiSeq

Should matchwith otheranalyses(=> lift-over)

Remove forOncotatoranalysis

Page 85: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

MAF files from Firehose

curated

somatic

Broad (center)

COAD (cancer type)

https://confluence.broadinstitute.org/display/GDAC/MAF+Dashboard

24.0 MB (size)

Page 86: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

https://www.biostars.org/p/69222/

Best available MAFs The Ding lab at TGI, WashU

Page 87: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Gene Expression Omnibus (GEO)

Raw data (cel files)

Expression matrix(normalized data)

Platform (microarray)(normalized data)

Sample data

Page 88: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

cBioPortal

Page 89: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

data from 105 cancer genomics studies

TCGA and other studies

Query and download

Many different analyses options

Page 90: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 91: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 92: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 93: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 94: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 95: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 96: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 97: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCIA

Page 98: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or
Page 99: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

The Cancer Immunome Atlas

Charoentong et al. Cell Rep. 2017. 18:248-262

Page 100: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Charoentong et al. Cell Rep. 2017. 18:248-262

Page 101: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCIA (Immunophenogram)

Page 102: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCIA (Tumor infiltrated lymphocytes)

Page 103: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCIA (Gene expression)

Page 104: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCIA (Survival analysis)

Page 105: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

GPViz

http://icbi.at/software/gpviz/gpviz.shtml

Do not update JAVA!

Page 106: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Refseq_hg19.gtf

oncotator.maf

LAML.maf

Refseq_CDD.txt

Page 107: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

GPViz

http://icbi.at/software/gpviz/gpviz.shtml Snajder et al, Bioinformatics 2013

Page 108: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

RNAseq

e.g. 20 mio

Page 109: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

0. Image analysis and base calling (Phred quality score)

=> FastQ files (sequence and corresponding quality levels)

1. Trimming adaptors and low quality reads2. Read mapping (Spliced alignment) (Tophat, STAR)

=> SAM/BAM files

3. Transcriptome reconstruction (reference transcriptome, GTF file)

4. Expression quantification (transcript isoforms) (HTseq, Cufflinks)-count reads-normalization

4. Differential expression analysis (negative-binomial test)(DESeq, edgeR, CuffDiff)

Analysis steps

Page 110: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Normalization

Reads per kilobase per million reads (RPKM)

Fragments per kilobase per million (FPKM) for paired-end seq.

Quantile normalization (upper quantile normalization)

TMM (trimmed mean of M values)

Page 111: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

TCGA

RNAseqV2 analysis MapSplice is used to do the alignment and RSEM to perform the quantitation.

raw_count … for EBseq introduce directlyfor DESeq2, EdgeR use integers

scaled_estimate … transcript per million TPM=scaled estimate*106

normalized_count …. upper quartile normalized RSEM count estimates

Page 112: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Isoform quantification

Uncertainy in assigning reads to isoforms Paired-end sequencing Spliced alignment Alternative splicing (statistical significant?)

Page 113: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Representation of gene expression

• n x m matrix with n genes and m experiments(conditions, patient samples)

• Representation as heatmap (e.g. red upregulatedgenes, green down regulated genes, black nochange)

• For experiments in reference design:

log2-fold change (log2FC, log2(A/B), log2 ratio)

• For patient samples and no reference:

mean centered log2-levels for each gene

log2-intensities for one-color arrayslog2-RPKM for RNAseq

z-score of log2-levels

Z= (X-m)/s X…log2-levels, m…mean,s…standard deviation

gen

es

experiments (patients)

heatmap

Page 114: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Clustering

• Unsupervised or supervised (classification)

• AgglomerativeBottom up approach, whereby single expression

profiles are successively joined to form nodes.

• DivisiveTop down approach, each cluster is successively split in the same fashion, until each cluster consists of one single profile.

Page 115: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

• Pearson correlation (hierarchical clustering)

• Euclidian distance (k-means clustering)

• Manhattan distance• Spearman rank correlation• Mutual information….

Similarity (distance) measures

-1 r 1

Page 116: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Hierarchical clustering

• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)

1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1

Dendrogram

6 cluster

15 cluster

Indicates similarity

Page 117: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Linkage

Single-linkage clusteringMinimal distance

Complete-linkage clusteringMaximal distance

Average-linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values

Weighted pair-group averageLike UPGMA but weighted according cluster size

Within-groups clusteringAverage of merged cluster is used instead of cluster elements

Ward’s methodSmallest possible increase in the sum of squared errors

Page 118: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

• partition n genes into k clusters, where k has to be predetermined

• k-means clustering minimizes the variability within and maximize between clusters

• Moderate memory and time consumption

K-means clustering

1. Generate random points (“cluster centers”) in n dimensions (results are depending on these seeds).

2.Compute distance of each data point to each of the cluster centers.

3.Assign each data point to the closest cluster center.

4.Compute new cluster center position as average of points assigned.

5.Loop to (2), stop when cluster centers do not move very much.

Page 119: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Principal component analysis (PCA)

PCA is a data reduction technique that allows to simplify multidimensional data sets into smaller number of dimensions (r<n).

Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.

Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80-90% of the variance can already be explained.

This analysis can be done by a special matrix decomposition (singular value decomposition SVD).

Page 120: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Singular value decomposition (SVD)

X = USVT with UUT = VTV = VVT = I

For mean centered data the Covariance matrix C can be calculated by XXT. U are eigenvectors of XXT and the eigenvalues are in the diagonal of S defined by the characteristic equation |C – λI | = 0.

Transformation of the input vectors into the principal component space can be described by Y = XU where the projection of sample i along the axis is defined by the j-th PC:

Page 121: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Biological meaning of the gene sets

?

• Gene ontology terms

• Pathway mapping

• Linking to Pubmed abstracts or associated MESH terms

• Regulation by the sametranscription factor (module)

• Protein families and domains

• Gene set enrichment analysis

• Over representation analysis

Page 122: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Gene set enrichment analysis

Subramanian A et al. Proc Natl Acad Sci (2005)

Page 123: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Gene set enrichment analysis

1. Given an a priori defined set of genes.

2. Rank genes (e.g. by t-value between 2 groups of microarray samples) ranked gene list L.

3. Calculation of an enrichment score (ES) that reflects the degree to which a gene set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.

4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype-based permutation test procedure.

5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).

http://www.broadinstitute.org/gseahttp://www.broadinstitute.org/cancer/software/gsea/wiki/

Page 124: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Gene Ontology (GO)

The three organizing principles (categories) of GO are

cellular component

biological process

molecular function

The Gene Ontology project (http://geneontology.org) provides a controlled vocabulary to describe gene and gene product attributes in any organism.

mitochondrium

cell cycle

isomerase activity

Page 125: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Termtranscription initiation

IDGO:0006352

DefinitionProcesses involved in starting transcription, where transcription is the synthesis of RNA by RNA polymerases using a DNA template.

What’s in a GO term?

Page 126: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Parent /child relation in directed acyclic graph (DAG)

2 relations:part_of

is_a

different levels

more specific

less specific

biological_process

Page 127: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Gene Ontology Browser (Amigo2)

http://amigo2.geneontology.org (http://geneontology.org/)

Inferred tree view

Term information Annotation

...

Page 128: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

ISS Inferred from Sequence Similarity

IEP Inferred from Expression Pattern

IMP Inferred from Mutant Phenotype

IGI Inferred from Genetic Interaction

IPI Inferred from Physical Interaction

IDA Inferred from Direct Assay

RCA Inferred from Reviewed Computational Analysis

TAS Traceable Author Statement

NAS Non-traceable Author Statement

IC Inferred by Curator

ND No biological Data available

Evidence code for GO annotations

Page 129: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Over representation analysis

m

g

gene universe (whole microarray)

GO term

ci

genes in cluster(gene list)

all genes with GO term

genes in clusterwith GO term

m-g c-i

g i

contingency table

Page 130: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Over representation analysis

Fisher exact test for contingency table

Hypergeometric distribution

Multiple hypothesis testing => adjust p-value

Not only for GO Terms also for TFBS, pathways,..

p =

5010

1000-5030-10

100030

m-g c-i

g i

50 redballs of1000balls

20x

draw 30x

10x

c=30 genes

m=1000genes

g=50 genes (GO) i=20 genes (GO)

Page 131: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

DAVID

Database for Annotation, Visualization and Integrated Discovery https://david.ncifcrf.gov Functional annotation tool (over representation analysis)

Dnajb1

Wnt11

Sorbs3

D230025D16Rik

Sfxn3

Hspa5

Golga3

Hgs

Npc1

Mta2

Cnn2

Spg20

Zpr1…

1019 mousegene symbols

Page 132: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Pathways

Canonical Pathways are idealized or generalized pathways that represent common properties of a particular signaling module or pathway

A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in a cell.Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on and off, or spur a cell to move (genome.gov/27530687).

metabolic pathways signaling pathways gene regulation pathways

Page 133: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Biochemical and Metabolic Pathways

Böhringer Mannheim

Page 134: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

KEGG

Kyoto Encylopedia of Genes and Genomes http://www.genome.jp/kegg/

Mapping of gene expression Updated information ($$)

1. Metabolism

2. Genetic Information Processing

3. Environmental Information Processing

4. Cellular Processes

Page 135: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Reactome (http://www.reactome.org)

BioCyc (HumanCyC, EcoCyc,..)

Biocarta (http://www.biocarta.com/genes)

PANTHER pathways (http://www.pantherdb.org)

Wiki Pathways (http://www.wikipathways.org)

Pathway databases

Page 136: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Hannah, Weinberg, Cell. 2000

Signaling pathways

Pathway interaction database (http://pid.nci.nih.gov)137 NCI curated pathway maps

Signal Transduction Knowledge Environmenthttp://stke.sciencemag.org/cm/

cancer cells

Page 137: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Pathway (and network) analysis

De novo network construction co-expression network reversed engineering

Pathway analysis over representation mapping gene expression to pathway

Interaction network (direct or indirect interactions) from protein-protein interaction databases (HPRD, IntAct) known and predicted functional association (STRING)

Page 138: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Pathway Commons

Aim: convenient access to pathway information Facilitate creation and communication of pathway data Aggregate pathway data in the public domain Provide easy access for pathway analysis

Page 139: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Cytoscape

Access pathway commons from cytoscape

http://www.cytoscape.org

Open source software for networkvisualization

Active community

>40 plugins (apps) extend functionality

e.g. Bingo, ClueGO

Easy to use and good documentation

Cline MS et al Nat Protoc. 2007

Page 140: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Other pathway tools

Metacore (GeneGO) (Thomson Reuters) ($$)

Ingenuity pathway analysis (IPA) (QIAGEN) ($$)

Pathway Studio (Ariadne) ($$)

GenMAPP (Mapping gene expression onto pathways)

PathVisio (Drawing pathways)

Page 141: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Box-and-whiskers plot

Page 142: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Measures of variability

Page 143: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Measures of variability

Page 144: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Survival analysis

Page 145: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Censored data

Page 146: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Survival function

Page 147: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Kaplan-Meier survival curves

Page 148: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Kaplan-Meier estimator

Page 149: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Comparison of Kaplan-Meier curves

Page 150: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Log-rank test

Page 151: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Log-rank test

Page 152: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Hazard ratio (HR)

Page 153: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Hazard function

Page 154: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Relative risk and proportional hazard model

Page 155: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Cox regression

Page 156: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

Cox regression

Page 157: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

LUAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data.txt

get_matrix_tn_rawcounts.R

get_matrix_tn_TPM.R

get_matrix_prim_TPM.R

deseq2_tn.R results_tn.txtLUAD_MATRIX_58TN_COUNTS.txt

LUAD_MATRIX_58TN_TPM.txt

LUAD_MATRIX_PRIM_TPM.txt

Firebrowse

LUAD_MATRIX_PRIM_LOG2TPM1.txtget_matrix_prim_LOG2TPM1.R

diff_expressed_genes.xlsx

Analyses

Gene ontology

Pathways

Page 158: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

boxplot_surv.R

LUAD_CLIN.txt

LUAD_MATRIX_PRIM_TPM.txt

results_tn.txt

LUAD_MATRIX_58TN_TPM.txt

Firebrowseluad_boxplot_os_genes.pdf

LUAD_IMMUNE.txt Heatmap,

Clustered data

RELAPSE.rnkEnrichment plots

NES, q-value (FDR)gensets.gmx

Analyses

Page 159: Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or

get_deg.REXPR_GSE67501.txtGEO

GSE67501

Analyses

normalized_data.txt

GEO

GSE51373

GSE51373_RAW.tarGSM1243877_U133Plus2_S1351.CEL.gz

GSM1243904_U133Plus2_RD00443.CEL.gz

get_eset.Rtarget.txt

get_deg_S_R.R

deg_S_R_GSE51373.txtdeg_S_R_GSE51373_UNIQUE.txt

diff_expressed_genes.xlsx

ROC.R

deg_GSE67501.txt