Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning...

Post on 23-Aug-2020

0 views 0 download

Transcript of Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning...

Bioinformatics I (KF) VU 041035 WS2019 http://icbi.at/mo

Hubert Hackl

Biocenter, Division of Bioinformatics, Medical University of Innsbruck,Innrain 80, 6020 Innsbruck, AustriaTel: +43-512-9003-71403, Email: hubert.hackl@i-med.ac.at, URL: http://icbi.at

I. Predictive methods on protein sequences (Protein domains, Signal P, NetMHCPan)

II. Sequence alignment and databases (BLAST, NCBI, TCGA, Firebrowse, TCIA, GEO, ArrayExpress, IntOgen, cBioPortal)

III. Differentially expressed genes (microarrays, RNAseq) (R/Bioconductorsoftware packages rma, limma, DEseq2)

IV. Expression profiling and clustering (Genesis)

V. Gene ontology, Pathway analysis (DAVID, KEGG, Reactome, ConsensusPathDB, ClueGO,Enrichr)

VI. Network analysis (Cytoscape)

VII. Gene set enrichment analysis (GSEA), Deconvolution

VIII. Predictive and prognostic marker (signatures) (logistic regression, survival analysis)

Contents

Symbol Meaning Description

R A or G puRineY C or T pYrimidineW A or T Weak hydrogen bondsS G or C Strong hydrogen bondsM A or C aMino groupsK G or T Keto groupsH A, C, or T (U) not G, (H follows G)B G, C, or T (U) not A, (B follows A)V G, A, or C not T (U), (V follows U)D G, A, or T (U) not C, (D follows C)N G, A, C or T (U) aNy nucleotide

Nomenclature of nuclein acids

Base Symbol Occurrence

Adenin A DNA, RNAGuanin G DNA, RNACytosin C DNA, RNAThymin T DNAUracil U RNA

+ strand 5´-ACGGTCGCTGTCGGTAGC-3´- strand 3´-TGCCAGCGACAGCCATCG-5´

e.g. in fasta format : >gene sequence|gi12345|chr17|-

GCTACCGACAGCGACCGT

DNA sequences are always from 5‘ to 3‘

Positions in the genome (genome assembly) are chromosome wise

e.g. human GRCh37/hg19

chr11:1-100 chr11:49,686,777-49,689,777

Positions in the chromosome start for both!! strands from position 1

+ strand 5´-ACGGTCGCTG…………TCGGTAGC-3´- strand 3´-TGCCAGCGAC…………AGCCATCG-5´

chr11:1 2523 2529

chr11:1 2523 2529

Nomenclature

Translation, genetic code and reading frames

Peptid chain, amino acid sequence, proteins

Protein sequences are always form N-terminal end to C-terminal end

backbone

sidechains

E.g.. SCD sequence in fasta format

Protein Sequence Analysis

Shared ancestry?Similar function?Domain or complete sequence?Are functional sequences conserved?

Homology

Searches

Homology searches

Sequence alignment

BLAST, FASTA

Profile

Analysis

Profile Analysis

Uses collective characteristics of a family of proteins

Position specific score matrix (PSSM)

Profile HMM

ProfileScan, Pfam, CDD, Prosite, BLOCKS

PSI-Blast

Protein Sequence Analysis

Amino Acid Composition

Hydrophobicity

Charge

Physical

Properties

Protein Sequence Analysis

Secondary structure

Specialized structures

Tertiary structure

Structural

Properties

Substitutions matrices

• Unrelated or random model assumes that letter a occursindependently with some frequency qa.

P(x,y|R) = qxiqxj

• The alternative match model of aligned pairs of residuesoccurs with a joint probability pab.

P(x,y|M) = pxi yi

• Odds ratio

P(x,y|M) pxi yi pxi yi

P(x,y|R) qxi qyj qxi qyj= =

Summary of substitutions matrices

• Triple-PAM strategy (Altschul, 1991)

– PAM 40 short alignments, highly similar– PAM 120 – PAM 250 longer, weaker local alignments

• BLOSUM (Henikoff, 1993)

– BLOSUM 90 short alignments, highly similar– BLOSUM 62 most effective in detecting known

members of a protein family– BLOSUM 30 longer, weaker local alignments

• No single matrix is the complete answer for all sequencecomparisons

• Programs like BLAST usually have default matrices!

Database search

FASTA

• Use hash table of short words of the database (DB) sequence and query sequence (2-6 chars)

• For words in query sequence, find similar words in DB using (fast) hash table lookup, and compute

R = position(query) – position (DB).

Areas of long match will show same R for many words.

• Score matching segments based on content of these matches. Extend the good matches empirically.

High-scoring segment pairs

Significance of scores

The number of unrelated matches with score greater than S is approximately Poisson distributed with mean

E(S)=Kmne-λS

where λ is a scaling factor m and n are the length of the sequences

The probability that there is a match of score greater than S follows a extreme value distribution:

Karlin S, Altschul S. Proc Natl Acad Sci (1990)

P(x>S)=1-e-E(S)

NCBI Blast

Program Query sequence Subject sequence

BLASTN Nucleotide Nucleotide

BLASTP Protein Protein

BLASTX Nucleotidesix-frame translation

Protein

TBLASTN Protein Nucleotidesix-frame translation

TBLASTX Nucleotidesix-frame translation

Nucleotidesix-frame translation

NCBI Blast Example

Blast Results

alignment

Be

st h

it

description

graphicalvisualization

E-value

Score (S)

conserved domaindatabase (CDD)

Multiple sequence alignment (Clustal W)

Hbb_Human -

Hbb_Horse .17 -

Hba_Human .59 .60 -

Hba_Horse .59 .59 .13 -

Myg_Whale .77 .77 .75 .75 -

Hbb_Human

Hbb_Horse

Hba_Human

Hba_Horse

Myg_Whale

1 PEEKSAVTALWGKVN--VDEVGG

2 GEEKAAVLALWDKVN--EEEVGG

3 PADKTNVKAAWGKVGAHAGEYGA

4 AADKTNVKAAWSKVGGHAGEYGA

5 EHEWQLVLHVWAKVEADVAGHGQ

Rooted neighbor-joining tree(guide tree) and sequence weights

Progressive alignmentfollowing guide tree

Pairwise alignmentcalculate distance matrix

1

23

4

1

23

4

BLOCKS

Steve Henikoff, FHCRC ,Seattle

Multiple alignments of conserved regions in protein families

– 1 block = 1 short, ungapped multiple alignment

– Families can be defined by one or more blocks

– Searches allow detection of one or more blocks representing a family

– http://blocks.fhcrc.org/

Profile Construction

20

1b

f(p,b) = frequency of amino acid b in position ps(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM)

PSSM(p,a) = f(p,b)*s(a,b)

PSI-BLAST

• Position-Specific Iterated BLAST search

• Used to identify distantly related sequences that are possibly missed during a standard BLAST search

• Easy-to-use version of a profile-based search Perform BLAST search against protein database Use results to calculate a position-specific scoring

matrix PSSM replaces query for next round of searches May be iterated until no new significant alignments

are found

Altschul et al., Nucleic Acids Res. 25: 3389-3402, 1997

DELTA-BLAST

• Method different from that used by PSI-BLAST

Step 1: Align the query against conserved domains derived from CDD

Step 2: Compute PSSMStep 3: Search sequence databases using PSSM as

the query

• Produces high-quality alignments, even at low levels of sequence similarity

• Dependent on homologous relationships captured within CDD

Markov chains

Markov chains: a sequence of events that occur one after another. The main restriction on a Markov chain is that the probability assigned to an event at any location in the chain can depend on only a fixed number of previous events.

Scoring sequences (e.g. start codon ATG)3 states (S1, S2, S3), p(A)=p(C)=p(G)=p(T)=0.25

A T G

p(A)=0.91p(C)=0.03p(G)=0.03p(T)=0.03

p(A)=0.03p(C)=0.03p(G)=0.03p(T)=0.91

p(A)=0.03p(C)=0.03p(G)=0.91p(T)=0.03

Markov chain 0th orderp(ATG)=0.913=0.752

Markov chain 1th orderp(ATG)=p(A)*p(T|A)*p(G|T)

S1 S2 S3

Hidden Markov Model (HMM)

- Example exon-intron border- 3 states: exon(E), 5‘SS (5), intron (I)

Eddy SR, Nat Biotech 2004

Emission probabilities

Hidden, want to infer(π)(hidden Markov chain)

Given (S)

log P(S,π|HMM,Θ)=log(1*0.2518*0.917*0.1*0.95*1.0*0.4*0.9*0.4*0.9*0.4*0.9*0.1*0.9*0.4*0.9*0.1*0.9*0.4*0.1)

Find best state path(highest score)

If man possible paths thanuse efficient Viterbi algorithm(based on dynamic programing)

G T A A G T C A

HMM parameters (Θ)

Profile Hidden Markov Model

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

P(A)=0.8

P(C)=0.0

P(G)=0.0

P(T)=0.2

P(A)=0.8

P(C)=0.2

P(G)=0.0

P(T)=0.0

P(A)=1.0

P(C)=0.0

P(G)=0.0

P(T)=0.0

P(A)=0.0

P(C)=0.0

P(G)=0.2

P(T)=0.8

P(A)=0.0

P(C)=0.8

P(G)=0.2

P(T)=0.0

P(A)=0.0

P(C)=0.8

P(G)=0.2

P(T)=0.0

P(A)=0.2

P(C)=0.4

P(G)=0.2

P(T)=0.2

1.0

0.4

1.0 0.4

0.6

0.6

1.0 1.01.0

[AT][CG][AC][ACGT]*A[TG][GC]

Regular Expressions

p(ACACATC)=0.8*1*0.8*1*0.8*0.6 *0.4*0.6*1*1*0.8*1*0.8=0.047log-odds=log(p(S)/0.25L)=log(0.047/0.257)

insertion state

- For multiple alignments (e.g. DNA sequences)

Profile Hidden Markov Model

Main states (gray)

Allows position dependent gap penalties Can be obtained from a multiple alignment (DNA or Protein) Can be used for searching a database for other members

of the family

Insert states to model highlyvariable regions in the alignemnt

Delete (silent, null) states

Avoid overfitting by using pseudocounts (e.g. add 1 to all counts)

Insert states

Neuronal network for secondary structure prediction

PredictProtein

Multi-step predictive algorithm (Rost et al., 1994)

Protein sequence queried against SWISS-PROTMaxHom used to generate iterative, profile-basedMultiple sequence alignment (Sander and Schneider,1991)

Multiple alignment fed into neural network (PHDsec)

Accuracy: Average > 70%, Best-case > 90%

http://www.predictprotein.org/

Prediction of protein function

Frank Eisenhaber et al.

Annotator

SignalP

• Neural network trained based on phylogeny– Gram-negative prokaryotic– Gram-positive prokaryotic– Eukaryotic

• Predicts secretory signal peptides• http://www.cbs.dtu.dk/services/SignalP/

Signal peptide score (S)

Cleavage site score (C)

Combined Score (Y)

Insulin

Proinsulin

DNA sequence(AY242110)

1. extract cds of insulin

2. extract mRNA seq of insulin

fasta file

blastn

4. extract protein sequence of insulin

NCBI Geneusing Refseq ID

fasta file

5. extract mRNA seq in human

Notepad++

Notepad++

NCBI Nucleotide (Genbank)

blastn

NCBI Geneusing Refseq ID

6. extract protein sequence of insulin

fasta file Notepad++

7. genomic location and adjacent genes

NCBI Geneusing Refseq ID

8. compare protein sequences 9. find signal peptide

blastp (align two sequences)Signal P server

Sus scrofa (pig) Homo sapiens (human)

Find difference between insulin sequence in pig and human

Signal peptide B-chain (30 AA)

C-peptide A-chain (21 AA)

Difference between pig and human insulin = 1AA

A-kinase anchoring proteins (AKAPs) binding to the regulatory subunit of protein kinase A (PKA)

Protter

GPR161

Protein secondary structure prediction (Jpred)

Amphipatic helix

Heliquest

xx

Protein structure

Predictive biomarker

Clonal evolution

Tumor heterogenity

Antigen PredictionsFunctional impact

Combinations Pathways

variants

Recurrent mutations

Population

Analyses and interpretation of DNA variants

x

Somatic mutation detection in tumor samples

Raphael et al. Genome Med. 2014

MuTect 2 (GATK)

MutSig (MutSigCV)

Lawrence, M. et al. Nature 2013

MutSig builds a model of the background mutation processes (BMR) that were at work during formation of the tumors, and it analyzes the mutations of each gene to identify genes that were mutated more often than expected by chance, given the background model.MutSigCV (CV for 'covariate') improves the BMR estimation by pooling data from 'neighbor' genes with similar genomic properties such as DNA replication time, chromatin state (open/closed), and general level of transcription activity.

Ensembl Variant Effect Predictor (VEP)

Oncotator

http://www.broadinstitute.org/oncotator/

chr1 43815005 43815005 A AT

chr1 43815008 43815008 TG T

chr1 43815013 43815013 G GT

chr1 43815036 43815036 G GCC

chr1 43815041 43815041 C G

chr1 43815047 43815047 G GC

chr1 43815049 43815049 A AG

chr1 115252237 115252237 T TG

jurkat_mutations.vcf

Oncotator

MutationAssessor

2,212288939,C,G

Chr Pos RefAll AltAll

https://www.broadinstitute.org/cancer/cga/mutsig_run (see key.txt)

MutSigCV

exome_full192_coverage.txt

gene_covariates.txt

Mutation_type_dictionary_file.txt

hg19

dlcbl_prim.maf

Intratumor heterogenity and clonal evolution

Cancer-Immunity cycle

Hackl et al. Nat Rev Genet 17: 441-458 (2016)

Cancer-Immunity cycle

Hackl et al. Nat Rev Genet 17: 441-458 (2016)

NetMHCpan

http://www.cbs.dtu.dk/services/NetMHCpan/

IntOGen

Driver genes

Functional Impact

Recurrence

Cancers arise due to alterations in genes that confer growth advantage to the cell . More than 400 such ‘cancer genes’, identified to date are currently annotated in the Cancer Gene Census.

Cancer driver mutations (cancer drivers) versus passenger mutations can beidentified based on:

Driver signals

Clustered mutations (OncodriveCLUST)

Functional Mutations (OncodriveFM)

Recurrent Mutations (MutSigCV)

Mutation frequency per cancertype

No. of mutated samples in cancer type

No. of protein affecting mutated samples in cancer type (PAM)

Mutation distribution along protein sequence

Protein domains

No. and position of mutationof protein affecting mutatedsamples in cancer type (PAM)

Different transcripts

IntOgen (Integrative Onco Genomics)

What is the most common BRAF mutation

In which cancer types IDH1 is a cancer driver and in whichcancer type mutation of IDH1 is most frequent

Most common drivers in breast carcinoma

Mutation frequency of VHL

Examples

TCGA

International Cancer Genome Consortium (ICGC)

International Cancer Genome Consortium (ICGC)

The International Cancer Genome Consortium (ICGC)was launched to coordinate large-scale cancer genomestudies in tumours from 50 different cancer typesand/or subtypes that are of clinical and societalimportance across the globe.

Systematic studies of more than 25,000 cancer genomesat the genomic, epigenomic and transcriptomic levelswill reveal the repertoire of oncogenic mutations,uncover traces of the mutagenic influences, defineclinically relevant subtypes for prognosis and therapeuticmanagement, and enable the development of newcancer therapies.

TCGA

TCGA

TCGA is supervised by the National Cancer Institute's Center for Cancer Genomics and the National Human Genome Research Institute funded by the US government.

3 year pilot project on 3 cancer types (starting 2006)

Phase II 33 different cancer types by 2014 including 10 rare cancers by 2014

Characterization centers (GCCs) for sequencing.

Genome data analysis centers (GDACs) for bioinformatic analyses.

It is also part of the The International Cancer Genome Consortium (ICGC)

Currently 34 cancer types >11000 samples

TCGA

Biospecimen Core Resource (BCR)Collect and process tissue samples

Genome Sequencing Centers (GSCs)Use high-throughput Genome Sequencing to identify the changes in DNA sequences in cancer

Genome Characterization Centers (GCCs)Analyze genomic and epigenomic changes involved in cancer

Data Coordinating Center (DCC)The TCGA data are centrally managed at the DCC

Genome Data Analysis Centers (GDACs)These centers provide informatics tools to facilitate broader use of TCGA data.

TCGA member map

TCGA

TCGA

https://wiki.nci.nih.gov/display/TCGA/TCGA+Wiki+Home

All information, tutorials, FAQs etc. can be found here:

TCGA Data Portal has changed to Genomic Data Commons (GDC)

https://gdc-docs.nci.nih.gov/Data_Portal/Users_Guide/Getting_Started/

TCGA barcodes

Tissue source siteBiospecimen core resource

Universally Unique Identifiers (UUIDs) have replaced TCGA barcodes

Sample

TCGA data levels

Sequencing data (fastq, BAM) (data level 1) is control accessed at Cancer Genomics Hub (CGHub)

Genomic Data Commons

Genomic Data Commons

Genomic Data Commons

Genomic Data Commons

Genomic Data Commons

GDC data transfer tool

https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

unzip

Download using manifest

>gdc-client download –m gdc_manifest_20161212_140030.txt

Firebrowse

Download RNAseqV2 with Firebrowse

~_Level_3__RSEM_genes__data.data.txt

Download clinical data with Firebrowse

Selected parameter

All clinical andsample

information

Download clinical data with Firebrowse

Patients

Sele

cted

clin

ical

par

amet

er

Somatic mutations (open access)

https://www.biostars.org/p/69222/

Tumor-specific Analysis Working Groups (AWGs) take the auto-generated variant calls from the Genome Sequencing Centers (GSCs), and remove false-positive variants, or recover those missed by the GSCs.

manually by experts automated scripts filters based on the extensive domain knowledge of said experts.

So the quality of variant calls may differ wildly between TCGA GSCs and AWGs, and across samples as the pipelines get better over time

In some tumor-types, and for some subsets of samples, mutations called from the first-pass of sequencing (usually exome-seq), are targeted by custom capture arrays, and re-sequenced. This results in much higher read-depths for the purpose of validating first-pass calls, for more accurate variant allele fractions (VAFs), or for finding calls that were missed.

Mutation Annotation Format (MAF) Specification

1 Hugo_Symbol IDH12 Entrez_Gene_Id 03 Center genome.wustl.edu4 NCBI_Build 375 Chromosome 26 Start_Position 2091131127 End_Position 2091131128 Strand +9 Variant_Classification Missense_Mutation

10 Variant_Type SNP11 Reference_Allele C12 Tumor_Seq_Allele1 C13 Tumor_Seq_Allele2 T14 dbSNP_RS rs12191350015 dbSNP_Val_Status unknown16 Tumor_Sample_Barcode TCGA-AB-2802-03B-01W-0728-0817 Matched_Norm_Sample_Barcode TCGA-AB-2802-11B-01W-0728-0818 Match_Norm_Seq_Allele1 C19 Match_Norm_Seq_Allele2 C20 Tumor_Validation_Allele121 Tumor_Validation_Allele222 Match_Norm_Validation_Allele1 C23 Match_Norm_Validation_Allele2 C24 Verification_Status Verified25 Validation_Status Valid26 Mutation_Status Somatic27 Sequencing_Phase Phase_IV28 Sequence_Source Capture29 Validation_Method30 Score 131 BAM_File dbGAP32 Sequencer Illumina HiSeq

Should matchwith otheranalyses(=> lift-over)

Remove forOncotatoranalysis

MAF files from Firehose

curated

somatic

Broad (center)

COAD (cancer type)

https://confluence.broadinstitute.org/display/GDAC/MAF+Dashboard

24.0 MB (size)

https://www.biostars.org/p/69222/

Best available MAFs The Ding lab at TGI, WashU

Gene Expression Omnibus (GEO)

Raw data (cel files)

Expression matrix(normalized data)

Platform (microarray)(normalized data)

Sample data

cBioPortal

data from 105 cancer genomics studies

TCGA and other studies

Query and download

Many different analyses options

TCIA

The Cancer Immunome Atlas

Charoentong et al. Cell Rep. 2017. 18:248-262

Charoentong et al. Cell Rep. 2017. 18:248-262

TCIA (Immunophenogram)

TCIA (Tumor infiltrated lymphocytes)

TCIA (Gene expression)

TCIA (Survival analysis)

GPViz

http://icbi.at/software/gpviz/gpviz.shtml

Do not update JAVA!

Refseq_hg19.gtf

oncotator.maf

LAML.maf

Refseq_CDD.txt

GPViz

http://icbi.at/software/gpviz/gpviz.shtml Snajder et al, Bioinformatics 2013

RNAseq

e.g. 20 mio

0. Image analysis and base calling (Phred quality score)

=> FastQ files (sequence and corresponding quality levels)

1. Trimming adaptors and low quality reads2. Read mapping (Spliced alignment) (Tophat, STAR)

=> SAM/BAM files

3. Transcriptome reconstruction (reference transcriptome, GTF file)

4. Expression quantification (transcript isoforms) (HTseq, Cufflinks)-count reads-normalization

4. Differential expression analysis (negative-binomial test)(DESeq, edgeR, CuffDiff)

Analysis steps

Normalization

Reads per kilobase per million reads (RPKM)

Fragments per kilobase per million (FPKM) for paired-end seq.

Quantile normalization (upper quantile normalization)

TMM (trimmed mean of M values)

TCGA

RNAseqV2 analysis MapSplice is used to do the alignment and RSEM to perform the quantitation.

raw_count … for EBseq introduce directlyfor DESeq2, EdgeR use integers

scaled_estimate … transcript per million TPM=scaled estimate*106

normalized_count …. upper quartile normalized RSEM count estimates

Isoform quantification

Uncertainy in assigning reads to isoforms Paired-end sequencing Spliced alignment Alternative splicing (statistical significant?)

Representation of gene expression

• n x m matrix with n genes and m experiments(conditions, patient samples)

• Representation as heatmap (e.g. red upregulatedgenes, green down regulated genes, black nochange)

• For experiments in reference design:

log2-fold change (log2FC, log2(A/B), log2 ratio)

• For patient samples and no reference:

mean centered log2-levels for each gene

log2-intensities for one-color arrayslog2-RPKM for RNAseq

z-score of log2-levels

Z= (X-m)/s X…log2-levels, m…mean,s…standard deviation

gen

es

experiments (patients)

heatmap

Clustering

• Unsupervised or supervised (classification)

• AgglomerativeBottom up approach, whereby single expression

profiles are successively joined to form nodes.

• DivisiveTop down approach, each cluster is successively split in the same fashion, until each cluster consists of one single profile.

• Pearson correlation (hierarchical clustering)

• Euclidian distance (k-means clustering)

• Manhattan distance• Spearman rank correlation• Mutual information….

Similarity (distance) measures

-1 r 1

Hierarchical clustering

• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)

1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1

Dendrogram

6 cluster

15 cluster

Indicates similarity

Linkage

Single-linkage clusteringMinimal distance

Complete-linkage clusteringMaximal distance

Average-linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values

Weighted pair-group averageLike UPGMA but weighted according cluster size

Within-groups clusteringAverage of merged cluster is used instead of cluster elements

Ward’s methodSmallest possible increase in the sum of squared errors

• partition n genes into k clusters, where k has to be predetermined

• k-means clustering minimizes the variability within and maximize between clusters

• Moderate memory and time consumption

K-means clustering

1. Generate random points (“cluster centers”) in n dimensions (results are depending on these seeds).

2.Compute distance of each data point to each of the cluster centers.

3.Assign each data point to the closest cluster center.

4.Compute new cluster center position as average of points assigned.

5.Loop to (2), stop when cluster centers do not move very much.

Principal component analysis (PCA)

PCA is a data reduction technique that allows to simplify multidimensional data sets into smaller number of dimensions (r<n).

Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.

Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80-90% of the variance can already be explained.

This analysis can be done by a special matrix decomposition (singular value decomposition SVD).

Singular value decomposition (SVD)

X = USVT with UUT = VTV = VVT = I

For mean centered data the Covariance matrix C can be calculated by XXT. U are eigenvectors of XXT and the eigenvalues are in the diagonal of S defined by the characteristic equation |C – λI | = 0.

Transformation of the input vectors into the principal component space can be described by Y = XU where the projection of sample i along the axis is defined by the j-th PC:

Biological meaning of the gene sets

?

• Gene ontology terms

• Pathway mapping

• Linking to Pubmed abstracts or associated MESH terms

• Regulation by the sametranscription factor (module)

• Protein families and domains

• Gene set enrichment analysis

• Over representation analysis

Gene set enrichment analysis

Subramanian A et al. Proc Natl Acad Sci (2005)

Gene set enrichment analysis

1. Given an a priori defined set of genes.

2. Rank genes (e.g. by t-value between 2 groups of microarray samples) ranked gene list L.

3. Calculation of an enrichment score (ES) that reflects the degree to which a gene set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.

4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype-based permutation test procedure.

5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).

http://www.broadinstitute.org/gseahttp://www.broadinstitute.org/cancer/software/gsea/wiki/

Gene Ontology (GO)

The three organizing principles (categories) of GO are

cellular component

biological process

molecular function

The Gene Ontology project (http://geneontology.org) provides a controlled vocabulary to describe gene and gene product attributes in any organism.

mitochondrium

cell cycle

isomerase activity

Termtranscription initiation

IDGO:0006352

DefinitionProcesses involved in starting transcription, where transcription is the synthesis of RNA by RNA polymerases using a DNA template.

What’s in a GO term?

Parent /child relation in directed acyclic graph (DAG)

2 relations:part_of

is_a

different levels

more specific

less specific

biological_process

Gene Ontology Browser (Amigo2)

http://amigo2.geneontology.org (http://geneontology.org/)

Inferred tree view

Term information Annotation

...

ISS Inferred from Sequence Similarity

IEP Inferred from Expression Pattern

IMP Inferred from Mutant Phenotype

IGI Inferred from Genetic Interaction

IPI Inferred from Physical Interaction

IDA Inferred from Direct Assay

RCA Inferred from Reviewed Computational Analysis

TAS Traceable Author Statement

NAS Non-traceable Author Statement

IC Inferred by Curator

ND No biological Data available

Evidence code for GO annotations

Over representation analysis

m

g

gene universe (whole microarray)

GO term

ci

genes in cluster(gene list)

all genes with GO term

genes in clusterwith GO term

m-g c-i

g i

contingency table

Over representation analysis

Fisher exact test for contingency table

Hypergeometric distribution

Multiple hypothesis testing => adjust p-value

Not only for GO Terms also for TFBS, pathways,..

p =

5010

1000-5030-10

100030

m-g c-i

g i

50 redballs of1000balls

20x

draw 30x

10x

c=30 genes

m=1000genes

g=50 genes (GO) i=20 genes (GO)

DAVID

Database for Annotation, Visualization and Integrated Discovery https://david.ncifcrf.gov Functional annotation tool (over representation analysis)

Dnajb1

Wnt11

Sorbs3

D230025D16Rik

Sfxn3

Hspa5

Golga3

Hgs

Npc1

Mta2

Cnn2

Spg20

Zpr1…

1019 mousegene symbols

Pathways

Canonical Pathways are idealized or generalized pathways that represent common properties of a particular signaling module or pathway

A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in a cell.Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on and off, or spur a cell to move (genome.gov/27530687).

metabolic pathways signaling pathways gene regulation pathways

Biochemical and Metabolic Pathways

Böhringer Mannheim

KEGG

Kyoto Encylopedia of Genes and Genomes http://www.genome.jp/kegg/

Mapping of gene expression Updated information ($$)

1. Metabolism

2. Genetic Information Processing

3. Environmental Information Processing

4. Cellular Processes

Reactome (http://www.reactome.org)

BioCyc (HumanCyC, EcoCyc,..)

Biocarta (http://www.biocarta.com/genes)

PANTHER pathways (http://www.pantherdb.org)

Wiki Pathways (http://www.wikipathways.org)

Pathway databases

Hannah, Weinberg, Cell. 2000

Signaling pathways

Pathway interaction database (http://pid.nci.nih.gov)137 NCI curated pathway maps

Signal Transduction Knowledge Environmenthttp://stke.sciencemag.org/cm/

cancer cells

Pathway (and network) analysis

De novo network construction co-expression network reversed engineering

Pathway analysis over representation mapping gene expression to pathway

Interaction network (direct or indirect interactions) from protein-protein interaction databases (HPRD, IntAct) known and predicted functional association (STRING)

Pathway Commons

Aim: convenient access to pathway information Facilitate creation and communication of pathway data Aggregate pathway data in the public domain Provide easy access for pathway analysis

Cytoscape

Access pathway commons from cytoscape

http://www.cytoscape.org

Open source software for networkvisualization

Active community

>40 plugins (apps) extend functionality

e.g. Bingo, ClueGO

Easy to use and good documentation

Cline MS et al Nat Protoc. 2007

Other pathway tools

Metacore (GeneGO) (Thomson Reuters) ($$)

Ingenuity pathway analysis (IPA) (QIAGEN) ($$)

Pathway Studio (Ariadne) ($$)

GenMAPP (Mapping gene expression onto pathways)

PathVisio (Drawing pathways)

Box-and-whiskers plot

Measures of variability

Measures of variability

Survival analysis

Censored data

Survival function

Kaplan-Meier survival curves

Kaplan-Meier estimator

Comparison of Kaplan-Meier curves

Log-rank test

Log-rank test

Hazard ratio (HR)

Hazard function

Relative risk and proportional hazard model

Cox regression

Cox regression

LUAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data.txt

get_matrix_tn_rawcounts.R

get_matrix_tn_TPM.R

get_matrix_prim_TPM.R

deseq2_tn.R results_tn.txtLUAD_MATRIX_58TN_COUNTS.txt

LUAD_MATRIX_58TN_TPM.txt

LUAD_MATRIX_PRIM_TPM.txt

Firebrowse

LUAD_MATRIX_PRIM_LOG2TPM1.txtget_matrix_prim_LOG2TPM1.R

diff_expressed_genes.xlsx

Analyses

Gene ontology

Pathways

boxplot_surv.R

LUAD_CLIN.txt

LUAD_MATRIX_PRIM_TPM.txt

results_tn.txt

LUAD_MATRIX_58TN_TPM.txt

Firebrowseluad_boxplot_os_genes.pdf

LUAD_IMMUNE.txt Heatmap,

Clustered data

RELAPSE.rnkEnrichment plots

NES, q-value (FDR)gensets.gmx

Analyses

get_deg.REXPR_GSE67501.txtGEO

GSE67501

Analyses

normalized_data.txt

GEO

GSE51373

GSE51373_RAW.tarGSM1243877_U133Plus2_S1351.CEL.gz

GSM1243904_U133Plus2_RD00443.CEL.gz

get_eset.Rtarget.txt

get_deg_S_R.R

deg_S_R_GSE51373.txtdeg_S_R_GSE51373_UNIQUE.txt

diff_expressed_genes.xlsx

ROC.R

deg_GSE67501.txt