Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning...
Transcript of Bioinformatics I (KF) VU 041035 WS2019 //icbi.i-med.ac.at/mo/Notes_WS2019.pdf · Symbol Meaning...
Bioinformatics I (KF) VU 041035 WS2019 http://icbi.at/mo
Hubert Hackl
Biocenter, Division of Bioinformatics, Medical University of Innsbruck,Innrain 80, 6020 Innsbruck, AustriaTel: +43-512-9003-71403, Email: [email protected], URL: http://icbi.at
I. Predictive methods on protein sequences (Protein domains, Signal P, NetMHCPan)
II. Sequence alignment and databases (BLAST, NCBI, TCGA, Firebrowse, TCIA, GEO, ArrayExpress, IntOgen, cBioPortal)
III. Differentially expressed genes (microarrays, RNAseq) (R/Bioconductorsoftware packages rma, limma, DEseq2)
IV. Expression profiling and clustering (Genesis)
V. Gene ontology, Pathway analysis (DAVID, KEGG, Reactome, ConsensusPathDB, ClueGO,Enrichr)
VI. Network analysis (Cytoscape)
VII. Gene set enrichment analysis (GSEA), Deconvolution
VIII. Predictive and prognostic marker (signatures) (logistic regression, survival analysis)
Contents
Symbol Meaning Description
R A or G puRineY C or T pYrimidineW A or T Weak hydrogen bondsS G or C Strong hydrogen bondsM A or C aMino groupsK G or T Keto groupsH A, C, or T (U) not G, (H follows G)B G, C, or T (U) not A, (B follows A)V G, A, or C not T (U), (V follows U)D G, A, or T (U) not C, (D follows C)N G, A, C or T (U) aNy nucleotide
Nomenclature of nuclein acids
Base Symbol Occurrence
Adenin A DNA, RNAGuanin G DNA, RNACytosin C DNA, RNAThymin T DNAUracil U RNA
+ strand 5´-ACGGTCGCTGTCGGTAGC-3´- strand 3´-TGCCAGCGACAGCCATCG-5´
e.g. in fasta format : >gene sequence|gi12345|chr17|-
GCTACCGACAGCGACCGT
DNA sequences are always from 5‘ to 3‘
Positions in the genome (genome assembly) are chromosome wise
e.g. human GRCh37/hg19
chr11:1-100 chr11:49,686,777-49,689,777
Positions in the chromosome start for both!! strands from position 1
+ strand 5´-ACGGTCGCTG…………TCGGTAGC-3´- strand 3´-TGCCAGCGAC…………AGCCATCG-5´
chr11:1 2523 2529
chr11:1 2523 2529
Nomenclature
Translation, genetic code and reading frames
Peptid chain, amino acid sequence, proteins
Protein sequences are always form N-terminal end to C-terminal end
backbone
sidechains
E.g.. SCD sequence in fasta format
Protein Sequence Analysis
Shared ancestry?Similar function?Domain or complete sequence?Are functional sequences conserved?
Homology
Searches
Homology searches
Sequence alignment
BLAST, FASTA
Profile
Analysis
Profile Analysis
Uses collective characteristics of a family of proteins
Position specific score matrix (PSSM)
Profile HMM
ProfileScan, Pfam, CDD, Prosite, BLOCKS
PSI-Blast
Protein Sequence Analysis
Amino Acid Composition
Hydrophobicity
Charge
Physical
Properties
Protein Sequence Analysis
Secondary structure
Specialized structures
Tertiary structure
Structural
Properties
Substitutions matrices
• Unrelated or random model assumes that letter a occursindependently with some frequency qa.
P(x,y|R) = qxiqxj
• The alternative match model of aligned pairs of residuesoccurs with a joint probability pab.
P(x,y|M) = pxi yi
• Odds ratio
P(x,y|M) pxi yi pxi yi
P(x,y|R) qxi qyj qxi qyj= =
Summary of substitutions matrices
• Triple-PAM strategy (Altschul, 1991)
– PAM 40 short alignments, highly similar– PAM 120 – PAM 250 longer, weaker local alignments
• BLOSUM (Henikoff, 1993)
– BLOSUM 90 short alignments, highly similar– BLOSUM 62 most effective in detecting known
members of a protein family– BLOSUM 30 longer, weaker local alignments
• No single matrix is the complete answer for all sequencecomparisons
• Programs like BLAST usually have default matrices!
Database search
FASTA
• Use hash table of short words of the database (DB) sequence and query sequence (2-6 chars)
• For words in query sequence, find similar words in DB using (fast) hash table lookup, and compute
R = position(query) – position (DB).
Areas of long match will show same R for many words.
• Score matching segments based on content of these matches. Extend the good matches empirically.
High-scoring segment pairs
Significance of scores
The number of unrelated matches with score greater than S is approximately Poisson distributed with mean
E(S)=Kmne-λS
where λ is a scaling factor m and n are the length of the sequences
The probability that there is a match of score greater than S follows a extreme value distribution:
Karlin S, Altschul S. Proc Natl Acad Sci (1990)
P(x>S)=1-e-E(S)
NCBI Blast
Program Query sequence Subject sequence
BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Nucleotidesix-frame translation
Protein
TBLASTN Protein Nucleotidesix-frame translation
TBLASTX Nucleotidesix-frame translation
Nucleotidesix-frame translation
NCBI Blast Example
Blast Results
…
alignment
Be
st h
it
description
graphicalvisualization
E-value
Score (S)
conserved domaindatabase (CDD)
…
Multiple sequence alignment (Clustal W)
Hbb_Human -
Hbb_Horse .17 -
Hba_Human .59 .60 -
Hba_Horse .59 .59 .13 -
Myg_Whale .77 .77 .75 .75 -
Hbb_Human
Hbb_Horse
Hba_Human
Hba_Horse
Myg_Whale
1 PEEKSAVTALWGKVN--VDEVGG
2 GEEKAAVLALWDKVN--EEEVGG
3 PADKTNVKAAWGKVGAHAGEYGA
4 AADKTNVKAAWSKVGGHAGEYGA
5 EHEWQLVLHVWAKVEADVAGHGQ
Rooted neighbor-joining tree(guide tree) and sequence weights
Progressive alignmentfollowing guide tree
Pairwise alignmentcalculate distance matrix
1
23
4
1
23
4
BLOCKS
Steve Henikoff, FHCRC ,Seattle
Multiple alignments of conserved regions in protein families
– 1 block = 1 short, ungapped multiple alignment
– Families can be defined by one or more blocks
– Searches allow detection of one or more blocks representing a family
– http://blocks.fhcrc.org/
Profile Construction
20
1b
f(p,b) = frequency of amino acid b in position ps(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM)
PSSM(p,a) = f(p,b)*s(a,b)
PSI-BLAST
• Position-Specific Iterated BLAST search
• Used to identify distantly related sequences that are possibly missed during a standard BLAST search
• Easy-to-use version of a profile-based search Perform BLAST search against protein database Use results to calculate a position-specific scoring
matrix PSSM replaces query for next round of searches May be iterated until no new significant alignments
are found
Altschul et al., Nucleic Acids Res. 25: 3389-3402, 1997
DELTA-BLAST
• Method different from that used by PSI-BLAST
Step 1: Align the query against conserved domains derived from CDD
Step 2: Compute PSSMStep 3: Search sequence databases using PSSM as
the query
• Produces high-quality alignments, even at low levels of sequence similarity
• Dependent on homologous relationships captured within CDD
Markov chains
Markov chains: a sequence of events that occur one after another. The main restriction on a Markov chain is that the probability assigned to an event at any location in the chain can depend on only a fixed number of previous events.
Scoring sequences (e.g. start codon ATG)3 states (S1, S2, S3), p(A)=p(C)=p(G)=p(T)=0.25
A T G
p(A)=0.91p(C)=0.03p(G)=0.03p(T)=0.03
p(A)=0.03p(C)=0.03p(G)=0.03p(T)=0.91
p(A)=0.03p(C)=0.03p(G)=0.91p(T)=0.03
Markov chain 0th orderp(ATG)=0.913=0.752
Markov chain 1th orderp(ATG)=p(A)*p(T|A)*p(G|T)
S1 S2 S3
Hidden Markov Model (HMM)
- Example exon-intron border- 3 states: exon(E), 5‘SS (5), intron (I)
Eddy SR, Nat Biotech 2004
Emission probabilities
Hidden, want to infer(π)(hidden Markov chain)
Given (S)
log P(S,π|HMM,Θ)=log(1*0.2518*0.917*0.1*0.95*1.0*0.4*0.9*0.4*0.9*0.4*0.9*0.1*0.9*0.4*0.9*0.1*0.9*0.4*0.1)
Find best state path(highest score)
If man possible paths thanuse efficient Viterbi algorithm(based on dynamic programing)
G T A A G T C A
HMM parameters (Θ)
Profile Hidden Markov Model
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
P(A)=0.8
P(C)=0.0
P(G)=0.0
P(T)=0.2
P(A)=0.8
P(C)=0.2
P(G)=0.0
P(T)=0.0
P(A)=1.0
P(C)=0.0
P(G)=0.0
P(T)=0.0
P(A)=0.0
P(C)=0.0
P(G)=0.2
P(T)=0.8
P(A)=0.0
P(C)=0.8
P(G)=0.2
P(T)=0.0
P(A)=0.0
P(C)=0.8
P(G)=0.2
P(T)=0.0
P(A)=0.2
P(C)=0.4
P(G)=0.2
P(T)=0.2
1.0
0.4
1.0 0.4
0.6
0.6
1.0 1.01.0
[AT][CG][AC][ACGT]*A[TG][GC]
Regular Expressions
p(ACACATC)=0.8*1*0.8*1*0.8*0.6 *0.4*0.6*1*1*0.8*1*0.8=0.047log-odds=log(p(S)/0.25L)=log(0.047/0.257)
insertion state
- For multiple alignments (e.g. DNA sequences)
Profile Hidden Markov Model
Main states (gray)
Allows position dependent gap penalties Can be obtained from a multiple alignment (DNA or Protein) Can be used for searching a database for other members
of the family
Insert states to model highlyvariable regions in the alignemnt
Delete (silent, null) states
Avoid overfitting by using pseudocounts (e.g. add 1 to all counts)
Insert states
Neuronal network for secondary structure prediction
PredictProtein
Multi-step predictive algorithm (Rost et al., 1994)
Protein sequence queried against SWISS-PROTMaxHom used to generate iterative, profile-basedMultiple sequence alignment (Sander and Schneider,1991)
Multiple alignment fed into neural network (PHDsec)
Accuracy: Average > 70%, Best-case > 90%
http://www.predictprotein.org/
Prediction of protein function
Frank Eisenhaber et al.
Annotator
SignalP
• Neural network trained based on phylogeny– Gram-negative prokaryotic– Gram-positive prokaryotic– Eukaryotic
• Predicts secretory signal peptides• http://www.cbs.dtu.dk/services/SignalP/
Signal peptide score (S)
Cleavage site score (C)
Combined Score (Y)
Insulin
Proinsulin
DNA sequence(AY242110)
1. extract cds of insulin
2. extract mRNA seq of insulin
fasta file
blastn
4. extract protein sequence of insulin
NCBI Geneusing Refseq ID
fasta file
5. extract mRNA seq in human
Notepad++
Notepad++
NCBI Nucleotide (Genbank)
blastn
NCBI Geneusing Refseq ID
6. extract protein sequence of insulin
fasta file Notepad++
7. genomic location and adjacent genes
NCBI Geneusing Refseq ID
8. compare protein sequences 9. find signal peptide
blastp (align two sequences)Signal P server
Sus scrofa (pig) Homo sapiens (human)
Find difference between insulin sequence in pig and human
Signal peptide B-chain (30 AA)
C-peptide A-chain (21 AA)
Difference between pig and human insulin = 1AA
A-kinase anchoring proteins (AKAPs) binding to the regulatory subunit of protein kinase A (PKA)
Protter
GPR161
Protein secondary structure prediction (Jpred)
Amphipatic helix
Heliquest
xx
Protein structure
Predictive biomarker
Clonal evolution
Tumor heterogenity
Antigen PredictionsFunctional impact
Combinations Pathways
variants
Recurrent mutations
Population
Analyses and interpretation of DNA variants
x
Somatic mutation detection in tumor samples
Raphael et al. Genome Med. 2014
MuTect 2 (GATK)
MutSig (MutSigCV)
Lawrence, M. et al. Nature 2013
MutSig builds a model of the background mutation processes (BMR) that were at work during formation of the tumors, and it analyzes the mutations of each gene to identify genes that were mutated more often than expected by chance, given the background model.MutSigCV (CV for 'covariate') improves the BMR estimation by pooling data from 'neighbor' genes with similar genomic properties such as DNA replication time, chromatin state (open/closed), and general level of transcription activity.
Ensembl Variant Effect Predictor (VEP)
Oncotator
http://www.broadinstitute.org/oncotator/
chr1 43815005 43815005 A AT
chr1 43815008 43815008 TG T
chr1 43815013 43815013 G GT
chr1 43815036 43815036 G GCC
chr1 43815041 43815041 C G
chr1 43815047 43815047 G GC
chr1 43815049 43815049 A AG
chr1 115252237 115252237 T TG
jurkat_mutations.vcf
Oncotator
MutationAssessor
2,212288939,C,G
Chr Pos RefAll AltAll
https://www.broadinstitute.org/cancer/cga/mutsig_run (see key.txt)
MutSigCV
exome_full192_coverage.txt
gene_covariates.txt
Mutation_type_dictionary_file.txt
hg19
dlcbl_prim.maf
Intratumor heterogenity and clonal evolution
Cancer-Immunity cycle
Hackl et al. Nat Rev Genet 17: 441-458 (2016)
Cancer-Immunity cycle
Hackl et al. Nat Rev Genet 17: 441-458 (2016)
NetMHCpan
http://www.cbs.dtu.dk/services/NetMHCpan/
IntOGen
Driver genes
Functional Impact
Recurrence
Cancers arise due to alterations in genes that confer growth advantage to the cell . More than 400 such ‘cancer genes’, identified to date are currently annotated in the Cancer Gene Census.
Cancer driver mutations (cancer drivers) versus passenger mutations can beidentified based on:
Driver signals
Clustered mutations (OncodriveCLUST)
Functional Mutations (OncodriveFM)
Recurrent Mutations (MutSigCV)
Mutation frequency per cancertype
No. of mutated samples in cancer type
No. of protein affecting mutated samples in cancer type (PAM)
Mutation distribution along protein sequence
Protein domains
No. and position of mutationof protein affecting mutatedsamples in cancer type (PAM)
Different transcripts
IntOgen (Integrative Onco Genomics)
What is the most common BRAF mutation
In which cancer types IDH1 is a cancer driver and in whichcancer type mutation of IDH1 is most frequent
Most common drivers in breast carcinoma
Mutation frequency of VHL
Examples
TCGA
International Cancer Genome Consortium (ICGC)
International Cancer Genome Consortium (ICGC)
The International Cancer Genome Consortium (ICGC)was launched to coordinate large-scale cancer genomestudies in tumours from 50 different cancer typesand/or subtypes that are of clinical and societalimportance across the globe.
Systematic studies of more than 25,000 cancer genomesat the genomic, epigenomic and transcriptomic levelswill reveal the repertoire of oncogenic mutations,uncover traces of the mutagenic influences, defineclinically relevant subtypes for prognosis and therapeuticmanagement, and enable the development of newcancer therapies.
TCGA
TCGA
TCGA is supervised by the National Cancer Institute's Center for Cancer Genomics and the National Human Genome Research Institute funded by the US government.
3 year pilot project on 3 cancer types (starting 2006)
Phase II 33 different cancer types by 2014 including 10 rare cancers by 2014
Characterization centers (GCCs) for sequencing.
Genome data analysis centers (GDACs) for bioinformatic analyses.
It is also part of the The International Cancer Genome Consortium (ICGC)
Currently 34 cancer types >11000 samples
TCGA
Biospecimen Core Resource (BCR)Collect and process tissue samples
Genome Sequencing Centers (GSCs)Use high-throughput Genome Sequencing to identify the changes in DNA sequences in cancer
Genome Characterization Centers (GCCs)Analyze genomic and epigenomic changes involved in cancer
Data Coordinating Center (DCC)The TCGA data are centrally managed at the DCC
Genome Data Analysis Centers (GDACs)These centers provide informatics tools to facilitate broader use of TCGA data.
TCGA member map
TCGA
TCGA
https://wiki.nci.nih.gov/display/TCGA/TCGA+Wiki+Home
All information, tutorials, FAQs etc. can be found here:
TCGA Data Portal has changed to Genomic Data Commons (GDC)
https://gdc-docs.nci.nih.gov/Data_Portal/Users_Guide/Getting_Started/
TCGA barcodes
Tissue source siteBiospecimen core resource
Universally Unique Identifiers (UUIDs) have replaced TCGA barcodes
Sample
TCGA data levels
Sequencing data (fastq, BAM) (data level 1) is control accessed at Cancer Genomics Hub (CGHub)
Genomic Data Commons
Genomic Data Commons
Genomic Data Commons
Genomic Data Commons
Genomic Data Commons
GDC data transfer tool
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
unzip
Download using manifest
>gdc-client download –m gdc_manifest_20161212_140030.txt
Firebrowse
Download RNAseqV2 with Firebrowse
~_Level_3__RSEM_genes__data.data.txt
Download clinical data with Firebrowse
Selected parameter
All clinical andsample
information
Download clinical data with Firebrowse
Patients
Sele
cted
clin
ical
par
amet
er
Somatic mutations (open access)
https://www.biostars.org/p/69222/
Tumor-specific Analysis Working Groups (AWGs) take the auto-generated variant calls from the Genome Sequencing Centers (GSCs), and remove false-positive variants, or recover those missed by the GSCs.
manually by experts automated scripts filters based on the extensive domain knowledge of said experts.
So the quality of variant calls may differ wildly between TCGA GSCs and AWGs, and across samples as the pipelines get better over time
In some tumor-types, and for some subsets of samples, mutations called from the first-pass of sequencing (usually exome-seq), are targeted by custom capture arrays, and re-sequenced. This results in much higher read-depths for the purpose of validating first-pass calls, for more accurate variant allele fractions (VAFs), or for finding calls that were missed.
Mutation Annotation Format (MAF) Specification
1 Hugo_Symbol IDH12 Entrez_Gene_Id 03 Center genome.wustl.edu4 NCBI_Build 375 Chromosome 26 Start_Position 2091131127 End_Position 2091131128 Strand +9 Variant_Classification Missense_Mutation
10 Variant_Type SNP11 Reference_Allele C12 Tumor_Seq_Allele1 C13 Tumor_Seq_Allele2 T14 dbSNP_RS rs12191350015 dbSNP_Val_Status unknown16 Tumor_Sample_Barcode TCGA-AB-2802-03B-01W-0728-0817 Matched_Norm_Sample_Barcode TCGA-AB-2802-11B-01W-0728-0818 Match_Norm_Seq_Allele1 C19 Match_Norm_Seq_Allele2 C20 Tumor_Validation_Allele121 Tumor_Validation_Allele222 Match_Norm_Validation_Allele1 C23 Match_Norm_Validation_Allele2 C24 Verification_Status Verified25 Validation_Status Valid26 Mutation_Status Somatic27 Sequencing_Phase Phase_IV28 Sequence_Source Capture29 Validation_Method30 Score 131 BAM_File dbGAP32 Sequencer Illumina HiSeq
Should matchwith otheranalyses(=> lift-over)
…
Remove forOncotatoranalysis
MAF files from Firehose
curated
somatic
Broad (center)
COAD (cancer type)
https://confluence.broadinstitute.org/display/GDAC/MAF+Dashboard
24.0 MB (size)
https://www.biostars.org/p/69222/
Best available MAFs The Ding lab at TGI, WashU
Gene Expression Omnibus (GEO)
Raw data (cel files)
Expression matrix(normalized data)
Platform (microarray)(normalized data)
Sample data
cBioPortal
data from 105 cancer genomics studies
TCGA and other studies
Query and download
Many different analyses options
TCIA
The Cancer Immunome Atlas
Charoentong et al. Cell Rep. 2017. 18:248-262
Charoentong et al. Cell Rep. 2017. 18:248-262
TCIA (Immunophenogram)
TCIA (Tumor infiltrated lymphocytes)
TCIA (Gene expression)
TCIA (Survival analysis)
GPViz
http://icbi.at/software/gpviz/gpviz.shtml
Do not update JAVA!
Refseq_hg19.gtf
oncotator.maf
LAML.maf
Refseq_CDD.txt
GPViz
http://icbi.at/software/gpviz/gpviz.shtml Snajder et al, Bioinformatics 2013
RNAseq
e.g. 20 mio
0. Image analysis and base calling (Phred quality score)
=> FastQ files (sequence and corresponding quality levels)
1. Trimming adaptors and low quality reads2. Read mapping (Spliced alignment) (Tophat, STAR)
=> SAM/BAM files
3. Transcriptome reconstruction (reference transcriptome, GTF file)
4. Expression quantification (transcript isoforms) (HTseq, Cufflinks)-count reads-normalization
4. Differential expression analysis (negative-binomial test)(DESeq, edgeR, CuffDiff)
Analysis steps
Normalization
Reads per kilobase per million reads (RPKM)
Fragments per kilobase per million (FPKM) for paired-end seq.
Quantile normalization (upper quantile normalization)
TMM (trimmed mean of M values)
TCGA
RNAseqV2 analysis MapSplice is used to do the alignment and RSEM to perform the quantitation.
raw_count … for EBseq introduce directlyfor DESeq2, EdgeR use integers
scaled_estimate … transcript per million TPM=scaled estimate*106
normalized_count …. upper quartile normalized RSEM count estimates
Isoform quantification
Uncertainy in assigning reads to isoforms Paired-end sequencing Spliced alignment Alternative splicing (statistical significant?)
Representation of gene expression
• n x m matrix with n genes and m experiments(conditions, patient samples)
• Representation as heatmap (e.g. red upregulatedgenes, green down regulated genes, black nochange)
• For experiments in reference design:
log2-fold change (log2FC, log2(A/B), log2 ratio)
• For patient samples and no reference:
mean centered log2-levels for each gene
log2-intensities for one-color arrayslog2-RPKM for RNAseq
z-score of log2-levels
Z= (X-m)/s X…log2-levels, m…mean,s…standard deviation
gen
es
experiments (patients)
heatmap
Clustering
• Unsupervised or supervised (classification)
• AgglomerativeBottom up approach, whereby single expression
profiles are successively joined to form nodes.
• DivisiveTop down approach, each cluster is successively split in the same fashion, until each cluster consists of one single profile.
• Pearson correlation (hierarchical clustering)
• Euclidian distance (k-means clustering)
• Manhattan distance• Spearman rank correlation• Mutual information….
Similarity (distance) measures
-1 r 1
Hierarchical clustering
• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)
1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1
Dendrogram
6 cluster
15 cluster
Indicates similarity
Linkage
Single-linkage clusteringMinimal distance
Complete-linkage clusteringMaximal distance
Average-linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values
Weighted pair-group averageLike UPGMA but weighted according cluster size
Within-groups clusteringAverage of merged cluster is used instead of cluster elements
Ward’s methodSmallest possible increase in the sum of squared errors
• partition n genes into k clusters, where k has to be predetermined
• k-means clustering minimizes the variability within and maximize between clusters
• Moderate memory and time consumption
K-means clustering
1. Generate random points (“cluster centers”) in n dimensions (results are depending on these seeds).
2.Compute distance of each data point to each of the cluster centers.
3.Assign each data point to the closest cluster center.
4.Compute new cluster center position as average of points assigned.
5.Loop to (2), stop when cluster centers do not move very much.
Principal component analysis (PCA)
PCA is a data reduction technique that allows to simplify multidimensional data sets into smaller number of dimensions (r<n).
Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.
Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80-90% of the variance can already be explained.
This analysis can be done by a special matrix decomposition (singular value decomposition SVD).
Singular value decomposition (SVD)
X = USVT with UUT = VTV = VVT = I
For mean centered data the Covariance matrix C can be calculated by XXT. U are eigenvectors of XXT and the eigenvalues are in the diagonal of S defined by the characteristic equation |C – λI | = 0.
Transformation of the input vectors into the principal component space can be described by Y = XU where the projection of sample i along the axis is defined by the j-th PC:
Biological meaning of the gene sets
?
• Gene ontology terms
• Pathway mapping
• Linking to Pubmed abstracts or associated MESH terms
• Regulation by the sametranscription factor (module)
• Protein families and domains
• Gene set enrichment analysis
• Over representation analysis
Gene set enrichment analysis
Subramanian A et al. Proc Natl Acad Sci (2005)
Gene set enrichment analysis
1. Given an a priori defined set of genes.
2. Rank genes (e.g. by t-value between 2 groups of microarray samples) ranked gene list L.
3. Calculation of an enrichment score (ES) that reflects the degree to which a gene set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.
4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype-based permutation test procedure.
5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).
http://www.broadinstitute.org/gseahttp://www.broadinstitute.org/cancer/software/gsea/wiki/
Gene Ontology (GO)
The three organizing principles (categories) of GO are
cellular component
biological process
molecular function
The Gene Ontology project (http://geneontology.org) provides a controlled vocabulary to describe gene and gene product attributes in any organism.
mitochondrium
cell cycle
isomerase activity
Termtranscription initiation
IDGO:0006352
DefinitionProcesses involved in starting transcription, where transcription is the synthesis of RNA by RNA polymerases using a DNA template.
What’s in a GO term?
Parent /child relation in directed acyclic graph (DAG)
2 relations:part_of
is_a
different levels
more specific
less specific
biological_process
Gene Ontology Browser (Amigo2)
http://amigo2.geneontology.org (http://geneontology.org/)
Inferred tree view
Term information Annotation
...
ISS Inferred from Sequence Similarity
IEP Inferred from Expression Pattern
IMP Inferred from Mutant Phenotype
IGI Inferred from Genetic Interaction
IPI Inferred from Physical Interaction
IDA Inferred from Direct Assay
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
NAS Non-traceable Author Statement
IC Inferred by Curator
ND No biological Data available
Evidence code for GO annotations
Over representation analysis
m
g
gene universe (whole microarray)
GO term
ci
genes in cluster(gene list)
all genes with GO term
genes in clusterwith GO term
m-g c-i
g i
contingency table
Over representation analysis
Fisher exact test for contingency table
Hypergeometric distribution
Multiple hypothesis testing => adjust p-value
Not only for GO Terms also for TFBS, pathways,..
p =
5010
1000-5030-10
100030
m-g c-i
g i
50 redballs of1000balls
20x
draw 30x
10x
c=30 genes
m=1000genes
g=50 genes (GO) i=20 genes (GO)
DAVID
Database for Annotation, Visualization and Integrated Discovery https://david.ncifcrf.gov Functional annotation tool (over representation analysis)
Dnajb1
Wnt11
Sorbs3
D230025D16Rik
Sfxn3
Hspa5
Golga3
Hgs
Npc1
Mta2
Cnn2
Spg20
Zpr1…
…
1019 mousegene symbols
Pathways
Canonical Pathways are idealized or generalized pathways that represent common properties of a particular signaling module or pathway
A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in a cell.Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on and off, or spur a cell to move (genome.gov/27530687).
metabolic pathways signaling pathways gene regulation pathways
Biochemical and Metabolic Pathways
Böhringer Mannheim
KEGG
Kyoto Encylopedia of Genes and Genomes http://www.genome.jp/kegg/
Mapping of gene expression Updated information ($$)
1. Metabolism
2. Genetic Information Processing
3. Environmental Information Processing
4. Cellular Processes
Reactome (http://www.reactome.org)
BioCyc (HumanCyC, EcoCyc,..)
Biocarta (http://www.biocarta.com/genes)
PANTHER pathways (http://www.pantherdb.org)
Wiki Pathways (http://www.wikipathways.org)
Pathway databases
Hannah, Weinberg, Cell. 2000
Signaling pathways
Pathway interaction database (http://pid.nci.nih.gov)137 NCI curated pathway maps
Signal Transduction Knowledge Environmenthttp://stke.sciencemag.org/cm/
cancer cells
Pathway (and network) analysis
De novo network construction co-expression network reversed engineering
Pathway analysis over representation mapping gene expression to pathway
Interaction network (direct or indirect interactions) from protein-protein interaction databases (HPRD, IntAct) known and predicted functional association (STRING)
Pathway Commons
Aim: convenient access to pathway information Facilitate creation and communication of pathway data Aggregate pathway data in the public domain Provide easy access for pathway analysis
Cytoscape
Access pathway commons from cytoscape
http://www.cytoscape.org
Open source software for networkvisualization
Active community
>40 plugins (apps) extend functionality
e.g. Bingo, ClueGO
Easy to use and good documentation
Cline MS et al Nat Protoc. 2007
Other pathway tools
Metacore (GeneGO) (Thomson Reuters) ($$)
Ingenuity pathway analysis (IPA) (QIAGEN) ($$)
Pathway Studio (Ariadne) ($$)
GenMAPP (Mapping gene expression onto pathways)
PathVisio (Drawing pathways)
Box-and-whiskers plot
Measures of variability
Measures of variability
Survival analysis
Censored data
Survival function
Kaplan-Meier survival curves
Kaplan-Meier estimator
Comparison of Kaplan-Meier curves
Log-rank test
Log-rank test
Hazard ratio (HR)
Hazard function
Relative risk and proportional hazard model
Cox regression
Cox regression
LUAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data.txt
get_matrix_tn_rawcounts.R
get_matrix_tn_TPM.R
get_matrix_prim_TPM.R
deseq2_tn.R results_tn.txtLUAD_MATRIX_58TN_COUNTS.txt
LUAD_MATRIX_58TN_TPM.txt
LUAD_MATRIX_PRIM_TPM.txt
Firebrowse
LUAD_MATRIX_PRIM_LOG2TPM1.txtget_matrix_prim_LOG2TPM1.R
diff_expressed_genes.xlsx
Analyses
Gene ontology
Pathways
boxplot_surv.R
LUAD_CLIN.txt
LUAD_MATRIX_PRIM_TPM.txt
results_tn.txt
LUAD_MATRIX_58TN_TPM.txt
Firebrowseluad_boxplot_os_genes.pdf
LUAD_IMMUNE.txt Heatmap,
Clustered data
RELAPSE.rnkEnrichment plots
NES, q-value (FDR)gensets.gmx
Analyses
get_deg.REXPR_GSE67501.txtGEO
GSE67501
Analyses
normalized_data.txt
GEO
GSE51373
GSE51373_RAW.tarGSM1243877_U133Plus2_S1351.CEL.gz
GSM1243904_U133Plus2_RD00443.CEL.gz
…
get_eset.Rtarget.txt
get_deg_S_R.R
deg_S_R_GSE51373.txtdeg_S_R_GSE51373_UNIQUE.txt
diff_expressed_genes.xlsx
ROC.R
deg_GSE67501.txt