2012 august 16 systems biology rna seq v2
-
Upload
anne-deslattes-mays -
Category
Technology
-
view
474 -
download
1
Transcript of 2012 august 16 systems biology rna seq v2
04/12/2023 Wellstein/Riegel Laboratory 1
Cancer Systems Biology:RNA-Seq
August 16, 2012Anne Deslattes Mays
Wellstein/Riegel LaboratoryMentor: Anton Wellstein, MD, PhD
04/12/2023 Wellstein/Riegel Laboratory 2
Talk Outline
• What is Systems Biology?• What is RNA-Seq?• RNA-Seq Differential Expression Analysis
04/12/2023 Wellstein/Riegel Laboratory 3
Systems Biology is a systems approach to building testable models of biology using observation and
measurement
04/12/2023 Wellstein/Riegel Laboratory 4
Systems Biology brings together interdisciplinary fields, tools, analysis and platforms
• Genomics• Epigenomics/epgenetics• Transcriptomics• Proteomics• Metabolomics• Glycomics• Lipidomics• Interactomics• NeuroElectroDynamics• Fluxomics• Biomics
04/12/2023 Wellstein/Riegel Laboratory 5
What is the discipline of Systems Biology?A Reverse Engineering Discipline
Input
Process
Output
Perhaps more Equivalent to a Decipher Project: Alan Turing and the group of codebreakers during world war two
deciphered the codes created by the Enigma. A Biological System is communicating we are trying to crack the code.
04/12/2023 Wellstein/Riegel Laboratory 6
Genome
Transcriptome
Proteome
Metabolome
What is Systems Biology?Systems Biology is a discipline using a
multitude of measurement technologies to capture the entirety of a biological systems
parts and then attempts to reverse engineer that biological system’s ability to dynamically
remodel in its response to stimuli
04/12/2023 Wellstein/Riegel Laboratory 7
Sequencing technologies
Mass Spec technologies
What is Systems Biology?Systems Biology is a discipline using a
multitude of measurement technologies to capture the entirety of a biological systems
parts and then attempts to reverse engineer that biological system’s ability to dynamically
remodel in its response to stimuliGenome
Transcriptome
Proteome
Metabolome
04/12/2023 Wellstein/Riegel Laboratory 8
What is Systems Biology?
Technology AdvancesSpurs
Research Advances
Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems
parts and then attempts to reverse engineer that biological system’s ability to dynamically
remodel in its response to stimuli
Sequencing technologies
Mass Spec technologies
Genome
Transcriptome
Proteome
Metabolome
04/12/2023 Wellstein/Riegel Laboratory 9
RNA-seq
04/12/2023 Wellstein/Riegel Laboratory 10
Here is an example RNA-Seq Workflow
Experimental Design
Sample Collection
Quality Control Read Trimming
Differential Analysis
Transcript Identification
Pathway Analysis
Marker Discovery
Sequencing
04/12/2023 Wellstein/Riegel Laboratory 11
Three steps to get to a fresh sequence with the Illumina Genome Sequence Analyzer
• Library generation• Cluster generation• Sequencing
04/12/2023 Wellstein/Riegel Laboratory 12
Before Library Construction1. Poly-A Selection (Total RNA ->
mRNA)2. mRNA fragmentation3. First strand synthesis (here we stop
if we want to maintain strand specificity
4. Second strand synthesis
Other techniques5. Ribozero6. Ribominus
Library Construction: Messenger RNA are Poly-A selected from Total RNA, fragmented and cDNA synthesized
04/12/2023 Wellstein/Riegel Laboratory 13
cDNA (single or double stranded)1. cDNA is blunt end-repaired and
phosphorylated (B.)2. A-base added to prepare for
indexed adapter ligation (C.)
Library Construction: End repair and adenylation results in adapter ligation ready constructs
04/12/2023 Wellstein/Riegel Laboratory 14
Index adapter ligation and product ready for amplification on cBot or the cluster station1. Strand specific tags are added to
the A base – ligate index adapter (D)
2. Denature and amplify for final product (E)
Library Construction: Adapter ligation results in cluster-generation-ready constructs
04/12/2023 Wellstein/Riegel Laboratory 15
Single DNA molecules hybridize to the lawn of oligos grafted to the surface of the flow cell1. Oligo lawn2. Oligos hybridize to the adapters
that had been ligated to the library fragments which flow through the cell
Cluster Generation: In the illumina Cbot system, single molecules are isothermally amplified in a flow cell to prepare them for sequencing
04/12/2023 Wellstein/Riegel Laboratory 16
Bridge amplifications resulting in 100s of millions of unique clusters1. Each fragment is clonally
amplified through a series of extensions and isothermal bridge amplifications
2. Reverse strands cleaved and washed away
3. Ends are blocked4. Sequencing primer hybridized to
the DNA template5. Libraries are ready for
sequencing
Cluster generation: Bound fragments are extended to make copies and reverse strands cleaved and washed away
04/12/2023 Wellstein/Riegel Laboratory 17
4 fluorescently labeled reversibly terminated nucleotides1. Each base competes for addition2. Natural competition ensures
highest accuracy3. After each round of synthesis,
clusters are excited by a laser emitting a color that identifies the newly added base
4. Fluorescent label and blocking group are removed allowing for addition of next nucleotide
5. Proprietary (Illumina) chemistry reads a base in each cycle
6. Allows for accurate sequencing through difficult regions such as homopolymers and repetitive sequence
Sequencing: 100s of millions of clusters sequenced simultaneously
04/12/2023 Wellstein/Riegel Laboratory 18
What was good for DNA is now good for RNA
• Technology advances => higher throughput sequencing at lower costs
• Whole Genome Sequencing has enabled• Whole Transcriptome Sequencing• Workflow for DNA sequencing and RNA sequencing is similar
04/12/2023 Wellstein/Riegel Laboratory 19
There are other ways to Inquire about the Transcriptome
• Array Based Technologies– Affymetrix– Agilent– Known genes and hybridization protocols
• Microarray– 20,000+ array experiments on a single platform– Edge effects– False positives / false negatives
• Bead-based arrays• Tiling arrays• SAGE
04/12/2023 Wellstein/Riegel Laboratory 20
What is unique about RNA-Seq?
• Allows you to discover and profile the entire transcriptome of any organism
• No probes or primers to design• Novel transcripts• Novel isoforms• Alternative splice sites• Rare transcripts• cSNPS – all of this in one experiment
04/12/2023 Wellstein/Riegel Laboratory 21
After sequencing…1. Quality control – trim your reads2. Count Reads
• Align to genome• Align to transcriptome
3. Interpret Data• Statistical tests (differential
expression analysis)• Visualization (mapped
reads)• Pathway analysis
Not so simple – big data, big compute requirements
After sequencing, we must then perform RNA-Seq Data Analysis
04/12/2023 Wellstein/Riegel Laboratory 22
How much RNA-sequencing data?1. 20 million paired end reads ~ 2 GB of data2. 100 million paired end reads ~ 10 GB of data
How much computation power?3. More memory, more processors, less time it takes to compute4. Outsource the analysis, still will need to store the results somewhere
Amazon web servicesS3 storageEC elastic cloud on demand computational facility
Georgetown University High Performance Computer Corematrix.georgetown.edu
UPENN Galaxy services
How much RNA-sequencing data, how much computation power and where do you go to compute?
04/12/2023 Wellstein/Riegel Laboratory 23
A growing number of tools enable RNA-Seq analysis
04/12/2023 Wellstein/Riegel Laboratory 24
These RNA-Seq tools are used for mapping reads, aligning reads and providing input for differential expression analysis
• Tuxedo suite– Bowtie, Tophat, Cufflinks
• Trinity Suite– Inchworm, chrysallis,
butterfly• RUM– RNA Unified Mapper
04/12/2023 Wellstein/Riegel Laboratory 25
What percentage of reads are covered? What percentage of reads are mapped?
3’ Bias on transcript reads1. 60-80% of reads are mapped2. Highest percentage or 3’ end of
reads are mapped3. Reads need to be quality trimmed
Mapping tools bias exons to known genes
04/12/2023 Wellstein/Riegel Laboratory 26
Galaxy is a web based tool committed to enable a researcher (more than just for RNA-Seq)
04/12/2023 Wellstein/Riegel Laboratory 27
04/12/2023 Wellstein/Riegel Laboratory 28
How to visualize mapped results?
• UCSC Genome Browser (Gbrowse)• Integrated Genome Browser (IGB)• Integrated Genome Viewer (IGV)
Many shared formats, reading many of the outputs generated by the programs, ability to generate ones own tracks
04/12/2023 Wellstein/Riegel Laboratory 29
04/12/2023 Wellstein/Riegel Laboratory 30
What do RNA-Seq reads look like for GAPDH?
Repeat masked allowing 1/2 mismatched bases blat’d reads viewed in IGB 6.7.2
04/12/2023 Wellstein/Riegel Laboratory 32
RNA-Seq Differential Expression analysis
What does GAPDH look like in terms of quantitation?
TOTAL BM HPPRPKM 3SEQ Counts BLAT Reads RPKM 3SEQ Counts BLAT Reads
CD34 0.7 340 230 8 8 14BST1 19.7 5374 31 31 CD133 0.2 173 176 16 16 33THY1 0 7 4 4 A12 1 0A5 0 0ALK 0 9 24 0 0 3B9 0 0C1 0 0C2 0 0C7 0 0E7 0 0E9 2 0F6 0 0G12 0 0GAPDH 3013.2 727831 356289 120.8 5559 2670H3 0 0
Blat read raw counts ratio == 3Seq counts ratio ~= 130 to 1RPKM ratio ~= 24.3
04/12/2023 Wellstein/Riegel Laboratory 34
RNA-Seq Quantification Challenge: A problem that exists with RNA-Seq data that doesn’t exist with array data: Longer transcripts produce more reads than shorter transcripts
One solution to account for this is RPKM (FPKM used by Cufflinks)
RPKM = 10^9 x C / NL, which is really just simply C/N
C(gene)= the number of mappable reads that fall onto a gene's exonsN= total number of mappable reads in the experimentL(gene)= the sum of the exons in base pairs.
Wold (2008)
04/12/2023 Wellstein/Riegel Laboratory 35
Cufflinks: Transcript assembly, differential expression, and differential regulation for RNA-seq
04/12/2023 Wellstein/Riegel Laboratory 36
Cuffdiff produces many output files:
1. Transcript FPKM expression tracking.2. Gene FPKM expression tracking; tracks the summed FPKM of transcripts sharing each gene_id3. Primary transcript FPKM tracking; tracks the summed FPKM of transcripts sharing each tss_id4. Coding sequence FPKM tracking; tracks the summed FPKM of transcripts sharing each p_id, independent
of tss_id5. Transcript differential FPKM.6. Gene differential FPKM. Tests differences in the summed FPKM of transcripts sharing each gene_id7. Primary transcript differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each
tss_id8. Coding sequence differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each p_id
independent of tss_id9. Differential splicing tests: this tab delimited file lists, for each primary transcript, the amount of
overloading detected among its isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.
10. Differential promoter tests: this tab delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here.
11. Differential CDS tests: this tab delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here.
04/12/2023 Wellstein/Riegel Laboratory 37
RNA-Seq Quantification Challenge: DESeq Method uses the geometric mean of counts in all samples
DESeq Method:Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.
To get the sequencing depth of a sample relative to the reference, calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample.
Now you have, for each gene, an estimate of the depth ratio. Simply take the median of all the quotients to get the relative depth of the library.
'estimateSizeFactors' function of DESeq package does this calculation.
04/12/2023 Wellstein/Riegel Laboratory 38
DESeq: an R package that works with Raw Counts to determine genes differentially expressed across samples
• Simon Anders
04/12/2023 Wellstein/Riegel Laboratory 39
04/12/2023 Wellstein/Riegel Laboratory 40
04/12/2023 Wellstein/Riegel Laboratory 41
What is Systems Biology?
Technology AdvancesSpurs
Research Advances
Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems
parts and then attempts to reverse engineer that biological system’s ability to dynamically
remodel in its response to stimuli
Sequencing technologies
Mass Spec technologies
Genome
Transcriptome
Proteome
Metabolome
04/12/2023 Wellstein/Riegel Laboratory 42
Resources
• http://dx.doi.org/10.1038/npre.2010.4282.1 (DESeq)• http://galaxy.psu.edu/• http://seqanswers.com/• http://www.broadinstitute.org/igv/• http://bioviz.org/igb/index.html• http://www.illumina.com• http://www.otogenetics.com• http://www.dnanexus.com• http://cufflinks.cbcb.umd.edu/• http://brb.nci.nih.gov/BRB-ArrayTools.html
04/12/2023 Wellstein/Riegel Laboratory 43
AcknowledgementsDr. Anton WellsteinDr. Anna Riegel
Dr. Marcel SchmidtJean-Baptiste MasaratiDr. Elena TassiThe entire lab: Tibari, Ghada, Ivana, Eveline, the entire Wellstein/Riegel laboratory
My Committee Dr. Yuri GusevDr. Anatoly DritschiloDr. Michael JohnsonDr. Christopher LoffredoDr. Habtom RessomDr. Terry Ryan (external committee member)
High Performance Core Group, Steve Moore, especially Woonki ChungAmazon Cloud ServicesDr. Ann Loraine, UNC, IGB DeveloperBrian Haas, Author Trinity Suite
04/12/2023 Wellstein/Riegel Laboratory 44
Given a list of differentially expressed Genes now enrichment analysis should be performed
• Enrichment analysis allows the researcher to leverage documented experiments which provide evidence for genes roles in pathways and functions that enable the researcher to determine the results and significance of their experiments
• DAVID– Gene ontology– Functional ontology
• Revigo– Output of David may be placed in REVIGO for further
interpretation and statistical exploration of significance of discovered sets of genes
04/12/2023 Wellstein/Riegel Laboratory 45
Using differentially expressed genes, biological pathways should be explored
• Differentially expressed genes are put into programs such as pathway studio or ingenuity
• Shortest path programs and• Canonical pathway analysis• Enables a researcher to reverse engineer the pathways
expressed in the course of a healthy response to a diseased response
• Ideally a pathway reveals the observed phenotype – connecting the expressed gene expression program with the phenotype – genotype – gene expression program to phenotype
04/12/2023 Wellstein/Riegel Laboratory 46
04/12/2023 Wellstein/Riegel Laboratory 47
Scientific knowledge is limited (and advanced) by the limits (and advancements) of measurement
• Ilya Shmulevich Genomic Signal Processing “Validity of the model involves observation and measurement, scientific knowledge is limited by the limits of measurement”
• Erwin Shrödinger Science Theory and Man: “It really is the ultimate purpose of all schemes and models to serve as scaffolding for any observations that are at all means observable”
04/12/2023 Wellstein/Riegel Laboratory 48
Before Library Construction1. Most vendors and cores will assess
the quality of the RNA before sequencing
2. Important to determine before sequencing begins
Garbage – in == Garbage out
Before library construction, RNA quality must be assessed
04/12/2023 Wellstein/Riegel Laboratory 49
Cluster Generation• Cbot cluster system single molecules are isothermally amplified
in a flow cell to prepare them for high-throughput sequencing• 8 channel genome analyzer has a dense lawn of oligos• Single DNA molecules hybridize to the lawn of oligos• Bound fragments are extended to make copies• Copies covalently bound to the flowcells surface• Each fragment is clonally amplified through a series of
extensions and isothermal bridge amplifications resulting in 100s millions of unique clusters
• Reverse strands cleaved and washed away• Ends are blocked• Sequencing primer hybridized to the DNA template• After cluster generation, libraries are ready for sequencing
04/12/2023 Wellstein/Riegel Laboratory 50
Sequencing
• 100s of millions of clusters sequenced simultaneously• Using 4 fluorescently labeled reversibly terminated
nucleotides• Natural competition ensures highest accuracy• After each round of synthesis, clusters are excited by a laser
emitting a color that identifies the newly added base• Fluorescent label and blocking group are then removed
allowing for the addition of the next nucleotide• Proprietary chemistry (Illumina) reads a base in each cycle• Allows for accurate sequencing through difficult regions such
as homopolymers and repetitive sequence
04/12/2023 Wellstein/Riegel Laboratory 51
04/12/2023 Wellstein/Riegel Laboratory 52
Systems Biology History (wikipedia)
• Systems biology roots found in– Quantitative modeling of enzyme kinetics– Mathematical modeling of population growth– Simulations to study neurophysiology– Control theory and cybernetics
• Theorists– Ludwig von Bertalanffy – General Systems Theory– Alan Lloyd Hodgkin and Andrew Fielding Huxley – constructed a
mathematical model that explained potential propagating along the axon of a neuron cell
– Denis Nobel – first computer model of the heart Pacemaker
04/12/2023 Wellstein/Riegel Laboratory 53
Institutes of Systems Biology
• 2000 – Institutes of Systems Biology established in Seattle and Tokyo
• After completion of Human Genome projects• NSF grand challenge for systems biology – build a
mathematical model of the whole cell