2012 august 16 systems biology rna seq v2

53
Cancer Systems Biology: RNA-Seq August 16, 2012 Anne Deslattes Mays Wellstein/Riegel Laboratory Mentor: Anton Wellstein, MD, PhD 06/26/2022 Wellstein/Riegel Laboratory 1

Transcript of 2012 august 16 systems biology rna seq v2

Page 1: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 1

Cancer Systems Biology:RNA-Seq

August 16, 2012Anne Deslattes Mays

Wellstein/Riegel LaboratoryMentor: Anton Wellstein, MD, PhD

Page 2: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 2

Talk Outline

• What is Systems Biology?• What is RNA-Seq?• RNA-Seq Differential Expression Analysis

Page 3: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 3

Systems Biology is a systems approach to building testable models of biology using observation and

measurement

Page 4: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 4

Systems Biology brings together interdisciplinary fields, tools, analysis and platforms

• Genomics• Epigenomics/epgenetics• Transcriptomics• Proteomics• Metabolomics• Glycomics• Lipidomics• Interactomics• NeuroElectroDynamics• Fluxomics• Biomics

Page 5: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 5

What is the discipline of Systems Biology?A Reverse Engineering Discipline

Input

Process

Output

Perhaps more Equivalent to a Decipher Project: Alan Turing and the group of codebreakers during world war two

deciphered the codes created by the Enigma. A Biological System is communicating we are trying to crack the code.

Page 6: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 6

Genome

Transcriptome

Proteome

Metabolome

What is Systems Biology?Systems Biology is a discipline using a

multitude of measurement technologies to capture the entirety of a biological systems

parts and then attempts to reverse engineer that biological system’s ability to dynamically

remodel in its response to stimuli

Page 7: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 7

Sequencing technologies

Mass Spec technologies

What is Systems Biology?Systems Biology is a discipline using a

multitude of measurement technologies to capture the entirety of a biological systems

parts and then attempts to reverse engineer that biological system’s ability to dynamically

remodel in its response to stimuliGenome

Transcriptome

Proteome

Metabolome

Page 8: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 8

What is Systems Biology?

Technology AdvancesSpurs

Research Advances

Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems

parts and then attempts to reverse engineer that biological system’s ability to dynamically

remodel in its response to stimuli

Sequencing technologies

Mass Spec technologies

Genome

Transcriptome

Proteome

Metabolome

Page 9: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 9

RNA-seq

Page 10: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 10

Here is an example RNA-Seq Workflow

Experimental Design

Sample Collection

Quality Control Read Trimming

Differential Analysis

Transcript Identification

Pathway Analysis

Marker Discovery

Sequencing

Page 11: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 11

Three steps to get to a fresh sequence with the Illumina Genome Sequence Analyzer

• Library generation• Cluster generation• Sequencing

Page 12: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 12

Before Library Construction1. Poly-A Selection (Total RNA ->

mRNA)2. mRNA fragmentation3. First strand synthesis (here we stop

if we want to maintain strand specificity

4. Second strand synthesis

Other techniques5. Ribozero6. Ribominus

Library Construction: Messenger RNA are Poly-A selected from Total RNA, fragmented and cDNA synthesized

Page 13: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 13

cDNA (single or double stranded)1. cDNA is blunt end-repaired and

phosphorylated (B.)2. A-base added to prepare for

indexed adapter ligation (C.)

Library Construction: End repair and adenylation results in adapter ligation ready constructs

Page 14: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 14

Index adapter ligation and product ready for amplification on cBot or the cluster station1. Strand specific tags are added to

the A base – ligate index adapter (D)

2. Denature and amplify for final product (E)

Library Construction: Adapter ligation results in cluster-generation-ready constructs

Page 15: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 15

Single DNA molecules hybridize to the lawn of oligos grafted to the surface of the flow cell1. Oligo lawn2. Oligos hybridize to the adapters

that had been ligated to the library fragments which flow through the cell

Cluster Generation: In the illumina Cbot system, single molecules are isothermally amplified in a flow cell to prepare them for sequencing

Page 16: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 16

Bridge amplifications resulting in 100s of millions of unique clusters1. Each fragment is clonally

amplified through a series of extensions and isothermal bridge amplifications

2. Reverse strands cleaved and washed away

3. Ends are blocked4. Sequencing primer hybridized to

the DNA template5. Libraries are ready for

sequencing

Cluster generation: Bound fragments are extended to make copies and reverse strands cleaved and washed away

Page 17: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 17

4 fluorescently labeled reversibly terminated nucleotides1. Each base competes for addition2. Natural competition ensures

highest accuracy3. After each round of synthesis,

clusters are excited by a laser emitting a color that identifies the newly added base

4. Fluorescent label and blocking group are removed allowing for addition of next nucleotide

5. Proprietary (Illumina) chemistry reads a base in each cycle

6. Allows for accurate sequencing through difficult regions such as homopolymers and repetitive sequence

Sequencing: 100s of millions of clusters sequenced simultaneously

Page 18: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 18

What was good for DNA is now good for RNA

• Technology advances => higher throughput sequencing at lower costs

• Whole Genome Sequencing has enabled• Whole Transcriptome Sequencing• Workflow for DNA sequencing and RNA sequencing is similar

Page 19: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 19

There are other ways to Inquire about the Transcriptome

• Array Based Technologies– Affymetrix– Agilent– Known genes and hybridization protocols

• Microarray– 20,000+ array experiments on a single platform– Edge effects– False positives / false negatives

• Bead-based arrays• Tiling arrays• SAGE

Page 20: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 20

What is unique about RNA-Seq?

• Allows you to discover and profile the entire transcriptome of any organism

• No probes or primers to design• Novel transcripts• Novel isoforms• Alternative splice sites• Rare transcripts• cSNPS – all of this in one experiment

Page 21: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 21

After sequencing…1. Quality control – trim your reads2. Count Reads

• Align to genome• Align to transcriptome

3. Interpret Data• Statistical tests (differential

expression analysis)• Visualization (mapped

reads)• Pathway analysis

Not so simple – big data, big compute requirements

After sequencing, we must then perform RNA-Seq Data Analysis

Page 22: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 22

How much RNA-sequencing data?1. 20 million paired end reads ~ 2 GB of data2. 100 million paired end reads ~ 10 GB of data

How much computation power?3. More memory, more processors, less time it takes to compute4. Outsource the analysis, still will need to store the results somewhere

Amazon web servicesS3 storageEC elastic cloud on demand computational facility

Georgetown University High Performance Computer Corematrix.georgetown.edu

UPENN Galaxy services

How much RNA-sequencing data, how much computation power and where do you go to compute?

Page 23: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 23

A growing number of tools enable RNA-Seq analysis

Page 24: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 24

These RNA-Seq tools are used for mapping reads, aligning reads and providing input for differential expression analysis

• Tuxedo suite– Bowtie, Tophat, Cufflinks

• Trinity Suite– Inchworm, chrysallis,

butterfly• RUM– RNA Unified Mapper

Page 25: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 25

What percentage of reads are covered? What percentage of reads are mapped?

3’ Bias on transcript reads1. 60-80% of reads are mapped2. Highest percentage or 3’ end of

reads are mapped3. Reads need to be quality trimmed

Mapping tools bias exons to known genes

Page 26: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 26

Galaxy is a web based tool committed to enable a researcher (more than just for RNA-Seq)

Page 27: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 27

Page 28: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 28

How to visualize mapped results?

• UCSC Genome Browser (Gbrowse)• Integrated Genome Browser (IGB)• Integrated Genome Viewer (IGV)

Many shared formats, reading many of the outputs generated by the programs, ability to generate ones own tracks

Page 29: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 29

Page 30: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 30

Page 31: 2012 august 16 systems biology rna seq v2

What do RNA-Seq reads look like for GAPDH?

Repeat masked allowing 1/2 mismatched bases blat’d reads viewed in IGB 6.7.2

Page 32: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 32

RNA-Seq Differential Expression analysis

Page 33: 2012 august 16 systems biology rna seq v2

What does GAPDH look like in terms of quantitation?

TOTAL BM HPPRPKM 3SEQ Counts BLAT Reads RPKM 3SEQ Counts BLAT Reads

CD34 0.7 340 230 8 8 14BST1 19.7 5374   31 31  CD133 0.2 173 176 16 16 33THY1 0 7   4 4  A12     1     0A5     0     0ALK 0 9 24 0 0 3B9     0     0C1     0     0C2     0     0C7     0     0E7     0     0E9     2     0F6     0     0G12     0     0GAPDH 3013.2 727831 356289 120.8 5559 2670H3     0     0

Blat read raw counts ratio == 3Seq counts ratio ~= 130 to 1RPKM ratio ~= 24.3

Page 34: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 34

RNA-Seq Quantification Challenge: A problem that exists with RNA-Seq data that doesn’t exist with array data: Longer transcripts produce more reads than shorter transcripts

One solution to account for this is RPKM (FPKM used by Cufflinks)

RPKM = 10^9 x C / NL, which is really just simply C/N

C(gene)= the number of mappable reads that fall onto a gene's exonsN= total number of mappable reads in the experimentL(gene)= the sum of the exons in base pairs.

Wold (2008)

Page 35: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 35

Cufflinks: Transcript assembly, differential expression, and differential regulation for RNA-seq

Page 36: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 36

Cuffdiff produces many output files:

1. Transcript FPKM expression tracking.2. Gene FPKM expression tracking; tracks the summed FPKM of transcripts sharing each gene_id3. Primary transcript FPKM tracking; tracks the summed FPKM of transcripts sharing each tss_id4. Coding sequence FPKM tracking; tracks the summed FPKM of transcripts sharing each p_id, independent

of tss_id5. Transcript differential FPKM.6. Gene differential FPKM. Tests differences in the summed FPKM of transcripts sharing each gene_id7. Primary transcript differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each

tss_id8. Coding sequence differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each p_id

independent of tss_id9. Differential splicing tests: this tab delimited file lists, for each primary transcript, the amount of

overloading detected among its isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.

10. Differential promoter tests: this tab delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here.

11. Differential CDS tests: this tab delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here.

Page 37: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 37

RNA-Seq Quantification Challenge: DESeq Method uses the geometric mean of counts in all samples

DESeq Method:Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.

To get the sequencing depth of a sample relative to the reference, calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample.

Now you have, for each gene, an estimate of the depth ratio. Simply take the median of all the quotients to get the relative depth of the library.

'estimateSizeFactors' function of DESeq package does this calculation.

Page 38: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 38

DESeq: an R package that works with Raw Counts to determine genes differentially expressed across samples

• Simon Anders

Page 39: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 39

Page 40: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 40

Page 41: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 41

What is Systems Biology?

Technology AdvancesSpurs

Research Advances

Systems Biology is a discipline using a multitude of measurement technologies to capture the entirety of a biological systems

parts and then attempts to reverse engineer that biological system’s ability to dynamically

remodel in its response to stimuli

Sequencing technologies

Mass Spec technologies

Genome

Transcriptome

Proteome

Metabolome

Page 42: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 42

Resources

• http://dx.doi.org/10.1038/npre.2010.4282.1 (DESeq)• http://galaxy.psu.edu/• http://seqanswers.com/• http://www.broadinstitute.org/igv/• http://bioviz.org/igb/index.html• http://www.illumina.com• http://www.otogenetics.com• http://www.dnanexus.com• http://cufflinks.cbcb.umd.edu/• http://brb.nci.nih.gov/BRB-ArrayTools.html

Page 43: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 43

AcknowledgementsDr. Anton WellsteinDr. Anna Riegel

Dr. Marcel SchmidtJean-Baptiste MasaratiDr. Elena TassiThe entire lab: Tibari, Ghada, Ivana, Eveline, the entire Wellstein/Riegel laboratory

My Committee Dr. Yuri GusevDr. Anatoly DritschiloDr. Michael JohnsonDr. Christopher LoffredoDr. Habtom RessomDr. Terry Ryan (external committee member)

High Performance Core Group, Steve Moore, especially Woonki ChungAmazon Cloud ServicesDr. Ann Loraine, UNC, IGB DeveloperBrian Haas, Author Trinity Suite

Page 44: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 44

Given a list of differentially expressed Genes now enrichment analysis should be performed

• Enrichment analysis allows the researcher to leverage documented experiments which provide evidence for genes roles in pathways and functions that enable the researcher to determine the results and significance of their experiments

• DAVID– Gene ontology– Functional ontology

• Revigo– Output of David may be placed in REVIGO for further

interpretation and statistical exploration of significance of discovered sets of genes

Page 45: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 45

Using differentially expressed genes, biological pathways should be explored

• Differentially expressed genes are put into programs such as pathway studio or ingenuity

• Shortest path programs and• Canonical pathway analysis• Enables a researcher to reverse engineer the pathways

expressed in the course of a healthy response to a diseased response

• Ideally a pathway reveals the observed phenotype – connecting the expressed gene expression program with the phenotype – genotype – gene expression program to phenotype

Page 46: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 46

Page 47: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 47

Scientific knowledge is limited (and advanced) by the limits (and advancements) of measurement

• Ilya Shmulevich Genomic Signal Processing “Validity of the model involves observation and measurement, scientific knowledge is limited by the limits of measurement”

• Erwin Shrödinger Science Theory and Man: “It really is the ultimate purpose of all schemes and models to serve as scaffolding for any observations that are at all means observable”

Page 48: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 48

Before Library Construction1. Most vendors and cores will assess

the quality of the RNA before sequencing

2. Important to determine before sequencing begins

Garbage – in == Garbage out

Before library construction, RNA quality must be assessed

Page 49: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 49

Cluster Generation• Cbot cluster system single molecules are isothermally amplified

in a flow cell to prepare them for high-throughput sequencing• 8 channel genome analyzer has a dense lawn of oligos• Single DNA molecules hybridize to the lawn of oligos• Bound fragments are extended to make copies• Copies covalently bound to the flowcells surface• Each fragment is clonally amplified through a series of

extensions and isothermal bridge amplifications resulting in 100s millions of unique clusters

• Reverse strands cleaved and washed away• Ends are blocked• Sequencing primer hybridized to the DNA template• After cluster generation, libraries are ready for sequencing

Page 50: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 50

Sequencing

• 100s of millions of clusters sequenced simultaneously• Using 4 fluorescently labeled reversibly terminated

nucleotides• Natural competition ensures highest accuracy• After each round of synthesis, clusters are excited by a laser

emitting a color that identifies the newly added base• Fluorescent label and blocking group are then removed

allowing for the addition of the next nucleotide• Proprietary chemistry (Illumina) reads a base in each cycle• Allows for accurate sequencing through difficult regions such

as homopolymers and repetitive sequence

Page 51: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 51

Page 52: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 52

Systems Biology History (wikipedia)

• Systems biology roots found in– Quantitative modeling of enzyme kinetics– Mathematical modeling of population growth– Simulations to study neurophysiology– Control theory and cybernetics

• Theorists– Ludwig von Bertalanffy – General Systems Theory– Alan Lloyd Hodgkin and Andrew Fielding Huxley – constructed a

mathematical model that explained potential propagating along the axon of a neuron cell

– Denis Nobel – first computer model of the heart Pacemaker

Page 53: 2012 august 16 systems biology rna seq v2

04/12/2023 Wellstein/Riegel Laboratory 53

Institutes of Systems Biology

• 2000 – Institutes of Systems Biology established in Seattle and Tokyo

• After completion of Human Genome projects• NSF grand challenge for systems biology – build a

mathematical model of the whole cell