Intro: High-throughput Sequencing - i-med.ac.at · Third generation sequencing The Technology Next...
Transcript of Intro: High-throughput Sequencing - i-med.ac.at · Third generation sequencing The Technology Next...
Intro: High-throughput Sequencing
„Next generation sequencing“
NGS
Anne Krogsdam
Div Bioinformatics & I-med GenomeSeq Core
MolMed WM2, 2016-05-10
This lecture is part of the „Bioinformatics“ submodule,
during which you will
Learn the basic concepts of NGS (today)
Generate your own cancer-normal cell libraries for sequencing (tomorrow)
Analyse your data, May 23-25
Learn more about how very large scale NGS research projects can lead to
increased medical precision. June 13-14
Use what you have learned, to present specific themes/techniques (Seminar)
- And some hot papers relating to this field (Journal club). June 15-16
Today: The basics
What is NGS? what is sequencing? A brief history and some technical facts
What can we do (investigate) with NGS? (a motivational detour)
Methodology,
Bench and bedside
What will you be doing with NGS in the lab tomorrow? Thematic intro and method walk-through
Disclaimer
• The talk today is intended for educational purposes
• This is the fastest moving technology within current Biotech
• The contents of this presentation may already be out of date (!)
What is NGS? What is sequencing?
Sequence: follow/following…… arrange in a particular order
ascertain the sequence of amino-acid or nucleotide residues in (a protein,
DNA, etc.)
NGS:
High-throughput: only nucleotide-sequences (RNA, DNA)
A couple of happy Nobel winners
A brief history of DNA sequencing technologies
2008 +
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available
at: www.genome.gov/sequencingcosts. Accessed 2016-05-09.
A brief history of DNA sequencing technologies
Moores law: ”the number of transistors in a dense integrated circuit doubles approximately every two
years” –describing speed of computing, considered the most epic development of our time…
Moores law applied to Sequencing: the costs per Megabase/Genome decimates by a factor of 2 every
two years….
Remember what happened in 2006-2008 ?
Sequencing Technology
Current DNA sequencing technologies
Sequencing Technology
So how does it work?
Lets start with one of the first NGS machines,
It is still the most popular for massively high throughput sequencing!
updated Jan 2015
Output range
Run time
Single-end reads per flow
cell
Max read length
0,3 – 1800 Gb
5h – 3 days
20 Mil – 3 Billion
150 – 300 bp
Solexa/Illumina
Sequencing Technology
A rather boring box, but highly productive!
Illumina – NGS
amplified single-molecule sequencing Template amplification
Sequencing Technology
Sequencing Technology
Illumina – NGS
amplified single-molecule sequencing Template amplification
14
Complementary strand elongation: DNA Polymerase
Sequencing Technology
Illumina – NGS
amplified single-molecule sequencing Sequencing
Illumina – NGS
amplified single-molecule sequencing Sequencing
Sequencing Technology
Paired-end technology was
generated to compensate
for the rather short
readlengths (originally max
75 bp).
Today, single reads are
generally preferred.
Instrument throughput is estimated using typical runs at a density of 700,000/panel
Each Flowchip has 6 lanes = 4.2 Billion reads pr Flowchip, >8 Billion reads pr run (2 FCs) of which usually > 65% passes
final high quality filtering (5.2 Billion)
SOLiD Technology (ABI, LifeTech)
Sequencing Technology
Prettier, smarter…. But slightly less productive….
ABI SOLiD: Sequencing by ligation:
Fluorescently Labeled octameres Template amplification
Templating: generating monoclonal clusters with „wildfire“
Isothermal Template walking
Sequencing Technology
18 Complementary strand elongation: DNA Ligase
ABI SOLiD: Sequencing by ligation:
Fluorescently Labeled octameres Sequencing
Sequencing Technology
19
Complementary strand elongation: DNA Ligase
Satay plots: each dot represents fluorescent imaging of a cluster This cluster is still
within good/best
range (close to
axis), but part of
the cluster is
green
Meaning that only
part of the cluster
reads as blue; this
translates into a
quality measure
for that ligation /
basecall
Flourescent imaging of the clusters
ABI SOLiD: Sequencing by ligation:
Fluorescently Labeled octamers Sequencing
Sequencing Technology
20
5 reading frames, each position is read twice Optinionally: an additional 6th frame can be read, increasing the basecall fidelity to 99.99%
ABI SOLiD: Sequencing by ligation:
Fluorescently Labeled octameres Sequencing
Sequencing Technology
ABI SOLiD: Sequencing by ligation:
Fluorescently Labeled octameres Sequencing
Sequencing Technology
But the SOLiD is being phased out…
So why bother to learn about the technology….
It comes later… wait for it!
Ion PGM™
Chip
Run time Output
200 bp read 400 bp read 200 bp read 400 bp read
Ion 314™ Chip
v2
2.3 hr 3.7 hr 30–50 Mb 60–100 Mb
Ion 316™ Chip
v2
3.0 hr 4.9 hr 300–500 Mb 600 Mb–1 Gb
Ion 318™ Chip
v2
4.4 hr 7.3 hr 600 Mb–1 Gb 1.2–2 Gb
Ion Proton™
Chip
Run time Output
200 bp read 400 bp read
(S5)
Ion PI™ Chip
2–4 hr
10-15Gb (>60M
reads)
16 Gb (40M
reads)
Ion P2 Chip* 2-8 hr (?) >30Gb (200M
reads) coming 2016
Ion semiconductor
sequencing: First „Post-light“ technology
„The Chip is the machine TM“ Ion PGM (2010) Ion Proton (2013) Ion S5 (2015)
Sequencing Technology
Ion semiconductor sequencing
Sequencing Technology
sub-micro size
spheres coated
with P1 oligos
P1 P1
P1 P2 adapter
Release
spheres
(monoclonally
coated with
library) from
emulsion
Library / template:
Magnetic beads
coated with P2
oligos are used
to pull out the
templated
spheres
(enrichment)
Ion semiconductor sequencing: Templating by emulsion PCR
Sequencing Technology
templated spheres & DNA Polymerase Semi-conductor Chip
dCTP
H+
Polymerase is bound to the
templated, enriched spheres
Templated spheres are loaded
onto chip
Aiming at one sphere in each
well inside the chip
chip is loaded in the Ion
sequencer
chip is flushed with dNTPs, one
at a time (dATP, cCTP, dGTP,
dTTP)
Ion semiconductor sequencing: Sequencing by synthesis
Sequencing Technology
Ion semiconductor sequencing: Sequencing by synthesis
The application of natural (non-labeled) dNTPs gives a better, unbiased global seq.
coverage
Sequencing Technology
PacBio RS: Single molecule sequencing with labeled
dNTPs
Single molecule resolution in real time • Short waiting time for result and simple
workflow – Generate basecalls in <1 day
– Polymerase speed ≥1 base per second
• No amplification required – Bias not introduced
– More uniform coverage
• Direct observation – Distinguish heterogeneous samples
– Simultaneous kinetic measurements
• Long reads (8000-30.000 nts) – Identify repeats and structural variants
– Less coverage required
• Information content – One assay, multiple applications
• Genetic variation (SVs to SNPs)
• Methylation
• Enzymology
-Long reads 6-10kb
-Median size of molecules 3kb
-Still 15% error rate
-No strobe sequencing
Software focus on:
De novo assembly
Hi quality CCS consensus reads
In preparation
-Load long molecules by magnetic beads
-Modified nucleotides detection
Sequencing Technology
SMRT® Technology
The PacBio RS II is based on novel Single-Molecule, Real-Time (SMRT) technology which enables the observation
of DNA synthesis by a DNA polymerase in real time. Sequencing occurs on SMRT Cells, each containing thousands
of Zero-Mode Waveguides (ZMWs) in which polymerases are immobilized. The ZMWs provide a window for
watching the DNA polymerase as it performs sequencing by synthesis.
Zero-Mode Waveguide with
imbedded polymerase and added
library
PacBio RS: Single molecule sequencing with labeled
dNTPs
Sequencing Technology
Discover the Epigenome
The PacBio® RS II detects DNA
base modifications using the kinetics
of the polymerization reaction
during sequencing.
PacBio RS: Single molecule sequencing with labeled
dNTPs
Sequencing Technology
Sam
ple
Pre
para
tio
n
LS – long sequencing reads
• Large insert sizes (2kb-10kb)
• Generates one pass on each molecule sequenced
• Small insert sizes 500bp
• Generates multiple passes on each molecule
sequenced
Standard
Circular
Consensus
CCS – high quality sequencing reads
PacBio RS: Single molecule sequencing with labeled
dNTPs
Sequencing Technology
Sequencing Technology
The first sequencing methods, developed by Gilbert and Sanger, can read multiple copies of one
DNA molecule at a time. Mainly used for confirming single gene mutations or for validating cloning
constructs (one sample prep -> one sequence read)
First generation sequencing
The first Next generation sequencing (NGS), also known as massive parallel sequencing, deep
sequencing or genome sequencing: relies on amplified template libraries Investigating RNA-omes
and DNA-omes or many selected RNA/DNA targets in parallel. Each base call relies on
incorporation of flourescent probes.
Second generation sequencing
Next generation sequencing, capable of sequencing amplified DNA/RNA, through „natural“ detection of
bases (no bulky flourescense probes)
Third generation sequencing
The Technology
Next generation sequencing, capable of sequencing DNA/RNA molecules directly, without previous
amplification. PCR bias avoided
Fourth generation sequencing
In situ sequencing for RNA analysis in preserved tissue and cells
Fifth generation sequencing
Hig
h-T
hro
ug
hp
ut S
eq
(on
e s
am
ple
pre
p -
> b
illio
ns o
f re
ad
s)
And, wait for it,…..
Highly Multiplexed Subcellular RNA
Sequencing in Situ
NGS in detail
Most frequent technologies - how does it work
Applications
What can we assay with these NGS-platforms, capable of
reading billions of sequences (reads) from each sample?
Basically any source of RNA or DNA we can imagine….
„The Glory of Science is to imagine more than we can prove.
The Fringe is the unexplored Territory where Truth and Fantasy
are not yet disentangled.“
Physicist Freeman Dyson
NGS in detail
But lets stick with some standard, validated applications
for now
What can we sequence? - Anything that can be converted into DNA RNA -> cDNA before sequencing.
Sources of RNA: cells, tissues, body-fluids (blood, sweat, urine, tears,….), microbiomes,
FFPE samples (biobanks)
Can we sequence an entire pool of total RNA at once?
(ie. all RNA from a tissue)
-Yes! ….. in principle ….. but…..
2 good reasons why it is rarely done:
RNA is an extremely diverse group of highly structured molecules, often with modified nucleotides -
> one library prep method alone will not be optimal for all species of RNA in a pool.
To get good coverage, more reads are required than for sequencing a selected subgroup of RNA
(ie. protein coding RNA constitutes less than 0,5% of total RNA) = much more expensive -better
to spend the money on more biol replicates!!!
! Important! Specify the research question in advance
so that a focused approach can be selected!
How can we focus our approach?
Sample preparation:
The most common, standardized („kitti-fied“) options:
miRNA RNA is size-selected to enrich for miRNA in final sample. Size-selection is repeated after
conversion into cDNA
mRNA RNA is enriched for poly-A tailed RNA by capture to polyT oligos (0,5%-> 20-30%)
„total“ RNA rRNA is depleted through capture to complementary probes (1-2% -> 80-90%)
Targeted enrichment of RNAs of interest currently 2 methods: capture by binding to complementary probes, or PCR with target specific
primers, both methods primarily designed for DE-seq.
NB! enrichment and depletion are very efficient but not completely
perfect methods
capture probes tend to also bind partially overlapping RNA, PCR is
never unbiased…..
How can we focus our approach?
Sample preparation:
The most common, standardized („kitti-fied“) options:
DNA: Exome-seq
Targeted capture of Exons (+++)
Whole Genome Seq Long reads, or mate-pair libraries
Panel seq Targeted sequencing of specified genes / regions of interest.
NB: now also possible for RNA targets.
Frese et al, Biology 2013
Lots of omics…….
Applications of NGS technologies
Massive parallel seq (NGS-seq) Output data in the
„omics“ size
Applications of NGS technologies
Rizzo J M , and Buck M J Cancer Prev Res 2012;5:887-900 ©2012 by American Association for Cancer Research
Standard Applications of NGS technologies
RNA-Seq Workshop, ICBI Nov
2014
Other, user-validated, manual sample prep approaches:
RIP-RNA (ie. enrichement of miRNA bound to Dicer or mRNA
bound to splicing factors)
ribosome fractionation (enrichment for translated RNA)
total small RNA-seq (enrichment for RNA of 10-200nt)
tRNA-seq (enrichment of RNA 40-100nt, and modified library
prep to allow complex structures to be transcribed into cDNA)
circular RNA-seq (unique cDNA method to generate linear
cDNA for library prep)
targeted RNA selection (custom-design probes or primers to
select desired pools of RNA)
PCR-based selection is called:
AmpliSeq
knowing how samples preparation and
library generation is done, allows you to
filter out the random noise stemming
from technical sources.
Rizzo J M , and Buck M J Cancer Prev Res 2012;5:887-900 ©2012 by American Association for Cancer Research
DNA
RNA
Fragment
Add adapters
PCR
Sequencing
Data output
Analysis
Library
preparation
Applications of NGS technologies
Basic workflow for NGS experiments.
Simplified view of NGS library preparation workflows for genomic DNA (gDNA)-seq and ChIP-seq.
Sonication (if high-qual
sonicator available).
Transposon-based
fragmentation (Nextera,
MuSeek)
Enzymatic digestion
Sonication
(cryofragmentation)
*
*
* Clean, RNA-free, HQ DNA
required Modified from Dijk et al., EXP CELL RESEARCH 3 2 2(2014)12 – 2 0
Capture-seq (Exome-seq)
Applications of NGS technologies – Library Prep
Simplified view of NGS library preparation workflows for genomic DNA (gDNA)-seq AMPLI-Seq.
PCR
Purify
amplicons
Adapter ligation
gDNA (10ng)
multiplexed primers
several hundred
amplicons possible in
one rx
Some of the most applied
primer panels:
Ion AmpliSeq™ Cancer
Hotspot Panel v2
Ion AmpliSeq™ Inherited
Disease Panel
Ion AmpliSeq™
Comprehensive Cancer Panel
Illumina TrueSeq Cancer
panels
Applications of NGS technologies – Library Prep
Dijk et al., E XP CELL RESEARCH 3 2 2(2014)12 – 2 0
The most common RNA-seq protocols fall in three main classes
(A) Classical Illumina protocol.
Random-primed double-
stranded cDNA synthesisis
followed by adapterl igation
andP CR.
(B) One class of strand-specific
methods relies on marking one
strand by chemical
modification. The dUTP
second strand marking method
follows basically the same
procedure as the classical
protocol except that dUTP is
incorporated during second
strand cDNA synthesis,
preventing this strand from
being amplified by PCR. Most
current transcriptome library
preparation kits follow the
dUTP method.
(C) The second class of strand-
specific methods relies on
attaching different adapters in
a known orientation relative to
the 5‘ and 3‘ ends of the RNA
transcript. The Illumina ligation
method is a well-know
example of this class and is
based on sequential ligation of
two different adapters. Most
current small RNA library
preparation kits follow the RNA
ligation method.
In Ion Torrent library prep both
adapters are ligated in one reaction
step, applying end specific adapter
structures.
small RNA
enrichment
High Qual total RNA (RIN
> 8), DNA free (!)
Applications of NGS technologies – Analysing billions of bases (Gb)
Applications of NGS technologies – Analysing billions of bases
(Gb)
These standardized methods are good! and getting better all the
time…. but they are not perfect….
Typical problems to watch out for: Size selection: is never completely reproducible at the cutting points…. so if you have a
stringent size selection of 20 – 200 nt, expect that expression profiles of RNA at 20nt or 200nt
will most likely be artefacts and not very reproducible between biol samples.
Enrichment of PolyA RNA to obtain mRNA includes more lncRNA that mRNA – and some
rRNA may still be present
Ribo depletion: is never perfect, usually 10-20% remains „undepleted“. If you find rRNA in
enriched/depleted samples: ignore it!!!! ist levels is not biologically correct information.
Strand specificity: is constantly improving, but some RNAs may still be present in the final
library in the „wrong“ direction.
The entire process of library prep (and templating) relies highly on PCR amplification. Even
when minimal number of PCR cycles are applied there will be some „skewing“ of the data.
However, this is partly compensated when samples are processed very conservativels and
absolutely indenically (at best simultaneously).
PCR-duplicates are hard to avoid. Again lower number of PCR cycles is the best cure. (to
check data for PCR duplicates I recommend the „FastQ“ software.
Try not to compare with very different % of PCR duplicates.
Dont ever expect to obtain identical results with two different methods or even worse: two
different seq platforms.
- Lots of technical and methodological options…
- Let‘s mentally climb a mountain and get an overview!
What do all these techniques have in common?
What are the major differences?
If you could do an NGS experiment right now,
what would you like to do?
How?
Sequencing is successfully completed!
Time for analysis
So…. Wild and amazing techniques and methods!
Constantly new developments!
But how does it affect out lives?
as scientists?
as humans?
NGS in science
over 23 000 bacterial genomes,
400 fungal genomes and
100 protist genomes, in addition to
55 genomes from invertebrate metazoa
and 39 genomes from plants
NB: Bacterial genomes ->
microbiomes….
So today…
we can read DNA
Or rather: Genomes, RNomes / Transcriptomes (->
cDNA)
DISCOVER
Y
Genome-seq / Exome-seq rare disease genes, cancer mutation profiles
Transcriptome – seq >75% of all genes express several alternative
transcripts
„Dark matter“ RNA discovered and enlightened
Body microbiome-seq „human body atlas“ consortium. A new hot
clue to health
Immuno-seq characterizing immune populations (HLA,
antigen, TCR, BCR etc) eg in tumor infiltrating
lymphocytes (TILs)
DE-seq rapid, unbiased differential gene expression
Personalized medicine
Agricultural development
Understanding origins (anthropology,
human identification)
NGS in science
in 2006, Price et. al. that used principal components from genotype data to
infer population structure and showed how it can help remove false
associations in a GWAS setting. Later on two papers from Novembre. J
showed how genotype reflects continuos population structure across
europe.
Genotypes reflect geography
NGS in science
So already as a science tool, we are bridging the effects for humans….
Bench Bedside
What is Personalized Medicine?
Only 5 years ago it was still a futuristic theoretic idea,
people discussed whether they „believed“ in it or not….
Politicians didnt discuss it at all….. (= no legislation)
Today its first implementations are becoming routine,
the discussions are about: more, faster, better;
accuracy and simplicity.
Politicians are struggling to catch up with at least some
legislation.
Empirical medicine
Precision medicine
Personalized medicine
Copyright © 2015 American Medical
Association. All rights reserved.
From: Population and Personalized Medicine in the Modern Era
JAMA. 2014;312(19):1969-1970. doi:10.1001/jama.2014.15224
Figure Legend: Tools Being Used in Clinical Research to Understand Population and Personalized Medicine
Why is „personalized medicine“ developing so fast?
Why Four Letters of Genetic Code Are So Complicated
It is evident that the genome is far more complex than previously thought.
While the understanding of its coding regions has considerably advanced, 99% of
the non-coding sequence is still challenging researchers from different disciplines
to finally unravel all of the functions of the genome.
Additionally, genetic variation across different individuals and populations is higher
than estimated, and the transition from common to rare variants is fluid, making
interpretation of their functional relevance difficult; structural changes, such as
insertions, deletions or copy number variations, are far more frequent than
previously thought. For instance, the Database of Genomic Variants (DGV) lists
about 60,000 CNVs, 850 inversions and 30,000 insertion/deletions identified in
healthy individuals
Frese et al, Biology
Applications of NGS in personalized medicine
whole-exome sequencing to identify two disease-causing mutations in a patient with
oculocutaneous albinism and congenital neutropenia Cullinane et al., 2011).
Nature Reviews genetics
Applications of whole genome sequencing: mainly to identify conserved /
deviating sequences between groups of individuals or between species. –To
learn more about the stability and function of the human genome.
Applications of Exome-seq: primary choice for
identifying mutations by comparison
Genomic DNA (gDNA)-seq, Exome-seq analysis
Applications of NGS technologies – Analysing billions of bases
(Gb)
Appendix A: List of drugs currently approved by the Food and Drug Administration (FDA) with associated
pharmacogenomic information. Specific nucleic sequence variants (genetic polymorphisms) in genes lead
to varying metabolism and/or distribution of individual drugs (Gullapalli et al, J Pathol Inform 2012)
Tomorrow in the lab
2 human cell lines
hMADS Human multipotent adipocyte-
derived stemcells
Jurkat immortalized line of human T
lymphocyte cells that are used
to study acute T cell leukemia
gDNA gDNA
Ion cancer hotspot AmpliSeq
To identify hotspot
cancer alleles in the
following target
genes:
Ion cancer hotspot AmpliSeq
The hours are true…. When you have learned how to do it, and have a „pipeline“ all set
up and run in.
However, we are going to take a little longer, for educational purposes
Ion cancer hotspot AmpliSeq
Let‘s go through the protocol to see how this is actually done
Separate pdf file + Run report example
What do the „reads“ actually look like in terms of genes and mutations?
IGV is freely available from the Broad institute homepage.
Data Output:
Files containing hundreds of thousands of reads…
From data to results
Analysis part of the lab experiment:
Turning read data into scientific results……
Wishing you joyful learning and experimenting!!!!