Intro: High-throughput Sequencing - i-med.ac.at · Third generation sequencing The Technology Next...

Intro: High-throughput Sequencing

„Next generation sequencing“

NGS

Anne Krogsdam

Div Bioinformatics & I-med GenomeSeq Core

MolMed WM2, 2016-05-10

This lecture is part of the „Bioinformatics“ submodule,

during which you will

Learn the basic concepts of NGS (today)

Generate your own cancer-normal cell libraries for sequencing (tomorrow)

Analyse your data, May 23-25

Learn more about how very large scale NGS research projects can lead to

increased medical precision. June 13-14

Use what you have learned, to present specific themes/techniques (Seminar)

- And some hot papers relating to this field (Journal club). June 15-16

Today: The basics

What is NGS? what is sequencing? A brief history and some technical facts

What can we do (investigate) with NGS? (a motivational detour)

Methodology,

Bench and bedside

What will you be doing with NGS in the lab tomorrow? Thematic intro and method walk-through

Disclaimer

• The talk today is intended for educational purposes

• This is the fastest moving technology within current Biotech

• The contents of this presentation may already be out of date (!)

What is NGS? What is sequencing?

Sequence: follow/following…… arrange in a particular order

ascertain the sequence of amino-acid or nucleotide residues in (a protein,

DNA, etc.)

NGS:

High-throughput: only nucleotide-sequences (RNA, DNA)

A couple of happy Nobel winners

A brief history of DNA sequencing technologies

2008 +

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available

at: www.genome.gov/sequencingcosts. Accessed 2016-05-09.

A brief history of DNA sequencing technologies

Moores law: ”the number of transistors in a dense integrated circuit doubles approximately every two

years” –describing speed of computing, considered the most epic development of our time…

Moores law applied to Sequencing: the costs per Megabase/Genome decimates by a factor of 2 every

two years….

Remember what happened in 2006-2008 ?

http://www.genome.gov/sequencingcosts

Sequencing Technology

Current DNA sequencing technologies


So how does it work?

Lets start with one of the first NGS machines,

It is still the most popular for massively high throughput sequencing!

updated Jan 2015

Output range

Run time

Single-end reads per flow

cell

Max read length

0,3 – 1800 Gb

5h – 3 days

20 Mil – 3 Billion

150 – 300 bp

Solexa/Illumina


A rather boring box, but highly productive!

Illumina – NGS

amplified single-molecule sequencing Template amplification


14

Complementary strand elongation: DNA Polymerase


Illumina – NGS

amplified single-molecule sequencing Sequencing

Illumina – NGS

amplified single-molecule sequencing Sequencing


Paired-end technology was

generated to compensate

for the rather short

readlengths (originally max

75 bp).

Today, single reads are

generally preferred.

Instrument throughput is estimated using typical runs at a density of 700,000/panel

Each Flowchip has 6 lanes = 4.2 Billion reads pr Flowchip, >8 Billion reads pr run (2 FCs) of which usually > 65% passes

final high quality filtering (5.2 Billion)

SOLiD Technology (ABI, LifeTech)


Prettier, smarter…. But slightly less productive….

ABI SOLiD: Sequencing by ligation:

Fluorescently Labeled octameres Template amplification

Templating: generating monoclonal clusters with „wildfire“

Isothermal Template walking


18 Complementary strand elongation: DNA Ligase


Fluorescently Labeled octameres Sequencing


19

Complementary strand elongation: DNA Ligase

Satay plots: each dot represents fluorescent imaging of a cluster This cluster is still

within good/best

range (close to

axis), but part of

the cluster is

green

Meaning that only

part of the cluster

reads as blue; this

translates into a

quality measure

for that ligation /

basecall

Flourescent imaging of the clusters


Fluorescently Labeled octamers Sequencing


20

5 reading frames, each position is read twice Optinionally: an additional 6th frame can be read, increasing the basecall fidelity to 99.99%




But the SOLiD is being phased out…

So why bother to learn about the technology….

It comes later… wait for it!

Ion PGM™

Chip

Run time Output

200 bp read 400 bp read 200 bp read 400 bp read

Ion 314™ Chip

v2

2.3 hr 3.7 hr 30–50 Mb 60–100 Mb

Ion 316™ Chip

v2

3.0 hr 4.9 hr 300–500 Mb 600 Mb–1 Gb

Ion 318™ Chip

v2

4.4 hr 7.3 hr 600 Mb–1 Gb 1.2–2 Gb

Ion Proton™

Chip

Run time Output

200 bp read 400 bp read

(S5)

Ion PI™ Chip

2–4 hr

10-15Gb (>60M

reads)

16 Gb (40M

reads)

Ion P2 Chip* 2-8 hr (?) >30Gb (200M

reads) coming 2016

Ion semiconductor

sequencing: First „Post-light“ technology

„The Chip is the machine TM“ Ion PGM (2010) Ion Proton (2013) Ion S5 (2015)


http://www.lifetechnologies.com/order/catalog/product/4482261?ICID=search-product







Ion semiconductor sequencing


sub-micro size

spheres coated

with P1 oligos

P1 P1

P1 P2 adapter

Release

spheres

(monoclonally

coated with

library) from

emulsion

Library / template:

Magnetic beads

coated with P2

oligos are used

to pull out the

templated

spheres

(enrichment)

Ion semiconductor sequencing: Templating by emulsion PCR


templated spheres & DNA Polymerase Semi-conductor Chip

dCTP

H+

Polymerase is bound to the

templated, enriched spheres

Templated spheres are loaded

onto chip

Aiming at one sphere in each

well inside the chip

chip is loaded in the Ion

sequencer

chip is flushed with dNTPs, one

at a time (dATP, cCTP, dGTP,

dTTP)

Ion semiconductor sequencing: Sequencing by synthesis


Ion semiconductor sequencing: Sequencing by synthesis

The application of natural (non-labeled) dNTPs gives a better, unbiased global seq.

coverage


PacBio RS: Single molecule sequencing with labeled

dNTPs

Single molecule resolution in real time • Short waiting time for result and simple

workflow – Generate basecalls in <1 day

– Polymerase speed ≥1 base per second

• No amplification required – Bias not introduced

– More uniform coverage

• Direct observation – Distinguish heterogeneous samples

– Simultaneous kinetic measurements

• Long reads (8000-30.000 nts) – Identify repeats and structural variants

– Less coverage required

• Information content – One assay, multiple applications

• Genetic variation (SVs to SNPs)

• Methylation

• Enzymology

-Long reads 6-10kb

-Median size of molecules 3kb

-Still 15% error rate

-No strobe sequencing

Software focus on:

De novo assembly

Hi quality CCS consensus reads

In preparation

-Load long molecules by magnetic beads

-Modified nucleotides detection


SMRT® Technology

The PacBio RS II is based on novel Single-Molecule, Real-Time (SMRT) technology which enables the observation

of DNA synthesis by a DNA polymerase in real time. Sequencing occurs on SMRT Cells, each containing thousands

of Zero-Mode Waveguides (ZMWs) in which polymerases are immobilized. The ZMWs provide a window for

watching the DNA polymerase as it performs sequencing by synthesis.

Zero-Mode Waveguide with

imbedded polymerase and added

library


dNTPs


Discover the Epigenome

The PacBio® RS II detects DNA

base modifications using the kinetics

of the polymerization reaction

during sequencing.


dNTPs


Sam

ple

Pre

para

tio

n

LS – long sequencing reads

• Large insert sizes (2kb-10kb)

• Generates one pass on each molecule sequenced

• Small insert sizes 500bp

• Generates multiple passes on each molecule

sequenced

Standard

Circular

Consensus

CCS – high quality sequencing reads


dNTPs


The first sequencing methods, developed by Gilbert and Sanger, can read multiple copies of one

DNA molecule at a time. Mainly used for confirming single gene mutations or for validating cloning

constructs (one sample prep -> one sequence read)

First generation sequencing

The first Next generation sequencing (NGS), also known as massive parallel sequencing, deep

sequencing or genome sequencing: relies on amplified template libraries Investigating RNA-omes

and DNA-omes or many selected RNA/DNA targets in parallel. Each base call relies on

incorporation of flourescent probes.

Second generation sequencing

Next generation sequencing, capable of sequencing amplified DNA/RNA, through „natural“ detection of

bases (no bulky flourescense probes)

Third generation sequencing

The Technology

Next generation sequencing, capable of sequencing DNA/RNA molecules directly, without previous

amplification. PCR bias avoided

Fourth generation sequencing

In situ sequencing for RNA analysis in preserved tissue and cells

Fifth generation sequencing

Hig

h-T

hro

ug

hp

ut S

eq

(on

e s

am

ple

pre

p -

> b

illio

ns o

f re

ad

s)

And, wait for it,…..

Highly Multiplexed Subcellular RNA

Sequencing in Situ

NGS in detail

Most frequent technologies - how does it work

Applications

What can we assay with these NGS-platforms, capable of

reading billions of sequences (reads) from each sample?

Basically any source of RNA or DNA we can imagine….

„The Glory of Science is to imagine more than we can prove.

The Fringe is the unexplored Territory where Truth and Fantasy

are not yet disentangled.“

Physicist Freeman Dyson

NGS in detail

But lets stick with some standard, validated applications

for now

What can we sequence? - Anything that can be converted into DNA RNA -> cDNA before sequencing.

Sources of RNA: cells, tissues, body-fluids (blood, sweat, urine, tears,….), microbiomes,

FFPE samples (biobanks)

Can we sequence an entire pool of total RNA at once?

(ie. all RNA from a tissue)

-Yes! ….. in principle ….. but…..

2 good reasons why it is rarely done:

RNA is an extremely diverse group of highly structured molecules, often with modified nucleotides -

> one library prep method alone will not be optimal for all species of RNA in a pool.

To get good coverage, more reads are required than for sequencing a selected subgroup of RNA

(ie. protein coding RNA constitutes less than 0,5% of total RNA) = much more expensive -better

to spend the money on more biol replicates!!!

! Important! Specify the research question in advance

so that a focused approach can be selected!

http://172.16.80.203/site_media/resources/img/dna-loader-64.da19fe38d02f.gif

How can we focus our approach?

Sample preparation:

The most common, standardized („kitti-fied“) options:

miRNA RNA is size-selected to enrich for miRNA in final sample. Size-selection is repeated after

conversion into cDNA

mRNA RNA is enriched for poly-A tailed RNA by capture to polyT oligos (0,5%-> 20-30%)

„total“ RNA rRNA is depleted through capture to complementary probes (1-2% -> 80-90%)

Targeted enrichment of RNAs of interest currently 2 methods: capture by binding to complementary probes, or PCR with target specific

primers, both methods primarily designed for DE-seq.

NB! enrichment and depletion are very efficient but not completely

perfect methods

capture probes tend to also bind partially overlapping RNA, PCR is

never unbiased…..


How can we focus our approach?

Sample preparation:

The most common, standardized („kitti-fied“) options:

DNA: Exome-seq

Targeted capture of Exons (+++)

Whole Genome Seq Long reads, or mate-pair libraries

Panel seq Targeted sequencing of specified genes / regions of interest.

NB: now also possible for RNA targets.

Frese et al, Biology 2013

Lots of omics…….

Applications of NGS technologies

Massive parallel seq (NGS-seq) Output data in the

„omics“ size


Rizzo J M , and Buck M J Cancer Prev Res 2012;5:887-900 ©2012 by American Association for Cancer Research

Standard Applications of NGS technologies

RNA-Seq Workshop, ICBI Nov

2014

Other, user-validated, manual sample prep approaches:

RIP-RNA (ie. enrichement of miRNA bound to Dicer or mRNA

bound to splicing factors)

ribosome fractionation (enrichment for translated RNA)

total small RNA-seq (enrichment for RNA of 10-200nt)

tRNA-seq (enrichment of RNA 40-100nt, and modified library

prep to allow complex structures to be transcribed into cDNA)

circular RNA-seq (unique cDNA method to generate linear

cDNA for library prep)

targeted RNA selection (custom-design probes or primers to

select desired pools of RNA)

PCR-based selection is called:

AmpliSeq

knowing how samples preparation and

library generation is done, allows you to

filter out the random noise stemming

from technical sources.


Rizzo J M , and Buck M J Cancer Prev Res 2012;5:887-900 ©2012 by American Association for Cancer Research

DNA

RNA

Fragment

Add adapters

PCR

Sequencing

Data output

Analysis

Library

preparation


Basic workflow for NGS experiments.

Simplified view of NGS library preparation workflows for genomic DNA (gDNA)-seq and ChIP-seq.

Sonication (if high-qual

sonicator available).

Transposon-based

fragmentation (Nextera,

MuSeek)

Enzymatic digestion

Sonication

(cryofragmentation)

*

*

* Clean, RNA-free, HQ DNA

required Modified from Dijk et al., EXP CELL RESEARCH 3 2 2(2014)12 – 2 0

Capture-seq (Exome-seq)

Applications of NGS technologies – Library Prep

Simplified view of NGS library preparation workflows for genomic DNA (gDNA)-seq AMPLI-Seq.

PCR

Purify

amplicons

Adapter ligation

gDNA (10ng)

multiplexed primers

several hundred

amplicons possible in

one rx

Some of the most applied

primer panels:

Ion AmpliSeq™ Cancer

Hotspot Panel v2

Ion AmpliSeq™ Inherited

Disease Panel

Ion AmpliSeq™

Comprehensive Cancer Panel

Illumina TrueSeq Cancer

panels

Applications of NGS technologies – Library Prep

Dijk et al., E XP CELL RESEARCH 3 2 2(2014)12 – 2 0

The most common RNA-seq protocols fall in three main classes

(A) Classical Illumina protocol.

Random-primed double-

stranded cDNA synthesisis

followed by adapterl igation

andP CR.

(B) One class of strand-specific

methods relies on marking one

strand by chemical

modification. The dUTP

second strand marking method

follows basically the same

procedure as the classical

protocol except that dUTP is

incorporated during second

strand cDNA synthesis,

preventing this strand from

being amplified by PCR. Most

current transcriptome library

preparation kits follow the

dUTP method.

(C) The second class of strand-

specific methods relies on

attaching different adapters in

a known orientation relative to

the 5‘ and 3‘ ends of the RNA

transcript. The Illumina ligation

method is a well-know

example of this class and is

based on sequential ligation of

two different adapters. Most

current small RNA library

preparation kits follow the RNA

ligation method.

In Ion Torrent library prep both

adapters are ligated in one reaction

step, applying end specific adapter

structures.

small RNA

enrichment

High Qual total RNA (RIN

> 8), DNA free (!)

Applications of NGS technologies – Analysing billions of bases (Gb)

Applications of NGS technologies – Analysing billions of bases

(Gb)

These standardized methods are good! and getting better all the

time…. but they are not perfect….

Typical problems to watch out for: Size selection: is never completely reproducible at the cutting points…. so if you have a

stringent size selection of 20 – 200 nt, expect that expression profiles of RNA at 20nt or 200nt

will most likely be artefacts and not very reproducible between biol samples.

Enrichment of PolyA RNA to obtain mRNA includes more lncRNA that mRNA – and some

rRNA may still be present

Ribo depletion: is never perfect, usually 10-20% remains „undepleted“. If you find rRNA in

enriched/depleted samples: ignore it!!!! ist levels is not biologically correct information.

Strand specificity: is constantly improving, but some RNAs may still be present in the final

library in the „wrong“ direction.

The entire process of library prep (and templating) relies highly on PCR amplification. Even

when minimal number of PCR cycles are applied there will be some „skewing“ of the data.

However, this is partly compensated when samples are processed very conservativels and

absolutely indenically (at best simultaneously).

PCR-duplicates are hard to avoid. Again lower number of PCR cycles is the best cure. (to

check data for PCR duplicates I recommend the „FastQ“ software.

Try not to compare with very different % of PCR duplicates.

Dont ever expect to obtain identical results with two different methods or even worse: two

different seq platforms.


- Lots of technical and methodological options…

- Let‘s mentally climb a mountain and get an overview!

What do all these techniques have in common?

What are the major differences?

If you could do an NGS experiment right now,

what would you like to do?

How?

Sequencing is successfully completed!

Time for analysis

So…. Wild and amazing techniques and methods!

Constantly new developments!

But how does it affect out lives?

as scientists?

as humans?

NGS in science

over 23 000 bacterial genomes,

400 fungal genomes and

100 protist genomes, in addition to

55 genomes from invertebrate metazoa

and 39 genomes from plants

NB: Bacterial genomes ->

microbiomes….

So today…

we can read DNA

Or rather: Genomes, RNomes / Transcriptomes (->

cDNA)

DISCOVER

Y

Genome-seq / Exome-seq rare disease genes, cancer mutation profiles

Transcriptome – seq >75% of all genes express several alternative

transcripts

„Dark matter“ RNA discovered and enlightened

Body microbiome-seq „human body atlas“ consortium. A new hot

clue to health

Immuno-seq characterizing immune populations (HLA,

antigen, TCR, BCR etc) eg in tumor infiltrating

lymphocytes (TILs)

DE-seq rapid, unbiased differential gene expression

Personalized medicine

Agricultural development

Understanding origins (anthropology,

human identification)

NGS in science

in 2006, Price et. al. that used principal components from genotype data to

infer population structure and showed how it can help remove false

associations in a GWAS setting. Later on two papers from Novembre. J

showed how genotype reflects continuos population structure across

europe.

Genotypes reflect geography

NGS in science

http://www.nature.com/ng/journal/v38/n8/abs/ng1847.html

So already as a science tool, we are bridging the effects for humans….

Bench Bedside

What is Personalized Medicine?

Only 5 years ago it was still a futuristic theoretic idea,

people discussed whether they „believed“ in it or not….

Politicians didnt discuss it at all….. (= no legislation)

Today its first implementations are becoming routine,

the discussions are about: more, faster, better;

accuracy and simplicity.

Politicians are struggling to catch up with at least some

legislation.

Empirical medicine

Precision medicine

Personalized medicine

Copyright © 2015 American Medical

Association. All rights reserved.

From: Population and Personalized Medicine in the Modern Era

JAMA. 2014;312(19):1969-1970. doi:10.1001/jama.2014.15224

Figure Legend: Tools Being Used in Clinical Research to Understand Population and Personalized Medicine

Why is „personalized medicine“ developing so fast?

Why Four Letters of Genetic Code Are So Complicated

It is evident that the genome is far more complex than previously thought.

While the understanding of its coding regions has considerably advanced, 99% of

the non-coding sequence is still challenging researchers from different disciplines

to finally unravel all of the functions of the genome.

Additionally, genetic variation across different individuals and populations is higher

than estimated, and the transition from common to rare variants is fluid, making

interpretation of their functional relevance difficult; structural changes, such as

insertions, deletions or copy number variations, are far more frequent than

previously thought. For instance, the Database of Genomic Variants (DGV) lists

about 60,000 CNVs, 850 inversions and 30,000 insertion/deletions identified in

healthy individuals

Frese et al, Biology

Applications of NGS in personalized medicine

whole-exome sequencing to identify two disease-causing mutations in a patient with

oculocutaneous albinism and congenital neutropenia Cullinane et al., 2011).

Nature Reviews genetics

Applications of whole genome sequencing: mainly to identify conserved /

deviating sequences between groups of individuals or between species. –To

learn more about the stability and function of the human genome.

Applications of Exome-seq: primary choice for

identifying mutations by comparison

Genomic DNA (gDNA)-seq, Exome-seq analysis

Applications of NGS technologies – Analysing billions of bases

(Gb)

Appendix A: List of drugs currently approved by the Food and Drug Administration (FDA) with associated

pharmacogenomic information. Specific nucleic sequence variants (genetic polymorphisms) in genes lead

to varying metabolism and/or distribution of individual drugs (Gullapalli et al, J Pathol Inform 2012)

Tomorrow in the lab

2 human cell lines

hMADS Human multipotent adipocyte-

derived stemcells

Jurkat immortalized line of human T

lymphocyte cells that are used

to study acute T cell leukemia

gDNA gDNA

Ion cancer hotspot AmpliSeq

To identify hotspot

cancer alleles in the

following target

genes:


The hours are true…. When you have learned how to do it, and have a „pipeline“ all set

up and run in.

However, we are going to take a little longer, for educational purposes

Let‘s go through the protocol to see how this is actually done

Separate pdf file + Run report example

What do the „reads“ actually look like in terms of genes and mutations?

IGV is freely available from the Broad institute homepage.

Data Output:

Files containing hundreds of thousands of reads…

From data to results

Analysis part of the lab experiment:

Turning read data into scientific results……

Wishing you joyful learning and experimenting!!!!

Intro: High-throughput Sequencing - i-med.ac.at · Third generation sequencing The Technology Next...

Documents

Transcript of Intro: High-throughput Sequencing - i-med.ac.at · Third generation sequencing The Technology Next...