Outline - Unife

Outline

• Introduction;

• Ion Torrent platform – how does it work?;

• Library preparation;

• NGS data analysis;

• Pros and cons.

Amplicon sequencing project

Amplicon sequencing project – NGS data analysis

Ion Torrent report


“Ion Server” approach: • Alignment; • Duplicate removal; • Variant caller.

Customised approach: • Up to you!

When it is useful: • Standard approaches (such as cancer panel, etc.); • Medical genetics;

When it is useful: • Not straightforward projects/organisms; • Structural variation; • Explore the data; • Customise some steps.

There are no strict/standard rules to analyses NGS data, there are some standard pipelines but it most depends on your case-study.


What shall I do next?

Data processing

Raw data

Alignment TMAP

Indexing/sorting/RG line samtools + picard

Local realignment GATK

Duplicate removal Picard

Fo

r

al

l

sa

mp

le

s

Multi-sample variant calling samtools

Validation

False Positive (FP%) and False Negative percentages (FN%)

Chromosome position and reference allele concordance

V a l i d a t i o n

Filtering (vcf file) Validation

(SNP chip, Complete Genomics)

It is not THE way to analyse NGS/Ion Torrent data but it is one possible way to analyse my data (hopefully the best one!).

What is it needed to produce the raw data?

Filtering variant sites

Sets of filters for several parameters

(BQ, MQ, DP, missing data per site/sample)

F i l t e r i n g

Amplicon sequencing project – before the raw data

Basecalling

Per-base quality scoring

Trimming

Raw data

Base recalibration

Before the raw data:

Io

n

To

rr

en

t

su

it

e

(I

on

S

er

ve

r)

Filtering


Basecalling Calling the base for each well

Base recalibration

Before the raw data:

Base recalibration is a process to improve base calls by relearning the homopolymer flow signal distribution from the alignment of a

fraction of library reads.

Before recalibration After recalibration

Per-base quality scoring

Trimming


Base quality score (BQ): • Phred-scale value; • -10*log_10(error rate);

Phred Quality Score

Probability of incorrect base call

Base call accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10000 99.99%

50 1 in 100000 99.999%

• Removal of adapter sequence; • Removal of lower-quality 3' Ends with Low Quality Scores.

Filtering • Removal of short reads; • Removal of adapter dimers; • Removal of reads lacking sequencing key; • Removal of reads with off-scale signal; • Removal of polyclonal reads.


Filtering • Removal of short reads; • Removal of adapter dimers; • Removal of reads lacking sequencing key; • Removal of reads with off-scale signal; • Removal of polyclonal reads.

An Ion Sphere Particle is clonal if all of its DNA fragments are cloned from a single original template. All the fragments on such a bead are identical, and they respond in unison as each nucleotide is flowed in turn across the chip.

A adaptper Barcode

DNA fragment to be sequenced

P1 adapter


• Clonal amplification, ~50% of flows are 0-signal flows; • The positive flows cluster around integer values.

• Polyclonal amplification, 19% of flows are 0-signal flows; • The positive flows no longer cluster exclusively around integer values.

• Super-mixed beads, ~0% of flows are 0-signal flows; • The positive flows do not cluster around integer values at all.

Amplicon sequencing project – Raw data

FASTQ file

Raw data


Data processing

Raw data

Alignment TMAP




Fo

r

al

l

sa

mp

le

s


Validation



V a l i d a t i o n






F i l t e r i n g

Amplicon sequencing project – Alignment

alignment – process of determining the most likely location within the genome for the observed DNA read

raw reads reference genome


trade-off: speed vs sensitivity – the higher the accuracy the longer the alignment run

two classes of methods:

Burrows-Wheeler

• Fast • less robust at high divergence with

reference genome • e.g. bwa

Hashing

• slow (needs more memory) • robust at high divergence with

reference genome • e.g. stampy

short reads: ranging between 150bp and 200bp, the shorter the read the harder is to find its location in the genome big amount of data: computationally challenging for memory and speed


TMAP – Ion Torrent suite

Burrows-Wheeler based software Hashing based software

raw reads reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read? This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps MQ is a phred-score of the quality of the alignment

Amplicon sequencing project – Alignment, Mapping Quality (MQ)

SAM/BAM format

SAM – sequence alignment map BAM – binary alignment map Standard formats for alignment BAM is the binary version of SAM – reduced size, easier to store and to access but the full information is not readable by human eye

Amplicon sequencing project – Alignment, BAM and SAM file


Data processing

Raw data

Alignment TMAP




Fo

r

al

l

sa

mp

le

s


Validation



V a l i d a t i o n






F i l t e r i n g

Amplicon sequencing project – Indexing/sorting/RG line

Index (.bai) file: This file acts like an external table of contents, and allows programs to jump directly to specific parts of the bam file without reading through all of the sequences. (samtools). Sort: sort the reads in the bam file by either chromosome position or name (samtools). Add RG line: line containing information about read group identifier, platform name, sample name, library name, etc....

For each bam file we have to:

SAM – sequence alignment map BAM – binary alignment map Tools to visualise bam files such as IGV (http://www.broadinstitute.org/igv/home) and Tablet (http://ics.hutton.ac.uk/tablet/ ).


http://www.broadinstitute.org/igv/home

http://www.broadinstitute.org/igv/home

http://ics.hutton.ac.uk/tablet/

http://ics.hutton.ac.uk/tablet/


Data processing

Raw data

Alignment TMAP




Fo

r

al

l

sa

mp

le

s


Validation



V a l i d a t i o n






F i l t e r i n g

Amplicon sequencing project – Local realignment

Short indels in the sample relative to the reference sequence can pose difficulties for alignment programs. Indels occuring towards the ends of the reads are often not aligned correctly, introducing an excess of SNPs.

It uses the full alignment context to determine whether the indel exists. Two-step process: 1. RealignerTargetCreator: it determines the small suspicious intervals which

are likely in need of realignment (GATK software); 2. IndelRealigner: it runs the realignment on those intervals (GATK software).


Data processing

Raw data

Alignment TMAP




Fo

r

al

l

sa

mp

le

s


Validation



V a l i d a t i o n






F i l t e r i n g

PCR is used during library preparation. This can results in duplicate DNA fragments in the final library prep. PCR-free protocols exist but require a large amount of DNA.

It can result in false SNPs calls. Duplicates may fake a high coverage thus giving high support to some variants (picard software).


C

C

C

C

A

Possible heterozygote, SNP call

Ref call


Data processing

Raw data

Alignment TMAP




Fo

r

al

l

sa

mp

le

s


Validation



V a l i d a t i o n






F i l t e r i n g

We can check the coverage!

Amplicon sequencing project – Coverage

coverage per position GATK/BEDtools

Chromosome name Position Coverage phax5574-500bp_up_down_ref 16994 154 phax5574-500bp_up_down_ref 16995 153 phax5574-500bp_up_down_ref 16996 152 phax5574-500bp_up_down_ref 16997 149 phax5574-500bp_up_down_ref 16998 148 phax5574-500bp_up_down_ref 16999 145 phax5574-500bp_up_down_ref 17000 149 phax5574-500bp_up_down_ref 17001 151 phax5574-500bp_up_down_ref 17002 149

Amplicon sequencing project – Coverage


Data processing

Raw data

Alignment TMAP




Fo

r

al

l

sa

mp

le

s


Validation



V a l i d a t i o n






F i l t e r i n g

variant calling

SNPs indels SV

samtools GATK: 1. Unified Genotyper 2. Haplotype caller

samtools GATK: 1. Unified Genotyper 2. Haplotype caller Dindel

SVMerge – pipeline combining many

different tools

Amplicon sequencing project – Variant calling and filtering

Haplotype caller: not for non-diploid organisms and pooled samples.

GATK Samtools

SNP true positive rate 0.769 0.851

SNP false positive rate 0.231 0.148

samtools


Factors to consider: - Base call qualities of each supporting base - Proximity to indels and homopolymer run - Mapping qualities of the reads supporting the SNP (increased read length or paired-

end help MQ scores) - Sequencing depth - Individual vs multi-sample calling

Multi-sample calling → better rescue of low frequency SNPs

VCF file: Standardised format for storing DNA polymorphism data - SNPs, indels, SV - Rich annotations Can store variant information over many samples Record meta-data about the site - dbSNP accession, filter status Very flexible - Tags can be introduced to describe new types of variants - Different VCF files may contain different information/annotations


VCF file had two sections: - Header - Data

Header lines starting with ##: arbitrary number of meta-information lines line starting with #: column definition – mandatory columns include: CHROM chromosome POS position of the start of the variant ID unique identifier of the variant (e.g. rs number for SNPs) REF reference allele ALT comma separated list of alternate non-reference alleles QUAL phred-scaled quality score FILTER site filtering information INFO user extensible annotation (e.g. samtools and GATK may differ in this) samples follow

Data one line per site (all columns described above per line); useful information per site and per sample


GT: genotype 0=ref, 1=alt; PL: phred-scaled genotype likelihoods (For a phred-scaled likelihood of P, the raw likelihood of that genotype L = 10-P/10 , so the higher the number, the less likely it is that your sample has that genotype); DP: depth of coverage; SP: phred-scaled strand bias P-value, it tests if variant bases tend to come from one strand; GQ: genotype quality, encoded as a phred quality -10log_10p(genotype call is wrong).

GT:PL:DP:SP:GQ 0/0;0,255,255;138;0;99

240 samples, 3 PHAX regions spanning 49070bp:

• Variant calling with BQ13 and MQ0 (standard parameters): 512 sites;

• Variant calling with BQ20 and MQ50: 419 sites;

• Filtering for DP10 (DP>=10) and excluding heterozygotes positions (converted in

missing data);

• Sites with >= 5% of missing data were removed (113):306 sites;



Missing data per site (filtered per 5% but several threshold were tested):

Threshold missing data # removed sites

2.5% 121

5% 113

10% 99

20% 90

30% 70

50% 60

80% 46

100% 13

0

20

40

60

80

100

120

140

2.5% 5% 10% 20% 30% 50% 80% 100%

# si

tes

Threshold for missing data filtering

Variant_sites_240_samples

missing_data


• Annotate the variants using dbSNP 138;

• SnpGap 10 (each snp within a 10bp around a gap will be filtered) and GapWin 3 (window size for filtering adjacent gaps);

Threshold QUAL (0-999) # kept sites

QUAL > 0 306

QUAL > 10 302

QUAL > 50 300

QUAL > 100 297

QUAL > 200 283

QUAL > 300 283

QUAL > 500 283

QUAL >= 999 283

270

275

280

285

290

295

300

305

310

> 0 > 10 > 50 > 100 > 200 > 300 > 500 >= 999#

kep

t si

tes

QUAL threshold

Variant_sites_240samples

QUAL

QUAL >= 200 would be the best value, 23 sites would be discarded, 3 out of 23 are dbSNP annotated...I did not filter for it.

QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant).

• Samples with >= 5% of missing data were removed (2);


Final dataset: • 238 samples (96_NTH005 and SP15); • 297 variant sites;

PHAX 3115 PHAX 5574 PHAX 8913 Total

# variant sites 33 214 50 297

PHAX region 3115 5574 8913

Average coverage 169.3x 208.3x 164.6x

# variants (SNPs) 33 214 50

SNP density (SNP/kb) 6.6 5.6 8.3

PAR

1

PAR

2

PH

AX

31

15

P

HA

X 5

57

4

PH

AX

89

13

X c

hro

mo

som

e

Average coverage across 238 samples Amplicon sequencing project – Variant calling and filtering


42% 68%

32%

58%

Singletons and non-singleton sites

non-singletonsites

singletons indbSNP

newsingletons

0%

20%

40%

60%

80%

100%

vari

ant

site

s

singletons - EU

singletons - ME

singletons - YRI

non-singleton sites


Data processing

Raw data

Alignment TMAP




Fo

r

al

l

sa

mp

le

s


Validation



V a l i d a t i o n






F i l t e r i n g

Amplicon sequencing project – Validation

• Any NGS dataset needs a validation step to check the quality of the data and to estimate the error rate in our experiment;

• Validation can be performed in different ways:

• Sanger sequencing of a subset of new

SNPs; • Custom SNP chip approach; • NGS with a different platform; • Comparison with already sequenced

samples in publicly available dataset (i.e. 1000 Genome Project, Complete Genomics, etc….).


Drmanac, et al., Science 2010

69 full genomes data: • A Yoruban trio; • A Puerto Rican trio; • A 17-member CEPH pedigree across three generations; • A diversity panel representing unrelated individuals from

nine different populations; Some African (YRI) and European (CEU) samples are included; 9 samples included in my dataset;


Specificity vs Sensitivity = False Positive vs False Negative

our sequenced sample

external source of variation [same sample] – good quality data (i.e. Complete genomics)

TP true positive

FP false positive

TN true negative

FN false negative

high specificity

high sensitivity

low FP

low FN


9 samples included in my dataset; Variant calling all sites (including reference sites and not only SNPS); 427 kb compared across 9 samples;

True calls (%) FP (%) FN (%)

99.9995 0.0005 0

Overall both FP and FN are low confirming the high quality of the data.

True negative

+ True

positive

Outline

• Introduction;

• Ion Torrent platform – how does it work?;

• Library preparation;

• NGS data analysis;

• Pros and cons.

Amplicon sequencing project

Amplicon sequencing project – Pros and cons

Ion PGM Pros: • Ion Torrent platform performs well for small NGS project;

• Fast run time and cost-effective for small scale project;

• Very high data quality;

• Torrent suite software for standard approaches (for not geeky people!);

Cons: • No standard pipeline in analysis NGS data, some general rules though; • Not extremely precise for either small or big indels;

• Not suitable for whole-genome sequencing (see Ion Pronton);

• Remember homopolymer issues (i.e. telomeres).

• Be smart in designing your experiment (i.e. coverage, barcodes, etc…) ;

• Be practical and “creative” in customising the best pipeline for your project ;

• Be critical regarding your data!

• Consider the information you loose at each filtering step;

• Check the error rate in your experiment;

• NGS technology has many advantages but Sanger sequencing was easier!

Amplicon sequencing project – Conclusions

software website

bwa http://bio-bwa.sourceforge.net/

picard http://picard.sourceforge.net/

samtools http://samtools.sourceforge.net/

GATK http://www.broadinstitute.org/gatk/

tablet http://bioinf.scri.ac.uk/tablet/

vcftools http://vcftools.sourceforge.net/

Useful resources:

Jia P et al, Plos One , 2012 – Variant calling. FreeBayes, https://wiki.gacrc.uga.edu/wiki/Freebayes , variant calling software.

Useful resources:

Jia P et al, Plos One , 2012 – Variant calling. FreeBayes, https://wiki.gacrc.uga.edu/wiki/Freebayes , variant calling software.

Acknowledgments

• Mark Jobling

• Alec Jeffreys

• Rita Neumann

• Pille Hallast

• Chiara Batini

Conclusions and future work

• RepeatSeq performs really well for m5753 and m5751…even though this dataset is pretty small, I would expect a low error rate on a bigger scale project;

• LobSTR performs quite well for m2036 (70% accuracy) but probably it is still worth going on with the ABI typing;

• m9053 does not have enough reads to be called by these tools…possibly it depends on the genomic context related to sequencing issues…ABI typing still needed;

• LobSTR error rate looks higher towards long allele compared to the reference sequence and among the wrong calls it prefers the reference allele;

• A validation dataset /subset typed with the ABI seems to be still needed.

Quality Score Predictors Torrent software uses the following six predictors that are correlated with empirical base call quality: P1 Penalty Residual: A penalty based on the difference between predicted and actual flow values. Computed by the base caller. P2 Local noise: Noise (defined as the maximum absolute difference between the flow value and the nearest integer) in the immediate neighborhood (plus/minus 1 base) of the given base. P3 Beverly Events: Number of high-residual flows in the 20-flow window around the flow containing the base. A flow has high residual when the normalized difference between the observed and model-predicted signal exceeds 0.4 or falls below –0.4. The more high-residual flows in the window, the lower quality the base call. P4 Multiple incorporations: Number of incorporated bases in this flow. Length of the homopolymer. For multiple incorporations of the same nucleotide in one flow, the last base in the incorporation order is assigned a value equivalent to the total number of incorporations. All other bases in the sequence of the multiple incorporations are assigned the value 1. P5 Environment noise: The average signal noise (defined as the absolute difference between the flow value and the nearest integer) in the neighborhood (plus/minus 5 bases) of the given base. P6 State Inphase: Live polymerase in phase.

Outline - Unife

Documents

Transcript of Outline - Unife