Aplicaciones de la secuenciación genómica de nueva ... · Aplicaciones de la secuenciación...
-
Upload
nguyenthuy -
Category
Documents
-
view
220 -
download
0
Transcript of Aplicaciones de la secuenciación genómica de nueva ... · Aplicaciones de la secuenciación...
www.jornadasaludinvestiga.es
Aplicaciones de la secuenciación
genómica de nueva generación
(Next Generation Sequencing,
NGS) en estudios clínicos
Javier Pérez Florido
Antonio Rueda Martín
Bioinformaticians
Genomics and Bioinformatics Platform of Andalusia
(GBPA)
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
Why NGS?
• It works!
• Versatility of the data
• Key: the development of a technology
able to massively parallelize the
sequencing process drastically
reduces sequencing time and costs
History of DNA Sequencing
Nature 458, 719-724 (2009)
Basics of the “new” Technology→ Get DNA.
→ Fragment de DNA and attach adaptors.
→ Attach it to something (bead or glass)
→ Extend and amplify signal with some color scheme.
→ Detect fluorochrome by microscopy.
→ Interpret series of spots as short strings of DNA.
→ Simultaneously sequencing entire libraries of DNA sequence fragments.
NGS Technologies
Differences among sequencing platforms
• Nanotechnology used.
• Detection system
• Resolution of the image analysis.
• Chemistry and enzymology.
• Read length and number of reads
• Signal to noise detection in the software (Q scores)
• Run time
• Cost
Roche 454 Pyrosequencing
M.L. Metzker, Nature Review Genetics(2010)
Roche 454 GS Systems
GS FLX+
GS Junior
• 10 h. Sequencing
• Avg read lenght 400 bp
• Reads per run 100,000
• 40 Mbp
• 23 h. Sequencing
• Avg read lenght 700 bp
• Reads per run 1,000,000
• 700 Mbp
Illumina: sequencing by synthesis
• DNA fragments are ligated at both ends to
adapters
• DNA fragments are immobilized at one
end on a solid support
• Single-stranded fragments create a
“bridge” structure
• Adapters act as primers for PCR
amplification
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
Illumina: PCR bridge amplification, reversible terminators
• Four reversible terminator nucleotides, each
labelled with a different fluorescent dye
• Incorporation of each nucleotide is detected by a
CCD camera
• Terminators are removed and synthesis repeated
• Just one nucleotide is incorporated in each cycle
Illumina Sequencers
HiSeq 2500
MiSeq
Max Output
1,000 GbMax Read Number
4,000 M
Max Read Length
2x125 bp
www.illumina.com
Max Output
15 Gb
Max Read Number
25 MMax Read Length
2x300 bp
Max Output
120 Gb
Max Read Number
400 MMax Read Length
2x150 bp
NextSeq 500
Solid 5500
200 Gb/run
35-75 bp fragments
1.8 - 4.8 billion reads/run
2x6 lanes/run
96 bar-codes
ECC: 99.99% accuracy
Colorspace reads
Third Generation NGS: PacBio RS• SMRT: Single Molecule Real time DNA synthesis.
• Single Molecule Sequencing: DNA synthesis is detected on a single DNA strand.
– Up to 15,000 nt, 50 bases/second
– DNA polymerase is affixed to the bottom of a tiny hole (~70nm).
– Only the bottom portion of the hole is illuminated allowing for detection of incorporation of dye-labeled nucleotide.
– Real-time Sequencing.
– DNA template is circularized by the use of “bell” shaped adapters.
– As long as the polymerase is stable this allows for continuous sequencing of both strands.
Advantages
• No amplification required.
• Extremely long read lengths.
• Average 2500 nt. Longest 15,000 nt.
Disadvantages
• High error rates.
• Error rate of ~15% for Indels. 1% Substitutions.
Most common applications of NGS
RNA-seq
/Transcriptomics
o Quantitative
o Descriptive
Alternative splicing
o miRNA profiling
Resequencing
o Mutation calling
o Profiling
oGenome annotation
De novo
sequencing
Copy number
variation
ChIP-seq /Epigenomics
o Protein-DNA interactions
o Active transcription factor binding
sites
oHistone methylationMetagenomics
Metatranscriptomics
Exome sequencing
Targeted sequencing
DNA sequencing - 1
• Whole GENOME Resequencing
– Need reference genome
– Variation discovery
• Whole GENOME “de novo” sequencing
– Uncharacterized genomes with no reference genome available
– known genomes where significant structural variation is expected.
– Long reads or mate-pair libraries. Sequencing mostly done by Roche 454, Illumina and PacBio.
– Assembly of reads is needed: Computational intensive
– E.g. Genome bacteria sequencing
DNA sequencing - 2• Targeted Resequencing
– Specific regions in the genome– Need reference genome– Need custom probes complementary
to the genomic regions• Nimblegen• Agilent
• Custom genes panel sequencing– Allows to cover high number of genes
related to a disease– Low cost and quicker than capillary
sequencing– E.g. Disease gene panel
• Whole EXOME Resequencing– Available for Human and Mouse– Variation discovery on ORFs
• 2% of human genome (lower cost)• 85% disease mutation are in the exome
Target Enrichment- Exome sequencing
DNA (patient)
Gene A Gene B
Produce shotgun
library
Capture exome
sequences
Wash & Sequence
Map against
reference genome
Determine
variants,
Annotate
and Filter
Candidate
mutations / genes
****
*
Don’t sequence all, just what you need
DNA sequencing - 3
• Amplicon sequencing
– Sequencing of regions amplified by PCR.
– Shorter regions to cover than targeted capture
– No need of custom probes
– Primer design is needed
– High fidelity polymerase
– Multiplexing is needed
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
Genomics & Bioinformatics Platform of
Andalusia, GBPA
Edificio INSUR,
Albert Einstein Street.
Cartuja Scientific and Technology Park, Sevilla
• Platform based on Next Generation Sequencing technologies
• Genomics and Bioinformatics labs together
Genomics & Bioinformatics Platform of
Andalusia GBPA
Infrastructure at GBPA
SOLid 5500 XL
Roche 454 GS-FLX+
High performance cluster• 24 HPC nodes (72-192Gb)
• Hyperthreading: up to 450
parallel jobs
• Total memory: 2Tb
• Storage: 540 Tb
Infrastructure at GBPA
• Recently, GBPA got funding for:
MiSeq Illumina
HiSeq 2500
Illumina
PacBio RSII
Pacific Biosciences
Projects at GBPA
Medical Genome Project (MGP)• A first step for the implementation of the personalized
medicine in the Andalusian Health System
• The characterization of a number of genetic diseasesby means of exome sequencing.– Genetic rare diseases
– Monogenic diseases
• To characterize SNPs in a Spanish healthy controlpopulation– 300 Individuals
– More than 500.000 variants found. Half of them notpreviously reported in any public repository.
Other projects at GBPA• Development of an NGS data analysis system for the clinical
diagnosis of genetic diseases.
• Currently working on the development of High PerformanceComputing tools for the analysis of huge sets of variants, incollaboration with the EBI and CIPF.
• Other collaborations: IBIS, Hosp. San Cecilio, Hosp. SanJoan de Deu, Hosp. Clinic-IDIBAPS, Hosp. Ramón y Cajal,CIEMAT, UGR, CABIMER, etc.
• Participated in the Sequence Quality Control project (SEQC –MAQC III)
– The SEQC/MAQC-III Consortium. A Comprehensive assessment of RNA-seqaccuracy, reproducibility and information content by the Sequencing QualityControl Consortium. Nature Biotechnology, 32, pp.903-914, 2014
Services at GBPA
RNA-seq
/Transcriptomics
o Quantitative
o Descriptive
Alternative splicing
o miRNA profiling
Resequencing
o Mutation calling
o Profiling
oGenome annotation
De novo
sequencing
Copy number
variation
ChIP-seq /Epigenomics
o Protein-DNA interactions
o Active transcription factor binding
sites
oHistone methylationMetagenomics
Metatranscriptomics
Exome sequencing
Targeted sequencing
http://www.gbpa.es
http://www.gbpa.es
Training: 4-day hands-on course for the analysis of genomics / transcriptomics NGS data
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
NGS data pipeline analysis
DNA sample NGS instrument Data
Library preparation
Sequencing Data analysis
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
NGS data pipeline analysis
Quality control
Sequence filtering
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
RAW data
Propietary format
FastQ
Different sequencers output
different files (sff, csfasta, qual
file, xsq, …)
Nearly all downstream
analysis take FastQ as
input sequence
NGS instrument
Quality control. Data formats: FastQ
• Fastq format “ is a fasta with qualities”:
1. Header line (like fasta but starting with “@”)
2. Sequence (string of nucleotides)
3. “+” and sequence ID (optional)
4. Quality values of sequence encoded as a single byte ASCII code
• File extension: .fastq
• Sequence quality encoding
o Base quality must be encoded in just 1 byte!
o Each base has a corresponding quality value: quality in position n isrelated to base in position n
o Encoding procedure:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Error probability Phred transformation
(inversed integer value)ASCII encoding
Quality control. Data formats: FastQ
• Phred + 33
o Sanger [0,40], Illumina 1.8 [0,41], llumina 1.9 [0,41]
• Phred + 64
o Illumina 1.3 [0,40], Illumina 1.5 [3,40]
Prob. of
incorrect
base call
Phred
quality
Score
Base
call
accuracy
1 in 10 10 90%
1 in 100 20 99%
1 in 1000 30 99.9%
1 in 10000 40 99.99%
1 in 100000 50 99.999%
Error probability Phred transformation
(inversed integer value)ASCII encoding
Quality control. Data formats: FastQ
Quality Control
• Evaluation of sequence quality
o Primary tool to assess sequencing
o Evaluating sequences in depth is a valuableapproach to assess how reliable our results willbe
o QC determines posterior filtering
o Any filtering decision will affect downstream analysis
o QC must be run after every critical step
o Huge files… don’t worry. Software tools will do itfor us.
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Quality ControlFastq file
FastQC
Quality control Addressing QC with FASTQC
• By means of FASTQC, raw reads can be evaluated in terms of differentquality metrics:
o Per base sequence quality
o Per sequence quality scores
o Per base sequence content
o Per base GC content
o Per sequence GC content
o Per base N content
o Sequence length distribution
o Duplicate sequences
o Overrepresented sequences
o Overrepresented k-mers
• Examples• Good quality
• Bad quality
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_sh
ort_fastqc/fastqc_report.html
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fast
qc/fastqc_report.html
Good
quality
Reasonable
quality
Poor quality
Shows an overview of the range of quality values across all bases at each position in the fastq file
• The central red line is the median value
• The yellow box represents the inter-quartile range (25-75%)
• The upper and lower whiskers represent the 10% and 90% points
• The blue line represents the mean quality
• Good data
• Consistent
• High quality along
the read
Per base sequence quality
Per base sequence quality
• Bad data
• High variance
• Qualitydecreasestowards the endof the read
Good
Reasonable
Poor
Per sequence quality scores
• Good data
• Most of the reads are high-quality
sequences
• Bad data
• Distribution with bi-modalities
Low quality reads
Allows to see if a subset of sequences have universally low quality values
Sequence length distribution
• Some sequencers output reads of
different length (for example,
Roche 454)
• Some sequencers generate
sequence fragments of uniform
length
Sequence Filtering
• It is important to remove bad quality data -> our
confidence on downstream analysis will be improved
Sequence Filtering
• Sequence
filtering:
o Mean quality
o Read length
o Read length after
trimming
o Percentage of
bases above a
quality threshold
o Adapter trimming
o Adapter reads
Minimum quality threshold
Sequence Filtering
• Sequence filtering tools
o Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
o Galaxy (https://main.g2.bx.psu.edu/)
o SeqTK (https://github.com/lh3/seqtk)
o Cutadapt (https://code.google.com/p/cutadapt/)
o Trimmomatic
(http://www.usadellab.org/cms/?page=trimmomatic)
o …
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
The mapping process
The mapping process
• Reference Genome
• Consensus sequence, built up from high qualitysequencing samples
• Control reference sequence to compare oursamples
• Genome Reference Consortium: created todeliver assemblies:• http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
• Fasta format
• Different assemblies
NGS data
• DNA-seq, RNA-seq, BS-seq, ChIP-seq, …
• Reads sizes ranging from 75bp – 20kbp
• Single-end and paired-end reads
• Basespace or colorspace
NGS data
• Challenges
o Massive Data:
• Solid 5500W: 240Gb in 1x75bp reads, 320 Gb
in 2x50bp
• Illumina HiSeq 2500: 160Gb in 2x150bp reads
o Natural variability: SNPs, indels, de novo
mutations, CNVs…
o Sequencing errors
o RNA-seq: gapped alignment
o Computing resources
Mapping process considerations
• Which aligner should I use?
o Read length
o DNA or RNA
o Basespace or colorspace
o Computing resources
• Aligner parameters
o Single-end or paired-end
o SNVs, Indels
o Read quality
o Should allow multiple hits?
• Smith-Waterman (SW)• Align any two sequences
• Too slow for NGS and very high memory footprint
• Based on Hashes• Faster than SW
• High memory footprint
• Burrows Wheeler Transform• Very Fast and low memory footprint
• Very sensitive to errors
• Hybrid approaches
Mapping process: algorithms
• BWA, BWA-SW and BWA-MEM
• Based on Burrows-Wheeler Aligner
• http://bio-bwa.sourceforge.net/
• Widely used: support many read lenghts, valid for Illumina, 454, etc
• Bowtie and Bowtie2
• Based on Burrows-Wheeler Transform (BWT) algorithm
• Allowed a few mismatches and no gaps, claimed to the fastest
• http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
• HPG Aligner
• Great speed and sensitivity
• DNA or RNA
• HPC technologies used to provide the fastest runtime: multicore, SSE, GPUs
• Valid for short and long reads
• http://wiki.opencb.org/projects/hpg/doku.php?id=aligner:overview
• Other tools
• DNA: Bfast, Shrimp & Shrimp2, Blat, Mosaik-aligner, NextGenMap
• RNA-Seq: TopHat (uses bowtie) & TopHat2 (uses bowtie2), SOAPSplice
• BS-seq: Bismark (uses bowtie2), BRAT & BRAT-BW
Mapping process: tools
Mapping output: SAM / BAM format
• http://samtools.github.io/hts-specs/SAMv1.pdf
Mapping output: SAM / BAM formatHeader
Alignment
• BAM format• BAM format is the binary (compressed) representation of a
SAM file
• A BAM file is smaller than its corresponding SAM file, and canbe read faster, but the content is the same
• BAM index• Indexing a BAM file allows to access the alignments by
overlapping an specified region without going through thewhole alignments
• BAM index file: .bai
• The BAM file must be sorted by coordinate to be indexed
• Tools• Samtools, http://samtools.sourceforge.net
• Provides several utilities for manipulating alignments: SAM toBAM, sorting, BAM index, etc
• Others: Picard, Pysam, Bio-Sam tools
Mapping output: SAM / BAM format
Mapping procedure
Choose aligner
SAM to BAM
Choose a valid
reference
Choose aligner params
Sort BAM
Index BAM
QC
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
Mapping procedure: can the results be visualized?
• Yes! Use IGV: The Integrative Genomics Viewer
• It is an integrated visualization tool of large data types
• http://www.broadinstitute.org/igv/
Mapping procedure: can the results be visualized?
• IGV supports multiple file formats (not only BAM!)
Mapping procedure: can the results be visualized?
RefSeq track
Zoom in to focus on an exon
Visualizing variants
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
What is variant calling?
Variant calling identifies variable sites (i.e. sites different to the reference)
What is variant calling?
What is variant calling?
• Variant types
o SNV: single nucleotide
variant
o Indel: small
insertion/deletion variant
What is variant calling?
• NGS data can suffer from high error rateso Base-calling, alignment errors
• Accurate variant calling is difficulto There is often considerable uncertainty associated
with the results.
• It is crucial to quantify and account for thisuncertaintyo it influences downstream analyses based on the
inferred SNVs (identification of rare mutations,estimation of allele frequencies, etc)
Mapping
Mark duplicates
Indel realigment
Base quality recalibration
Variant calling
Filtering and labeling
Variant annotation
PRE
POS
Variant calling pipeline based on GATK
M. DePristo et al. A framework for variation discovery and genotyping using
next-generation DNA sequencing data. Nature Genetics.43:491-498, 2011
Sequence processing
Mapping
Variant annotation
Candidate variants
Custom filtering
Variant calling
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
SNV calling
Indel calling
SNV calling
• SNV: single nucleotide variant• Examine the bases aligned to position and look for differences
• Two steps:• Variant calling: aims to determine in which positions with at least one
of the bases differs from reference.
• Genotype calling: process of determining the genotype for eachindividual for positions in which a variant has already been called
• Early methods:• Counting alleles at each site and using simple cutoff rules for when to
call a SNV or genotype
• Probabilistic frameworks:• Compute genotype likelihood
• Advantages:
• Provide statistical measures of uncertainty
• Lead to higher accuracy on genotype calling
• Provide a natural framework for incorporating information (allelefrequency, Linkage Disequilibrium, etc)
SNV calling
• Probabilistic framework: Bayesian approach (used by GATK):
where:
D represents our data (read base pileup at this referencebase)
G represents the genotype under consideration.
p(G|D) is the posterior probability of genotype G
p(D|G) genotype likelihood
p(G) is the prior probability of seeing this genotype (SNPDBs, population sample, etc)
p(D) is constant over all genotypes (can be ignored)
p G p D Gp G D
p D
Indel calling
• Small insertions and deletions observed in thealignment of the read relative to the referencegenome• BAM format: I or D character in CIGAR denote indel in
the read
• Factor to consider when calling indels• Misalignment of the read
• Alignment score (often cheaper to introduce multipleSNVs than an indel)
• Sufficient flanking sequence either side of the read
• Length of the reads
Variant calling software
Nielsen R, et al., Genotype and SNP calling from next-generation sequencing data. Nature Reviews,
Genetics, 2011; 12: 443-451
Variant Calling Format (VCF)
Header
Data
Info fields
FORMAT fields:
GT: Genotype. For a single ALT:
0|0 – the sample is homozygous reference
0|1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
1|1 – the sample is homozygous alternate
GQ: Genotype Quality: phred-scaled confidence that the true genotype is the one provided in GT
DP: Approximate read depth
HQ: Haplotype Quality
Variant filtration and labelingGOAL
• Filter SNVs and Indels based on certain criteria, for example, using aset of expression derived from INFO fields: depth, mapping quality, etc.
• Variants that pass such criteria are labeled as PASS. Other wise, arelabeled to NOT_PASS (or whatever label we want to use)
• Filtering parameters for SNVs and indels are different
http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set
SNVs
QD < 2.0
MQ < 40.0
FS > 60.0
HaplotypeScore > 13.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
Indels
QD < 2.0
FS > 200.0
ReadPosRankSum < -20.0
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
Why annotation?
• Each healthy person carriesin the exome:• Aprox. 11,000 synonymous
variants
• Aprox. 11,000 non-synonymousvariants
• From 250 to 300 loss-of-functionvariants in annotated genes
• From 50 to 100 variantspreviously implicate in inheriteddisorders
Why annotation?
• Each healthy person carriesin the exome:• Aprox. 11,000 synonymous
variants
• Aprox. 11,000 non-synonymousvariants
• From 250 to 300 loss-of-functionvariants in annotated genes
• From 50 to 100 variantspreviously implicate in inheriteddisorders
• Freeman-Sheldon syndrome
• Only two variants are the true
disease causal mutations
Why annotation?
GOAL:To identify a small subset of functionally
important variants from large amounts of
sequencing data to pinpoint potential disease
causal genes and causal mutations
P.Cordero et al., Whole-Genome sequencing in personalized therapeutics, Clinical Pharmacology &
Therapeutics, vol.91(6): 1001-1009, 2012
G.M.Cooper et al., Needles in stacks of needles: finding disease-causal variants in a wealth of
genomic data. Nature Genetics, vol. 12:628-640, 2011
K.Wang et al. , ANNOVAR: functional annotation of genetic variants from high-throughput sequencing
data, Nucleic Acids Research, vol.38(16), e164, 2010
Why annotation?
G.M.Cooper et al., Needles in stacks of needles: finding disease-causal variants in a wealth of genomic
data. Nature Genetics, vol. 12:628-640, 2011
Needles in stacks of needles
Annotation levels
Annotation levels
• Genomic localization
Some DBs of functional information
Annotation software
• ANNOVARo http://www.openbioinformatics.org/annovaro Local annotation of SNVs and indels
o DBs: dbSNP, 1000g, regulatory information…
o Prediction: SIFT, Polyphen, Mutation Taster
o Species: human, mouse, worm, fly, yeast
• Variant Effect Predictor (VEP)o http://www.ensembl.org/info/docs/tools/vep/index.html
o Can annotate SNVs, Indels and complex variants
o Prediction: SIFT, Polyphen
o Many species
• HPG varianto http://docs.bioinfo.cipf.es/projects/hpg-variant
o Can annotate SNVs and Indels
o Huge amount of DBs available: HGMD, 1000g, dbSNP, regulatoryinformation…
o 11 species (human, mouse, work, fly, yeast,…)
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
199 200
201 249
Custom filtering
• The set of annotated variants can be huge.• Filtering is needed, for example, based on:
• Genotype according to pedigree
• Variant type: synonymous, nonsynonymous, stoploss,stopgain…
• Population frequency
• Conservation
• Disease information
• Pathway or ontologies
• …
It depends on the hypothesis!!!
NGS data pipeline analysis
Sequence processing
Mapping
Variant calling
Variant annotation
Candidate variants
Custom filtering
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
NGS data pipeline analysis
Sequence processing (FASTQ)
Mapping (BAM)
Variant calling (VCF)
Variant annotation (VCF)
Candidate
variants (XLS,
PDF, HTML…)
Custom filtering (VCF)
From raw to candidate variants
FAMILY-21 STRATEGY
• DNA Capture: Custom capture to target
chromosome 21 exons.
• Multiplexed Illumina NextSeq sequencing
• Read Quality control
• Read Mapping using BWA
• Variant Calling using GATK
• Variant quality filtration using GATK
•Phenotype: Muscular Dystrophy
•Monogenic Disease
•Inheritance: Autosomal recessive
•Linked to chromosome 21
•Consanguinity
199 200
201 249
From raw to candidate variants
Raw Variants 1078 Variants
From raw to candidate variants
Raw Variants
Annovar Annotation
1078 Variants
From raw to candidate variants
Raw Variants
Annovar Annotation
Filter by Variant Type
1078 Variants
From raw to candidate variants
Raw Variants
Annovar Annotation
Filter by Variant Type Non Synonymous Variants
1078 Variants
155 Variants
From raw to candidate variants
Raw Variants
Annovar Annotation
Filter by Variant Type Non Synonymous Variants
1078 Variants
155 Variants
Filter by Pedigree 4 Variants
From raw to candidate variants
Raw Variants
Annovar Annotation
Filter by Variant Type Non Synonymous Variants
1078 Variants
155 Variants
Filter by Pedigree 4 Variants
Full Annotation 1 Candidate
From raw to candidate variants
Raw Variants
Experimental
Validation
Annovar Annotation
Filter by Variant Type Non Synonymous Variants
1078 Variants
155 Variants
Filter by Pedigree 4 Variants
Full Annotation 1 Candidate
Agenda
• PART I→ Introduction to NGS technologies
→ About GBPA
• PART II→ NGS data analysis pipeline: from raw data to candidate variants
• PART III→ Full example: from raw to candidate variants
→ Successful stories
Successful stories
• Whole Genome Example
Successful stories
• Whole Exome Example
Successful stories
• Targeted Sequencing Example
Gracias por su atención.