Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese...

Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations)

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Types of genetic variation2. Single nucleotide variants and small

insertions/deletions3. Large insertions/deletions and translocations4. Repeats and copy number variations5. Inversions

Last update: 20-Sep-2015

TYPES OF GENETIC VARIATIONPart 1


Genetic “variation”• Two main definitions:

1. Differences in DNA among different individuals in a population

2. Differences in DNA between an individual and a reference (focus of this lecture)• Sometimes, it is easy to define the reference

– The human reference sequence(http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/ https://en.wikipedia.org/wiki/Human_Genome_Project)

– “Normal” genome (e.g., blood from the same cancer patient)

• Sometimes, it is not easy to define– A’s insertion with respect to B is B’s deletion with respect to A

– Which one is more “normal”?


http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/

https://en.wikipedia.org/wiki/Human_Genome_Project


Types of genetic variation• Single nucleotide variants (SNVs)– Single nucleotide polymorphisms (SNPs) if found

in >1% of individuals in a population• Small insertions/deletions (indels)– Several nucleotides long

• Structural variations (SVs)– Larger variations



Some proposed definitions


Term Definition

Structural variant (SV) A genomic alteration (e.g., a CNV, and inversion) that involves segments of DNA >1kb

Copy number variant (CNV) A duplication or deletion event involving >1kb of DNA

Duplicon A duplicated genomic segment >1kb in length with >90% similarity between copies

Indel Variation from insertion or deletion event involving <1kb of DNA

Intermediate-sized structural variant (ISV)

A structural variant that is -8kb to 40kb in size. This can refer to a CNV or a balanced structural rearrangement (e.g., an inversion)

Low copy repeat (LCR) Similar to segmental duplication

Multisite variant (MSV) Complex polymorphic variation that is neither a PSV nor a SNP

Paralogous sequence variant (PSV) Sequence difference between duplicated copies (paralogs)

Segmental duplication Duplicated region ranging from 1kb upward with a sequence identity of >90%

Interchromosomal Duplications distributed among nonhomologous chromosomes

Intrachromosomal Duplications restricted to a single chromosome

Single nucleotide polymorphism (SNP)

Base substitution involving only a single nucleotide; ~10 million are thought to be present in the human genome at >1%, leading to an average of one SNP difference per 1250 bases between randomly chosen individuals

Table source: Freeman et al., Genome Research 16(8):949-961, (2006) More commonly used


Origin of genetic variations• SNVs: Errors during DNA replication that survive the

proof-reading and mismatch-repair mechanisms


Image credit: Wikipedia; Martin and D. Scharff, Nature Reviews Immunology 2(8):605-614, (2002)


Origin of genetic variations• SVs: Various mechanisms

– FoSTeS: Fork stalling and template switching

– MEI: Mobile element insertion

– NAHR: Non-allelic homologous recombination

– NHEJ: Non-homologous end-joining


Image credit: Bickhart and Liu, Frontiers in Genetics 10.3389, (2014); Xing et al., Genome Research 19(9):1516-1526, (2009)


Origin of genetic variations• SVs: More figures


Image credit: Gu et al., PathoGenetics 1(1):4, (2008)


Consequence of genetic variants• Hitting genes:


Image source: http://www.nbs.csudh.edu/chemistry/faculty/nsturm/CHEMXL153/DNAMutationRepair.htm


Consequence of genetic variants• Hitting genes:

– Synonymous (silent) mutation (no change in protein sequence)• May still affect translational efficiency

– Nonsense mutation (pre-mature stop codon)– Read-through (removal of the stop codon)– Missense mutation (change of one/a few amino acids)– Frameshift (shifting the reading frame)– Affecting splicing (removal/new acceptor site or donor site)– Deletion of whole exon/gene– Changing gene copy number– Gene fusion– ...

• Others (more difficult to determine):– Disrupting protein binding sites– Affecting gene regulation– Affecting DNA 3D structure– ...

• See “Effect prediction details” section of SnpEff manual (http://snpeff.sourceforge.net/SnpEff_manual.html) for more details


http://snpeff.sourceforge.net/SnpEff_manual.html


Using NGS to identify genetic variations

• General steps:1. Align sequencing reads to reference

ORConstruct sequence assembly from sequencing reads

2. Look for differences• The alignment strategy only works when accurate and

efficient read alignment is possible.– Cannot determine parts that are completely not in reference

• The assembly strategy only works for genomic regions that can be accurately assembled.

• In both strategies, it is also required to distinguish between sequencing errors/biases and variants.



DNA-seq vs. RNA-seq in calling variants

• Using DNA to identify genetic variants could identify variants not functionally significant– Example: Fused gene due to translocation not

actually expressed• Using RNA to identify genetic variants could

falsely treat post-transcriptional modifications as genetic variants– Example: RNA editing

• In general, good to have support from both DNA and RNA data


SINGLE NUCLEOTIDE VARIANTS AND SMALL INSERTIONS/DELETIONS

Part 2


A typical pipeline• The Genome Analysis Toolkit (GATK) workflow for

calling variants in RNA-seq data (similar for DNA-seq)


Image credit: Broad Institute, https://www.broadinstitute.org/gatk/guide/tagged?tag=rnaseq


More details of pipeline• The Genome Analysis Toolkit (GATK) workflow for

calling variants in RNA-seq data


Image credit: Broad Institute, https://www.broadinstitute.org/gatk/guide/tagged?tag=rnaseq


Read re-alignment• In standard sequence alignment, each read is

aligned to reference independently.• To discover indels accurately, re-alignment by

combining information from multiple reads is recommended.– Usually fix mis-alignments at read ends

• Example:Reference: CGACCGTRead 1: ACCAGT (more likely to be one insertion than two SNVs)Read 2: CGACCA (not sure whether it is insertion or SNV by itself,

more likely to be an insertion after considering read 1)



Read re-alignment


Before re-alignment After re-alignmentImage credit: DePristo et al., Nature Genetics 43(5):491-498, (2011)


Re-calibration of base quality scores1. Assuming the observed quality score is affected by:– Actual quality score– Machine cycle (i.e., base position on the read)– Di-nucleotide context (the base itself and the one before)

2. Estimating the weight of each factor using mismatches at loci not known to vary in the dbSNP database of genetic variants– All these mismatches are assumed to be due to

sequencing errors

3. Adjusting the quality scores accordingly



Re-calibration of base quality scores


Image credit: DePristo et al., Nature Genetics 43(5):491-498, (2011)


Calling SNVs• Notations:– D: data (all bases aligned to a position)– Di: the i-th aligned base (i.e., the base aligned to the

position on the i-th read)– Gj, Gk: genotypes

– Hj1, Hj2: alleles (haplotypes) of Gj

– i: base calling error rate of the i-th aligned base

• Bayesian formulation:


Pr൫𝐺𝑗ห𝐷൯= Pr൫𝐺𝑗൯Pr൫𝐷ห𝐺𝑗൯σ Prሺ𝐺𝑘ሻPrሺ𝐷ȁ𝐺𝑘ሻ𝐺𝑘=ሼAA,Aȁ,ȁȁሽPr൫𝐷ห𝐺𝑗൯= ෑ� ቈ

Pr൫𝐷𝑖ห𝐻𝑗1൯2 + Pr൫𝐷𝑖ห𝐻𝑗2൯2 𝑖Pr൫𝐷𝑖ห𝐻𝑗1൯=൜1− ε𝑖 if 𝐷𝑖 = 𝐻𝑗1ε𝑖 if 𝐷𝑖 ≠ 𝐻𝑗1


Calling indels• For indels, Pr(Di|Hj1) is computed based on a hidden

Markov model:– Ix, Iy: The two indel haplotypes– : gap opening penalty– : gap extension penalty– pxi

,yj: likelihood of aligning xi and yj

– qxi : likelihood of aligning xi and a gap


Image source: https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-5-Variant_calling.pdf

LARGE INSERTIONS/DELETIONS AND TRANSLOCATIONS

Part 3


Useful types of information• Split reads: One single read aligned to two different locations on reference

– Precisely define break points– Could be difficult to align– Relatively rare

• Paired-end reads: The two reads in a mate pair aligned to the reference with an unexpected distance, or one read cannot be aligned– Easier to happen– Reads easier to align– Cannot determine precise break points– Could be hard to judge if it is an SV due to inexact insert size

• Read depth/alignment quality: Drop of read depth/alignment quality around break points due to difficulty of alignment, and lack of aligned reads in deleted regions– Can be observed even in standard alignment pipelines– The drop is not always clear– Some drops could be due to other reasons



Useful types of information


Image credit: Keane et al., Frontiers in Genetics 10.3389, (2014)

Expected insert size

Distance of the aligned locations on the reference


Alignment strategies• Split mapping– Need to try different possible ways to split a read,

or use specialized alignment algorithms– If the split is too imbalanced, the shorter part may

not be aligned (uniquely)• Constructing junction library (also used in

aligning RNA-seq reads), then aligning reads onto the putative junction sequences– Need to first have a rough idea of the break points– Need to try different possible junctions



Junction library• Suppose we have the following rough estimate

of the break points of a deletion (e.g., based on alignment of paired-end reads):

• Possible junctions:


A C G A G A T A C T G A C A G A T T A C T G A T G C A G T A

A C G A G A T G A T G C A G T AA C G A G A T A T G C A G T AA C G A G A T T G C A G T AA C G A G A T G C A G T AA C G A G A G A T G C A G T AA C G A G A A T G C A G T AA C G A G A T G C A G T AA C G A G A G C A G T A

A C G A G G A T G C A G T AA C G A G A T G C A G T AA C G A G T G C A G T AA C G A G G C A G T AA C G A G A T G C A G T AA C G A A T G C A G T AA C G A T G C A G T AA C G A G C A G T A


Real SVs vs. sequencing/alignment errors

• Real SVs are usually indicated by:– Even coverage around break points (ladder)– Good base quality and alignment scores




• A good case:


Putative junction sequence

Break points

Paired-end reads

Split reads



• A bad case (gray portions of reads are aligned perfectly; colored portions are mismatches, reads marked in dark red have unexpected insert sizes):


Break points

Putative junction sequence


Break point confusion• SVs could be due to micro-homology at the

break points:

– Does the GAT come from the paternal or maternal copy?• Does it matter?• It matters more if we want to know what happens to

the other ends of the breaks




Paternal

Maternal

A C G A G A T G C A G T A

REPEATS AND COPY NUMBER VARIATIONS

Part 4


Copy number variation• For a diploid organism, each cell contains two

copies of the same chromosome.– If a gene is unique, there are exactly two copies of

it.• Sometimes, the copy number is not 2:– Paralogs (gene duplication – various mechanisms)– Retro-transcription– Aneuploidy (not exactly 2 copies of each

chromosome)• Whole genome• Whole chromosome



Copy number variation• In general, DNA regions can have 2 copies for

many reasons• Copy numbers can have significant

consequences. For example,– Haploinsufficiency (having only one copy cannot

maintain function)– Gene dosage (amount of transcripts/proteins)– Complex phenotypic consequences (e.g., copy

number of DUF1220 domain related to human brain size and diseases)



Smaller-scale repeats• Genomes contain many types of repeats

– By size• Tandem repeats: one immediately after another

– E.g., TTAGGG at telomeres: related to protection

• Short interspersed nuclear elements (SINEs)– E.g., Alu elements: ~280bp, GC rich

• Long interspersed nuclear elements (LINEs)– E.g., L1 elements: ~6-8kbp, AT rich

– By number of occurrences– By mechanism: transposable elements (TEs)

• Retrotransposons (transcription reverse transcription): copy and paste• DNA transposons: cut and paste

• Some regions are defined as low complexity regions (LCRs) – regions with low information content



Identifying CNVs• Useful information:– For determining boundaries:• Split reads• Paired-end reads• Loss of heterozygosity (LOH)

– For determining both boundaries and copy number:• Read depth, relative to “normal”

– Could be hard to define the “normal” line

• B-allele frequency (BAF)• Long reads, if long enough



LOH• Typically, heterozygous variants appear in all

different places in the genome• A large region without heterozygous variants

may indicate occurrence of CNV– Note: Having only one copy leads to LOH, but LOH

can also happen in regions with other copy numbers



BAF• LOH only indicates regions with one allele

completely disappeared• B-allele frequency is a more general concept

that asks for the count of reads that support the B allele (defined arbitrarily) as a ratio of the total number of reads aligned to the location (that support either the A or B allele)– The concept was originally defined for microarray

data



BAF, LOH and LRR• LRR: log2(observed signal / expected signal)


An illustration of log R Ratio (LRR) and B Allele Freq (BAF) values for the chromosome 15 q-arm of an individual. A normal chromosome region has three BAF genotype clusters, as represented as AA, AB, and BB genotypes in boxes, and with LRR values centered around zero. The copy-neutral LOH region has normal LRR values, but without the AB genotype cluster. The increased copy number for a CNV region can be detected based on an increased number of peaks in the BAF distribution, as well as increased LRR values. The patterns of LRR and BAF for different CNV regions, normal regions, and copy-neutral LOH regions are distinct from each other, thus the combination of LRR and BAF can be used to generate CNV calls.

Image credit: Wang et al., Genome Research 17(11):1665-1674, (2007)

INVERSIONSPart 5


Balanced mutation• Insertion, deletion and CNV result in copy

number changes• In contrast, translocations and inversions

usually do not– They are called “balanced mutations”

• Balanced mutations cannot be detected by checking read depth



Inversions: A closer look• Suppose we have the following sequence:– ACGCAT

• What would it look like if the CGCA part is inverted?– AACGCT?– AGCGTT?– ATGCGT?

• Even with both strands sequenced and inversions, we do not try to align a 3’-5’ sequence with a 5’-3’ sequence



Inversion: Strand and read orientation


Image credit: Okamura et al., BMC Genomics 8:160, (2007)

Reference

Sequenced DNA

Fragment 1 AACTTG

Alignments 1 AAC TTG

Fragment 2 AACGTT

Alignments 2 AAC TTG

Fragment 3 CTTTTG

Alignments 3 TTC TTG

Assuming perfect alignments

Fragment 4 CTTGTT

Alignments 4 TTCTTG

http://www.biomedcentral.com/1471-2164/8/160/figure/F1


More on read orientations• Some SVs are complex


Image credit: Medvedev et al., Nature Methods 6(11S):S13-S20, (2009); Pevzner, PNAS 100(13):7672-7677, (2003)


Even more on read orientations• If a fragment is too long, one

can circularize it, segment the circularized DNA again, and sequence the segment with the junction


Image source: Illumina Nextera technical note, http://www.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf


VCF files• There is a file format defined for genetic

variants called VCF (Variant Call Format).– Specification available at

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

– Two main sections: header and content– Header provides basic information of the file, and

defines content attributes and filters– Each line in the content section represents one

variant in one or more samples






An example


Example source: http://samtools.github.io/hts-specs/VCFv4.2.pdf

##fileformat=VCFv4.2##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3


Final remarks• Some types of genetic variation take more

time and need more complex methods to detect Detect the easy ones first1. Use standard alignment results to:• Detect SNVs and small indels• Get rough information of large indels, translocations,

CNVs and inversions

2. Use unaligned reads and additional procedures to determine detailed information of the SVs



Final remarks• Some methods call genetic variants by

combining the information from multiple samples.– Consistency among samples– Contrast among samples (e.g., tumor vs. non-tumor

from the same patient – somatic variants) [lecture]• To study the relationships among multiple

variants, one may further construct haplotypes [project] or identify epistatic interactions among variants [project].



Summary• Genetic variations– SNVs– Small indels– SVs: Large indels, translocations, CNVs, inversions

• Methods for detecting genetic variants– Split read– Paired-end read• Orientations

– Depth of coverage– Allele ratios and frequencies


Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese...

Documents

Transcript of Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese...