Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

24
Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali

Transcript of Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Page 1: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Identification of Copy Number Variants using Genome Graphs

Dhawal VermaAdvisor: Dr. Hesham Ali

Page 2: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Introduction

The genome of an organism offers great insight into its phylogenetic history interaction with the environment internal functions

Even within the same species, the genomes of two individuals differ. Although the genomic variations are relatively small, they account for the observed variations in: Phenotypes (Heterozygosity) Susceptibility towards various diseases.

Page 3: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Motivation

Heterozygosity is of major interest to researchers of genetic variation in natural populations.

It refers to the state of having different alleles at one or more corresponding chromosomal loci.

It is often one of the first "parameters" that one presents in a data set. It can tell us a great deal about the structure and even history of a population.

Page 4: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Motivation

Role in diseases SVs and CNVs have been associated with

susceptibility or resistance to disease.

Gene copy number can be elevated in cancer cells.

Copy number variation has also been associated with autism, schizophrenia and idiopathic learning disability.

Page 5: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Visualization of Genome

Genome = A Book

Written in 4 letters of nucleotides – A T G C

23 Chromosomes = 23 Chapters

Genes = Stories in each chapter

Page 6: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Genome A T G C

Page 7: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Genomic Structural Variation Every Genome differs from another,

however like different books differ from one another, the list of words used in the book comes from a known dictionary of words.

Like different positions of various words in a sentence give out a different meaning, different positions of the same gene in a genome give us a distinct feature and causes a variation in genomes.

Page 8: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Genomic Structural Variation Until fairly recently, single nucleotide

polymorphisms (SNPs) were thought to be the main source of variation in the human genome.

SNPs are variations that involve a change in just one nucleotide.

THE RAT CAN RUN FAST

THE CAT CAN RUN FAST High-throughput genome scanning

technologies revealed that there are other forms of genomic variation beyond single base-pair substitutions.

Page 9: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Structural Variants

Structural variant is the umbrella term to encompass a group of genomic alterations involving segments of DNA typically larger than 1 kb.

The structural variation may be

•Quantitative (CNVs – indels and duplications)

•Positional (translocations)

•Orientational (inversions).

Page 10: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Copy Number Variants (CNVs) CNVs are defined as chromosomal

segments, at least 1000 bases (1 kb) in length that vary in number of copies from human to human.

CNVs are large chunks of DNA that are deleted, copied, flipped or otherwise rearranged in combinations that can be unique for each individual.

YOU CAN RUN FAST YOU CAN RUN RUN RUN FAST

Page 11: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

SNP v CNV

SNPs always occur in two alleles, while approximately 5% of the human genome are defined as structurally variant in the normal population, involving more than 800 independent genes.

Of the total amount of variation between two human individuals

CNVs + SVs >>> SNPs

Page 12: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Primitive methods for detection of CNVs

1. Whole-genome array comparative genome hybridization(aCGH), which tests the relative frequencies of probe DNA segments between two genomes

2. SNP arrays to measure the intensity of probe signals at known SNP loci.

Page 13: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Limitations of the methods

The size and breakpoint resolution of any prediction is correlated with the density of the probes on the array, which is limited by the density of the array itself (for aCGH) the density of known SNP loci (for SNP arrays).

The limited resolution of arrays for high copy count segments and the lack of unique probes make it difficult to identify CNVs in repetitive regions.

Page 14: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Research Proposal

An effective computational method for the identification of Copy Number Variants in genomes.

Model Next generation sequencing data can be modeled in a

graph that we call a Genome Graph Algorithm

By effectively mapping the reference genome graph with the donor graph and making use of two different existing methods known as Depth of coverage and Paired end mapping together, we can overcome their limitations and detect the CNVs with higher sensitivity and specificity.

Page 15: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Research Proposal

Our literature survey indicates that PEM method is used specifically for detecting SVs and DOC method for CNVs.

CNVs in general are considered as a subset of SVs.

By integrating the two methods we can use PEM signatures at a higher magnification level.

Also the complexity can be reduced by using the bi-directional genome graphs.

Page 16: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Genome Graphs

With the advent of Next Generation Sequencing data that provides as much as 40x coverage for a human genome, a special class of graphs known as Genome graphs emerged.

The vertices represent either the reads or their substrings (k-mers expressed by various combinations of the letters A,T,G and C)

The edges represent overlaps between them (the prefix of one read is the suffix of the other).

Page 17: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Genome Graphs

•A genome graph can be unidirectional or bi-directional. •Bi-directional genome graph implements the double-strandedness of DNA.•Bi-directional graphs help reduce the complexity of algorithm as in unidirectional graphs two “complementary” walks are searched while in bi-directional graph a single walk can fetch both the sequence and its complement.

Page 18: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Depth of Coverage method

Depth of Coverage The density of reads mapping to the region Several recent studies have shown that by

comparing the DOC within a sliding window of the genome to what is expected in the reference genome, it is possible to detect changes in copy number

Limitations Very Complicated difficult to separate true changes in copy

number from segments that are over or under sampled by the sequencing technology.

Page 19: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Depth of Coverage

In a genome graph, an increase/decrease in number of vertices between two known vertices in the reference genome gives an indication of CNV.

Page 20: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Paired End Mapping method

PEM method: two paired reads (called matepairs) are generated

at an approximately known distance in the donor genome.

The reads are mapped to a reference genome, and matepairs mapping at a distance significantly different from the expected length (termed discordant) suggest structural variants.

Limitations Difficulty in detecting larger insertions and variation

within areas of segmental duplications

Page 21: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

PEM signatures in Genome Graphs

Page 22: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

PEM signatures v DOC signatures

In contrast to most PEM signatures, DOC signatures can be used to detect very large events.

The larger the event, the stronger the signature.

However, they are not able to accurately identify smaller events that PEM signatures, even with low coverage, are able to detect.

Page 23: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.

Next Steps:

While inversions do not cause any changes in copy number, an area that is deleted (SV) will correspond to a loss (CNV). Similarly, a region containing a tandem duplication will be annotated as both having an insertion (SV) and as exhibiting a gain (CNV). In this way, any PEM method for SV detection can be viewed as a method for detecting a subset of CNVs

Depth of Coverage method is used extensively for detecting CNVs, PEM technique is majorly used for detecting SVs

Our hypothesis is that PEM techniques can be used to improve both the sensitivity and specificity of depth of coverage based methods using a probabilistic graph-theoretic framework.

Page 24: Identification of Copy Number Variants using Genome Graphs Dhawal Verma Advisor: Dr. Hesham Ali.