INTRODUCTION · Web viewCNVs calling Modules Table of Contents INTRODUCTION2 INTRODUCTION...

CNVS calling and visualization from NGS data

CNVs calling Modules

Table of ContentsINTRODUCTION........................................................................................................................................................ 2

INTRODUCTION REFERENCES.............................................................................................................................. 3

GLOSSARY and MODULE OVERVIEW................................................................................................................. 4

MODULE DESCRIPTION.......................................................................................................................................... 7

CNV CALLING: three different path.................................................................................................................... 71. ERDS/SVA path (DOC) (see STEP5 “Sequence Variant Analyzer v1.0, for hg18 annotations” pipeline, for “Sequence Variant Analyzer v1.1, for has not been released yet a compatible ERDS version).................................................................................................................................................................................... 72. BOWTIE/CNVer/SAVANT path (DOC+PEM).........................................................................................................72.1 BOWTIE alignment..................................................................................................................................................... 7

2.1.1 Build bowtie index for the reference genome..............................................................................................................72.1.2 Bowtie alignment with SAM production.........................................................................................................................82.1.3 SAM2BAM conversion............................................................................................................................................................ 8

2.2 CNVer.............................................................................................................................................................................. 82.2.1 CNVer call..................................................................................................................................................................................... 82.2.2 Sort .bam....................................................................................................................................................................................... 92.2.3 Indexing the .bam file..............................................................................................................................................................92.2.4 Visualization................................................................................................................................................................................9

3. CNVseq path (DOC).................................................................................................................................................... 113.1 Hits file production................................................................................................................................................. 113.2 CNV call with CNVseq.............................................................................................................................................. 11

1.

1


INTRODUCTION

The field of computational methods for discovering structural variation on NGS data is a still open computational and bioinformatics challenge. Many tools are being currently available (as described at http://www.gen2phen.org/wiki/tools-and-methods-mapping-genomic-structural-variation#Programs). The framework of all structural variant discovery methods is to detect anomalous “signature” or patterns, and then call the underlying variant. There are four main strategies able to map sequence reads to the reference genome and to identify discordant signatures /patterns diagnostic for different classes of structural variants. None of them is comprehensive, with each method having peculiar strengths and weakness in detection, depending on the variant type or the properties of the underlying sequence at the variant locus. The four methods are:

(1) Read pair technologies (aka Paired End Mapping strategies): these software rely on the information that reside in the (1) order, (2) distance and (3) orientation of the reads to detect structural variants (deletion, insertion, inversions, indels). These methods allow to detect (in principle) almost all the classes of variation. Reads that map too far apart suggest the presence of a deletion, reads too close together may be indicative of insertions, and reads with orientation inconsistencies can point to inversions and tandem duplications. When only one member of the pair map to the reference genome (aka singlet read), the companion read suggest the presence of variant sequences not included in the reference genome. Some computational tools based on read-pair approach (There are now many computational tools based on a read-pair approach, including PEMer[1], VariationHunter[2-4], BreakDancer[5], MoDIL[6], MoGUL[7], HYDRA[8], Corona[9] and SPANNER [10]. This is a powerful strategy with some limitations :(1) resolving ambiguous mapping assignments in repetitive regions may be difficult; (2) they depend on the insert size that usually follows a distribution far from being perfectly tight rising the problem in detecting larger events.

[(2)] Read-depth methods (aka Depth of Coverage strategies): these software are based on the assumption that mapping depth across the whole genome follows a random (typically Poisson or modified Poisson) distribution, and then the number of reads expected to map within a region to be proportional to the number of times that region appears in the sequenced genome. Thus, a region that has been deleted or duplicated will display either less or more reads than expected. Read depth is the only sequencing-based method to accurately predict absolute copy numbers [11-12], even if (1) the breakpoint resolution is often poor and (2) they are less sensitive than PEM methods for the detection of smaller events.

Some softwares use a combination of the two previously described approaches to more reliably detect CNVs (CNVer [13] and Genome STRiP [12]).

(2)[(3)] Split-read approaches : these methods are able to detect deletions and smaller insertions down to single-base-pair resolution. They are based on the detection of “split” sequence-read signature, namely when the alignment of a read to the genome is broken: thus, a stretch of gaps in the read indicates a deletion or in the reference indicates an insertion.The application of this method is still limited owing the computation overhead of

2

CNVS calling and visualization from NGS datathe local gapped alignment and currently available only in the unique regions of the genome. A tool available is PINDEL[14].

(3)[(4)] De novo assembly : de novo assembly would be in theory able to capture all the forms of structural variation with reads long enough. However the sequence-assembly approaches are not extensively used yet, and typically a combination of de novo and local assembly algorithm are used to form contigs that are then compared to reference genome. Well-known de novo assembly algorithms are EULER-USR [15], ABySS [16], SOAPdenovo[17] and ALLPATHS-LG [18], Cortex assembler [13], NovelSeq framework [19], TIGRA[13].

INTRODUCTION REFERENCES

[1] Korbel, J. O. et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol. 10, R23 (2009).

[2] Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).

[3] Hormozdiari, F. et al. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics (Oxford, England) 26, i350–i357 (2010).

[4] Hormozdiari, F., Hajirasouliha, I., A., M., Eichler, E. E. & Sahinalp, S. C. Simultaneous structural variation discovery in multiple paired-end sequenced genomes. Proc. RECOMB 2011 (in the press).

[5] Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677–681 (2009).

[6] Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods 6, 473–474 (2009).

[7] Lee, S., Xing, E. & Brudno, M. MoGUL: detecting common insertions and deletions in a population. Proc. RECOMB 2010 6044, 357–368 (2010).

[8] McKernan, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-baseencoding. Genome Res. 19, 1527–1541 (2009).

[9] Mills, R. E. et al. Mapping copy number variation at fine scale by population scale genome sequencing. Nature 470, 59–65 (2011).

[10] Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genet. 41, 1061–1067 (2009).

3

CNVS calling and visualization from NGS data[11] Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641– 646 (2010).

[12] Medvedev, P., Fiume, M., Dzamba, M., Smith, T. & Brudno, M. Detecting copy number variation with mated short reads. Genome Res. 20, 1613–1622 (2010).

[13] Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery and genotyping of genome structural polymorphism by sequencing on apopulation scale. Nature Genet. 13 Feb 2011

[14] Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertionsfrom paired-end short reads. Bioinformatics (Oxford, England) 25, 2865–2871 (2009).

[15] Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 19, 336–346 (2009).

[16] Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

[17] Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2009).

[18] Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518(2011).

[19] Hajirasouliha, I. et al. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics (Oxford,England) 26, 1277–1283 (2010).

GLOSSARY and MODULE OVERVIEW

“Input”: name of the input file“Label”: is specified when the name of the input on the pipeline canvas has to be slightly different form the one specified in the Input section to be more clear.“Tool”: script/program in use“Server Location”: location on the fgene server“Output”: name of the output file

4

CNVS calling and visualization from NGS datasss

5


6


MODULE DESCRIPTION

GOAL: call CNVs starting from raw reads (FASTQ format)FINAL OUTPUT: files with CNV calls

CNV CALLING: three different path

1. ERDS/SVA path (DOC) (see STEP5 “Sequence Variant Analyzer v1.0, for hg18 annotations” pipeline, for “Sequence Variant Analyzer v1.1, for has not been released yet a compatible ERDS version)

Input: alignment.rmd.sorted.md.recal.clean.bam (= output STEP#)Tool: erds Server Location: /applications/scripts_sva/erds1.01/erds.sh

Output: .gsap file (visualization of CNVs in Sequence Variant Analyzer). From the GUI is possible to export the large variants in a .csv file.

Example:

Pipeline Module:

2. BOWTIE/CNVer/SAVANT path (DOC+PEM)

(PREREQUISITE: download the companion package from the CNVer website http://compbio.cs.toronto.edu/CNVer/. It allows to use CNVer on alignments performed against the UCSC ref genome. See notes on ensemble alignments.)

2.1 BOWTIE alignment

2.1.1 Build bowtie index for the reference genome

Input: ref.fa reference files (preferentially the UCSC fasta reference genome, chr1-22,X,Y. See notes if using ensemble genome)Tool: bowtie (bowtie-build command)Server Location: /applications/rseqtools/example_dataset/bowtie-0.12.7

Output: series of .ebwt files

Example: bowtie-build ${CD}/human_genome2.fa ucsc_hg18_new_bowtie > \

7

CNVS calling and visualization from NGS dataucsc_hg18_new_bowtie.log

Pipeline Module:

2.1.2 Bowtie alignment with SAM production

Input: sequence.fastq file (SANGER FORMAT, output of STEP#1, I) Label: Illumina reads sequence.fastq files Tool: bowtieServer Location: /applications/BOWTIE/bowtie-0.12.7Output: alignment.sam

Example: bowtie ${CD}/ucsc_hg18_new_bowtie -1 ${CD}/941408_fwd.fastq -2 ${CD}/941408_rev.fastq -v 3 -a -m 600 --best --strata --sam ${CD}/941408_bduc_ucsc_hg18_new.sam > ${CD}/941408_bduc_ucsc_hg18_new.log

Pipeline Module:

2.1.3 SAM2BAM conversion

Input: alignment.samTool: bowtiepicard (SamFormatConverter.jar)Server Location: /applications/picard_1.38/picard-tools-1.38Output: alignment.bam

Example: -jar /apps/picard/1.45/bin/SamFormatConverter.jar INPUT= ${CD}/941408_bduc_ucsc_hg18_new.sam \OUTPUT=${CD}/941408_bduc_ucsc_hg18_new.bam

2.2 CNVer

2.2.1 CNVer call

Input: alignment.bamTool: CNVer (cnver.pl)Server Location: /applications/CNVer/cnver-0.8.1/srcOutput: .cnv files

Example: cnver.pl --map_list /projects2/CNVer_0.8.1_testing/map_list.txt --ref_folder /applications/CNVer/cnver-0.8.1/hg18comp --work_dir /projects2/CNVer_0.8.1_testing --read_len 101 --mean_insert 175 --stdev_insert 25 --min_mps 3

8


Pipeline Module:

2.2.2 Sort .bam

Input: alignment.bam fileTool: samtools (sort option)Server Location: /applications/samtools-0.1.7_x86_64-linuxOutput: alignment.sorted.bam file

Example: /applications/samtools-0.1.7_x86_64-linux/samtools sort /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsRemoveDuplicates_1.OutputNo-DuplicatesBAMfile-1.bam /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsSort_1.OutputSortedBAMfile-1.bam

Pipeline Module:

2.2.3 Indexing the .bam file

Input: alignment.sorted.md.bam fileTool: samtools (index option)Server Location: /applications/samtools-0.1.7_x86_64-linuxOutput: alignment.sorted.md.bam.bai file

Example: /applications/samtools-0.1.7_x86_64-linux/samtools index /projects/pipelineCache/pipeline/2011January27_15h51m34s061ms/SamToolsCamldMDtag_1.OutputNo-DuplicatesBAMfile-1.bam

Pipeline Module:

2.2.4 Visualization

9

CNVS calling and visualization from NGS data2.2.4.1 BAM sorting

Input: alignment.bamTool: samtools (sortServer Location: /applications/SAVANT_gb_updatedOutput: alignment.genome.cov.bam file

Example:

Pipeline Module:

2.2.4.2 Coverage track production

Input: alignment.bamTool: SAVANT genome browser (GUI)Server Location: /applications/SAVANT_gb_updatedOutput: alignment.genome.cov.bam file

Example:

Pipeline Module:

2.2.4.3 Formatting the ref.fa file in a ref.fa.savant

Input: ref.fa file (see #2.1.1)Tool: SAVANT genome browser (GUI)Server Location: /applications/SAVANT_gb_updatedOutput: ref.fa.savant project

Example:

Pipeline Module:

2.2.4.4 Formatting the cnv file in cnv.bed

Input: .cnv file Tool: SAVANT genome browser (GUI)Server Location: /applications/SAVANT_gb_updatedOutput: .cnv.bed

Example:

Pipeline Module:

2.2.4.5 Visualization

Input: alignment.sorted.bam, alignment, genome.cov.bam, .cnv.bed filesTool: SAVANT genome browserServer Location: /applications/SAVANT_gb_updated

10

CNVS calling and visualization from NGS dataOutput: saved in a .savant project

Example:

Pipeline Module:

ENSEMBL ALIGNMENTS NOTES

PREREQUISITE: perform the alignments on a 1-22,X,Y ensemble reference genome. AVOID the contigs!

1) Conversion chr1-22 (UCSC) to 1-22 (ENSEMBL). a. In the hg18comp folder remove chr from allchr.txt, autosomes.txt. b. In the folders inside hg18comp (contig_breaks_folder, fasta_files_folder,

repeat_regions_folder, self_alignments_folder) change the name of the files and also the content of the file accordingly chr1-22 (UCSC) to 1-22 (ENSEMBL).

3. CNVseq path (DOC)

3.1 Hits file production

Input: alignment.bamTool: samtools (view option)Server Location: /applications/samtools-0.1.7_x86_64-linuxOutput: .hits file

Example: samtools view -F 4 my.bam | perl -lane 'print "$F[2]\t$F[3]"' > my.hits /ifs/ccb/CCB_SW_Tools/BioinformaticsGenetics/samtools/samtools-0.1.10/samtools view -bT /ifs/ccb/CCB_SW_Tools/BioinformaticsGenetics/MAQ_BWA_2010/test_data/ref_chr2.fasta.fai -o /ifs/pl_cache/cranium/pipelnvr/2010December02_09h30m18s797ms/SamToolsView_1.OutputBAMfile-1.bam /ifs/pl_cache/cranium/pipelnvr/2010December02_09h30m18s797ms/SamToolsmaq2sam-long_1.OutputSAMfile-1.sam

Pipeline Module:

3.2 CNV call with CNVseq

Input: .hits fileTool: CNVseq+R Server Location: /applications/CNVseq/cnv-seqOutput: .hits.cnv file

Example: ./cnv-seq.pl --test SMS-135.hits --ref JAM-230.hits --genome human

11


In R:

> library(proto)> library(grid)> library(cnv)> data <- read.delim("JLK-227.hits-vs-JAM-230.hits.log2-0.6.pvalue-0.001.minw-4.cnv") #upload the .cnv file> cnv.summary(data) # produce a description of the dataset like that:

CNV percentage in genome: 0.8%CNV nucleotide content: 24516307CNV count: 403Mean size: 60835Median size: 135865Max Size: 503201Min Size: 20129

>plot.cnv(data, chromosome=2, from=140036061, to=144238634) # to plot a specific region so in this case you can insert the interval you want and see the log R ratio OR check the cnv.print file and look inside that if your region is among the significant ones.

>plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6) #to plot a CNV (when you do cnv.print(data) look at the CNVid and them choose what CNV to be plotted)

Pipeline Module:

NOTE:

What a user might want to do with the plotting (see OUTPUT PLOT)?

(1)Automatically export ALL the pdf plots of all the CNVs (so this step may be connected directly to the CNVseq. I guess would be great to put a parameter that if checked allows this automatic production:

plot.cnv.all <- function (data, chrom.gap = 2e+07,colour = 5, title = WG plots, ylim = c(-2,2),xlabel = "Chromosome")

12

CNVS calling and visualization from NGS data(2) Automatically export ALL the pdf plots of all the CNVs in one chromosome by time (so this step may be connected directly to the CNVseq. I guess would be great to put a parameter that if checked allows this automatic production (with the possibility to choose the chromosome):

plot.cnv.chr <- function (data, chromosome = the number of the chromosome, from = NA(beginning coordinate), to = (ending coordinate), title = chromosome x,ylim = c(-4, 4), glim = c(-2, 2),xlabel = "Position (bp)")

(3) In the vast majority of the cases, the user will download the cnv file produced by CNVseq, will examine it and choose some events that are more interesting. In this case what I figure in my mind is the possibility to check a parameter that says “by now stop the process to CNVseq”. Once the user has downloaded and examined the .cnv files and has x regions to visualize, he can go back to the same module (without having restarted it) and export plots:

By genomic region: >plot.cnv(data, chromosome=2, from=140036061, to=144238634) # to plot a specific region so in this case you can insert the interval you want and see the log R ratio

ByCNVid: >plot.cnv(data, CNV=4 (this the is found by the user in the cnv file, upstream=4e+6, downstream=4e+6)

What a user might want to do with the output description file (see OUTPUT DATA DESCRIPTION)?

This step is ok.

13

INTRODUCTION · Web viewCNVs calling Modules Table of Contents INTRODUCTION2 INTRODUCTION...

Documents

Transcript of INTRODUCTION · Web viewCNVs calling Modules Table of Contents INTRODUCTION2 INTRODUCTION...