Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

22
Zebra Finch Seg Dup Analysis 1. Genome 2. Parameters for Pipeline 3. Analysis
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    1

Transcript of Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Page 1: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Zebra Finch Seg Dup Analysis

1. Genome

2. Parameters for Pipeline

3. Analysis

Page 2: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Zebra Finch Genome

• The Genome (Jul. 2008 assembly of the zebra finch genome taeGut1, WUSTL v3.2.4) is downloaded from UCSU. This assembly was produced by the Genome Sequencing Center at the Washington University in St. Louis (WUSTL) School of Medicine.

• The zebra finch DNA used for the shotgun sequencing and the BAC and cosmid libraries was derived from a single male domesticated zebra finch. The initial assembly was generated using PCAP with approximately 6X coverage. About 1.0 Gb of the 1.2-Gb genome has been ordered and oriented along 33 chromosomes and one linkage group. The chromosome names are based on their homologous chromosomes in the chicken (Gallus gallus).

• Total genome size (gapped) 1,233,186,341 bp

Page 3: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Seg Dup detection pipelines

• WGAC to detect Seg Dup in genomic assemblies by looking for homologouse pairs ( >1 kb in length >90% identity).

• WSSDto detect Seg Dup in given sequences based on depth coverage of WGS (whole-genome shotgun reads). Depth coverage > Average + 3SD. Done by Ginger Cheng.

Page 4: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Parameters and notes for WGAC pipeline

• Repeats– The sequences download from UCSC has been soft masked.

• UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata'

– The repeat coordinates were reverse generated based on the soft-masked sequences.

• Blast parsing seeds in WGAC pipeline:– the seed size is 250 bp.

Page 5: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Result from WGAC Pipeline

• Total pairs of WGAC detected (>1 kb and >90% identity) 198180

• Inter chromosome pairs 81415• Intra chromosome pairs 116742• Chromosome inter and intra

(excluding chr_random and chrUn) 26510• ChrUn inter and intra 172670• Total WGAC NR (bp) 384,501,909• Total genome size (with gap) 1,233,186,341

Notes:• The NR space of WGAC is about 31% zebra finch genome, which is too high. It is

either due to the incomplete repeat masking or redundant sequences in chr_random and chrUn. 87% of the total WGAC pairs (inter and intra) have at least one sequence in each pair is on chrUn. The result indicates a big portal of false positive WGAC is from chrUn.

Page 6: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

General analysis of WGAC length and identity distribution

1. Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn.2. Identity distribution peaked at 97-98%. Few are higher than 99%.

WGAC identity distribution

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

90.00%

91.00%

92.00%

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

99.50%

100.00%

Identity

To

tal (

bp

)

inter

intra

WGAC length distribution

0

50000000

100000000

150000000

200000000

250000000

1.kb

2.kb

3.kb

4.kb

5.kb

6.kb

7.kb

8.kb

9.kb

10.kb

20.kb

30.kb

40.kb

50.kb

WGAC Length (bp)

To

tal (

bp

)

interlen

intralen

Page 7: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

General analysis, NR distribution on chromosome high SD in chrUn

None redundant WGAC length distribution on Chromosome

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

chr1chr1_randomchr1Achr1A

_randomchr1Bchr1B

_randomchr2chr2_randomchr3chr3_randomchr4chr4Achr4A

_randomchr4_randomchr5chr5_randomchr6chr6_randomchr7chr7_randomchr8chr8_randomchr9chr9_randomchr10chr10_randomchr11chr11_randomchr12chr12_randomchr13chr13_randomchr14chr14_randomchr15chr15_randomchr16_randomchr17chr17_randomchr18chr18_randomchr19chr19_randomchr20chr20_randomchr21chr21_randomchr22chr22_randomchr23chr23_randomchr24chr24_randomchr25chr25_randomchr26chr26_randomchr27chr27_randomchr28chr28_randomchrLG

2chrLG

5chrLG

E22

chrLGE

22_randomchrZchrZ

_randomchrU

n

Total (bp)

ch

rom

os

om

e

inter

intra

both

Percentage of none redundant WGAC on chromosome

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

chr1chr1_randomchr1Achr1A

_randomchr1Bchr1B

_randomchr2chr2_randomchr3chr3_randomchr4chr4Achr4A

_randomchr4_randomchr5chr5_randomchr6chr6_randomchr7chr7_randomchr8chr8_randomchr9chr9_randomchr10chr10_randomchr11chr11_randomchr12chr12_randomchr13chr13_randomchr14chr14_randomchr15chr15_randomchr16_randomchr17chr17_randomchr18chr18_randomchr19chr19_randomchr20chr20_randomchr21chr21_randomchr22chr22_randomchr23chr23_randomchr24chr24_randomchr25chr25_randomchr26chr26_randomchr27chr27_randomchr28chr28_randomchrLG

2chrLG

5chrLG

E22

chrLGE

22_randomchrZchrZ

_randomchrU

n

Chromosome

pe

rce

nt

(%)

inter

intra

both

tjbrown
Not sure if the slide header makes sense here. Also, the first figure should start with "Nonredundant" (not "none redundant") -- make this correct to 2nd slide too. Chromosome should not be capitalized in first slide.
Page 8: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Global image shows the inter and intra pairs of 10 kb and above 90% in identity without or with chrUn. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs.

Without chrUnWith chrUn

Page 10: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

WSSD analysis done by Ginger

http://eichlerlab.gs.washington.edu/help/ginger/zebrafinch/ • Downloaded the WGS reads; about 11,683,735 reads from trace archive at NCBI.

• Downloaded zfinch-finished BACs. These BACs are used to determine the threshold for WGS depth coverage. For 5-kb window, the average number of reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22.

• Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%.

(UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')

Page 11: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

WSSD results

• A total of 16,076 regions with 44,218,871 bp were found in wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them are on chrUn.

• A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls

Page 12: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

General view showing WGAC (>5kb) and WSSD on all chromosomes

Grey above lines are WSSDBrow below lines are WGAC

tjbrown
May need to re-word as there are 2 "all" words and the header is not clear.
Page 13: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Union of WSSD and WGAC gene intersect with Seg Dups

• A nonredundant union of WGAC and WSSD is generated with cut-off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with 50,902,487 bp, which is about 10 mb more than WSSD alone.

• However, be aware there may be false positive sites, especially on chrUn, since we know there are high false positive WGACs on chromosomes and chrUn.

Page 14: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Summary table 1

  total chrN chrUnNo. nr

interval file

wssd (bp) 44,218,871 11,237,985 35,080,886 729 wssdGE10K_nogap.tab

wgac (bp) 384,501,909 232,493,308 152,008,601 7387 oo.weild10kb.join.all.cull

AllDup (bp) 394,988,746 235,022,961 159,965,785 5934 allDUP

Wssd and Wgac shared 8,195,577 3,182,128 5,013,449    

Genome (bp) 1,233,186,341 1,057,961,026 175,225,315

Page 15: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Large SDs >=10 kb

• SD >=10 kb in size were pulled out. There are a total of

3,839 intervals with length 50,902,487 bp in the allDup.tab.

Page 16: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

The study of the chromosome only WGAC

• The Segment duplications on sequences assigned to chromosome should be more reliable sequences with less artifact.

• It should contains sequences reflecting best of the

assembly.

Page 17: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

• Total Dup length 105,145,288 bp• Intra Dup length 100,234,309 bp• Inter Dup length 8,499,428 bp

• More Dup is intra chromosome dup >90%• These intra chromosome dup are predominantly short

range intra dup, see the global view on next slide

Page 18: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Global view of 90%-5k and 94%-5k respectively, showing significant amount of WGAC pairs are intra chromosome short range duplications.

Page 19: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

The blowup view showing WGAC on chromosome 1 at 5k and 94%. This is WGAC detected on sequences assigned

to chromosome only

Page 20: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Detail of a sample region on chr1

WSSD

GreyDepth of coverage by reads

Assembly Gaps

Intra chromosome Homology pairs

The average identity for the for the reads mapped to the region.

Red >99%

Orange >98%

Yellow > 97%

Green > 96%

Page 21: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Text description for slide 20

• Each black line represent the chromosome regions as indicated by ticks.• Blue bars and pairs are the intra chromosome homologous pairs (segment

duplications) found.• Red bar and pair on chromosome line represent the inter chromosome

homologous pairs (inter chromosome Segment Duplications).• The grey bars under the chromosome line represent the depth of coverage

at the regions by WGS reads in 1kb window. The longer the bar is , the higher the depth of coverage by sequence reads.

• The color bar under the chromosome line represent the average identity for all the reads mapped to the region. Red(>99%), Orange(>98%), yellow(>97%), green (>96%).

• The black bar above the chromosome line represent WSSD detected.• The purple vertical line on chromosome line represent the assembly gaps.• Each tick represent the 10000bp; each line is 100kb.

Page 22: Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

result

• Most of the intra chromosomal pairs are very close to each other. In most cases, one sequence within the pair has gaps on both ends, which suggest the contig is not physically connected to its adjacent sequences. It was placed at current position by the mate pairs.

• Some of them are also next to each other, separated by a gap.

• We have not see in sampled region that a single contig contains both sequences within the pairs of intra chromosome segment duplications.

• Consider observation mentioned above, we think there is a high possibility that they could be assembly artifacts introduced by assembler.