With LMG Secretariat LMG Forum February 2011 Christopher Croft, LMGS Justin Emrich, Atrium.
VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April...
-
Upload
gabriella-winstanley -
Category
Documents
-
view
216 -
download
2
Transcript of VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April...
![Page 1: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/1.jpg)
VectorBase
Frank Collins, Scott Emrich, Dan Lawson,Greg MadeyBRC PI/PM Meeting
Bethesda, MDApril 27, 2012
![Page 2: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/2.jpg)
![Page 3: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/3.jpg)
Genome Sizes
• Pediculus humanus: ~110 Mb, N50 = 488 kb• Anopheles gambiae S: ~260 Mb, N50 = 1,505 kb• Culex quinquefasciatus: ~580 Mb, N50 = 487 kb• Aedes aegypti: ~1.3 Gb, N50 = 1,500 kb• Ixodes scapularis: ~1.8 Gb, N50 = 72 kb
![Page 4: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/4.jpg)
4
Future genomesWhite papers
SandfliesLutzomyia longipalpisPhlebotomus papatasi
Anopheles (AGCC)Anopheles arabiensisAnopheles quadriannulatusAnopheles merusAnopheles melasAnopheles christylAnopheles epiroticusAnopheles stephensiAnopheles maculatusAnopheles funestusAnopheles minimusAnopheles culicifaciesAnopheles farautiAnopheles dirusAnopheles atroparvusAnopheles albimanus
GlossinaGlossina palpalisGlossina fuscipesGlossina pallidipesGlossina brevipalpisGlossina austeniStomoxys calcitransMusca domestica
SimuliumSimulium vittatumSimulium sirbanumSimulium damnosumSimulium ochraceumSimulium squamosumSimulium thyolenseSimulium santipauliSimulium woodiSimulium exiguum Simulium yahense
Tick & MitesLeptotrombidium delienseIxodes scapularis*Dermacentor variabilisOrnithodorus turicata
AnophelesAnopheles darlingi*Anopheles stephensi
Others
AedesAedes albopictus
i5K initiative
![Page 5: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/5.jpg)
First New Release in New Contract
![Page 6: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/6.jpg)
![Page 7: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/7.jpg)
![Page 8: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/8.jpg)
Challenges of vector genomes
• Relatively large, hard to inbreed genomes
• Heterozygosity in sequencing samples (up to 80 different males were sequenced for the new gambiae genomes) causes dubious scaffolds.
• Inversions and heterochromatic regions induce gaps
• Newer generation sequencing has reduced cost but has not yet kept overall quality
• Non-trivial annotations
![Page 9: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/9.jpg)
An. gambiae formsM-form
• More permanent• Available year-round• Allows slower development• Predator-rich
S-form
• Ephemeral• rainy-season dependent• Requires rapid development• Largely predator-free
![Page 10: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/10.jpg)
C. Cheng et al, unpublished
Divergence across chromosome arms
2L 2R
X
3R3L
![Page 11: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/11.jpg)
Optical mapping DBP : Wisconsin
![Page 12: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/12.jpg)
Size matters
Genome MB optically mapped genes found
S Sanger
145,837.97
14162
S Illumina
58,192.13
14124
PEST
60,239.6
14324
Sanger + Ill
204,030.1
14224
![Page 13: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/13.jpg)
13
Annotation strategies• Speeding up computational annotation• Use of MAKER system• Prediction by projection from ‘high quality’ reference
• Expanded use of RNA-Seq• Scripture, Trinity & Cufflinks/Bowtie
• Community engagement• Primarily deployed for new genomes (Glossina, Rhodnius)• Works for all other VectorBase genomes
![Page 14: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/14.jpg)
14
de novo annotationMAKER with RNA-Seq & reference proteomes
Aim:• Gene prediction pipeline for the masses.• Used for a number of arthropod genome projects• Touted as the default pipeline for many more (part of the GMOD toolkit)
Overview• ab-initio gene predictions from SNAP, Augustus & FGENESH Final gene models from MAKER EST alignments from both EXONERATE and BLASTN Protein alignments from EXONERATE and BLASTX Repeats from RepeatFinder & RepeatMasker• Additional data sets integrated via GFF3 files (RNA-Seq)• Uses MPI for parallelization over a compute farm Optimization for long scaffolds
Summary• Iterative runs give acceptable reference gene sets.• Used for Glossina and An. stephensi• Used by others for Strigamia, Manduca, published ant genomes
![Page 15: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/15.jpg)
15
Community annotation• Simple tool to capture community annotation• Makes gene prediction and evidence available as GFF3• Compatible with Artemis and Apollo tools• Submissions in GFF3 format
• Gene structure corrections• Gene meta data (symbol, description, citations)
• Glossina annotation effort (Nov 11 – Apr 12)• 790 GFF submissions• 2670 items of metadata
• gene symbols, descriptions• Structure confirmation
![Page 16: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/16.jpg)
16
ARTEMIS APOLLO
scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 52 305 696 + . ID=xxxx3;Name=sp|Q91VD9|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|
scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|
>MY SUPERCONTIGATATATGCGTTGAGCTGCGTTACGTTCGGGATGCGTTAGGCTTGTGAGCTGGATCGGTCCTGCCTGCGTCGATATAAACGACCT…
Identify gene
Modify model
SubmitCAP
GFF3FASTA
![Page 17: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551b8969550346a10a8b55b7/html5/thumbnails/17.jpg)
17
Population biology• Chado Natural diversity schema • 183 projects, 15190 samples• incorporates Irbase samples
• Ensembl variation schema• 1,511,335 SNP calls• Visualization through browser• Data downloads through browser• Queries via BioMart interface