Slow and Steady: The Sea Urchin Genome Project

29
Slow and Steady: Slow and Steady: The Sea Urchin Genome The Sea Urchin Genome Project Project David A. Schwarz David A. Schwarz Mentor: Dr. Andrew Cameron Mentor: Dr. Andrew Cameron Site: California Institute of Site: California Institute of Technology Technology

description

Slow and Steady: The Sea Urchin Genome Project. David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology. Objective. Curate the non annotated, predicted genes of the sea urchin genome. Learn to annotate genes and register as many as possible to spbase.org. - PowerPoint PPT Presentation

Transcript of Slow and Steady: The Sea Urchin Genome Project

Page 1: Slow and Steady: The Sea Urchin Genome Project

Slow and Steady:Slow and Steady:The Sea Urchin Genome The Sea Urchin Genome

ProjectProject

David A. SchwarzDavid A. Schwarz

Mentor: Dr. Andrew CameronMentor: Dr. Andrew Cameron

Site: California Institute of Site: California Institute of TechnologyTechnology

Page 2: Slow and Steady: The Sea Urchin Genome Project

ObjectiveObjective

► Curate the non annotated, predicted genes Curate the non annotated, predicted genes of the sea urchin genome.of the sea urchin genome.

► Learn to annotate genes and register as Learn to annotate genes and register as many as possible to spbase.orgmany as possible to spbase.org

Page 3: Slow and Steady: The Sea Urchin Genome Project

ImportanceImportance

►The purple sea urchin: the only non-The purple sea urchin: the only non-chordate deuterostome with a chordate deuterostome with a sequenced genome.sequenced genome.

► It could help us understand the It could help us understand the evolution of biological processes such evolution of biological processes such as odor perception and immunity.as odor perception and immunity.

►Developments made in the project Developments made in the project could benefit future genome projects.could benefit future genome projects.

Page 4: Slow and Steady: The Sea Urchin Genome Project

Strongylocentrotus Strongylocentrotus purpuratuspurpuratus

► Phylum: Phylum: EchinodermataEchinodermata

► Radially symmetrical Radially symmetrical shell, 3 – 10 cm.shell, 3 – 10 cm.

► Spines can reach 3 Spines can reach 3 cm long.cm long.

► Moves slowly, Moves slowly, feeding mostly on feeding mostly on algae.algae.

► Reproduces by Reproduces by external fertilization.external fertilization.

Page 5: Slow and Steady: The Sea Urchin Genome Project

PhylogenyPhylogeny

Page 6: Slow and Steady: The Sea Urchin Genome Project

Data FlowData Flow

Sp Genome

CAPSS (v2.1)WGS (v0.5)

GLEAN

Predicted Set of 28,944 genes

Estimated Setof 23,300 genes

Page 7: Slow and Steady: The Sea Urchin Genome Project

Genome SequencingGenome Sequencing

►WGS = Whole Genome Shotgun WGS = Whole Genome Shotgun SequencingSequencing Genome assembly named Spur_v0.5 Genome assembly named Spur_v0.5

►CAPSS = Cloned-Array Pooled Shotgun CAPSS = Cloned-Array Pooled Shotgun Sequencing StrategySequencing Strategy Genome assembly named Spur_v2.1Genome assembly named Spur_v2.1

Page 8: Slow and Steady: The Sea Urchin Genome Project

Data FlowData Flow

Sp Genome

CAPSS (v2.1)WGS (v0.5)

GLEAN

Predicted Set of 28,944 genes

Estimated Setof 23,300 genes

Page 9: Slow and Steady: The Sea Urchin Genome Project

SequencingSequencing

► WGS:WGS:► Extract DNAExtract DNA► DigestDigest► Sequence the Sequence the

FragmentsFragments► Assemble the Assemble the

genome.genome.

► CAPSS:CAPSS:► Combines WGS with Combines WGS with

BAC.BAC.► Uses BACs as Uses BACs as

framework for framework for genome assembly.genome assembly.

Page 10: Slow and Steady: The Sea Urchin Genome Project

CAPSSCAPSS

Page 11: Slow and Steady: The Sea Urchin Genome Project

Data FlowData Flow

Sp Genome

CAPSS(v2.1)WGS(v0.5)

GLEAN

Predicted Set of 28,944 genes

Estimated Setof 23,300 genes

Page 12: Slow and Steady: The Sea Urchin Genome Project

GLEANGLEAN

Gnomon Genscan

Ensembl

GLEAN StatisticalAlgorithm

Page 13: Slow and Steady: The Sea Urchin Genome Project

DiscrepancyDiscrepancy

► Spur_v0.5 – Spur_v0.5 – ► 28,944 predicted28,944 predicted► ~10,044 annotated~10,044 annotated► 18,944 non 18,944 non

annotatedannotated

► ~ 5,700 gene difference possibly due to:~ 5,700 gene difference possibly due to: 4 – 5% species polymorphism (E. Davidson, et 4 – 5% species polymorphism (E. Davidson, et

al.)al.) Assembly errorAssembly error Prediction errorPrediction error

► Spur_v2.1Spur_v2.1► 23,300 estimated23,300 estimated► Gene number Gene number

reduced when reduced when duplicates overlapduplicates overlap

Page 14: Slow and Steady: The Sea Urchin Genome Project

MethodsMethods

► Python FilteringPython Filtering ► Python SearchingPython Searching► BioPython module:BioPython module:

BLAST hit FASTA BLAST hit FASTA sequencessequences

► Grep-like functions:Grep-like functions: GLEAN models by GLEAN models by

protein typeprotein type FASTA sequences in FASTA sequences in

GLEAN protein GLEAN protein databsedatabse

Infile:Gene list

If conditions meet:Print to outfile

Check against:Data file

Page 15: Slow and Steady: The Sea Urchin Genome Project

Example ListExample ListGLEAN3_00003GLEAN3_00003 ref|NP_104627.1| hypothetical protein [Mesorhizobium loti] >gi|1... 38 0.48ref|NP_104627.1| hypothetical protein [Mesorhizobium loti] >gi|1... 38 0.48GLEAN3_00004GLEAN3_00004 ref|NP_788284.1| CG33087-PC [Drosophila melanogaster] >gi|232403... 40 0.19ref|NP_788284.1| CG33087-PC [Drosophila melanogaster] >gi|232403... 40 0.19GLEAN3_00005GLEAN3_00005 ref|NP_509604.1| abnormal NUClease NUC-1, deoxyribonuclease DLAD... 69 4e-11ref|NP_509604.1| abnormal NUClease NUC-1, deoxyribonuclease DLAD... 69 4e-11GLEAN3_00008GLEAN3_00008 ref|XP_293875.3| similar to RIKEN cDNA B130016O10 gene [Homo sap... 240 5e-62ref|XP_293875.3| similar to RIKEN cDNA B130016O10 gene [Homo sap... 240 5e-62GLEAN3_00010GLEAN3_00010 gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 86 6e-16gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 86 6e-16GLEAN3_00011GLEAN3_00011 gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 143 3e-32gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 143 3e-32GLEAN3_00014GLEAN3_00014 ref|NP_062642.1| ubiquitin-conjugating enzyme E2A, RAD6 homolog;... 229 2e-59ref|NP_062642.1| ubiquitin-conjugating enzyme E2A, RAD6 homolog;... 229 2e-59GLEAN3_00018GLEAN3_00018 failedfailedGLEAN3_00019GLEAN3_00019 failedfailedGLEAN3_00020GLEAN3_00020 failedfailedGLEAN3_00021GLEAN3_00021 ref|NP_196259.2| chaperone protein - related [Arabidopsis thalia... 110 4e-23ref|NP_196259.2| chaperone protein - related [Arabidopsis thalia... 110 4e-23GLEAN3_00023GLEAN3_00023 failedfailedGLEAN3_00024GLEAN3_00024 sp|O42587|PRSA_XENLA 26S protease regulatory subunit 6A (TAT-bin... 130 1e-29sp|O42587|PRSA_XENLA 26S protease regulatory subunit 6A (TAT-bin... 130 1e-29GLEAN3_00027GLEAN3_00027 gb|AAD19348.1| reverse transcriptase-like protein [Takifugu rubr... 172 2e-41gb|AAD19348.1| reverse transcriptase-like protein [Takifugu rubr... 172 2e-41GLEAN3_00028GLEAN3_00028 gb|AAH53792.1| MGC64389 protein [Xenopus laevis] 164 3e-39gb|AAH53792.1| MGC64389 protein [Xenopus laevis] 164 3e-39GLEAN3_00029GLEAN3_00029 failedfailedGLEAN3_00030GLEAN3_00030 ref|XP_060945.2| similar to Olfactory receptor 10T2 [Homo sapien... 54 5e-06ref|XP_060945.2| similar to Olfactory receptor 10T2 [Homo sapien... 54 5e-06GLEAN3_00032GLEAN3_00032 dbj|BAA22375.1| Nfrl [Xenopus laevis] 339 7e-92dbj|BAA22375.1| Nfrl [Xenopus laevis] 339 7e-92GLEAN3_00033GLEAN3_00033 ref|XP_354640.1| RIKEN cDNA D430035D22 gene [Mus musculus] 186 1e-45ref|XP_354640.1| RIKEN cDNA D430035D22 gene [Mus musculus] 186 1e-45GLEAN3_00034GLEAN3_00034 dbj|BAC04242.1| unnamed protein product [Homo sapiens] 207 5e-52dbj|BAC04242.1| unnamed protein product [Homo sapiens] 207 5e-52GLEAN3_00037GLEAN3_00037 dbj|BAC02921.1| zVeph-A [Danio rerio] 112 4e-23dbj|BAC02921.1| zVeph-A [Danio rerio] 112 4e-23GLEAN3_00038GLEAN3_00038 ref|NP_004198.1| solute carrier family 16, member 3; monocarboxy... 44 0.008ref|NP_004198.1| solute carrier family 16, member 3; monocarboxy... 44 0.008GLEAN3_00039GLEAN3_00039 failedfailed

Page 16: Slow and Steady: The Sea Urchin Genome Project

Data CurationData Curation Non-annotated Genes

(18,900)

Filtering by coordinates (18,761)

Filtering by mRNA expression (17,159)

Filtering by BLAST failures (14,014)

Filtering by sequence(9,469)

Filtering by Reciprocal Blast(5,319)

Filtering by Protein Quality(2,478)

Condition: Different name, same genome coordinates

Genes removed: 139

Page 17: Slow and Steady: The Sea Urchin Genome Project
Page 18: Slow and Steady: The Sea Urchin Genome Project

Data CurationData Curation Non-annotated Genes

(18,900)

Filtering by coordinates (18,761)

Filtering by mRNA expression (17,159)

Filtering by BLAST failures (14,014)

Filtering by sequence(9,469)

Filtering by Reciprocal Blast(5,319)

Filtering by Protein Quality(2,478)

Condition: Evidence for gene expression

Genes removed: 1,603

Page 19: Slow and Steady: The Sea Urchin Genome Project

Data CurationData Curation Non-annotated Genes

(18,900)

Filtering by coordinates (18,761)

Filtering by mRNA expression (17,159)

Filtering by BLAST failures (14,014)

Filtering by sequence(9,469)

Filtering by Reciprocal Blast(5,319)

Filtering by Protein Quality(2,478)

Condition: No hits

Genes removed: 3,145

Page 20: Slow and Steady: The Sea Urchin Genome Project

Data CurationData Curation Non-annotated Genes

(18,900)

Filtering by coordinates (18,761)

Filtering by mRNA expression (17,159)

Filtering by BLAST failures (14,014)

Filtering by Sequence(9,469)

Filtering by Reciprocal Blast(5,319)

Filtering by Protein Quality(2,478)

Condition: Exactly the same BLAST hit

Genes removed: 4,545

Page 21: Slow and Steady: The Sea Urchin Genome Project

Data CurationData Curation Non-annotated Genes

(18,900)

Filtering by coordinates (18,761)

Filtering by mRNA expression (17,159)

Filtering by BLAST failures (14,014)

Filtering by sequence(9,469)

Filtering by Reciprocal Blast(5,519)

Filtering by Protein Quality(2,478)

Condition: Successful Reciprocal BLAST match

Genes removed: 3,952

Page 22: Slow and Steady: The Sea Urchin Genome Project

Reciprocal BlastReciprocal Blast

Sea urchin protein database (GLEAN)

NCBI Nr database

AB

X Y

GLEAN_A NCBI Protein B (score) (e-value)

Good

Reciprocal

Blast

Page 23: Slow and Steady: The Sea Urchin Genome Project

Reciprocal BlastReciprocal Blast

Sea urchin protein database (GLEAN)

NCBI Nr database

AB

X Y

GLEAN_A NCBI Protein B (score) (e-value)

Bad

Reciprocal

Blast

Page 24: Slow and Steady: The Sea Urchin Genome Project

Data CurationData Curation Non-annotated Genes

(18,900)

Filtering by coordinates (18,761)

Filtering by mRNA expression (17,159)

Filtering by BLAST failures (14,014)

Filtering by sequence(9,470)

Filtering by Reciprocal Blast(5,519)

Filtering by Protein Quality(2,478)

Conditions: Names such as “hypothetical”, “predicted”, “unnamed”

Genes removed: 3,041

Page 25: Slow and Steady: The Sea Urchin Genome Project

Annotation ProcessAnnotation Process

Search sequences of proteins of similar type or domain (use GLEAN DB and PFAM)

Build phylogeny tree with Clustal X.

Annotate gene following Spbase guidelines.

If necessary: Do some research on the protein typeor its domains. (Using PFAM)

Page 26: Slow and Steady: The Sea Urchin Genome Project
Page 27: Slow and Steady: The Sea Urchin Genome Project

Contributions to AnnotationContributions to Annotation

►AnnotationAssist.pyAnnotationAssist.py Automates searching for families in the Automates searching for families in the

Glean databaseGlean database Autofetches sequences for Clustal XAutofetches sequences for Clustal X Stores everything on a unique directory Stores everything on a unique directory

based on Glean model name and familybased on Glean model name and family

Page 28: Slow and Steady: The Sea Urchin Genome Project

ReferencesReferences

►Polymorphism: R.J. Britten, A. Cetta, Polymorphism: R.J. Britten, A. Cetta, E.H. Davidson, Cell E.H. Davidson, Cell 15, 15, 1175 (1978)1175 (1978)

►CAPSS: W. W. Cai, R. Chen, R. A. Gibbs, CAPSS: W. W. Cai, R. Chen, R. A. Gibbs, A. Bradley, A. Bradley, Genome Res.Genome Res. 1111, 1619 , 1619 (2001). (2001).

Page 29: Slow and Steady: The Sea Urchin Genome Project

AcknowledgmentsAcknowledgments

► Dr. Andrew CameronDr. Andrew Cameron► David FeltDavid Felt► Lauren Lee and Lauren Lee and

Nowelle IbarraNowelle Ibarra► SoCalBSI Staff and SoCalBSI Staff and

CoordinatorCoordinator► SoCalBSI ParticipantsSoCalBSI Participants

► Funding:Funding: NIHNIH NSFNSF DOEDOE Beckman InstituteBeckman Institute