Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, &...
Transcript of Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, &...
![Page 1: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/1.jpg)
Scientific Databasingwith TreeGenes:
Genotype, Phenotype, & Environment
Jill WegrzynDepartment of Ecology & Evolutionary BiologyInstitute for Systems Genomics: Computational Biology CoreUniversity of Connecticut, Storrs CT
treegenesdb.org
![Page 2: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/2.jpg)
Big Data in Genomics
“ComparedgenomicswiththreeothermajorgeneratorsofBigData:Astronomy,YouTube,andTwitter...Genomics iseitheronparwithorthemostdemandingofthedomainsanalyzedhereintermsofdataacquisition,storage,distribution,andanalysis”
Unit SizeByte 1Kilobyte 1,000Megabyte 1,000,000Gigabyte 1,000,000,000Terabyte 1,000,000,000,000Petabyte 1,000,000,000,000,000Exabyte 1,000,000,000,000,000,000Zettabyte 1,000,000,000,000,000,000,000
Mostly Genomic but…Proteomics, Phenomics, Metabolomics…
![Page 3: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/3.jpg)
•Kb=1000bp
•Mb=1x106 bp
•Gb=1x109 bp
•Tb=1x1012 bp
•Pb =1x1015 bp
1Gb 10Gb 100Gb
GenomesarevastinformationrepositoriesHuman3Gb
![Page 4: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/4.jpg)
Sensors & MetadataSequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
IO SystemsHardrives, Networking, Databases, Compression, LIMS
Compute SystemsCPU, GPU, Distributed, Clouds
Scalable AlgorithmsStreaming, Sampling, Indexing,
Machine Learningclassification, modeling,
visualization & data Integration
ResultsDomainKnowledge
Acquiring Knowledge through Big Data
![Page 5: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/5.jpg)
Gene Conservation of Tree Species –Banking on the Future (2016)
• Survey Conducted– Breeders, Geneticists, Land Managers, and
Ecologists– 31 Questions
• Trees (greenhouse, plots, landscape, numbers, species)• Data collection (devices, software)• Analytical tools (statistical, databases)• Data storage• Challenges
– 283 Respondents
![Page 6: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/6.jpg)
Gene Conservation of Tree Species –Banking on the Future (2016)
01020304050607080
ComputationalResources
FormattingData
HostingDataontheWeb
AccessingDatafromDatabases
IntegratingDataAcrossDatabases
ScriptingSupporttoExtract
Information
![Page 7: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/7.jpg)
Motivation (Data Provider)
• Support next-generation data requirements for the biological database– Increased quantity and availability of new data– Support data integration across resources– Support complex data analytics–Move data efficiently
![Page 8: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/8.jpg)
treegenesdb.org
TreeGenes Database: History
– Began to hold forest tree genetic maps and associated markers
– Expanded to other data types• Sequence
– Reseqeuncing, Large-Scale Genotyping, Transcriptomics/Expression
– Full Genome Sequences
• Analysis and Visualization Tools– Ability for users to mine the data
• Resources for the user community– Literature, Colleagues
![Page 9: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/9.jpg)
TreeGenes Database: Users
Unique Web Visitors to TreeGenes Database per month, January-December 2016
treegenesdb.org
10,000
2,086 users from 862 organizations in 94 countries
![Page 10: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/10.jpg)
• 1,774 species from 101 genera– At least one genetic artifact from each species
• Full genome sequence: 21 species• Transcriptome/Expression resources:
4,120,817 sequences from 283 species• 106 genetic maps from 35 species
treegenesdb.org
TreeGenes Database: Species
![Page 11: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/11.jpg)
treegenesdb.org
TreeGenes Database: Species
![Page 12: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/12.jpg)
treegenesdb.org
TreeGenes Database: Data Sources
Primary data sources (semi-automated)• Primary databases such as NCBI/EBI• Appropriate data should be submitted to primary
databases• Consistent with changing standards
– Currently no repository for non-human SNPs (new!)
User submissions • For data and metadata not captured well by primary
databases (Journals)
Project submissions• Internal project management (private to public)
Curated Sources• Phytozome and PlantGDB• PLAZA (OrthoFinder)• TRY-DB (Phenotypes)• Dryad (Flat files)
![Page 13: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/13.jpg)
Data that is not collected!treegenesdb.org
TreeGenes Database: Data Sources
Submit genetic maps, association or population study data
Most submissions from journal requirement: Tree Genetics and Genomes, New Phytologist, and Forests
![Page 14: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/14.jpg)
PopulationStudy
•Publication•Species
StudyDesign
•Landscape•CommonGarden•Greenhouse•GrowthChamber
•Breeding(Plot)
Phenotype,Genotype,Environment
•Georeferenced
RawData•Trees•Genotypes•Phenotypes
treegenesdb.org
TreeGenes Database: Data Sources
![Page 15: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/15.jpg)
Metadata on published studies!treegenesdb.org
TreeGenes Database: Data Sources
Genetic maps, association or population studies
![Page 16: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/16.jpg)
treegenesdb.org
TreeGenes Database: Data Sources
Genetic maps, association or population studies
Obtain TGDR accession number!
![Page 17: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/17.jpg)
Opensourcecontentmanagementsystem(CMS)anddatabaseforbiologicaldata
Modulesforgenetic,genomic,andbreedingdatageneratedthroughaCMSandstandardizedschema
Benefits:• Reducesdevelopmentcosts• ProvidesanAPIforcomplete
customization• UsesGMODChado andcommunity
ontologiesforstandardization• Accesscontrolforuser/usergroups• Allowsforsharingofextensionsbetween
sites– Implementedinover30databases!
![Page 18: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/18.jpg)
Current State of Tripal
• http://tripal.info• Content Management System for Biological Data• Over 100 Installations• Current Version 2.0
![Page 19: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/19.jpg)
Tripal Gateway Project (Data Provider)
• Support next-generation data requirements for the biological database
• Tripal Gateway Project– Increased quantity and availability of new data– Support data integration across resources (Web
Services) – Tripal Exchange (v3.0)– Support complex data analytics (Integration with
Galaxy API)– Move data efficiently (Software Defined
Networking – Tripal Data Transfer BDSS)
![Page 20: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/20.jpg)
AlexFeltus,Kuangching WangClemson,Univ.DataTransfer,SDN,SOS
DorrieMain,SookJung,StephenFicklinWashingtonStateUniversity• GenomeDatabaseforRosaceae,• CoolSeasonFoodLegumes• CitrusGenomeDatabase
KirstinBett,LaceySandersonUniv ofSaskatchewan• KnowPulse
JillWegrzynUniversityofConnecticut• TreeGenes
UniversityofUtahNSFACI-REFCollaborators
SteveCannon,Ethy Cannon,IowaStateAndrewFarmer,NCGR• LegumeInfo,PeanutBase
DataTransferCollaborators
ProjectPIs
CollaboratingDatabasesDataAnalysisCollaborators
GalaxyProjectTexasAdvancedComputingCenter,publicGalaxyServer
MegStatonUniversityofTennessee• HardwoodGenomics
Tripal GatewayProjectTree(&Legume)Databases
![Page 21: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/21.jpg)
treegenesdb.org
TreeGenes Database: Interfaces
![Page 22: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/22.jpg)
Web-based framework (Galaxy) promotes genomics analysis
![Page 23: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/23.jpg)
Integrating Galaxy with Tripal
![Page 24: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/24.jpg)
Data analysis brought to the user via the database with Galaxy Workflows
DNA Sequence Data• Re-sequencingalignment• Variantdiscovery(againstthereference)• Variantdiscovery(betweensamples)• Predictionoffunctionalgeneticvariants• AssociationGenetics• FunctionalAnnotation
RNASequenceData• Transcriptomeassembly• Alignmenttoareference• DifferentialExpressionanalysis• Geneco-expressionnetworkconstruction• MiRNA analysis
![Page 25: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/25.jpg)
treegenesdb.org
BDSS: Big Data Smart Socket
• SmartDataTransfer• Standaloneclientwithametadatarepository• Firststepistobuildaninventoryofdatasourcesrelevanttoaparticularusercommunity• NCBI(Genbank forRawData)• Cyverse (iPlant foranalytics)• Tripal supportedwebsitesforsupportingdata
• Determinesoptimalmethodfordatatransferforeachdatasourcethroughtesting
• Datatransfermethodologyisencodedintothemetadatarepository
![Page 26: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/26.jpg)
treegenesdb.org
BDSS: Moving data efficiently
![Page 27: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/27.jpg)
Tripal Gateway: Use Cases
Tripal Gateway:
1. A user could search across community DBs for their set of SNPs interest (from a genotyping array) using Tripal Exchange.
2. The probe sequences could be gathered as a list and transferred to the user with the Data Transfer (BDSS)tool.
3. If the user prefers to use Galaxy for analysis, the transfer could load the probes into the Tripal Galaxy module and align them to a recently released genome reference
4. Basic workflow for alignment could be selected along with the appropriate target in Galaxy
![Page 28: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/28.jpg)
![Page 29: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/29.jpg)
PopulationStudy
•Publication•Species
StudyDesign
•Landscape•CommonGarden•Greenhouse•GrowthChamber
•Breeding(Plot)
Phenotype,Genotype,Environment
•Georeferenced
RawData•Trees•Genotypes•Phenotypes
treegenesdb.org
TreeGenes Database: Data Sources
Inadditionto:• Internalprojects• TREESNAP(public)• DRYAD• TRY-DB
![Page 30: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/30.jpg)
treegenesdb.org
TreeGenes Database: CartograTree
– Providing context to geo-referenced data–Data from TreeGenes, WorldClim, Ameriflux,
TRY-DB
![Page 31: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/31.jpg)
treegenesdb.org
TreeGenes Database: Interfaces
– Retrieve genotype, phenotype, environmental, and sequence data
– Further analysis (MUSCLE, TASSEL, PAML) via SSWAP
![Page 32: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/32.jpg)
treegenesdb.org
TreeGenes Database: SSWAP
– SSWAP “reasons” over the input data and responds with relevant applications
– Send data through pipeline with selection (parameters)
![Page 33: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/33.jpg)
treegenesdb.org
TreeGenes Database: Cyverse(TACC)
– Connect with Cyverse Views– Download data locally or maintain on cloud-based
storage
![Page 34: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/34.jpg)
treegenesdb.org
CartograTree: Current Development
• Flexible georeferenced tagging• Approximate• Exact• Obscured (radius)
• Environmental layers (Geoserver)• Soil• Fire/Drought• Climate models• LIDAR
• Integration with Tripal• User control of workspace• Ability to upload their own trees/phenotypes
• Connection with Galaxy framework • More analytical options (PLINK, TASSEL, MSA, PAML)• Intelligent workflows
![Page 35: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/35.jpg)
treegenesdb.org
CartograTree: TreeSNAP
• Validated accessions from TreeSNAP (obscured)
![Page 36: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/36.jpg)
treegenesdb.org
CartograTree: Galaxy Workflows
Transcriptomics
ExomeCapture
RNA-Seq
GenotypingArray
Affy
Illumina
WholeGenome
Resequencing
GBS• RAD-Seq• ddRAD-Seq
![Page 37: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/37.jpg)
treegenesdb.org
CartograTree: Advanced Interface
• 142species• 27,913TGDR• 17,412Inventory• 26,332TRY-DB
• 815TreeSNAP
• ReleaseDate:• December2017
![Page 38: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for](https://reader035.fdocuments.us/reader035/viewer/2022070622/5e426d4639cea123590acb6e/html5/thumbnails/38.jpg)
treegenesdb.org
TreeGenes Database: Team
Project LeadsJill Wegrzyn Emily GrauNic Herndon
AdvisingDamian Gessler
Semantic Options
@TreeGenes TreeGenes Database
Project DevelopersSean BuehlerTaylor FalkPeter RichterClayton Michael
CollaboratorsStephen Ficklin (Tripal)Alex Feltus (BDSS)Meg Staton (HWG)Dorrie Main (GDR)