Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome...
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
2
Transcript of Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome...
![Page 1: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/1.jpg)
Model of a real workflow
And issues to discuss…
![Page 2: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/2.jpg)
PlasmoDB workflow
P.FalciparumStandardgenome
P.GallenaciumStandardgenome
P.VivaxStandardgenome
P.YoelliStandardgenome
P.BergheiStandardgenome
P.ChabaudiStandardgenome
P.KnowlesiStandardgenome
P.ReichonowiStandardgenome
P.FalciparumNon-standard
synteny
![Page 3: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/3.jpg)
Standard Genome Workflow
blastxNrdb
genome
Splign
In: Pf, Pb, Py, Pv
blastpnrdb proteins
molecularweight
Isolelectricpoint
molecularWeight
Min/max
psipred
run TMHMM
Load TMHMM
taxonomy SONRDB
Genome
TIGR TGI
Extract proteins
Extract genomicsequence
Copy proteinsTo cluster
Copy genomicseqs
To cluster
Global steps(oval)
Subflows(double line)
Compile timeInclude/ExcludeCalculate
Translatedprotein
In: Pf, Pk
![Page 4: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/4.jpg)
Standard Genome Workflow
blastxNrdb
genome
Splign
In: Pf, Pb, Py, Pv
blastpnrdb proteins
CalculateTranslated
protein
In: Pf, Pk
molecularweight
Isolelectricpoint
molecularWeight
Min/max
psipred
run TMHMM
Load TMHMM
taxonomy SONRDB
Genome
TIGR TGI
Extract proteins
Extract genomicsequence
Copy proteinsTo cluster
Copy genomicseqs
To cluster
![Page 5: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/5.jpg)
NRDB
Copy from downloadsite
Shorten defline
NRDB resource
Copy to clusterCopy to cluster
![Page 6: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/6.jpg)
Resources
acquire
unpack
ext db
Ext db rls
insert
![Page 7: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/7.jpg)
Psipred
fix protein IDsFor psipred
create psipredTask dir
copy Data Dirto cluster
copy psipredProtein fileto cluster
start psipredOn cluster
wait for cluster
copy psipredFiles from
cluster
fix psipredFile names
make Alg Inv
load psipred
create psipredData dir
![Page 8: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/8.jpg)
BLAST
CreateSimilarity dir
Start blast
Wait for cluster
Copy files From cluster
extract IDsFrom Blast
result
Load Subjectsubset
Load Result
Optional step(runtime test)
![Page 9: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/9.jpg)
Splign
runSplign
Extract subjectSequenceAlt defline
insertSplign
Extract querySequenceAlt defline
![Page 10: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/10.jpg)
Discussion
![Page 11: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/11.jpg)
Graph file-- features --
• Workflow xml file
• Subflows– Parameters– Constants– Interpolating variables
• Global steps– Steps that are only executed once by the whole workflow, even if in multiple
subflows– Declare a namespace?
• Include/exclude– Compile time inclusion/exclusion– If not compiled in, flow passes right through
• Skip-able steps– Runtime exclusion, based on a dynamic test
![Page 12: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/12.jpg)
Graph file-- sharing across projects --
• Live in svn: ApiCommonData/Load/lib/xml/workflow
• Found by system in $GUS_HOME/lib/xml/workflow
• Shared across all projects– Use include/exclude to specify project specific functionality– Therefore, each build must be on its own branch, to avoid interference
![Page 13: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/13.jpg)
Graph file-- step values --
• Avoid side effects in file system (ok in database)– All files shared by steps must be passed as param values
• outputFiles• inputFiles
• Avoid hard-coded values– Use Constants
• Avoid hand-coded values that change each build– Must be computed by step– Eg blast Y= value
• External Db Rls values– Always pass external db rls spec, eg
• Plasmodium Falciparum Chromosomes:2008-07-13
– Upgrade steps to conform to this
• Table names– Want to be able to reuse these values across steps– Always use same format, eg:
• Dots.ExternalNaSequence
![Page 14: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/14.jpg)
Graph file -- cluster --
• Wait for cluster step– Sends email– (takes list of email addresses as config. Maybe we should set up mailing list?)
• Followed by a waitForHuman step. – By default is in “WAIT_FOR_HUMAN” state
• Orthogonal to other states and offline status
– Pilot can turn that off, and it will run
![Page 15: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/15.jpg)
Graph file-- resources pipeline --
• We still use a resources.xml file– Needed by the front end
• Pubmed• Descriptions• Data sources and attributions
• Handled by a regular subflow• Only one unpack step
– Current multiple unpack steps need to be combined into a simple script
• Dedicated step classes:– ApiCommonData::Load::Step::AcquireExternalResource– ApiCommonData::Load::Step::UnpackExternalResource– ApiCommonData::Load::Step::InsertExternalDatabase– ApiCommonData::Load::Step::InsertExternalDatabaseRelease– ApiCommonData::Load::Step::InsertExternalResource• Are subclasses of ApiCommonData::Load::Step::AcquireExternalStep
• Knows how to parse the resources.xml file
![Page 16: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/16.jpg)
Configuration files
• Steps Configuration– Global
• Commonly used properties
• Not validated until runtime
– Static• Defined per step class
• Convenient, often all is necesssary
– Cascading?– Multi-steps file
• Distinguish between stable properties and mutable ones– Version numbers often change
• Svn?
• Pilot configuration?
![Page 17: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/17.jpg)
Runtime File & Directory Structure
• Avoid side-effects• Use explicit input/ouput params in xml file• Move to a nested data directory structure?
/files/cbil/data/cbil/Plasmodb/5.5/workflow/data/Seqfiles/
nrdb.fsaPvivax/
Seqfiles/Psipred/Assembly/
ESTs/Initial/Intermediate/
– Would use the namespace attribute, somehow• Use path statement, eg:
– ../– ../tmhmm
• Steps directories– Use nested structure for subflows?
![Page 18: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/18.jpg)
External Files Repository
• Do we need it?
• If so, what needs to be improved?
![Page 19: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/19.jpg)
Documentation of the workflow
• Workflow must be able to run in “documentation” mode– Doesn’t run any steps– Instead, produces documentation as expected by front end
• Methods xml file
• Resources xml file
![Page 20: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/20.jpg)
GUI
• Should it run in the web context?– Security issues– Avoids having to have installed software– Would work from home– All members of team could see the flow– Somehow restrict editability– Could be posted on real site as documentation?
• Overkill? Too detailed?
• Needs to handle subflows– Subflow node needs to show a summary of what is going on inside the subflow
• Multi-colored, to show various states inside it
• Gray out paths that are offline
• Expand/collapse?
![Page 21: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/21.jpg)
Mini-flows
• like mini-pipes, but for workflows…
![Page 22: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/22.jpg)
Slides after this are notes, and other junk
![Page 23: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/23.jpg)
Standard resources
taxonomy EnzymeDBSO NRDB dbEST[tax_id]
GOGO Codes
BibliographicRef terms MO terms MO types MO InterProMO Entry
Orthomclphyletic
orthomcl
![Page 24: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/24.jpg)
Plasmodb resources
IEDBepitopes
IEDBdbxrefs
NA Genbankdbrefs
AA Genbankdbrefs pdb Pdb index
![Page 25: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/25.jpg)
P.falciparum resources
ZhangESTs
ApicopolastFlorens
2002
Pf plastidFlorentESTs
Pf mitochonWatanabe Pf
transcriptsWatanabe Pf
ESTsPf GO
AssociationsSanger IT
SNPsSU SNPs Broad SNPs
CombinedSNPs
DeRisiOligos
WinzelerGenetic Var.
array
DeRisiDd2
DeRisiHB3
WinzelerCell Cycle
DeRisi3D7
ScrippsArray
WinzelerGametocyte
DeRisiArray7282
MTC KIArray
BaumMeta data
DurasinghMeta data
GSE5247Meta data
CowmanMeta data
Pfab Array
E-MEXP 449Meta data
E-MEXP 439Meta data
PlasmodbGene ids
E-MEXP 128Meta data
WatersMeta data
WatersGametocyteMass spec
DailyMeta data
GSE2265Meta data
GSE8099Meta data
interactomeWatersFemale
Gametes mass
Mutual info
Plasmo mapy2hSage tag
Array design
Sage tag freqsPf chr
Genbank refs
TIGR geneindexes
BaumArray data
DurasinghArray data
GSE5247Array data
CowmanArray data
E-MEXP 449Array data
E-MEXP 439Array data
E-MEXP 128Array data
WatersArray data
DailyArray data
GSE2265array data
GSE8099Array data
BaumRAD anal
DurasinghRAD anal
GSE5247RAD anal
CowmanRAD anal
E-MEXP 449RAD anal
E-MEXP 439RAD anal
E-MEXP 128RAD anal
WatersRAD anal
DailyRAD anal
GSE2265RAD anal
GSE8099RAD anal
Watersmale
Gametes mass
Watersmixed
Gametes mass
PASADb refs
HagaiEC
WinzelerDb refs
WinzelerLit refs
PredictedProteinstructs
mr4
Cowmansubcellular
Haldarsubcellular
Merozoitepeptides
lasonderoocycts
Florens2004
Broad SNPcoverage
eviganLasonderOocycts
sporozoitesEntrezDbrefs
Pubmeddbrefs
Broad bar codeBroad 3k
genotyping
Lasondersalivary
sporozoites
![Page 26: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/26.jpg)
P. vivax resources
Watanabe Pvtranscripts
Pv contigsWatanabe Pv
ESTsPv dbrefs Pv GB dbrefs Pv mitochon
Pv chromosomes
TIGR geneindexes
![Page 27: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome.](https://reader033.fdocuments.us/reader033/viewer/2022052603/56649d7e5503460f94a61176/html5/thumbnails/27.jpg)
C.parvum C.hominis
Synteny
start
End
Plasmo Toxo
Api
End
Start