Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly...
Transcript of Software: the good, the bad and the ugly...Software: What’s good, bad and ugly The bad and ugly...
Software:What’sgood,badandugly
BioinformaticsSoftware
SoftwareX
All softwarecontainsbugs:
• IndustryAverage:“about15- 50errorsper1000linesofdeliveredcode”SteveMcConnell(authorofCodeCompleteandSoftwareEstimation:DemystifyingtheBlackArt)
• Rangefromspellingmistakeinerrormessagetocompletelyincorrectresults
• Mostsoftwarewillprocesstheinputtoproduceanoutputwithouterrorsorwarnings
Never blindlytrustsoftwareorpipelines
• Always test andvalidate results
• Avoidblack box software(definition:producesresults,butnooneknowshow)
W Y
X
Z
Software:What’sgood,badandugly
Differentclassesofbioinformaticssoftware
SoftwareX
W Y
X
Z
Class Examples Description
Processing TopHat2,Bowtie2 Performingcomputationally intensivetask,applyingmathematicalmodels
Evaluation FastQC,BamQC DerivingQCmetricsfromoutputfiles
Converters SamToFastq (PicardTools) Simply convertingbetweenfileformats.Generallystablenoregularupdates
Pipelines Galaxy,ClusterFlow Theglueforjoiningsoftwaretocreateanautomatedpipeline
1. FindingSoftwaretodothejob2. Hasthesoftwarebeenpublished?3. SoftwareAvailability4. DocumentationAvailability5. Presenceonusergroups6. InstallationandRunning7. ErrorsandLogFiles8. Usestandardfileformats9. EvaluatingCommercialSoftware10. Bugsinscripts/pipelinestorunsoftware11. Writingyourownsoftware12. Usingandcreatingpipelines
12StepGuideforevaluatingandselectingbioinformaticssoftwaretools
Software:What’sgood,badandugly
Identifytherequiredtask
1.Findingsoftwaretodothejob
Alignmentofmethylation sequencingdatatoreferencegenome
Arethererelatedstudiesperformingsimilaranalysis?Publication/posters/talks
Requiredfeatures MustHave1.INPUTstandardFASTQformatfiles2.OUTPUTstandardBAMalignments3.OUTPUTCompatiblewithmethylKit
Liketohave1.Performmethylationcalls2.Mustmakeuseofmultiprocessorsforlargenumbersofsamples
Software:What’sgood,badandugly
✓ PublishedinapeerreviewedjournalAsstandalonesoftwareorpartofstudy
Citedbyotherpeerreviewedpapers✓
Hasthesoftwarebeenbenchmarked(byotherpeoplethantheauthors)
Shortreadmappingis“generallysolvedproblem”Informativeforruntimes
✓ BMCBioinformatics;2016;17(Suppl 4):69DOI:10.1186/s12859-016-0910-3
2.Hasthesoftwarebeenpublished?
Software:What’sgood,badandugly
SoftwareavailablefordownloadHostedonarecognisedsoftwarerepositorye.g. GitHub,BitBucket,SourceForge
Softwareregularlyupdated/bugsfixed/releasesMorethanonedeveloper(e.g.groupaccount)
Permanentarchiveofsoftwarereleasese.g.zenodo.org,figshare.com
✓
University/Institute/CompanyWebsiteSoftwareistheresponsibilityofagroupnotjustanindividual
✓
3.SoftwareAvailability
Software:What’sgood,badandugly
✓
Bugsarereportedandfixed
Newfeaturerequestsareadded
✓
3.SoftwareAvailability
Software:What’sgood,badandugly
UserDocumentation✓
ReleaseDocumentationVersions– aidsreproduceability
✓
4.DocumentationAvailability
Software:What’sgood,badandugly
WT KO1 KO2 KO3
WT
KO1
KO2
KO3
X
X
X
XX
X
4.DocumentationAvailability
Software:What’sgood,badandugly
Example:RNA-SeqDifferentialGeneanalysisusingDESeq2DiscoverlimitationsDESeq2 Manual”The resultsfunction without any arguments will automatically perform a contrast of the last level of the last variable in the design formula over the first level.”
count.data <- DESeqDataSetFromMatrix(sampleTable=smplTbl, design= ~ genotype)
count.data <- DESeq(count.data)
binomial.result <- results(count.data)
binomial.result <- results(count.data, contrast=c(”genotype",”K01",”K02"))
Name fileName genotypeWT1 wt1.htseq_counts.txt WTWT2 wt2.htseq_counts.txt WTWT3 wt3.htseq_counts.txt WTKO1.1 ko1.1.htseq_counts.txt KO1KO1.2 ko1.2.htseq_counts.txt KO1KO1.3 ko1.3.htseq_counts.txt KO1KO2.1 ko2.1.htseq_counts.txt KO2KO2.2 ko2.2.htseq_counts.txt KO2KO2.3 ko2.3.htseq_counts.txt KO2KO3.1 ko3.1.htseq_counts.txt KO3KO3.2 ko3.2.htseq_counts.txt KO3KO3.3 ko3.3.htseq_counts.txt KO3
✓
✗
Evidenceforsupportquestionsbeinganswerede.g.FAQ,searchablepublicsupportgroup
✓
5.Presenceonusergroups
https://www.biostars.org
http://seqanswers.com
Software:What’sgood,badandugly
Istheresomeonenearbyyoucanaskforhelp BioinformaticsCoreFacility
Researchgroupdownthecorridor
GitHubIssuesGoogleGroups
Willrunonstandardarchitecture✓
✓ Easytoinstall
6.InstallationandRunning
Releaseversions✓$bismark --version
Bismark - BisulfiteMapperandMethylationCaller.Bismark Version:v0.16.3_dev
Copyright2010-15FelixKrueger,Babraham Bioinformaticswww.bioinformatics.babraham.ac.uk/projects/
SourcecodeavailableBinariescansimplifyinstallation
✓
Docker/Galaxy/BaseSpace
Software:What’sgood,badandugly
✓ DefaultparametersAsensiblesetofdefaultparametersthatarelikelytoproduceagoodfirstpassattheresults
Example:Traceabilityofresultsthoughthestepsintheanalysis
Intermediateresultsareexcellentcheckpoints
Software:What’sgood,badandugly
6.InstallationandRunning
Alignment BAM countsHTSeq-count DESeq2
Normalisedreadcounts
DifferentiallyExpressedGenes
✓ ✓ ✓
RNA-SeqDifferentialGeneExpressionAnalysis
Sample:SampleCorrelation
MA-PlotsSamplePCA
PerGenestd.dev
✓✓
✓ ✓
✓
7.ErrorsandLogFiles
Warnings Don’tignorewarnings,theymaybetellingyousomethingcrucialaboutyourdata
Errors Problemsevereenoughfortheprogramtostopandproduceanerror
Software:What’sgood,badandugly
Keepandreadlogfilesforsoftwarerun
8.Usestandardfileformats
StandardInputFiles✓FASTA,FASTQ
Convertingbetweenformatscouldintroduceerrors
StandardOutputFiles✓BAM
Compatiblewithdownstreamtools
Software:What’sgood,badandugly
Bioinformaticiansspendanembarrassingamountoftimeconvertingbetweenfileformats
9.EvaluatingCommercialSoftware
Software:What’sgood,badandugly
ShouldyouusecommercialsoftwaretodoRNA-SeqDGEanalysis?
Lotsofgoodcommercialsoftwareavailablee.g.Partek
Pros Cons
GraphicalInterface– nocommandline Runanalysiswithoutunderstandingthesteps
Singleapplicationforallsteps Hardertotracebackstepbystep
DedicatedCustomerSupport Limited usergroupactivity
Lesstransparency(methods/bugsfixed)
Expensive
Licenserequiredtoreproduceanalysis(e.g.reviewers)
10.Bugsinscripts/pipelinestorunsoftware
Software:What’sgood,badandugly
1.BashScriptforrunningfastQC
for file in *_1.fq.gz;dofastqc $file
done
multiqc .
Oftenwrittenspecificallyforeachanalysisorprojectandarepronetobugs
Examplesofaccidentallymissingoutsamples
2.RNA-SeqDESeq2SampleTablegenotype <- data.frame(
‘WT’, ’wt’, ’Wt’, ‘KO1’,’kO1’,’KO1’,‘KO2’,’Ko2’,’KO2’,‘KO3’,’KO3’,’KO3’)
...results(dds, contrast=c(”genotype",”WT",”K02"))
Software:What’sgood,badandugly
Thebadandugly
• Homemade“glue”scriptsforrunningsoftwarecanbebugprone• “darkscriptmatter”isn’treviewedorassessesandrarelyreleasedinmethodssections• Ina3000samplestudy,errorsarepropagated3000times!
Thegood
• Purposebuildpipelinetools• Premadepipelinesfore.g.RNA-Seqdifferentialgeneexpression• Jobqueuing- Loadbalancingacrosshardware(laptoptoclusterfarm)• Logfilestrackasamplesprogressthroughpipeline
11.UtilisingdedicatedPipelinetools
Software:What’sgood,badandugly
http://clusterflow.io/
https://usegalaxy.org/
https://github.com/common-workflow-language
CommonWorkflowLanguage
11.UtilisingdedicatedPipelinetools
InteractionviaawebbrowserPublicandprivateserverinstallsManypre-builtpipelinesLargeusercommunity
CommandlineinterfaceManypre-builtpipelines
AlanguageforbuildingyourownpipelinesUtilisedbyotherpipelinetoolse.g.NextIO
Rule1:IdentifytheMissingPiecesRule2:CollectFeedbackfromProspectiveUsersRule3:BeReadyforDataGrowthRule4:UseStandardDataFormatsforInputandOutputRule5:ExposeOnlyMandatoryParametersRule6:ExpectUserstoMakeMistakesRule7:ProvideLoggingInformationRule8:GetUsersStartedQuicklyRule9:OfferTutorialMaterialRule10:ConsidertheFutureofYourTool
Software:What’sgood,badandugly
12.Developingyourownsoftware
Ifyouaresureagreatpieceofsoftwaredoesn’talreadyexistorcanbemodifiedforthetask
Developingyourowntoolsgivesanappreciationofhowdifficultitcanbe
Weightingtheevaluationcriteria
Software:What’sgood,badandugly
Criteria Importance Comments1 FindingSoftwaretodothejob +++++ Use therighttoolsforthejob
2 Hasthesoftwarebeenpublished? +++ Newsoftwarebeingreleasedsocheckforimprovedmethods.Justbecauseitspublishedandwelluseddoesn’tmeanit’sstill thebest
3 SoftwareAvailability ++
Opennessitagoodsignforfindingerror/bugs/suggestingfeatureenhancements
4 DocumentationAvailability +++++5 Presenceonusergroups +++++6 InstallationandRunning +++7 ErrorsandLogFiles +++++8 Usestandardfileformats ++++ Conversions couldaddsourcesoferror9 EvaluatingCommercialSoftware + PriceVsOpensourcesoftware
10 Bugsinscripts/pipelinestorunsoftware +++++ Pipelinesstandardise workflows
11 Utilising dedicatedpipelinetools +12 Writingyourownsoftware + Don’tre-inventingthewheel
Software:What’sgood,badandugly
Method1Method2Method3
Compromisesforruntimevsaccuracy/sensitivity
ProjectAhas3000samplesvsProjectBwith12samples
Method1:4hourspersample98%accuracyMethod2:30minspersample97%accuracy
WhatwouldyouchooseifMethod1:4hourspersample98%accuracyMethod3:15minspersample 90%accuracy
Weightingtheevaluationcriteria
Software:What’sgood,badandugly
Summary
Manywaystoevaluatesoftware
• Opennessandengagementwithusersisveryimportant- bugsfixed,featuresadded,largeuserbase
• Evaluatefeatures,e.g.runtime,againstyourprojectrequirements
• Ifyouareusingpipelines,usepurposebuildpipeliningtools