PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

Post on 14-Apr-2017

194 views 0 download

Transcript of PyConFr 2015 - ASaiM: Lessons learned from developing a framework for biologists

ASaiMLessonslearnedfromdevelopingaframeworkforbiologists

BéréniceBatut—October16th,2015

PhDthesisinbioinformaticsandcomputationalbiology

ContributiontoaevolprojectDevelopmentofsimplePythonscripts

Post-docinbioinformatics

DevelopmentofASaiMproject

ASaiMproject

ObjectivesDevelopmentofabioinformaticsenvironmenttoanalyze

datafromgutmicrobiota

Gutmicrobiota

Communityofmicroorganismspeciesthatliveinthedigestivetracts

"Forgotten"organ

Metagenomic:studyofmicrobiota

ComplexityShortsequencesSequencevariabilityUncompletereferencedatabases

Needfornumeroustreatmentstoextractusefulinformation

Extract id

16S sequence id

Extract id

18S sequence id

Extract id and e-value

16S e-value

Extract id and e-value

18S e-value

Compare similarity

16S similar id

Compare difference

16S specific id

Compare similarity

18S similar id

Compare difference

18S specific id

Compare similarity Compare similarity

18S similar e-value16S similar e-value

Join columns

16S and 18S similar e-value

Extract line where 18S e-value < 16S e-value

Extract line where 16S e-value < 18S e-value

E-value of 16S similar to conserve

E-value of 18S similar to conserve

Extract column corresponding to id Extract column corresponding to id

Id of 18S similar to conserve

Id of 16S similar to conserve

Concatenate

Id of 18S to conserve

Concatenate

Id of 16S to conserve

Extract sequences whose id in a list

16S sequences to conserve

Extract sequences whose id in a list

16S sequences to conserve

16S sequence id

Remove first line

16S e-value 18S e-value 18S sequence id

Remove first line Remove first line Remove first line

Input sequences

16S sequences 18S sequences

rRNA populus sequences

rRNA sclerotinia sequences

Silva bacteria 16S sequences

Silva archee 16S sequences

Silva eukaryota 18S sequences

SortMeRNA

Non populus sequencesPopulus sequencesPopulus blast report

Extact id for report with id > 97% and

coverage > 97%Extact id

Populus id Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position = 1

Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position > target sequence length

Id

Compare difference

Populus conserved id

Extract sequences whose id not in a list

Populus not conserved sequences

Concatenate

SortMeRNA

Non populus sequences

Non sclerotinia and non populus

sequences

SortMeRNA SortMeRNA

16S blast report 18S blast report

Exampleofworkflowtosortsequencesgiventheirtype

ASaiMframework

Bioinformaticsframeworktogenerateworkflowstoanalyzedatafromgutmicrobiota

MainRequirements

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

FirsttestedapproachSimplePythonscripts

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

SecondtestedapproachWorflowmanagerssuchasLuigi,Airflow,...

Airflowdependencygraph(from )Airbnbsite

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

ThirdtestedapproachHomemadeapproach

Configurationfile

WorkflowdescriptionWebinterfaceforgeneration

Pythonscriptstoexecuteworkflowinconfigurationfile

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

MainissuewiththeseapproachesDependencybetweenthetasks

Airflowdependencygraph(from )Airbnbsite

Extract id

16S sequence id

Extract id

18S sequence id

Extract id and e-value

16S e-value

Extract id and e-value

18S e-value

Compare similarity

16S similar id

Compare difference

16S specific id

Compare similarity

18S similar id

Compare difference

18S specific id

Compare similarity Compare similarity

18S similar e-value16S similar e-value

Join columns

16S and 18S similar e-value

Extract line where 18S e-value < 16S e-value

Extract line where 16S e-value < 18S e-value

E-value of 16S similar to conserve

E-value of 18S similar to conserve

Extract column corresponding to id Extract column corresponding to id

Id of 18S similar to conserve

Id of 16S similar to conserve

Concatenate

Id of 18S to conserve

Concatenate

Id of 16S to conserve

Extract sequences whose id in a list

16S sequences to conserve

Extract sequences whose id in a list

16S sequences to conserve

16S sequence id

Remove first line

16S e-value 18S e-value 18S sequence id

Remove first line Remove first line Remove first line

Input sequences

16S sequences 18S sequences

rRNA populus sequences

rRNA sclerotinia sequences

Silva bacteria 16S sequences

Silva archee 16S sequences

Silva eukaryota 18S sequences

SortMeRNA

Non populus sequencesPopulus sequencesPopulus blast report

Extact id for report with id > 97% and

coverage > 97%Extact id

Populus id Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position = 1

Id

Compare difference

Populus conserved id

Extact id for report with id > 97% and target position > target sequence length

Id

Compare difference

Populus conserved id

Extract sequences whose id not in a list

Populus not conserved sequences

Concatenate

SortMeRNA

Non populus sequences

Non sclerotinia and non populus

sequences

SortMeRNA SortMeRNA

16S blast report 18S blast report

Extract id

16S sequence id

Extract id

18S sequence id

Extract id and e-value

16S e-value

Extract id and e-value

18S e-value

Compare similarity

16S similar id

Compare difference

16S specific id

Compare similarity

18S similar id

Compare difference

18S specific id

Compare similarity Compare similarity

18S similar e-value16S similar e-value

Join columns

16S sequence id

Remove first line

16S e-value 18S e-value 18S sequence id

Remove first line Remove first line Remove first line

16S sequences 18S sequences

rRNA sclerotinia sequences

Silva bacteria 16S sequences

Silva archee 16S sequences

Silva eukaryota 18S sequences

SortMeRNA

Non populus sequences

Non sclerotinia and non populus

sequences

SortMeRNA SortMeRNA

16S blast report 18S blast report

FinalapproachGalaxy

Open-sourceprojectbasedonPythonInternationaldevelopmentcommunityWebinterfaceGalaxyToolShed

Fitwithframeworkrequirements?

Generationofworkflowwithnumeroustools

Easytouse

Flexibility

Heavilyandeasilydocumented

Easytomaintain

Galaxydependencygraph

Input dataset

output

Line/Word/Character count

Text file

out_file1

Extract (constrained) information

Similarity search report

report_filepathoutput_filepath

Extract (constrained) information

Similarity search report

report_filepathoutput_filepath

Line/Word/Character count

Text file

out_file1

Line/Word/Character count

Text file

out_file1

Remove beginning

from

out_file1

Remove beginning

from

out_file1

Line/Word/Character count

Text file

out_file1

Compare two Datasets

Compareagainst

out_file1

Join two Datasets

Joinwith

out_file1

Compare two Datasets

Compareagainst

out_file1

Line/Word/Character count

Text file

out_file1

Cut

From

out_file1

Line/Word/Character count

Text file

out_file1

Filter

Filter

out_file1

Filter

Filter

out_file1

Line/Word/Character count

Text file

out_file1

Cut

From

out_file1

Cut

From

out_file1

Line/Word/Character count

Text file

out_file1

Line/Word/Character count

Text file

out_file1

Cut

From

out_file1

Concatenate datasets

Concatenate DatasetDataset 1 > Select

out_file1

Concatenate datasets

Concatenate DatasetDataset 1 > Select

out_file1

Line/Word/Character count

Text file

out_file1

Line/Word/Character count

Text file

out_file1

Extract

Sequence fileConstraints on sequences 1 > List of constraint

information_filefasta_sequence_filefastq_sequence_filequality_filefasta_sequence_file_from_fastqreport_filepath

Input dataset

output

Input dataset

output

Workflowtosortsequencesgiventheirtype

ASaiMframeworkConfigurationofaGalaxyserverDevelopmentofwrappersfortoolintegrationDevelopmentofscriptstouseGalaxyandAPI

Usedtools

Code

Githubandsubmodules,Gitlab

Documentation

Sphinx+ReadTheDoc+Github

Webpage

Jekyll+Githubpage

Management

Trello,Slack

Learnedfromthisproject

Needtocorrectlydefinetheconception

Noworkflowmanagerwithinput/outputdependency

Donoreinventthewheel

Donotpreferhome-madesolution

Integrateactivecommunity

Needofgoodtoolsandgoodhabitsinbigprojects

ThankYou.Questions?

bebatut.fr

github.com/bebatut

twitter.com/bebatut