Bonnal bosc2010 bio_ruby

20
BioRuby BioRuby Project Update Raoul J.P. Bonnal co-authors: Raoul J.P. Bonnal [email protected] Life Science Informatics Integrative Biology Program Fondazione INGM Italy 11th Annual Bioinformatic Open Source Conference (BOSC) 2010 Boston, Massachusetts, USA co-authors: Toshiaki Katayama Pjotr Prins Mitsuteru Nakao Christian M Zmasek Nahoisa Goto

Transcript of Bonnal bosc2010 bio_ruby

Page 1: Bonnal bosc2010 bio_ruby

BioRubyBioRuby

Project Update

Raoul J.P. Bonnal co-authors:Raoul J.P. Bonnal

[email protected]

Life Science Informatics

Integrative Biology Program

Fondazione INGM

Italy

11th Annual Bioinformatic Open Source Conference (BOSC) 2010

Boston, Massachusetts, USA

co-authors:

Toshiaki Katayama

Pjotr Prins

Mitsuteru Nakao

Christian M Zmasek

Nahoisa Goto

Page 2: Bonnal bosc2010 bio_ruby

Introduction

BioRuby - bioinformatics library for Ruby language

• Object oriented scripting language, functional and reflective

• has become popular by "Ruby on Rails“

• created by Matz in 1993 in Japan• created by Matz in 1993 in Japan

Page 3: Bonnal bosc2010 bio_ruby

BioRuby & Platforms

Ruby Interpreter

Performances

Ruby

RubyEE

Portability

JRuby

Java libraries

Operating Systems

gem install bio

Page 4: Bonnal bosc2010 bio_ruby

BioRuby & PlatformsBioLib

Ruby Interpreter

Performances

Ruby

RubyEE

Portability

JRuby

Java libraries

Operating Systems

gem install bio

Page 5: Bonnal bosc2010 bio_ruby

BioRuby & PlatformsCytoscape

Ruby Interpreter

Performances

Ruby

RubyEE

Portability

JRuby

Java libraries

Operating Systems

gem install bio

Page 6: Bonnal bosc2010 bio_ruby

History2008 20102009

WebServices Workflows SemanticWeb

Code fest1.3.0

1.3.11.4.0

GSoC

•Ruby 1.9.2

GSoC

•phyloXML

Code fest

BOSC

---+++git

•Ruby 1.9.2

•NeXML I/O, RDF triples

•Infer gene duplications

•phyloXML

GitHub:

http://github.com/bioruby/bioruby

GSoC references:Ruby 1.9.2 support of BioRuby (OBF)

Develop an API for NeXML I/O, and, RDF triples for BioRuby (NESCent)

Implementation of algorithm to infer gene duplications in BioRuby (OBF)

Implementing phyloXML support in BioRuby (NESCent)

Page 7: Bonnal bosc2010 bio_ruby

BioRuby Features

Category Modules

Object Sequence pathway, tree, bibliography referenceObject Sequence pathway, tree, bibliography reference

Sequence

Manipulation

translation, alignment, location,mapping, feature table, molecular

weight, design siRNA, restriction enzyme

Format GenBank, EMBL, UniProt, KEGG, PDB, MEDLINE, REBASE, FASTQ, GFF,

MSF, ABIF, SCF, GCG, Lasergene, GEO SOFT, Gene Ontology

Tool BLAST, FASTA, EMBOSS, HMMER, InterProScan,GenScan, BLAT, Sim4,

Spidey, MEME, ClustalW, MUSCLE, MAFFT, T-Coffee, ProbCons

Phylogeny PHYLIP, PAML, phyloXML, NEXUS, NewickPhylogeny PHYLIP, PAML, phyloXML, NEXUS, Newick

Web Service NCBI, EBI, DDBJ, KEGG, TogoWS, PSORT, TargetP, PTS1, SOSUI, TMHMM

ODBA BioSQL, BioFetch, indexed flat files

Shell Interactive environment for rapid Bioinformatics analyses

Page 8: Bonnal bosc2010 bio_ruby

Relevant New Features1

Bio::SQL Interoperable storage of sequences -Raoul Bonnal-

require ‘ bio ’ #active_record (ORM)#active_record (ORM)#your_database_adapter (MYSQL, Postgresql,JDBC)connection = Bio::SQL. establish_connection ({‘development=>{‘hostname=>you_host_name,

‘database’=> ‘CoolBioSeqDB’,‘adapter’=> ‘jdbcmysql’‘username’=> ‘Raoul’,‘password’=> ‘SmartPassword’},

‘development’)#read a GenBank file and store:my_sotrage = Bio::SQL:: Biodatabase.find (:first)my_sotrage = Bio::SQL:: Biodatabase.find (:first)genbank = Bio::GenBank.open(‘dbvrl1.gb’)genbank.each_entry do |gb|

Bio::SQL::Sequence.new(:biosequence=>gb.to_bioseque nce,:biodatabase=>my_sotrage)

end

#fetch an accession is easyBio::SQL.fetch_accession(your_accession).to_biosequ ence.output(:embl)

Page 9: Bonnal bosc2010 bio_ruby

Relevant New Features2

Bio::PhyloXML r/w by -Diana Jaunzeikare, Christian M Zmasek-

require ‘ bio ’ # libxml-ruby

#Create a parserphyloxml = Bio::PhyloXML::Parser.new(‘example.xml’)

#Consume the treephyloxml.each do |tree|

puts tree.nameend#Wrintingwriter = Bio::PhyloXML::Writer.new(‘my_tree.xml’)write.writer (tree2)write.writer (tree2)

#Extract informationphyloxml = Bio::PhyloXML::Parser.new(‘ncbi_taxnonomy _mollusca.xml’)phyloxml.each do |tree|

tree.each_nome do |node|print ‘Scientific name: ‘, node.taxonomies[0].scien tific_name,‘\n’

endend Han, M. V. and Zmasek, C. M. (2009). phyloXML: XML for

evolutionary biology and

comparative genomics. BMC Bioinformatics, 10, 356.

Page 10: Bonnal bosc2010 bio_ruby

Relevant New Features3

Bio::FASTQ r/w Next Generation Sequencing FASTQ -Naohisa Goto-

require ‘ bio ’ff_fasta = Bio:: FlatFile.open ( filename.fasta )ff_fasta = Bio:: FlatFile.open ( filename.fasta )ff_qual = Bio::FlatFile.open(filename.qual)

while entry_fasta = ff_fasta.next_entryseq = entry_fasta.to_biosequenceseq.quality_score_type = :phredseq.quality_scores = ff_qual.next_entry.dataputs seq.output(:fastq,

:title => entry_fasta.definition)end

● Format supported: SOLEXA, ILLUMINA

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P.

M. (2010). The Sanger

FASTQ file format for sequences with quality scores, and

the Solexa/Illumina

FASTQ variants. Nucleic Acids Res, 38(6), 1767.1771.

Page 11: Bonnal bosc2010 bio_ruby

Relevant New Features4

Bio::NCBI::REST exampleBio::NCBI::REST examplerequire ‘ bio ’ncbi = Bio::NCBI::REST::ESearch.newncbi.search("nucleotide", "tardigrada")ncbi.count("nucleotide", "tardigrada")ncbi.nucleotide("tardigrada")ncbi.taxonomy("tardigrada")ncbi.pubmed("tardigrada", "reldate" => 365)ncbi.pubmed("mammoth mitochondrial genome")

Bio::TogoWSBio::TogoWS entry point for PDBj, NCBI, DDBJ, EBI, KEGGrequire ‘ bio ’t = Bio::TogoWS::REST.newputs t.entry('genbank', 'AF237819')puts t.search('uniprot', 'lung cancer')

Page 12: Bonnal bosc2010 bio_ruby

BioRuby is Agile

● OpenBio* developers are the Stakeholders

Speed up in the iteration proccess● Speed up in the iteration proccess

● Frequent meetings (mail, skype/voice chat, irc)

● Test Everything (required for new features)

– Improve quality , maintainability and guarantee portability

– Ruby Unit Testing Framework , Rspec

● GitHub

● Low barries for new developers

● 32 forks and 100 people watching us

Agile Manifesto

Page 13: Bonnal bosc2010 bio_ruby

Moving to Agile Programming

2500

1000

1500

2000

Tests

Tutorial's lines

0

500

1000

1.0.0 1.1.0 1.2.0 1.2.1 1.3.0 1.3.1 1.4.0

Page 14: Bonnal bosc2010 bio_ruby

Refactoring

3000

3500

1500

2000

2500

3000

Files

Classes

Modules

Methods

0

500

1000

1.0.0 1.1.0 1.2.0 1.2.1 1.3.0 1.3.1 1.4.0

Methods

Page 15: Bonnal bosc2010 bio_ruby

Ongoing Work

● Semantic Web (started @ BioHackathon 2010)

Expose data in RDF● Expose data in RDF

● Consuming SPARQL end points efficiently

● Ruby 1.9.2 support of BioRuby ( GSoC & OBF)

● Improved performances

● Develop an API for NeXML I/O, and, RDF triples for BioRuby (GSoC &

NESCent)NESCent)

● Implementation of algorithm to infer gene duplications in BioRuby

(GSoC & OBF)

Page 16: Bonnal bosc2010 bio_ruby

PlugIn system

● We want a BioRuby core stable on every OS

But… we want to use experimental code ASAP● But… we want to use experimental code ASAP

● BioRuby + BioRuby Plugin + Rails we can have multiple

applications with an unique core and specific features

– User or Application

● Suggest Guidelines for plugin namespace

● On GitHub you can find our plugins looking for

bioruby -plugin -NAME

Page 17: Bonnal bosc2010 bio_ruby

PlugIn system

The plugin system will be delivered with the next

BioRuby releaseBioRuby release

BioGraphics – Jan Aerts-

For biologists:

bioruby --plugin install graphics

For geeks:For geeks:

bioruby --plugin install git://github.com/user/repo.g it

It’s very experimental

Page 18: Bonnal bosc2010 bio_ruby

What We Need

● Better integration with R

● Better support for data visualization (interpretation)

● Detailed Roadmap

Page 19: Bonnal bosc2010 bio_ruby

Publications

BioRuby: Bioinformatics software for the Ruby programming language (submitted)

Naohisa Goto, Pjotr Prins, Mitsuteru Nakao, Raoul Bonnal, Jan Aerts and Toshiaki Katayama

The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and

workflows (accepted)

Toshiaki Katayama et all.

Toshiaki Katayama, Mitsuteru Nakao and Toshihisa Takagi (2010)

TogoWS: integrated SOAP and REST APIs for interoperable bioinformatics Web services, Nucleic Acids

Research, 2010, Vol. 38, No. suppl_2 W706-W711, doi:10.1093/nar/gkq386 (Web Server Issue 2010)

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P. M. (2010).

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Nucleic Acids Res, 38(6), 1767.1771.

Over 24 articles use BioRuby as in their analyses, check the up to date list:

http://bioruby.open-bio.org/wiki/Research_using_BioRuby

Page 20: Bonnal bosc2010 bio_ruby

Acknoledgments

● BioRuby Team

● Toshiaki Katayama*

Open Bioinformatics Foundation

● Toshiaki Katayama*

● Naoshita Goto*

● Pjotr Prins*

● Mitsuteru Nakao*

● Jan Aerts*

● Christian M Zmasek*

● All GSoC students

Google Summer of Code

Database Center for Life Science

All GSoC students

NESCentNational Evolutionary Synthesis Center

* co-author