Fields bosc2010 bio_perl

BioPerl Update 2010:Towards a Modern BioPerlBOSC 7-10-10Chris Fields (UIUC)

Present Day BioPerl

✤ Addressing new bioinformatics problems

✤ Collaborations in Open Bioinformatics Foundation

✤ Google Summer of Code

Towards a Modern BioPerl

✤ Lowering the barrier for new users to become involved

✤ Using Modern Perl language features

✤ Dealing with the BioPerl monolith

BioPerl 2.0?

✤ BioPerl and Modern Perl OOP (Moose)

✤ BioPerl and Perl 6

Background

✤ Started in 1996, many contributors over the years✤ Jason Stajich (UCR)

✤ Hilmar Lapp (NESCent)

✤ Heikki Lehväslaiho (KAUST)

✤ Georg Fuellen (Bielefeld)

✤ Ewan Birney (Sanger, EBI)

✤ Aaron Mackey (Univ. Virginia)

✤ Chris Dagdigian (BioTeam)

✤ Steven Brenner (UC-Berkeley)

✤ Lincoln Stein (OICR, CSHL)

✤ Ian Korf (Wash U)

✤ Chris Mungall (NCBO)

✤ Brian Osborne (BioTeam)

✤ Steve Trutane (Stanford)

✤ Sendu Bala (Sanger)

✤ Dave Messina (Sonnhammer Lab)

✤ Mark Jensen (TCGA)

✤ Rob Buels (SGN)

✤ Many, many more!

✤ Open source: ‘Released under the same license as Perl itself’ i.e. Artistic

✤ http://bioperl.org

✤ Core developers - make releases, drive the project, set vision

✤ Regular contributors - have direct commit access

Background

http://bioperl.org

http://bioperl.org

BioPerl Distributions

✤ BioPerl Core - the main distribution (aka ‘bioperl-live’ if using dev version)

✤ BioPerl-Run - Perl ‘wrappers’ for common bioinformatics tools

✤ BioPerl-DB - BioSQL ORM to BioPerl classes

Biological Sequences

✤ Bio::Seq - sequence record class#!/bin/perl -w

use Modern::Perl; use Bio::Seq; my $seq_obj = Bio::Seq->new(-seq => "aaaatgggggggggggccccgtt", -display_id => "ABC12345", -desc => "example 1", -alphabet => "dna");

say $seq_obj->display_id; # ABC12345 say $seq_obj->desc; # example 1say $seq_obj->seq; # aaaatgggggggggggccccgtt

my $revcom = $seq_obj->revcom; # new Bio::Seq, but revcomsay $revcom->seq; # aacggggcccccccccccatttt

Sequence I/O

✤ Bio::SeqIO - sequence I/O stream classes (pluggable)#!/usr/bin/perl -w

use Modern::Perl;use Bio::SeqIO;

my ($infile, $outfile) = @ARGV;

my $in = Bio::SeqIO->new(-file => $infile, -format => 'genbank');my $out = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta');

while (my $seq_obj = $in->next_seq) { say $seq_obj->display_id; $out->write_seq($seq_obj);}

Sequence Features

✤ Bio::SeqFeature::Generic - generic SF implementationGenBank File

use Modern::Perl;use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file => shift, -format => 'genbank');

while (my $seq_obj = $in->next_seq) { for my $feat_obj ($seq_obj->get_SeqFeatures) { say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; for my $tag ($feat_obj->get_all_tags) { say " tag: $tag"; for my $value ($feat_obj->get_tag_values($tag)) { say " value: $value"; } } }}

source 1..2629 /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" /strain="OG1RF" /db_xref="taxon:474186" gene 25..>2629 /gene="pyr operon" /note="pyrimidine biosynthetic operon"

Primary tag: sourceLocation: 1..2629 tag: db_xref value: taxon:474186 tag: mol_type value: genomic DNA tag: organism value: Enterococcus faecalis OG1RF tag: strain value: OG1RF

Report Parsing

Query= gi|1786183|gb|AAC73113.1| (AE000111) aspartokinase I,homoserine dehydrogenase I [Escherichia coli] (820 letters)

Database: ecoli.aa 4289 sequences; 1,358,990 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogen... 1567 0.0gb|AAC76922.1| (AE000468) aspartokinase II and homoserine dehydr... 332 1e-91gb|AAC76994.1| (AE000475) aspartokinase III, lysine sensitive [E... 184 3e-47gb|AAC73282.1| (AE000126) uridylate kinase [Escherichia coli] 42 3e-04

>gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia coli] Length = 820

Score = 1567 bits (4058), Expect = 0.0 Identities = 806/820 (98%), Positives = 806/820 (98%)

Query: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDASbjct: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60

Report Parsing

✤ Bio::SearchIO#!/usr/bin/perl -w

use Modern::Perl;use Bio::SearchIO; my $in = Bio::SearchIO->new(-format => 'blast', -file => 'ecoli.bls');

while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { say "Query=".$result->query_name; say " Hit=".$hit->name; say " Length=".$hsp->length('total'); say " Percent_id=".$hsp->percent_identity."\n"; } }}

Query=gi|1786183|gb|AAC73113.1| Hit=gb|AAC73113.1| Length=820 Percent_id=98.2926829268293




Local/Remote Database Interfaces

✤ Bio::DB::GenBank

#!/bin/perl -w

use Modern::Perl; use Bio::DB::GenBank; my $db_obj = Bio::DB::GenBank->new; # query NCBI nuc db my $seq_obj = $db_obj->get_Seq_by_acc('A00002');

say $seq_obj->display_id; # A00002say $seq_obj->length(); # 194

✤ Also EntrezGene, GenPept, RefSeq, UniProt, EBI, etc.

And Lots More!

✤ Bio::Align/IO

✤ Bio::Assembly/IO

✤ Bio::Tree/IO

✤ Local flatfile databases

✤ Bio::Graphics

✤ SeqFeature databases

✤ Bio::Pedigree/IO

✤ Bio::Coordinate/IO

✤ Bio::Map/IO

✤ Bio::Restriction/IO

✤ Bio::Structure/IO

✤ Bio::Factory

✤ Bio::Tools::Run (catch-all namespace)

✤ Bio::Factory (create objects)

✤ Bio::Range/Location

Current Development

Next-Gen Sequence

✤ Second-generation/next-generation sequencing

✤ This is Lincoln Stein

✤ There is a reason he is smiling...

✤ Bio-SamTools - support for SAM and BAM data (via SamTools)

✤ Bio-BigFile - support for BigWig/BigBed (via Jim Kent’s UCSC tools)

✤ Separate CPAN distributions

✤ GBrowse (Lincoln’s talk this afternoon), BioPerl

✤ Via SeqFeatures (high-level API for both modules)

✤ Via Bio::Assembly and BioPerl-Run (using the above modules)

Next-Gen Sequence

Data Courtesy R. Khetani, M. Hudson, G. Robinson

New Tools/Wrappers

✤ BowTie

✤ BWA

✤ MAQ

✤ BEDTools (beta)

✤ SAMTools

✤ HMMER3

✤ BLAST+

✤ PAML

✤ Infernal v.1.0

✤ NCBI eUtils (SOAP, CGI-based)

✤ TopHat/CuffLinks (upcoming)

✤ The Cloud - bioperl-max

Mark Jensen, Thomas Sharpton,

Dave Messina,Kai Blin,

Dan Kortschak

Collaborations

SURVEY AND SUMMARY

The Sanger FASTQ file format for sequenceswith quality scores, and the Solexa/IlluminaFASTQ variantsPeter J. A. Cock1,*, Christopher J. Fields2, Naohisa Goto3, Michael L. Heuer4 andPeter M. Rice5

1Plant Pathology, SCRI, Invergowrie, Dundee DD2 5DA, UK, 2Institute for Genomic Biology, 1206W. GregoryDrive, M/C 195, University of Illinois at Urbana-Champaign, IL 61801, USA, 3Genome Information ResearchCenter, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka 565-0871,Japan, 4Harbinger Partners, Inc., 855 Village Center Drive, Suite 356, St. Paul, MN 55127, USA and 5EMBLOutstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,Cambridge CB10 1SD, UK

Received October 13, 2009; Revised November 13, 2009; Accepted November 17, 2009

ABSTRACT

FASTQ has emerged as a common file format forsharing sequencing read data combining both thesequence and an associated per base qualityscore, despite lacking any formal definition todate, and existing in at least three incompatiblevariants. This article defines the FASTQ format,covering the original Sanger standard, the Solexa/Illumina variants and conversion between them,based on publicly available information such asthe MAQ documentation and conventions recentlyagreed by the Open Bioinformatics Foundationprojects Biopython, BioPerl, BioRuby, BioJava andEMBOSS. Being an open access publication, itis hoped that this description, with the examplefiles provided as Supplementary Data, willserve in future as a reference for this important fileformat.

INTRODUCTION

One of the core issues of Bioinformatics is dealing with aprofusion of (often poorly defined or ambiguous) fileformats. Some ad hoc simple human readable formatshave over time attained the status of de facto standards.A ubiquitous example of this is the ‘FASTA sequence fileformat’, originally invented by Bill Pearson as an inputformat for his FASTA suite of tools (1). Over time, thisformat has evolved by consensus; however, in the absence

of an explicit standard some parsers will fail to cope withvery long ‘>’ title lines or very long sequences withoutline wrapping. There is also no standardization forrecord identifiers.In the area of DNA sequencing, the FASTQ file format

has emerged as another de facto common format for dataexchange between tools. It provides a simple extension tothe FASTA format: the ability to store a numeric qualityscore associated with each nucleotide in a sequence. This isa very minimal representation of a sequencing read—nothing about the relative levels of the four nucleotidesis captured [e.g. from Sanger capillary sequencing orSolexa/Illumina sequencing (2)] nor did this in any wayattempt to deal with flow or colour space data [e.g. Roche454 (3) or ABI SOLiD (4)].No doubt because of its simplicity, the FASTQ format

has become widely used as a simple interchange fileformat. Unfortunately, history has repeated itself, andthe FASTQ format su!ers from the absence of a cleardefinition (which we hope this manuscript will address),and several incompatible variants.We discuss the history of the FASTQ format, describing

key variants, and conventions adopted by the OpenBioinformatics Foundation (OBF, http://www.open-bio.org) projects Biopython (5), BioPerl (6), BioRuby(http://www.bioruby.org), BioJava (7), and EMBOSS (8)(each represented here by an author) for reading, writ-ing and converting between them. This is intended toprovide a public, open access and citable definition ofthis community consensus of the FASTQ formatspecification.

*To whom correspondence should be addressed. Tel: +44 1382 562731; Fax:+44 1382 562426; Email: [email protected]

Published online 16 December 2009 Nucleic Acids Research, 2010, Vol. 38, No. 6 1767–1771doi:10.1093/nar/gkp1137

! The Author(s) 2009. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

The Google Summer of Code

✤ O|B|F was accepted this year for the first time

✤ Headed by Rob Buels (SGN), with some help from Hilmar Lapp and myself

✤ Six projects, covering BioPerl, BioJava, Biopython, BioRuby

The Google Summer of Code

✤ BioPerl has actually been part of the Google Summer of Code for the last three years (as have many other Bio*):

✤ NESCent - admin: H. Lapp:

✤ 2008 - PhyloXML parsing (student: Mira Han)

✤ 2009 - NeXML parsing (student: Chase Miller)

✤ O|B|F - admin: R. Buels:

✤ 2010 - Alignment subsystem refactoring (student: Jun Yin)

GSoC - Alignment Subsystem

✤ Clean up current code

✤ Include capability of dealing with large datasets

✤ Target next-gen data, very large alignments?

✤ Abstract the backend (DB, memory, etc.)

✤ SAM/BAM may work (via Bio::DB::SAM)

✤ ...but what about protein sequences?


✤ BioPerl will be turning 15 soon

✤ What can we improve?

✤ What can we do with the current code?

✤ Maybe some that we can use in a BioPerl 2.0?

✤ Or a BioPerl 6?

What We Can Do Now

✤ Lower the barrier

✤ Use Modern Perl

✤ Deal with the monolith

Lower the Barrier

✤ We have already started on this - May 2010

✤ Migrate source code repository to git and GitHub

✤ Original BioPerl developers are added as collaborators on GitHub...

✤ ...but now anyone can now ‘fork’ BioPerl, make changes, submit ‘pull requests’, etc.

✤ Since May, have had many forks, pull requests with code reviews (so a decent success)

Using Modern Perl

✤ Minimal version of Perl required for BioPerl is v5.6.1

✤ Even v5.8.1 is considered quite old

✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)

print "I like newlines\n";

say "I like newlines";

sub implement_me { shift->throw_not_implemented}

sub implement_me { ... } # yada yada

say

yada yada

defined-or

# work only if false && defined$foo ||= 'default';

if (!defined($foo)) { $foo = 'default'}

$foo //= 'default';

Using Modern Perl

if ($key ~~ %hash) { # like exists # do something}

if ($foo ~~ /\d+/ ) { # like =~ # do something}

given ($foo) { when (%lookup) { ... } when (/^(\d+)/) { ... } when (/^[A-Za-z]+/) { ... } default { ... }}

Smart Match given/when

Using Modern Perl

Dealing with the Monolith

✤ Release manager nightmares:

✤ Remote databases disappear (XEMBL)

✤ Others change service or URLs (SeqHound)

✤ Services become obsolete (Pise)

✤ Developers move on, disappear, modules bit-rot (not saying :)

✤ How do we solve this problem?


Classes Tests (Files)

bioperl-live(Core)

bioperl-run

bioperl-db

bioperl-network

874 23146 (341)

123* 2468 (80)

72 113 (16)

9 327 (9)

* Had 285 more prior to Pise module removal!


✤ Maybe we shouldn’t be friendly to the monolith

✤ Maybe we should ‘blow it up’

✤ (Of course, that means make the code modular)

✤ It was originally designed with that somewhat in mind (interfaces)


✤ Separate distributions make it easier to submit fixes as needed

✤ However, separate distributions make developing a little trickier

✤ Can we create a distribution that resembles BioPerl as users know it?

✤ Is this something we should worry about?

✤ YES

✤ Don’t alienate end-users!

Towards BioPerl 2.0?

✤ Biome: BioPerl with Moose

✤ BioPerl6: self-explanatory

Biome

✤ BioPerl classes implemented in Moose

✤ GitHub: http://github.com/cjfields/biome

✤ Implemented: Ranges, Locations, simple PrimarySeq, Annotation, SeqFeatures, prototype SeqIO

✤ Interfaces converted to Moose Roles

✤ ‘Type’-checking used for data types

http://github.com/cjfields/biome

http://github.com/cjfields/biome

package Biome::Role::Range;

use Biome::Role;use Biome::Types qw(SequenceStrand);

requires 'to_string';

has strand => ( isa => SequenceStrand, is => 'rw', default => 0, coerce => 1);

has start => ( is => 'rw', isa => 'Int',);

has end => ( is => 'rw', isa => 'Int');

sub length { $_[0]->end - $_[0]->start + 1;}

AttributesRole

Classpackage Biome::Range;

use Biome;

with 'Biome::Role::Range';

sub to_string { my ($self) = @_; return sprintf("(%s, %s) strand=%s", $self->start, $self->end, $self->strand);}

BioPerl 6

✤ BioPerl6: http://github.com/cjfields/bioperl6

✤ Little has been done beyond simple implementations

✤ Code is open to anyone for experimentation

✤ Ex: Philip Mabon donated a FASTA grammar:

http://github.com/cjfields/bioperl6

http://github.com/cjfields/bioperl6

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Acknowledgements

✤ All BioPerl developers

✤ Chris Dagdigian and Mauricio Herrera Cuadra (O|B|F gurus)

✤ Cross-Collaborative work: Peter Cock (Biopython), Pjotr Prins (BioLib, BioRuby), Naohisa Goto (BioRuby), Michael Heuer and Andreas Prlic (BioJava), Peter Rice (EMBOSS)

✤ Questions? Do we even have time?

Fields bosc2010 bio_perl

Technology

Transcript of Fields bosc2010 bio_perl