Fields bosc2010 bio_perl

50
BioPerl Update 2010: Towards a Modern BioPerl BOSC 7-10-10 Chris Fields (UIUC)

Transcript of Fields bosc2010 bio_perl

Page 1: Fields bosc2010 bio_perl

BioPerl Update 2010:Towards a Modern BioPerlBOSC 7-10-10Chris Fields (UIUC)

Page 2: Fields bosc2010 bio_perl

Present Day BioPerl

✤ Addressing new bioinformatics problems

✤ Collaborations in Open Bioinformatics Foundation

✤ Google Summer of Code

Page 3: Fields bosc2010 bio_perl

Towards a Modern BioPerl

✤ Lowering the barrier for new users to become involved

✤ Using Modern Perl language features

✤ Dealing with the BioPerl monolith

Page 4: Fields bosc2010 bio_perl

BioPerl 2.0?

✤ BioPerl and Modern Perl OOP (Moose)

✤ BioPerl and Perl 6

Page 5: Fields bosc2010 bio_perl

Background

✤ Started in 1996, many contributors over the years✤ Jason Stajich (UCR)

✤ Hilmar Lapp (NESCent)

✤ Heikki Lehväslaiho (KAUST)

✤ Georg Fuellen (Bielefeld)

✤ Ewan Birney (Sanger, EBI)

✤ Aaron Mackey (Univ. Virginia)

✤ Chris Dagdigian (BioTeam)

✤ Steven Brenner (UC-Berkeley)

✤ Lincoln Stein (OICR, CSHL)

✤ Ian Korf (Wash U)

✤ Chris Mungall (NCBO)

✤ Brian Osborne (BioTeam)

✤ Steve Trutane (Stanford)

✤ Sendu Bala (Sanger)

✤ Dave Messina (Sonnhammer Lab)

✤ Mark Jensen (TCGA)

✤ Rob Buels (SGN)

✤ Many, many more!

Page 6: Fields bosc2010 bio_perl

✤ Open source: ‘Released under the same license as Perl itself’ i.e. Artistic

✤ http://bioperl.org

✤ Core developers - make releases, drive the project, set vision

✤ Regular contributors - have direct commit access

Background

Page 7: Fields bosc2010 bio_perl

BioPerl Distributions

✤ BioPerl Core - the main distribution (aka ‘bioperl-live’ if using dev version)

✤ BioPerl-Run - Perl ‘wrappers’ for common bioinformatics tools

✤ BioPerl-DB - BioSQL ORM to BioPerl classes

Page 8: Fields bosc2010 bio_perl

Biological Sequences

✤ Bio::Seq - sequence record class#!/bin/perl -w

use Modern::Perl; use Bio::Seq; my $seq_obj = Bio::Seq->new(-seq => "aaaatgggggggggggccccgtt", -display_id => "ABC12345", -desc => "example 1", -alphabet => "dna");

say $seq_obj->display_id; # ABC12345 say $seq_obj->desc; # example 1say $seq_obj->seq; # aaaatgggggggggggccccgtt

my $revcom = $seq_obj->revcom; # new Bio::Seq, but revcomsay $revcom->seq; # aacggggcccccccccccatttt

Page 9: Fields bosc2010 bio_perl

Sequence I/O

✤ Bio::SeqIO - sequence I/O stream classes (pluggable)#!/usr/bin/perl -w

use Modern::Perl;use Bio::SeqIO;

my ($infile, $outfile) = @ARGV;

my $in = Bio::SeqIO->new(-file => $infile, -format => 'genbank');my $out = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta');

while (my $seq_obj = $in->next_seq) { say $seq_obj->display_id; $out->write_seq($seq_obj);}

Page 10: Fields bosc2010 bio_perl

Sequence Features

✤ Bio::SeqFeature::Generic - generic SF implementationGenBank File

use Modern::Perl;use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file => shift, -format => 'genbank');

while (my $seq_obj = $in->next_seq) { for my $feat_obj ($seq_obj->get_SeqFeatures) { say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; for my $tag ($feat_obj->get_all_tags) { say " tag: $tag"; for my $value ($feat_obj->get_tag_values($tag)) { say " value: $value"; } } }}

source 1..2629 /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" /strain="OG1RF" /db_xref="taxon:474186" gene 25..>2629 /gene="pyr operon" /note="pyrimidine biosynthetic operon"

Primary tag: sourceLocation: 1..2629 tag: db_xref value: taxon:474186 tag: mol_type value: genomic DNA tag: organism value: Enterococcus faecalis OG1RF tag: strain value: OG1RF

Page 11: Fields bosc2010 bio_perl

Sequence Features

✤ Bio::SeqFeature::Generic - generic SF implementationGenBank File

use Modern::Perl;use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file => shift, -format => 'genbank');

while (my $seq_obj = $in->next_seq) { for my $feat_obj ($seq_obj->get_SeqFeatures) { say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; for my $tag ($feat_obj->get_all_tags) { say " tag: $tag"; for my $value ($feat_obj->get_tag_values($tag)) { say " value: $value"; } } }}

source 1..2629 /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" /strain="OG1RF" /db_xref="taxon:474186" gene 25..>2629 /gene="pyr operon" /note="pyrimidine biosynthetic operon"

Primary tag: sourceLocation: 1..2629 tag: db_xref value: taxon:474186 tag: mol_type value: genomic DNA tag: organism value: Enterococcus faecalis OG1RF tag: strain value: OG1RF

Page 12: Fields bosc2010 bio_perl

Sequence Features

✤ Bio::SeqFeature::Generic - generic SF implementationGenBank File

use Modern::Perl;use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file => shift, -format => 'genbank');

while (my $seq_obj = $in->next_seq) { for my $feat_obj ($seq_obj->get_SeqFeatures) { say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; for my $tag ($feat_obj->get_all_tags) { say " tag: $tag"; for my $value ($feat_obj->get_tag_values($tag)) { say " value: $value"; } } }}

source 1..2629 /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" /strain="OG1RF" /db_xref="taxon:474186" gene 25..>2629 /gene="pyr operon" /note="pyrimidine biosynthetic operon"

Primary tag: sourceLocation: 1..2629 tag: db_xref value: taxon:474186 tag: mol_type value: genomic DNA tag: organism value: Enterococcus faecalis OG1RF tag: strain value: OG1RF

Page 13: Fields bosc2010 bio_perl

Report Parsing

Query= gi|1786183|gb|AAC73113.1| (AE000111) aspartokinase I,homoserine dehydrogenase I [Escherichia coli] (820 letters)

Database: ecoli.aa 4289 sequences; 1,358,990 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogen... 1567 0.0gb|AAC76922.1| (AE000468) aspartokinase II and homoserine dehydr... 332 1e-91gb|AAC76994.1| (AE000475) aspartokinase III, lysine sensitive [E... 184 3e-47gb|AAC73282.1| (AE000126) uridylate kinase [Escherichia coli] 42 3e-04

>gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia coli] Length = 820

Score = 1567 bits (4058), Expect = 0.0 Identities = 806/820 (98%), Positives = 806/820 (98%)

Query: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDASbjct: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60

Page 14: Fields bosc2010 bio_perl

Report Parsing

✤ Bio::SearchIO#!/usr/bin/perl -w

use Modern::Perl;use Bio::SearchIO; my $in = Bio::SearchIO->new(-format => 'blast', -file => 'ecoli.bls');

while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { say "Query=".$result->query_name; say " Hit=".$hit->name; say " Length=".$hsp->length('total'); say " Percent_id=".$hsp->percent_identity."\n"; } }}

Query=gi|1786183|gb|AAC73113.1| Hit=gb|AAC73113.1| Length=820 Percent_id=98.2926829268293

Query=gi|1786183|gb|AAC73113.1| Hit=gb|AAC76922.1| Length=821 Percent_id=29.5980511571255

Query=gi|1786183|gb|AAC73113.1| Hit=gb|AAC76994.1| Length=471 Percent_id=30.1486199575372

Query=gi|1786183|gb|AAC73113.1| Hit=gb|AAC73282.1| Length=97 Percent_id=28.8659793814433

Page 15: Fields bosc2010 bio_perl

Local/Remote Database Interfaces

✤ Bio::DB::GenBank

#!/bin/perl -w

use Modern::Perl; use Bio::DB::GenBank; my $db_obj = Bio::DB::GenBank->new; # query NCBI nuc db my $seq_obj = $db_obj->get_Seq_by_acc('A00002');

say $seq_obj->display_id; # A00002say $seq_obj->length(); # 194

✤ Also EntrezGene, GenPept, RefSeq, UniProt, EBI, etc.

Page 16: Fields bosc2010 bio_perl

And Lots More!

✤ Bio::Align/IO

✤ Bio::Assembly/IO

✤ Bio::Tree/IO

✤ Local flatfile databases

✤ Bio::Graphics

✤ SeqFeature databases

✤ Bio::Pedigree/IO

✤ Bio::Coordinate/IO

✤ Bio::Map/IO

✤ Bio::Restriction/IO

✤ Bio::Structure/IO

✤ Bio::Factory

✤ Bio::Tools::Run (catch-all namespace)

✤ Bio::Factory (create objects)

✤ Bio::Range/Location

Page 17: Fields bosc2010 bio_perl

Current Development

Page 18: Fields bosc2010 bio_perl

Next-Gen Sequence

✤ Second-generation/next-generation sequencing

✤ This is Lincoln Stein

✤ There is a reason he is smiling...

Page 19: Fields bosc2010 bio_perl

✤ Bio-SamTools - support for SAM and BAM data (via SamTools)

✤ Bio-BigFile - support for BigWig/BigBed (via Jim Kent’s UCSC tools)

✤ Separate CPAN distributions

✤ GBrowse (Lincoln’s talk this afternoon), BioPerl

✤ Via SeqFeatures (high-level API for both modules)

✤ Via Bio::Assembly and BioPerl-Run (using the above modules)

Next-Gen Sequence

Page 20: Fields bosc2010 bio_perl

Data Courtesy R. Khetani, M. Hudson, G. Robinson

Page 21: Fields bosc2010 bio_perl

New Tools/Wrappers

✤ BowTie

✤ BWA

✤ MAQ

✤ BEDTools (beta)

✤ SAMTools

✤ HMMER3

✤ BLAST+

✤ PAML

✤ Infernal v.1.0

✤ NCBI eUtils (SOAP, CGI-based)

✤ TopHat/CuffLinks (upcoming)

✤ The Cloud - bioperl-max

Mark Jensen, Thomas Sharpton,

Dave Messina,Kai Blin,

Dan Kortschak

Page 22: Fields bosc2010 bio_perl

Collaborations

SURVEY AND SUMMARY

The Sanger FASTQ file format for sequenceswith quality scores, and the Solexa/IlluminaFASTQ variantsPeter J. A. Cock1,*, Christopher J. Fields2, Naohisa Goto3, Michael L. Heuer4 andPeter M. Rice5

1Plant Pathology, SCRI, Invergowrie, Dundee DD2 5DA, UK, 2Institute for Genomic Biology, 1206W. GregoryDrive, M/C 195, University of Illinois at Urbana-Champaign, IL 61801, USA, 3Genome Information ResearchCenter, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka 565-0871,Japan, 4Harbinger Partners, Inc., 855 Village Center Drive, Suite 356, St. Paul, MN 55127, USA and 5EMBLOutstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,Cambridge CB10 1SD, UK

Received October 13, 2009; Revised November 13, 2009; Accepted November 17, 2009

ABSTRACT

FASTQ has emerged as a common file format forsharing sequencing read data combining both thesequence and an associated per base qualityscore, despite lacking any formal definition todate, and existing in at least three incompatiblevariants. This article defines the FASTQ format,covering the original Sanger standard, the Solexa/Illumina variants and conversion between them,based on publicly available information such asthe MAQ documentation and conventions recentlyagreed by the Open Bioinformatics Foundationprojects Biopython, BioPerl, BioRuby, BioJava andEMBOSS. Being an open access publication, itis hoped that this description, with the examplefiles provided as Supplementary Data, willserve in future as a reference for this important fileformat.

INTRODUCTION

One of the core issues of Bioinformatics is dealing with aprofusion of (often poorly defined or ambiguous) fileformats. Some ad hoc simple human readable formatshave over time attained the status of de facto standards.A ubiquitous example of this is the ‘FASTA sequence fileformat’, originally invented by Bill Pearson as an inputformat for his FASTA suite of tools (1). Over time, thisformat has evolved by consensus; however, in the absence

of an explicit standard some parsers will fail to cope withvery long ‘>’ title lines or very long sequences withoutline wrapping. There is also no standardization forrecord identifiers.In the area of DNA sequencing, the FASTQ file format

has emerged as another de facto common format for dataexchange between tools. It provides a simple extension tothe FASTA format: the ability to store a numeric qualityscore associated with each nucleotide in a sequence. This isa very minimal representation of a sequencing read—nothing about the relative levels of the four nucleotidesis captured [e.g. from Sanger capillary sequencing orSolexa/Illumina sequencing (2)] nor did this in any wayattempt to deal with flow or colour space data [e.g. Roche454 (3) or ABI SOLiD (4)].No doubt because of its simplicity, the FASTQ format

has become widely used as a simple interchange fileformat. Unfortunately, history has repeated itself, andthe FASTQ format su!ers from the absence of a cleardefinition (which we hope this manuscript will address),and several incompatible variants.We discuss the history of the FASTQ format, describing

key variants, and conventions adopted by the OpenBioinformatics Foundation (OBF, http://www.open-bio.org) projects Biopython (5), BioPerl (6), BioRuby(http://www.bioruby.org), BioJava (7), and EMBOSS (8)(each represented here by an author) for reading, writ-ing and converting between them. This is intended toprovide a public, open access and citable definition ofthis community consensus of the FASTQ formatspecification.

*To whom correspondence should be addressed. Tel: +44 1382 562731; Fax:+44 1382 562426; Email: [email protected]

Published online 16 December 2009 Nucleic Acids Research, 2010, Vol. 38, No. 6 1767–1771doi:10.1093/nar/gkp1137

! The Author(s) 2009. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 23: Fields bosc2010 bio_perl

The Google Summer of Code

✤ O|B|F was accepted this year for the first time

✤ Headed by Rob Buels (SGN), with some help from Hilmar Lapp and myself

✤ Six projects, covering BioPerl, BioJava, Biopython, BioRuby

Page 24: Fields bosc2010 bio_perl

The Google Summer of Code

✤ BioPerl has actually been part of the Google Summer of Code for the last three years (as have many other Bio*):

✤ NESCent - admin: H. Lapp:

✤ 2008 - PhyloXML parsing (student: Mira Han)

✤ 2009 - NeXML parsing (student: Chase Miller)

✤ O|B|F - admin: R. Buels:

✤ 2010 - Alignment subsystem refactoring (student: Jun Yin)

Page 25: Fields bosc2010 bio_perl

GSoC - Alignment Subsystem

✤ Clean up current code

✤ Include capability of dealing with large datasets

✤ Target next-gen data, very large alignments?

✤ Abstract the backend (DB, memory, etc.)

✤ SAM/BAM may work (via Bio::DB::SAM)

✤ ...but what about protein sequences?

Page 26: Fields bosc2010 bio_perl

Towards a Modern BioPerl

Page 27: Fields bosc2010 bio_perl

Towards a Modern BioPerl

✤ BioPerl will be turning 15 soon

✤ What can we improve?

✤ What can we do with the current code?

✤ Maybe some that we can use in a BioPerl 2.0?

✤ Or a BioPerl 6?

Page 28: Fields bosc2010 bio_perl

What We Can Do Now

✤ Lower the barrier

✤ Use Modern Perl

✤ Deal with the monolith

Page 29: Fields bosc2010 bio_perl

Lower the Barrier

✤ We have already started on this - May 2010

✤ Migrate source code repository to git and GitHub

✤ Original BioPerl developers are added as collaborators on GitHub...

✤ ...but now anyone can now ‘fork’ BioPerl, make changes, submit ‘pull requests’, etc.

✤ Since May, have had many forks, pull requests with code reviews (so a decent success)

Page 30: Fields bosc2010 bio_perl

Using Modern Perl

✤ Minimal version of Perl required for BioPerl is v5.6.1

✤ Even v5.8.1 is considered quite old

✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)

Page 31: Fields bosc2010 bio_perl

Using Modern Perl

✤ Minimal version of Perl required for BioPerl is v5.6.1

✤ Even v5.8.1 is considered quite old

✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)

Page 32: Fields bosc2010 bio_perl

print "I like newlines\n";

say "I like newlines";

sub implement_me { shift->throw_not_implemented}

sub implement_me { ... } # yada yada

say

yada yada

defined-or

# work only if false && defined$foo ||= 'default';

if (!defined($foo)) { $foo = 'default'}

$foo //= 'default';

Using Modern Perl

Page 33: Fields bosc2010 bio_perl

if ($key ~~ %hash) { # like exists # do something}

if ($foo ~~ /\d+/ ) { # like =~ # do something}

given ($foo) { when (%lookup) { ... } when (/^(\d+)/) { ... } when (/^[A-Za-z]+/) { ... } default { ... }}

Smart Match given/when

Using Modern Perl

Page 34: Fields bosc2010 bio_perl

Dealing with the Monolith

✤ Release manager nightmares:

✤ Remote databases disappear (XEMBL)

✤ Others change service or URLs (SeqHound)

✤ Services become obsolete (Pise)

✤ Developers move on, disappear, modules bit-rot (not saying :)

✤ How do we solve this problem?

Page 35: Fields bosc2010 bio_perl

Dealing with the Monolith

Classes Tests (Files)

bioperl-live(Core)

bioperl-run

bioperl-db

bioperl-network

874 23146 (341)

123* 2468 (80)

72 113 (16)

9 327 (9)

* Had 285 more prior to Pise module removal!

Page 36: Fields bosc2010 bio_perl

Dealing with the Monolith

✤ Maybe we shouldn’t be friendly to the monolith

✤ Maybe we should ‘blow it up’

✤ (Of course, that means make the code modular)

✤ It was originally designed with that somewhat in mind (interfaces)

Page 37: Fields bosc2010 bio_perl

Dealing with the Monolith

✤ Separate distributions make it easier to submit fixes as needed

✤ However, separate distributions make developing a little trickier

✤ Can we create a distribution that resembles BioPerl as users know it?

✤ Is this something we should worry about?

✤ YES

✤ Don’t alienate end-users!

Page 38: Fields bosc2010 bio_perl

Towards BioPerl 2.0?

✤ Biome: BioPerl with Moose

✤ BioPerl6: self-explanatory

Page 39: Fields bosc2010 bio_perl

Biome

✤ BioPerl classes implemented in Moose

✤ GitHub: http://github.com/cjfields/biome

✤ Implemented: Ranges, Locations, simple PrimarySeq, Annotation, SeqFeatures, prototype SeqIO

✤ Interfaces converted to Moose Roles

✤ ‘Type’-checking used for data types

Page 40: Fields bosc2010 bio_perl

package Biome::Role::Range;

use Biome::Role;use Biome::Types qw(SequenceStrand);

requires 'to_string';

has strand => ( isa => SequenceStrand, is => 'rw', default => 0, coerce => 1);

has start => ( is => 'rw', isa => 'Int',);

has end => ( is => 'rw', isa => 'Int');

sub length { $_[0]->end - $_[0]->start + 1;}

AttributesRole

Classpackage Biome::Range;

use Biome;

with 'Biome::Role::Range';

sub to_string { my ($self) = @_; return sprintf("(%s, %s) strand=%s", $self->start, $self->end, $self->strand);}

Page 41: Fields bosc2010 bio_perl

BioPerl 6

✤ BioPerl6: http://github.com/cjfields/bioperl6

✤ Little has been done beyond simple implementations

✤ Code is open to anyone for experimentation

✤ Ex: Philip Mabon donated a FASTA grammar:

Page 42: Fields bosc2010 bio_perl

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 43: Fields bosc2010 bio_perl

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 44: Fields bosc2010 bio_perl

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 45: Fields bosc2010 bio_perl

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 46: Fields bosc2010 bio_perl

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 47: Fields bosc2010 bio_perl

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 48: Fields bosc2010 bio_perl

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 49: Fields bosc2010 bio_perl

class Bio::Grammar::Actions::Fasta { method TOP($/){ my @matches = gather for $/<fasta> -> $m { take $m.ast; };

make @matches; } method fasta($/){ my $id =$/<description_line>.ast<id>; my $desc = $/<description_line>.ast<description>; my $obj = Bio::PrimarySeq.new( display_id => $id, description => $desc, seq => $/<sequence>.ast); make $obj; } method description_line($/){ make $/; } method id($/) { make $/; } method description($/){ make $/; } method sequence($/){ make (~$/).subst("\n", '', :g); }}

grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^\> <id> <.ws> <description> \n } token id { | <identifier> | <generic_id> } token identifier { \S+ } token generic_id { \S+ } token description { \N+ } token sequence { <-[>]>+ }}

Grammar (FASTA) Actions (FASTA)

Page 50: Fields bosc2010 bio_perl

Acknowledgements

✤ All BioPerl developers

✤ Chris Dagdigian and Mauricio Herrera Cuadra (O|B|F gurus)

✤ Cross-Collaborative work: Peter Cock (Biopython), Pjotr Prins (BioLib, BioRuby), Naohisa Goto (BioRuby), Michael Heuer and Andreas Prlic (BioJava), Peter Rice (EMBOSS)

✤ Questions? Do we even have time?