BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

BioPerl

Based on a presentation by Manish Anand/Jonathan Nowacki/

Ravi Bhatt/Arvind Gopu

Introduction

Objective of BioPerl: Develop reusable, extensible core Perl modules for

use as a standard for manipulating molecular biological data.

Background: Started in 1995 One of the oldest open source Bioinformatics

Toolkit Project

So what is BioPerl?

Higher level of abstraction

Re-usable collection of Perl modules that facilitate bioinformatics application development: Accessing databases with different formats Sequence manipulation Execution and Parsing of the results of molecular

biology programs

Catch? BioPerl does not include programs like Blast, ClustalW, etc Uses system calls to execute external programs

So what is BioPerl? (continued…)

551 modules (incl. 82 interface modules)37 module groups79,582 lines of code (223,310 lines total)144 lines of code per moduleFor More info: BioPerl Module Listing

Major Areas covered in Bioperl

Sequences, features, annotations,Pairwise alignment reportsMultiple sequence alignmentsBibliographic dataGraphical rendering of sequence tracksDatabase for features and sequences

Additional things

Gene prediction parsersTrees, parsing phylogenetic and

molecular evolution software outputPopulation genetic data and summary

statisticsTaxonomyProtein Structure

Downloading modules

Modules can be obtained from: www.CPAN.org (Perl Modules) www.BioPerl.org (BioPerl Modules)

Downloading modules from CPAN Interactive mode

perl -MCPAN -e shell

Batch mode use CPAN; clean, install, make, recompile, test

Directory Structure

BioPerl directory structure organization: Bio/ BioPerl modules models/ UML for BioPerl classes t/ Perl built-in tests t/data/ Data files used for the tests scripts/ Reusable scripts that use BioPerl scripts/contributed/ Contributed scripts not

necessarily integrated into BioPerl. doc/ "How To" files and the FAQ as XML

Parsing Sequences

Bio::SeqIOmultiple drivers: genbank, embl, fasta,...

Sequence objectsBio::PrimarySeqBio::SeqBio::Seq::RichSeq

Sequence Object Creation

Sequence Creation :$sequence = Bio::Seq->new( -seq => ‘AATGCAA’

-display_id => ‘my_sequence’);

Flat File Format Support :

Raw, FASTA, GCG, GenBank, EMBL, PIR

Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP

Sequence object

Common (Bio::PrimarySeq) methodsseq() - get the sequence as a string length() - get the sequence lengthsubseq($s,$e) - get a subseqeunce translate(...) - translate to protein [DNA] revcom() - reverse complement [DNA]display_id() - identifier stringdescription() - description string

Sequence Types

Different Sequence Objects:

Seq – Some annotations RichSeq – Additional annotations PrimarySeq – Bare minimum annotation

( id, accession number, alphabet) LocatableSeq – Start, stop and gap information also LargeSeq – Very long sequences LiveSeq – Newly sequenced genomes

Using a sequence

use Bio::PrimarySeq;my $str = “ATGAATGATGAA”;my $seq = Bio::PrimarySeq->new(-seq => $str,

-display_id=>”example”);print “id is “, $seq->display_id,”\n”;print $seq->seq, “\n”;my $revcom = $seq->revcom;print $revcom->seq, “\n”;print “frame1=”,$seq->translate->seq,“\n”;

id is exampleATGAATGATGAATTCATCATTCATtrans frame1=MNDE

Accessing remote databases

$gb = new Bio::DB::GenBank();

$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');

$seq2 = $gb->get_Seq_by_acc('AF303112');

$seqio = $gb-> get_Stream_by_id(["J00522","AF303112","2981014"]);

Sequence – Accession numbers

# Get a sequence from RefSeq by accession number

use Bio::DB::RefSeq;

$gb = new Bio::DB::RefSeq;

$seq = $gb->get_Seq_by_acc(“NM_007304”);

print $seq->seq();

Reading and Writing Sequences

Bio::SeqIO fasta, genbank, embl, swissprot,...

Takes care of writing out associated features and annotations

Two functionsnext_seq (reading sequences)write_seq (writing sequences)

Writing a Sequence

use Bio::SeqIO;# Let’s convert swissprot to fasta formatmy $in = Bio::SeqIO->new( -format => ‘swiss’,

-file => ‘file.sp’);my $out = Bio::SeqIO->new( -format => ‘fasta’,

-file => ‘>file.fa’);`while( my $seq = $in->next_seq ) {

$out->write_seq($seq);}

Manipulating sequence data with Seq methods

Allows the easy manipulation of bioinformatics data Specific parts of various annotated formats can be

selected and rearranged. Unwanted information can be voided out of reports Important information can be highlighted, processed,

stored in arrays for graphs/charts/etc with relative ease

Information can be added and subtracted in a flash

The Code

#!/usr/local/bin/perl

use Bio::Seq;

use Bio::SeqIO;

my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta');

my $seqobj = $seqin->next_seq();

my $seq = $seqobj->seq(),"\n"; #plain sequence

print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n";

$seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines

print $seq,"\n";

Before

Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord)

Molecular Weight Monomer Counter Codon Counter DNA weights RNA weights Amino Weights More

The Code

#!/usr/local/bin/perl

use Bio::PrimarySeq;

use Bio::Tools::SeqStats;

my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1');

$seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj);

$hash_ref = $seq_stats->count_monomers();

foreach $base (sort keys %$hash_ref) {

print "Number of bases of type ", $base, "= ",%$hash_ref->{$base},"\n";

}

The Results

More Code

use SeqStats;

$seq_stats = Bio::Tools::SeqStats->new($seqobj);

$weight = $seq_stats->get_mol_wt();

-returns the molecular weight

$monomer_ref = $seq_stats->count_monomers();

-counts the number of monomers

$codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence

-counts the number of codons

Monomer

Why the Large and The Small MW?

Note that since sequences may contain ambiguous monomers (eg "M"meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wtreturns a two-element array containing the greatest lower bound andleast upper bound of the molecule. (For a sequence with no ambiguous monomers, the two elements of the returned array will be equal.)

Identifying restriction enzyme sites (Restriction Enzyme)

Bioperl's standard RestrictionEnzyme object comes with data for

more than 150 different restriction enzymes. To select all available enzymes with cutting patterns that

are six bases long: $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');

@sixcutters = $re->available_list(6); sites for that enzyme on a given nucleic acid sequence

can be obtained using $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');

# $seqobj is the Seq object for the dna to be cut @fragments = $re1->cut_seq($seqobj);

Manipulating sequence alignments

Bioperl offers several perl objects to facilitate sequence alignment: pSW (Smith-Waterman) Clustalw.pm TCoffee.pm bl2seq option of StandAloneBlast.

Manipulating Alignments

Some of the manipulations possible with SimpleAlign include: slice(): Obtaining an alignment ``slice'', that is, a subalignment

inclusive of specified start and end columns. column_from_residue_number(): Finding column in an alignment

where a specified residue of a specified sequence is located. consensus_string(): Making a consensus string. This method

includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ``?'''s in the consensus

percentage_identity(): A fast method for calculating the average percentage identity of the alignment

consensus_iupac(): Making a consensus using IUPAC ambiguity codes from DNA and RNA.

The Code

use Bio::SimpleAlign; $aln = Bio::SimpleAlign->new('t/data/testaln.fasta'); $threshold_percent = 60; $consensus_with_threshold = $aln-

>consensus_string($threshold_percent); $iupac_consensus = $aln->consensus_iupac(); # dna/rna alignments only $percent_ident = $aln->percentage_identity; $seqname = 'AKH_HAEIN'; $pos = $aln-

>column_from_residue_number($seqname, 14);

Searching for Sequence Similarity

BLAST with BioPerlParsing Blast and FASTA Reports

Search and SearchIOBPLite, BPpsilite, BPbl2seq

Parsing HMM ReportsStandalone BioPerl BLAST

Remote Execution of BLAST

BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm

Runs these jobs at NCBI automatically NCBI has dynamic configurations (server side) to “always” be

up and ready Automatically updated for new BioPerl Releases

Convenient for independent researchers who do not have access to huge computing resources

Quick submission of Blast jobs without tying up local resources (especially if working from standalone workstation)

Legal Restrictions!!!

Example of Remote Blast

$remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' );

$r = $remote_blast->submit_blast("t/data/ecolitst.fa"); while (@rids = $remote_blast->each_rid ) {

foreach $rid ( @rids ) {

$rc = $remote_blast->retrieve_blast($rid);}

}

Sample Script to Read and Parse BLAST Report

# Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report);

$result = $searchio->next_result; # Get info about the entire report $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hsp = $hit->next_hsp; $hsp_start = $hsp->query->start;

Running BLAST Locally

StandAloneBlast Bio::Tools::Run::StandAloneBlast Factory Objects

@params = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast->new(@params);

Advantages: Private Use Customized Local Resources Avoid Network Problems

Examples

# Setting parameters similar to RemoteBlast$input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG");

$blast_report = $factory->blastall($input); # Blast Report Object that directly accesses parser

while (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){

print $hsp->score . " " . $hsp->subject->seqname . "\n";

} }

Format Conversion – Sequences Example

Use: Bio::SeqIO Core Code:

$in = Bio::SeqIO->new('-file' => "COG0001",

'-format' => 'Fasta');

$out = Bio::SeqIO->new('-file' => ">COG0001.gen",

'-format' => 'genbank');

while ( my $seq = $in->next_seq() ) {

$out->write_seq($seq);

}

Format Conversion – Alignments

Alignment formats supported: INPUT: fasta, selex (HMMER), bl2seq,

clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme

OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved).

Next_aln( ) and write_aln( ) methods of the ‘Bio::AlignIO’ object are used

ClustalW and Profile Align

ClustalW using BioPerlClustalw program should be installed and

environment variable ‘CLUSTALDIR’ set Setting Parameters – Build a factory

Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet‘

Need reference to sequence array object (See example)

Align( ) and Profile_align( ) methods used

ClustalW – Example

Use Bio::SeqIO, Bio::Tools::Run::Alignment::Clustalw

Core code (Simple Align):

@params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1);

$factory = Bio::Tools::Run::Alignment::

Clustalw->new(@params);

$seq_array_ref = \@seq_array;

$aln= $factory->align($seq_array_ref);

Smith Waterman Search

Smith Waterman pairwise alignmentStandard method for producing an optimal

local alignment of two sequences Auxilliary Bioperl-ext library required SW algorithm implemented in C and

incorporated into bioperl Align_and_show() & Pairwise_alignment() in

Bio::Tools::pSW module are methods used

Smith Waterman Search – Example

Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO Core code:

$factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2);

$aln = $factory->pairwise_alignment($seq_array[0], $seq_array[1]);

my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT);$alnout->write_aln($aln);

Smith Waterman Search

AlignIO object in previous slide – could also be used to print into a file

Use double loop to do all pairwise comparisons

More Info: Bio::Tools::pSW mapage

BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Documents

Transcript of BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.