BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

44
BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu

Transcript of BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Page 1: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

BioPerl

Based on a presentation by Manish Anand/Jonathan Nowacki/

Ravi Bhatt/Arvind Gopu

Page 2: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Introduction

Objective of BioPerl: Develop reusable, extensible core Perl modules for

use as a standard for manipulating molecular biological data.

Background: Started in 1995 One of the oldest open source Bioinformatics

Toolkit Project

Page 3: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

So what is BioPerl?

Higher level of abstraction

Re-usable collection of Perl modules that facilitate bioinformatics application development: Accessing databases with different formats Sequence manipulation Execution and Parsing of the results of molecular

biology programs

Catch? BioPerl does not include programs like Blast, ClustalW, etc Uses system calls to execute external programs

Page 4: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

So what is BioPerl? (continued…)

551 modules (incl. 82 interface modules)37 module groups79,582 lines of code (223,310 lines total)144 lines of code per moduleFor More info: BioPerl Module Listing

Page 5: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Major Areas covered in Bioperl

Sequences, features, annotations,Pairwise alignment reportsMultiple sequence alignmentsBibliographic dataGraphical rendering of sequence tracksDatabase for features and sequences

Page 6: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Additional things

Gene prediction parsersTrees, parsing phylogenetic and

molecular evolution software outputPopulation genetic data and summary

statisticsTaxonomyProtein Structure

Page 7: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Downloading modules

Modules can be obtained from: www.CPAN.org (Perl Modules) www.BioPerl.org (BioPerl Modules)

Downloading modules from CPAN Interactive mode

perl -MCPAN -e shell

Batch mode use CPAN; clean, install, make, recompile, test

Page 8: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Directory Structure

BioPerl directory structure organization: Bio/ BioPerl modules models/ UML for BioPerl classes t/ Perl built-in tests t/data/ Data files used for the tests scripts/ Reusable scripts that use BioPerl scripts/contributed/ Contributed scripts not

necessarily integrated into BioPerl. doc/ "How To" files and the FAQ as XML

Page 9: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Parsing Sequences

Bio::SeqIOmultiple drivers: genbank, embl, fasta,...

Sequence objectsBio::PrimarySeqBio::SeqBio::Seq::RichSeq

Page 10: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Sequence Object Creation

Sequence Creation :$sequence = Bio::Seq->new( -seq => ‘AATGCAA’

-display_id => ‘my_sequence’);

Flat File Format Support :

Raw, FASTA, GCG, GenBank, EMBL, PIR

Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP

Page 11: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Sequence object

Common (Bio::PrimarySeq) methodsseq() - get the sequence as a string length() - get the sequence lengthsubseq($s,$e) - get a subseqeunce translate(...) - translate to protein [DNA] revcom() - reverse complement [DNA]display_id() - identifier stringdescription() - description string

Page 12: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Sequence Types

Different Sequence Objects:

Seq – Some annotations RichSeq – Additional annotations PrimarySeq – Bare minimum annotation

( id, accession number, alphabet) LocatableSeq – Start, stop and gap information also LargeSeq – Very long sequences LiveSeq – Newly sequenced genomes

Page 13: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Using a sequence

use Bio::PrimarySeq;my $str = “ATGAATGATGAA”;my $seq = Bio::PrimarySeq->new(-seq => $str,

-display_id=>”example”);print “id is “, $seq->display_id,”\n”;print $seq->seq, “\n”;my $revcom = $seq->revcom;print $revcom->seq, “\n”;print “frame1=”,$seq->translate->seq,“\n”;

id is exampleATGAATGATGAATTCATCATTCATtrans frame1=MNDE

Page 14: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Accessing remote databases

$gb = new Bio::DB::GenBank();

$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');

$seq2 = $gb->get_Seq_by_acc('AF303112');

$seqio = $gb-> get_Stream_by_id(["J00522","AF303112","2981014"]);

Page 15: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Sequence – Accession numbers

# Get a sequence from RefSeq by accession number

use Bio::DB::RefSeq;

$gb = new Bio::DB::RefSeq;

$seq = $gb->get_Seq_by_acc(“NM_007304”);

print $seq->seq();

Page 16: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Reading and Writing Sequences

Bio::SeqIO fasta, genbank, embl, swissprot,...

Takes care of writing out associated features and annotations

Two functionsnext_seq (reading sequences)write_seq (writing sequences)

Page 17: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Writing a Sequence

use Bio::SeqIO;# Let’s convert swissprot to fasta formatmy $in = Bio::SeqIO->new( -format => ‘swiss’,

-file => ‘file.sp’);my $out = Bio::SeqIO->new( -format => ‘fasta’,

-file => ‘>file.fa’);`while( my $seq = $in->next_seq ) {

$out->write_seq($seq);}

Page 18: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Manipulating sequence data with Seq methods

Allows the easy manipulation of bioinformatics data Specific parts of various annotated formats can be

selected and rearranged. Unwanted information can be voided out of reports Important information can be highlighted, processed,

stored in arrays for graphs/charts/etc with relative ease

Information can be added and subtracted in a flash

Page 19: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

The Code

#!/usr/local/bin/perl

use Bio::Seq;

use Bio::SeqIO;

my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta');

my $seqobj = $seqin->next_seq();

my $seq = $seqobj->seq(),"\n"; #plain sequence

print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n";

$seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines

print $seq,"\n";

Page 20: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Before

Page 21: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

After

Page 22: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord)

Molecular Weight Monomer Counter Codon Counter DNA weights RNA weights Amino Weights More

Page 23: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

The Code

#!/usr/local/bin/perl

use Bio::PrimarySeq;

use Bio::Tools::SeqStats;

my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1');

$seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj);

$hash_ref = $seq_stats->count_monomers();

foreach $base (sort keys %$hash_ref) {

print "Number of bases of type ", $base, "= ",%$hash_ref->{$base},"\n";

}

Page 24: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

The Results

Page 25: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

More Code

use SeqStats;

$seq_stats = Bio::Tools::SeqStats->new($seqobj);

$weight = $seq_stats->get_mol_wt();

-returns the molecular weight

$monomer_ref = $seq_stats->count_monomers();

-counts the number of monomers

$codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence

-counts the number of codons

Page 26: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Monomer

Page 27: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Why the Large and The Small MW?

Note that since sequences may contain ambiguous monomers (eg "M"meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wtreturns a two-element array containing the greatest lower bound andleast upper bound of the molecule. (For a sequence with no ambiguous monomers, the two elements of the returned array will be equal.)

Page 28: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Identifying restriction enzyme sites (Restriction Enzyme)

Bioperl's standard RestrictionEnzyme object comes with data for

more than 150 different restriction enzymes. To select all available enzymes with cutting patterns that

are six bases long: $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');

@sixcutters = $re->available_list(6); sites for that enzyme on a given nucleic acid sequence

can be obtained using $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');

# $seqobj is the Seq object for the dna to be cut @fragments = $re1->cut_seq($seqobj);

Page 29: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Manipulating sequence alignments

Bioperl offers several perl objects to facilitate sequence alignment: pSW (Smith-Waterman) Clustalw.pm TCoffee.pm bl2seq option of StandAloneBlast.

Page 30: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Manipulating Alignments

Some of the manipulations possible with SimpleAlign include: slice(): Obtaining an alignment ``slice'', that is, a subalignment

inclusive of specified start and end columns. column_from_residue_number(): Finding column in an alignment

where a specified residue of a specified sequence is located. consensus_string(): Making a consensus string. This method

includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ``?'''s in the consensus

percentage_identity(): A fast method for calculating the average percentage identity of the alignment

consensus_iupac(): Making a consensus using IUPAC ambiguity codes from DNA and RNA.

Page 31: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

The Code

use Bio::SimpleAlign; $aln = Bio::SimpleAlign->new('t/data/testaln.fasta'); $threshold_percent = 60; $consensus_with_threshold = $aln-

>consensus_string($threshold_percent); $iupac_consensus = $aln->consensus_iupac(); # dna/rna alignments only $percent_ident = $aln->percentage_identity; $seqname = 'AKH_HAEIN'; $pos = $aln-

>column_from_residue_number($seqname, 14);

Page 32: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Searching for Sequence Similarity

BLAST with BioPerlParsing Blast and FASTA Reports

Search and SearchIOBPLite, BPpsilite, BPbl2seq

Parsing HMM ReportsStandalone BioPerl BLAST

Page 33: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Remote Execution of BLAST

BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm

Runs these jobs at NCBI automatically NCBI has dynamic configurations (server side) to “always” be

up and ready Automatically updated for new BioPerl Releases

Convenient for independent researchers who do not have access to huge computing resources

Quick submission of Blast jobs without tying up local resources (especially if working from standalone workstation)

Legal Restrictions!!!

Page 34: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Example of Remote Blast

$remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' );

$r = $remote_blast->submit_blast("t/data/ecolitst.fa"); while (@rids = $remote_blast->each_rid ) {

foreach $rid ( @rids ) {

$rc = $remote_blast->retrieve_blast($rid);}

}

Page 35: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Sample Script to Read and Parse BLAST Report

# Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report);

$result = $searchio->next_result; # Get info about the entire report $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hsp = $hit->next_hsp; $hsp_start = $hsp->query->start;

Page 36: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Running BLAST Locally

StandAloneBlast Bio::Tools::Run::StandAloneBlast Factory Objects

@params = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast->new(@params);

Advantages: Private Use Customized Local Resources Avoid Network Problems

Page 37: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Examples

# Setting parameters similar to RemoteBlast$input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG");

$blast_report = $factory->blastall($input); # Blast Report Object that directly accesses parser

while (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){

print $hsp->score . " " . $hsp->subject->seqname . "\n";

} }

Page 38: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Format Conversion – Sequences Example

Use: Bio::SeqIO Core Code:

$in = Bio::SeqIO->new('-file' => "COG0001",

'-format' => 'Fasta');

$out = Bio::SeqIO->new('-file' => ">COG0001.gen",

'-format' => 'genbank');

while ( my $seq = $in->next_seq() ) {

$out->write_seq($seq);

}

Page 39: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Format Conversion – Alignments

Alignment formats supported: INPUT: fasta, selex (HMMER), bl2seq,

clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme

OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved).

Next_aln( ) and write_aln( ) methods of the ‘Bio::AlignIO’ object are used

Page 40: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

ClustalW and Profile Align

ClustalW using BioPerlClustalw program should be installed and

environment variable ‘CLUSTALDIR’ set Setting Parameters – Build a factory

Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet‘

Need reference to sequence array object (See example)

Align( ) and Profile_align( ) methods used

Page 41: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

ClustalW – Example

Use Bio::SeqIO, Bio::Tools::Run::Alignment::Clustalw

Core code (Simple Align):

@params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1);

$factory = Bio::Tools::Run::Alignment::

Clustalw->new(@params);

$seq_array_ref = \@seq_array;

$aln= $factory->align($seq_array_ref);

Page 42: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Smith Waterman Search

Smith Waterman pairwise alignmentStandard method for producing an optimal

local alignment of two sequences Auxilliary Bioperl-ext library required SW algorithm implemented in C and

incorporated into bioperl Align_and_show() & Pairwise_alignment() in

Bio::Tools::pSW module are methods used

Page 43: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Smith Waterman Search – Example

Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO Core code:

$factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2);

$aln = $factory->pairwise_alignment($seq_array[0], $seq_array[1]);

my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT);$alnout->write_aln($aln);

Page 44: BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.

Smith Waterman Search

AlignIO object in previous slide – could also be used to print into a file

Use double loop to do all pairwise comparisons

More Info: Bio::Tools::pSW mapage