BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.
-
Upload
dylan-harrell -
Category
Documents
-
view
212 -
download
0
Transcript of BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.
BioPerl
Based on a presentation by Manish Anand/Jonathan Nowacki/
Ravi Bhatt/Arvind Gopu
Introduction
Objective of BioPerl: Develop reusable, extensible core Perl modules for
use as a standard for manipulating molecular biological data.
Background: Started in 1995 One of the oldest open source Bioinformatics
Toolkit Project
So what is BioPerl?
Higher level of abstraction
Re-usable collection of Perl modules that facilitate bioinformatics application development: Accessing databases with different formats Sequence manipulation Execution and Parsing of the results of molecular
biology programs
Catch? BioPerl does not include programs like Blast, ClustalW, etc Uses system calls to execute external programs
So what is BioPerl? (continued…)
551 modules (incl. 82 interface modules)37 module groups79,582 lines of code (223,310 lines total)144 lines of code per moduleFor More info: BioPerl Module Listing
Major Areas covered in Bioperl
Sequences, features, annotations,Pairwise alignment reportsMultiple sequence alignmentsBibliographic dataGraphical rendering of sequence tracksDatabase for features and sequences
Additional things
Gene prediction parsersTrees, parsing phylogenetic and
molecular evolution software outputPopulation genetic data and summary
statisticsTaxonomyProtein Structure
Downloading modules
Modules can be obtained from: www.CPAN.org (Perl Modules) www.BioPerl.org (BioPerl Modules)
Downloading modules from CPAN Interactive mode
perl -MCPAN -e shell
Batch mode use CPAN; clean, install, make, recompile, test
Directory Structure
BioPerl directory structure organization: Bio/ BioPerl modules models/ UML for BioPerl classes t/ Perl built-in tests t/data/ Data files used for the tests scripts/ Reusable scripts that use BioPerl scripts/contributed/ Contributed scripts not
necessarily integrated into BioPerl. doc/ "How To" files and the FAQ as XML
Parsing Sequences
Bio::SeqIOmultiple drivers: genbank, embl, fasta,...
Sequence objectsBio::PrimarySeqBio::SeqBio::Seq::RichSeq
Sequence Object Creation
Sequence Creation :$sequence = Bio::Seq->new( -seq => ‘AATGCAA’
-display_id => ‘my_sequence’);
Flat File Format Support :
Raw, FASTA, GCG, GenBank, EMBL, PIR
Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP
Sequence object
Common (Bio::PrimarySeq) methodsseq() - get the sequence as a string length() - get the sequence lengthsubseq($s,$e) - get a subseqeunce translate(...) - translate to protein [DNA] revcom() - reverse complement [DNA]display_id() - identifier stringdescription() - description string
Sequence Types
Different Sequence Objects:
Seq – Some annotations RichSeq – Additional annotations PrimarySeq – Bare minimum annotation
( id, accession number, alphabet) LocatableSeq – Start, stop and gap information also LargeSeq – Very long sequences LiveSeq – Newly sequenced genomes
Using a sequence
use Bio::PrimarySeq;my $str = “ATGAATGATGAA”;my $seq = Bio::PrimarySeq->new(-seq => $str,
-display_id=>”example”);print “id is “, $seq->display_id,”\n”;print $seq->seq, “\n”;my $revcom = $seq->revcom;print $revcom->seq, “\n”;print “frame1=”,$seq->translate->seq,“\n”;
id is exampleATGAATGATGAATTCATCATTCATtrans frame1=MNDE
Accessing remote databases
$gb = new Bio::DB::GenBank();
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
$seq2 = $gb->get_Seq_by_acc('AF303112');
$seqio = $gb-> get_Stream_by_id(["J00522","AF303112","2981014"]);
Sequence – Accession numbers
# Get a sequence from RefSeq by accession number
use Bio::DB::RefSeq;
$gb = new Bio::DB::RefSeq;
$seq = $gb->get_Seq_by_acc(“NM_007304”);
print $seq->seq();
Reading and Writing Sequences
Bio::SeqIO fasta, genbank, embl, swissprot,...
Takes care of writing out associated features and annotations
Two functionsnext_seq (reading sequences)write_seq (writing sequences)
Writing a Sequence
use Bio::SeqIO;# Let’s convert swissprot to fasta formatmy $in = Bio::SeqIO->new( -format => ‘swiss’,
-file => ‘file.sp’);my $out = Bio::SeqIO->new( -format => ‘fasta’,
-file => ‘>file.fa’);`while( my $seq = $in->next_seq ) {
$out->write_seq($seq);}
Manipulating sequence data with Seq methods
Allows the easy manipulation of bioinformatics data Specific parts of various annotated formats can be
selected and rearranged. Unwanted information can be voided out of reports Important information can be highlighted, processed,
stored in arrays for graphs/charts/etc with relative ease
Information can be added and subtracted in a flash
The Code
#!/usr/local/bin/perl
use Bio::Seq;
use Bio::SeqIO;
my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta');
my $seqobj = $seqin->next_seq();
my $seq = $seqobj->seq(),"\n"; #plain sequence
print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n";
$seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines
print $seq,"\n";
Before
After
Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord)
Molecular Weight Monomer Counter Codon Counter DNA weights RNA weights Amino Weights More
The Code
#!/usr/local/bin/perl
use Bio::PrimarySeq;
use Bio::Tools::SeqStats;
my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1');
$seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj);
$hash_ref = $seq_stats->count_monomers();
foreach $base (sort keys %$hash_ref) {
print "Number of bases of type ", $base, "= ",%$hash_ref->{$base},"\n";
}
The Results
More Code
use SeqStats;
$seq_stats = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
-returns the molecular weight
$monomer_ref = $seq_stats->count_monomers();
-counts the number of monomers
$codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
-counts the number of codons
Monomer
Why the Large and The Small MW?
Note that since sequences may contain ambiguous monomers (eg "M"meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wtreturns a two-element array containing the greatest lower bound andleast upper bound of the molecule. (For a sequence with no ambiguous monomers, the two elements of the returned array will be equal.)
Identifying restriction enzyme sites (Restriction Enzyme)
Bioperl's standard RestrictionEnzyme object comes with data for
more than 150 different restriction enzymes. To select all available enzymes with cutting patterns that
are six bases long: $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI');
@sixcutters = $re->available_list(6); sites for that enzyme on a given nucleic acid sequence
can be obtained using $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI');
# $seqobj is the Seq object for the dna to be cut @fragments = $re1->cut_seq($seqobj);
Manipulating sequence alignments
Bioperl offers several perl objects to facilitate sequence alignment: pSW (Smith-Waterman) Clustalw.pm TCoffee.pm bl2seq option of StandAloneBlast.
Manipulating Alignments
Some of the manipulations possible with SimpleAlign include: slice(): Obtaining an alignment ``slice'', that is, a subalignment
inclusive of specified start and end columns. column_from_residue_number(): Finding column in an alignment
where a specified residue of a specified sequence is located. consensus_string(): Making a consensus string. This method
includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ``?'''s in the consensus
percentage_identity(): A fast method for calculating the average percentage identity of the alignment
consensus_iupac(): Making a consensus using IUPAC ambiguity codes from DNA and RNA.
The Code
use Bio::SimpleAlign; $aln = Bio::SimpleAlign->new('t/data/testaln.fasta'); $threshold_percent = 60; $consensus_with_threshold = $aln-
>consensus_string($threshold_percent); $iupac_consensus = $aln->consensus_iupac(); # dna/rna alignments only $percent_ident = $aln->percentage_identity; $seqname = 'AKH_HAEIN'; $pos = $aln-
>column_from_residue_number($seqname, 14);
Searching for Sequence Similarity
BLAST with BioPerlParsing Blast and FASTA Reports
Search and SearchIOBPLite, BPpsilite, BPbl2seq
Parsing HMM ReportsStandalone BioPerl BLAST
Remote Execution of BLAST
BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm
Runs these jobs at NCBI automatically NCBI has dynamic configurations (server side) to “always” be
up and ready Automatically updated for new BioPerl Releases
Convenient for independent researchers who do not have access to huge computing resources
Quick submission of Blast jobs without tying up local resources (especially if working from standalone workstation)
Legal Restrictions!!!
Example of Remote Blast
$remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' );
$r = $remote_blast->submit_blast("t/data/ecolitst.fa"); while (@rids = $remote_blast->each_rid ) {
foreach $rid ( @rids ) {
$rc = $remote_blast->retrieve_blast($rid);}
}
Sample Script to Read and Parse BLAST Report
# Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report);
$result = $searchio->next_result; # Get info about the entire report $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hsp = $hit->next_hsp; $hsp_start = $hsp->query->start;
Running BLAST Locally
StandAloneBlast Bio::Tools::Run::StandAloneBlast Factory Objects
@params = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
Advantages: Private Use Customized Local Resources Avoid Network Problems
Examples
# Setting parameters similar to RemoteBlast$input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG");
$blast_report = $factory->blastall($input); # Blast Report Object that directly accesses parser
while (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){
print $hsp->score . " " . $hsp->subject->seqname . "\n";
} }
Format Conversion – Sequences Example
Use: Bio::SeqIO Core Code:
$in = Bio::SeqIO->new('-file' => "COG0001",
'-format' => 'Fasta');
$out = Bio::SeqIO->new('-file' => ">COG0001.gen",
'-format' => 'genbank');
while ( my $seq = $in->next_seq() ) {
$out->write_seq($seq);
}
Format Conversion – Alignments
Alignment formats supported: INPUT: fasta, selex (HMMER), bl2seq,
clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme
OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved).
Next_aln( ) and write_aln( ) methods of the ‘Bio::AlignIO’ object are used
ClustalW and Profile Align
ClustalW using BioPerlClustalw program should be installed and
environment variable ‘CLUSTALDIR’ set Setting Parameters – Build a factory
Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet‘
Need reference to sequence array object (See example)
Align( ) and Profile_align( ) methods used
ClustalW – Example
Use Bio::SeqIO, Bio::Tools::Run::Alignment::Clustalw
Core code (Simple Align):
@params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1);
$factory = Bio::Tools::Run::Alignment::
Clustalw->new(@params);
$seq_array_ref = \@seq_array;
$aln= $factory->align($seq_array_ref);
Smith Waterman Search
Smith Waterman pairwise alignmentStandard method for producing an optimal
local alignment of two sequences Auxilliary Bioperl-ext library required SW algorithm implemented in C and
incorporated into bioperl Align_and_show() & Pairwise_alignment() in
Bio::Tools::pSW module are methods used
Smith Waterman Search – Example
Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO Core code:
$factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2);
$aln = $factory->pairwise_alignment($seq_array[0], $seq_array[1]);
my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT);$alnout->write_aln($aln);
Smith Waterman Search
AlignIO object in previous slide – could also be used to print into a file
Use double loop to do all pairwise comparisons
More Info: Bio::Tools::pSW mapage