Biopython: Overview, State of the Art and Outlook

Post on 10-May-2015

1.792 views 2 download


Transcript of Biopython: Overview, State of the Art and Outlook

Biopython: Overview, State of the Art and Outlook


Twitter: @sbassi

A few words about Python:

Python is a general-purpose high-level and dynamic programming language.

It supports multiple programming paradigms (OOP, imperative and functional programming).

It features a fully dynamic type system and automatic memory management.

●Easy to learn●Easy to read (looks like pseudocode)●Interpreted (compiled to a vm bytecode, it is fast to program)●High level data structures (lists, dictionaries, sets and more)●Multiplatform (from supercomputers to phones)●Batteries included philosophy●Extensive 3rd party libraries●Free (as in freedom and as in beer).●Strong community

Python features

Read a file, load an array an sort itVB

Dim i, j, Array_Used As Integer

Dim MyArray() As String

Dim InBuffer, Temp As String

Array_Used = 0

ReDim MyArray(50)

'open a text file here . . .

Do While Not EOF(file_no)

Line Input #file_no, MyArray(Array_Used)

Array_Used = Array_Used + 1

If Array_Used = UBound(MyArray) Then

ReDim Preserve MyArray(UBound(MyArray) + 50)

End If


'simple bubble sort

For i = Array_Used - 1 To 0 Step -1

For j = 1 To i

If MyArray(j - 1) > MyArray(j) Then


Temp = MyArray(j - 1)

MyArray(j - 1) = MyArray(j)

MyArray(j) = Temp

End If



Read a file, load an array an sort itPython

# Open the filehandle

file_object = open(FILENAME)

# Read all line and store them in a list

lista = file_object.readlines()

# Sort the list


What can be done with Python?

from pylab import *from data_helper import get_daily_dataintc, msft = get_daily_data()delta1 = diff([0]# size in points ^2volume = (15*intc.volume[:-2]/intc.volume[0])**2close = 0.003*intc.close[:-2]/0.003*[:-2]scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.75)ticks = arange(-0.06, 0.061, 0.02) xticks(ticks)yticks(ticks)xlabel(r'$\Delta_i$', fontsize=20)ylabel(r'$\Delta_{i+1}$', fontsize=20)title('Volume and percent change')grid(True)show()

Robots (made in Argentina)


A set of freely available Python tools for bioinformatics and molecular biology

Features include:

●Parsing bioinformatics files into python structures●A sequence class to store sequences, ids and features●Interface to popular bioinformatics programs (clustalw, blast, primer3 and more)●Tools for performing common operations on DNA/protein sequence (translation, transcription, Tm, weight)●Code to deal with alignments●Integration with other languages via BioCorba

Biopython in the lab (real world usage)

Contributions to Biopython


●Tm function●LCC function●Two checksums function in Bio.SeqUtils.CheckSum


●Feedback●Bug reporting●Testing (BLAST, SFF files, BioSQL)

Sequence class

>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> seq_1=Seq('GATCGATGGGCCTATATAGGA', IUPAC.unambiguous_dna)>>> rna_1 = seq_1.transcribe()>>> str(rna_1)'GAUCGAUGGGCCUAUAUAGGA'>>> rna_1.translate()Seq('DRWAYIG', IUPACProtein())

Run a BLAST search

from Bio.Blast import NCBIStandalone as BLASTr,e = BLAST.blastall(b_exe, 'blastn', b_db,f_in, gap_open='3', gap_extend='2', wordsize=20, expectation=1e-50, alignments=1, descriptions=1, align_view='0', html='F')

Parse a BLAST result

from Bio.Blast import NCBIXMLfor rec in NCBIXML.parse(r): for align in rec.alignments: for hsp in align.hsps: print hsp.query_start, hsp.query_end print hsp.sbjct_start, hsp.sbjct_end if hsp.identities>90: print align.title

Typical bioinformatic problems and Biopython (1/3)

Problem: Sequence manipulation in batchTool: SeqRecord and SeqIO

Problem: Filtering vector contaminationTool: SeqRecord, SeqIO, NCBIXML and NCBIStandalone

Problem: Searching for primersTool: Emboss.Applications

Problem: Calculate melting temperatureTool: SeqUtils

Typical bioinformatic problems and Biopython (2/3)

Problem: Introduce mutations with restrictionsTool: Restriction and Data.CodonTable

Problem: Extract information from alignmentTool: Clustalw.MultipleAlignCL

Problem: Get a substitution matrix from an alignmentTool: Align.AlignInfo and SubsMat

Problem: Parse structural data Tool: PDB.PDBParser

Typical bioinformatic problems and Biopython (3/3)

Problem: Calculate linkage desiquilibriumTool: PopGen.GenePop

Problem: Running SIMCOAL2Tool: PopGen.SimCoal

Problem: Data persistence (in relational database)Tool: BioSQL

Problem: Retrieve data from EntrezTool: Entrez.efetch

Outlook for Biopython

Current version: 1.53 (December 2009)

For 1.54:

●Updated multiple sequence alignment object●Bio.Phylo module●Bio.SeqIO support for Standard Flowgram Format (SFF) files


●Extending Bio.PDB (GSoC grant)●Support Python 3

Additional Resources

Biopython website:


Cock PJ, et al. “Biopython: freely available Python tools for computational molecular biology and bioinformatics”. Bioinformatics 2009 Jun 1; 25(11) 1422-3. doi:10.1093/bioinformatics/btp163 pmid:19304878.

Bassi S (2007) A Primer on Python for Life Science Researchers. PLoS Comput Biol 3(11): e199. doi:10.1371/journal.pcbi.0030199

Book: “Python for Bioinformatics” and

Mailing list:



Python in Argentina:

Thank you!