Bioinformatics - · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support...

32
Bioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Martin Saturka www.Bioplexity.org Bioinformatics - Software

Transcript of Bioinformatics - · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support...

Page 1: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Bioinformatics - Lecture 10

BioinformaticsSoftware support

Martin Saturka

http://www.bioplexity.org/lectures/

EBI version 0.4

Creative Commons Attribution-Share Alike 2.5 License

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 2: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Software for bioinformatics

Common computation tools and systems in bioinformatics.numerical, algebraic and statistical software.computation systems specific to bioinformatics.

Main topicsgeneral software- scripting, licenses- mpi, sse, gpgpuscientific tools- emboss, 3D structures- algebra, regression, graphsR system- syntax, statistics- packages, examples

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 3: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Open source

IP: copyrights, trademarks, patents

software licensespublic domainBSD, MITLGPL, GPL

multimedia, textsFDLCC - by, sa, nd, nc

open source licensingOpen source initiativewww.opensource.org/licenses/Creative commonscreativecommons.orgsciencecommons.org

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 4: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Programming

theoretical systems and actual languages

approachesimperative

most standard programming languagesdeclarative

functional, logic, constraint programming

languagescompiled

low level work: C/C++, Fortraninterpreted

Python, Tcl/Tk, Perl, Ruby, PHP, Lisp

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 5: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Floating point

precisionsingle, double, extended precision, double double,quadruple

hardwareFPU: x87, RISC, pipelinesSIMD: altivec, sse, gpgpu

inverted square ’magic’float InvSqrt(float x) {

float xhalf = 0.5f*x;

int i = *(int*)&x; // float → bitsi = 0x5f3759df - (i>>1); // guess on result valuex = *(float*)&i; // bits → floatx = x*(1.5f-xhalf*x*x); // result value adjustingreturn x;} // relative error below 0.002

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 6: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

MPI

parallel programming

methodsMPI, PVM, threadswww.open-mpi.org

# include <mpi.h>

...

MPI Init(&argc, &argv);

MPI Comm size(MPI COMM WORLD, &ntasks);

MPI Comm rank(MPI COMM WORLD, &id);

...

MPI Send(msg, ln, MPI INT, dest, tag, MPI COMM WORLD);

MPI Recv(msg, ln, MPI INT, MPI ANY SOURCE, tag, MPI COMM WORLD, &st);

...

MPI Finalize();

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 7: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Algebraic methods

non-trivial informatical / numerical methods

hashing - data storageperfect hashing

function from a given constant set of strings to an intervalcuckoo hashing

simple implementation, usage of two hash functions

direct minimizationlinear programming

used e.g. for robust (median) regressionquadratic programming

used e.g. for SVM - support vector machines

eigen problemseigen-vectors as linear data approximation

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 8: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

FFT NLLS

Fast Fourier transform with non-linear least squares fitting

usagebiological cycle / rhytm study

period determination, run description

stepsdetrending

arithmetic mean subtractingnormalization

variation unificationFFT

taking the greatest valueNLLS

(cosine) curve fitting to data

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 9: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Data transformations

initial data preparation peak shapes

interpolated original data

detrended and standardized data

adjusting after theFFT and perioddetermination aredone

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 10: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Non-linear fitting

regression with a class of non-linear functions

Levenberg-Marquardt methodsmall errors generally approximated by quadraticsiterative method - smooth interpolation betweenthe steepest descent and the inverse Hessian methodsecond derivatives give ’order’ information,first derivatives give the minimizing directiondamped iteration along the first derivatives

softwareimplemented in many packagesR system, GSL library, Octave, etc.

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 11: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Algebra

linear algebra, series, groups, symbolic manipulations

linear algebraOctave (www.octave.org), Scilab (www.scilab.org)arpack, lapack, scalapack, blas, atlaswww.netlib.org/lapack/math-atlas.sourceforge.net

algebraMaxima, GAP, Pari-GP, Axiom, R, GSL, NumPymaxima.sourceforge.netwww.gnu.org/software/gsl/

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 12: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Graphics

general visualization software

2D / graphs / diagramsGraphviz (www.graphviz.org)GD, Gnuplot, PLplotXFig, Dia

3D / OpenGL graphicsVTK (www.vtk.org)OpenSceneGraph (www.openscenegraph.org)Pov-Ray (www.povray.org)Blender, DataExplorer, Mayavi, OpenInventor, etc.

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 13: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

VTK / Tcl interface

#!/usr/bin/env wish8.4package require vtk

vtkSphereSource spheresphere SetRadius 1.0sphere SetCenter 1 1 1sphere SetThetaResolution 8sphere SetPhiResolution 8

vtkConeSource conecone SetHeight 3.0cone SetRadius 1.0cone SetResolution 10

vtkPolyDataMapper sphereMappersphereMapper SetInput [sphere GetOutput]vtkPolyDataMapper coneMapperconeMapper SetInput [cone GetOutput]

vtkActor sphereActorsphereActor SetMapper sphereMapper

vtkActor coneActorconeActor SetMapper coneMapper

vtkRenderer renren AddActor sphereActorren AddActor coneActorren SetBackground 0.9 0.9 0.9

vtkRenderWindow renWinrenWin AddRenderer renrenWin SetSize 300 300

vtkRenderWindowInteractor ireniren SetRenderWindow renWinvtkInteractorStyleTrackballCamera style

iren SetInteractorStyle styleiren AddObserver UserEvent \

wm deiconify .vtkInteractiren Initializewm withdraw .

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 14: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

VTK example

VTK usagesuitable for complex 3D data visualizationinteractive, screen export, many algorithmsC++ libs, interface to Python, Tcl/Tk, Java

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 15: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

MM/MD

software for molecular modelling

classical MM/MDGromacs (www.gromacs.org)Tinker, NAMD/VMD

molecular visualizationRasMol, Raster3D, VieMol, Garlic, PyMolwww.openrasmol.orgpymol.sourceforge.net

file toolsOpenBabel - data formats conversionopenbabel.sourceforge.net

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 16: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Sequences

standard sequence comparison / manipulation tools

Blastwww.ncbi.nlm.nih.gov/blast/blast.wustl.edu

Embossemboss.sourceforge.netthe European Molecular Biology Open Software Suiteusage for:sequence alignment, database search, motif identification,sequence patterns, presentation tools

Clustal X/W, Phylip, Molphy, fastDNAmlmultiple sequence alignment, phylogenies

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 17: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Sequence profiles

hidden-stochastic approaches

HMM methodshmmerhmmer.janelia.orgemboss.sourceforge.net/embassy/hmmer/build and calibrate modelsalign and extract sequences

CM methodsRfamrfam.janelia.orgwww.sanger.ac.uk/Software/Rfam/sequence alignments, covariant models

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 18: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Expression clustering

Clusterprogram, C library, Python/Perl interfacebonsai.ims.u-tokyo.ac.jp/˜mdehoon/

software/cluster/software.htm

clustering gene expression datak-means clusters, hierarchical clusteringoriginal software by Eisenrana.lbl.gov/EisenSoftware.htm

cluster visualization by treeViewjtreeview.sourceforge.net

LinguaR-system package for gene expression data-miningwww.bioplexity.orgrelation search and clustering

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 19: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Open source sites

bioinformatics open-source software sites

common / bioinformatics repositoriesbioinformatics.org

lists on bioinformatics software, databases, newssourceforge.net

general repository of open-source software

common / bioinformatics projectswww.r-project.org

general statistics and microarray analysis softwarewww.open-bio.org

biological sequences oriented scripting tools

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 20: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Statistics

statistics methods

branchesexplorative / descriptive statistics

data characteristics, as mean, variance, etc.confirmative / inferential statistics

comparing achieved p-values to α significance (0.05) level

parametric methodswhen we assume a known class of (usually normal)distributions of random errorsexample: Student’s t-test

robust methodstests without assumption of a distributionusually safe, but could be weak on distinguishingexample: quantile tests

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 21: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R system

the standard open-source statistics software

descriptionsystem for statistical computingwith many statistical tests, modelling, time-series, etc.graphics with suitable 2D/3D plotsdata models on matrices, arrays, data-framesspecific functionality by CRAN packages

aboutnot for string processing (use Perl/Python/Ruby),not for internal processing of large databases(use respective DBMS)originally S system, now R and S++ systemsused commonly for bioinformatics, biostatistics,econometrics

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 22: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R syntax

vectors, matrices, arrays for regular datadata frames: matrix like-structures for database-like tables,i.e. particular columns of possibly different types

z1 <- c(2.3, 3.5, 12.1, 4.9, 8.2)sum(z1)/length(z1); mean(z1); var(z1)z2 <- 2*z1 - 1z3 <- array (c(1:24), c(4,6))z3[1, 3:5] <- NAz3[is.na(z3)] <- 0

f <- function (x1, x2) {x3 <- (x1 * x2)ˆ0.5x3

}f(2,3)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 23: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R statistical models

dependency description formulae

˜ operator for model definitionY ˜ X the Y response depends on XY ˜ X1 + X2 the Y depends on both X1 and X2Y ˜ X1 - X2 the Y depends on X1, not on X2

linear regression of y by x :x <- c(2.3, 3.5, 12.1, 4.9, 8.2)y <- c(4.3, 5.6, 30.0, 12.5, 20.7)y ˜ xclassification analysis of variance:av <- state <- c("one", "one", "two", "one", "two")A <- factor(av)y ˜ Aclassification analysis of covariance:y ˜ A + x

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 24: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R example

simple R usage on (statistics) problems

linear regressionlm(formula = y ˜ x)

analysis of varianceaov(formula = y ˜ A)

Student’s t-testt.test(c(0.1, 0.11, 0.9, 0.8), c(2.1, 2.0, 1.5))

graphicsplot(sin, 0, 7)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 25: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R data

methods for reading / writing data

file content:col1 col2 col3 ...

row1 1.2 8.5 -2.0 ...row2 2.2 -6.1 3.2 ......

tabular dataread.table("file", header = TRUE, row.names = 1)

write.table(dataframe)data import

package foreign - for e.g. Octave datarelational databases

packages RPgSQL, RdbiPgSQL, RSQLite, PL/RBioConductor

for microarray data

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 26: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R extensions

R packages system

extending Rusage of the R language andcompiled languages C/C++, fortran

extern interfaceZ <- .Fortran("fncnam", ..., PACKAGE="pkg")Z <- .C("functionname", ..., PACKAGE="pkg")

subroutine fncnam(matrix, size1, size2, result)integer size1, size2double precision matrix1(size1,size2), result

...result = 3.14end

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 27: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R networking

parallel programming interface for R

R snowsimple network of workstationscan be used with MPI, PVM, sockets

functionslibrary(snow)

cl <- makeCluster(2, type = "MPI")clusterCall(cl, function() Sys.info())clusterEvalQ(cl, library(boot))clusterApply(cl, 1:2, get("+"), 3)stopCluster(cl)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 28: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

R packages

the comprehensive R archive network

CRANarchive of packages for R

survival models, time-seriesbootstrapping, samplingvarious clustering methodsdatabase interfacesquadratic programming

bioinformaticsmicroarray processingbioconductor.orgbioplexity.org

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 29: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Python

the current standard open-source scripting language

www.python.org

characteristicscan be viewed as a usable Java replacementdynamic, object-oriented, extensible languagegluing tool with many usable packageshigh-level language, not for a low-level work

package interfacesuser interface: TkInter, wxPythondatabase: DBI, SQLAlchemy, SQLObjectalgorithms: Boost library interfacenumber cruncing: NumPy/SciPy, MPI, RPy

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 30: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Python example

sample python code - usage of VTK libraries

import vtk

def setImageWrite(self, ftyp, fname):if ("PS" == ftyp):

wobj = vtk.vtkGL2PSExporter()wobj.SetFilePrefix(fname)wobj.SetRenderWindow(self.renWin)wobj.Write()

else:wobj = vtk.vtkPNGWriter()ffname = fname + ".png"w2i = vtk.vtkWindowToImageFilter()w2i.SetInput(self.renWin)w2i.Update()wobj.SetFileName(ffname)wobj.SetInput(w2i.GetOutput())self.renWin.Render()wobj.Write()

return

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 31: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

RPy

R interface to Python

rpy.sourceforge.net

package descriptionsimple / robust interfaceR objects available in Pythonall the R functions availableR modules available as well

package usage

import rpyrpy.r.t test([0.1, 0.11, 0.9, 0.8], [2.1, 2.0, 1.5])rpy.r.plot(rpy.r.sin, 0, 7)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Page 32: Bioinformatics -  · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support Martin Saturka   EBI version 0.4 Creative Commons Attribution-Share

Items to remember

Nota bene:

programming approaches

Softwarescripting, algebra systemsmolecular, bioinformatic tools

R systemstatistics models, datapackages, python interface

Martin Saturka www.Bioplexity.org Bioinformatics - Software