Bioinformatics - · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support...

Post on 28-Mar-2018

213 views 0 download

Transcript of Bioinformatics - · PDF fileBioinformatics - Lecture 10 Bioinformatics Software support...

Bioinformatics - Lecture 10

BioinformaticsSoftware support

Martin Saturka

http://www.bioplexity.org/lectures/

EBI version 0.4

Creative Commons Attribution-Share Alike 2.5 License

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Software for bioinformatics

Common computation tools and systems in bioinformatics.numerical, algebraic and statistical software.computation systems specific to bioinformatics.

Main topicsgeneral software- scripting, licenses- mpi, sse, gpgpuscientific tools- emboss, 3D structures- algebra, regression, graphsR system- syntax, statistics- packages, examples

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Open source

IP: copyrights, trademarks, patents

software licensespublic domainBSD, MITLGPL, GPL

multimedia, textsFDLCC - by, sa, nd, nc

open source licensingOpen source initiativewww.opensource.org/licenses/Creative commonscreativecommons.orgsciencecommons.org

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Programming

theoretical systems and actual languages

approachesimperative

most standard programming languagesdeclarative

functional, logic, constraint programming

languagescompiled

low level work: C/C++, Fortraninterpreted

Python, Tcl/Tk, Perl, Ruby, PHP, Lisp

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Floating point

precisionsingle, double, extended precision, double double,quadruple

hardwareFPU: x87, RISC, pipelinesSIMD: altivec, sse, gpgpu

inverted square ’magic’float InvSqrt(float x) {

float xhalf = 0.5f*x;

int i = *(int*)&x; // float → bitsi = 0x5f3759df - (i>>1); // guess on result valuex = *(float*)&i; // bits → floatx = x*(1.5f-xhalf*x*x); // result value adjustingreturn x;} // relative error below 0.002

Martin Saturka www.Bioplexity.org Bioinformatics - Software

MPI

parallel programming

methodsMPI, PVM, threadswww.open-mpi.org

# include <mpi.h>

...

MPI Init(&argc, &argv);

MPI Comm size(MPI COMM WORLD, &ntasks);

MPI Comm rank(MPI COMM WORLD, &id);

...

MPI Send(msg, ln, MPI INT, dest, tag, MPI COMM WORLD);

MPI Recv(msg, ln, MPI INT, MPI ANY SOURCE, tag, MPI COMM WORLD, &st);

...

MPI Finalize();

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Algebraic methods

non-trivial informatical / numerical methods

hashing - data storageperfect hashing

function from a given constant set of strings to an intervalcuckoo hashing

simple implementation, usage of two hash functions

direct minimizationlinear programming

used e.g. for robust (median) regressionquadratic programming

used e.g. for SVM - support vector machines

eigen problemseigen-vectors as linear data approximation

Martin Saturka www.Bioplexity.org Bioinformatics - Software

FFT NLLS

Fast Fourier transform with non-linear least squares fitting

usagebiological cycle / rhytm study

period determination, run description

stepsdetrending

arithmetic mean subtractingnormalization

variation unificationFFT

taking the greatest valueNLLS

(cosine) curve fitting to data

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Data transformations

initial data preparation peak shapes

interpolated original data

detrended and standardized data

adjusting after theFFT and perioddetermination aredone

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Non-linear fitting

regression with a class of non-linear functions

Levenberg-Marquardt methodsmall errors generally approximated by quadraticsiterative method - smooth interpolation betweenthe steepest descent and the inverse Hessian methodsecond derivatives give ’order’ information,first derivatives give the minimizing directiondamped iteration along the first derivatives

softwareimplemented in many packagesR system, GSL library, Octave, etc.

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Algebra

linear algebra, series, groups, symbolic manipulations

linear algebraOctave (www.octave.org), Scilab (www.scilab.org)arpack, lapack, scalapack, blas, atlaswww.netlib.org/lapack/math-atlas.sourceforge.net

algebraMaxima, GAP, Pari-GP, Axiom, R, GSL, NumPymaxima.sourceforge.netwww.gnu.org/software/gsl/

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Graphics

general visualization software

2D / graphs / diagramsGraphviz (www.graphviz.org)GD, Gnuplot, PLplotXFig, Dia

3D / OpenGL graphicsVTK (www.vtk.org)OpenSceneGraph (www.openscenegraph.org)Pov-Ray (www.povray.org)Blender, DataExplorer, Mayavi, OpenInventor, etc.

Martin Saturka www.Bioplexity.org Bioinformatics - Software

VTK / Tcl interface

#!/usr/bin/env wish8.4package require vtk

vtkSphereSource spheresphere SetRadius 1.0sphere SetCenter 1 1 1sphere SetThetaResolution 8sphere SetPhiResolution 8

vtkConeSource conecone SetHeight 3.0cone SetRadius 1.0cone SetResolution 10

vtkPolyDataMapper sphereMappersphereMapper SetInput [sphere GetOutput]vtkPolyDataMapper coneMapperconeMapper SetInput [cone GetOutput]

vtkActor sphereActorsphereActor SetMapper sphereMapper

vtkActor coneActorconeActor SetMapper coneMapper

vtkRenderer renren AddActor sphereActorren AddActor coneActorren SetBackground 0.9 0.9 0.9

vtkRenderWindow renWinrenWin AddRenderer renrenWin SetSize 300 300

vtkRenderWindowInteractor ireniren SetRenderWindow renWinvtkInteractorStyleTrackballCamera style

iren SetInteractorStyle styleiren AddObserver UserEvent \

wm deiconify .vtkInteractiren Initializewm withdraw .

Martin Saturka www.Bioplexity.org Bioinformatics - Software

VTK example

VTK usagesuitable for complex 3D data visualizationinteractive, screen export, many algorithmsC++ libs, interface to Python, Tcl/Tk, Java

Martin Saturka www.Bioplexity.org Bioinformatics - Software

MM/MD

software for molecular modelling

classical MM/MDGromacs (www.gromacs.org)Tinker, NAMD/VMD

molecular visualizationRasMol, Raster3D, VieMol, Garlic, PyMolwww.openrasmol.orgpymol.sourceforge.net

file toolsOpenBabel - data formats conversionopenbabel.sourceforge.net

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Sequences

standard sequence comparison / manipulation tools

Blastwww.ncbi.nlm.nih.gov/blast/blast.wustl.edu

Embossemboss.sourceforge.netthe European Molecular Biology Open Software Suiteusage for:sequence alignment, database search, motif identification,sequence patterns, presentation tools

Clustal X/W, Phylip, Molphy, fastDNAmlmultiple sequence alignment, phylogenies

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Sequence profiles

hidden-stochastic approaches

HMM methodshmmerhmmer.janelia.orgemboss.sourceforge.net/embassy/hmmer/build and calibrate modelsalign and extract sequences

CM methodsRfamrfam.janelia.orgwww.sanger.ac.uk/Software/Rfam/sequence alignments, covariant models

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Expression clustering

Clusterprogram, C library, Python/Perl interfacebonsai.ims.u-tokyo.ac.jp/˜mdehoon/

software/cluster/software.htm

clustering gene expression datak-means clusters, hierarchical clusteringoriginal software by Eisenrana.lbl.gov/EisenSoftware.htm

cluster visualization by treeViewjtreeview.sourceforge.net

LinguaR-system package for gene expression data-miningwww.bioplexity.orgrelation search and clustering

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Open source sites

bioinformatics open-source software sites

common / bioinformatics repositoriesbioinformatics.org

lists on bioinformatics software, databases, newssourceforge.net

general repository of open-source software

common / bioinformatics projectswww.r-project.org

general statistics and microarray analysis softwarewww.open-bio.org

biological sequences oriented scripting tools

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Statistics

statistics methods

branchesexplorative / descriptive statistics

data characteristics, as mean, variance, etc.confirmative / inferential statistics

comparing achieved p-values to α significance (0.05) level

parametric methodswhen we assume a known class of (usually normal)distributions of random errorsexample: Student’s t-test

robust methodstests without assumption of a distributionusually safe, but could be weak on distinguishingexample: quantile tests

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R system

the standard open-source statistics software

descriptionsystem for statistical computingwith many statistical tests, modelling, time-series, etc.graphics with suitable 2D/3D plotsdata models on matrices, arrays, data-framesspecific functionality by CRAN packages

aboutnot for string processing (use Perl/Python/Ruby),not for internal processing of large databases(use respective DBMS)originally S system, now R and S++ systemsused commonly for bioinformatics, biostatistics,econometrics

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R syntax

vectors, matrices, arrays for regular datadata frames: matrix like-structures for database-like tables,i.e. particular columns of possibly different types

z1 <- c(2.3, 3.5, 12.1, 4.9, 8.2)sum(z1)/length(z1); mean(z1); var(z1)z2 <- 2*z1 - 1z3 <- array (c(1:24), c(4,6))z3[1, 3:5] <- NAz3[is.na(z3)] <- 0

f <- function (x1, x2) {x3 <- (x1 * x2)ˆ0.5x3

}f(2,3)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R statistical models

dependency description formulae

˜ operator for model definitionY ˜ X the Y response depends on XY ˜ X1 + X2 the Y depends on both X1 and X2Y ˜ X1 - X2 the Y depends on X1, not on X2

linear regression of y by x :x <- c(2.3, 3.5, 12.1, 4.9, 8.2)y <- c(4.3, 5.6, 30.0, 12.5, 20.7)y ˜ xclassification analysis of variance:av <- state <- c("one", "one", "two", "one", "two")A <- factor(av)y ˜ Aclassification analysis of covariance:y ˜ A + x

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R example

simple R usage on (statistics) problems

linear regressionlm(formula = y ˜ x)

analysis of varianceaov(formula = y ˜ A)

Student’s t-testt.test(c(0.1, 0.11, 0.9, 0.8), c(2.1, 2.0, 1.5))

graphicsplot(sin, 0, 7)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R data

methods for reading / writing data

file content:col1 col2 col3 ...

row1 1.2 8.5 -2.0 ...row2 2.2 -6.1 3.2 ......

tabular dataread.table("file", header = TRUE, row.names = 1)

write.table(dataframe)data import

package foreign - for e.g. Octave datarelational databases

packages RPgSQL, RdbiPgSQL, RSQLite, PL/RBioConductor

for microarray data

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R extensions

R packages system

extending Rusage of the R language andcompiled languages C/C++, fortran

extern interfaceZ <- .Fortran("fncnam", ..., PACKAGE="pkg")Z <- .C("functionname", ..., PACKAGE="pkg")

subroutine fncnam(matrix, size1, size2, result)integer size1, size2double precision matrix1(size1,size2), result

...result = 3.14end

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R networking

parallel programming interface for R

R snowsimple network of workstationscan be used with MPI, PVM, sockets

functionslibrary(snow)

cl <- makeCluster(2, type = "MPI")clusterCall(cl, function() Sys.info())clusterEvalQ(cl, library(boot))clusterApply(cl, 1:2, get("+"), 3)stopCluster(cl)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

R packages

the comprehensive R archive network

CRANarchive of packages for R

survival models, time-seriesbootstrapping, samplingvarious clustering methodsdatabase interfacesquadratic programming

bioinformaticsmicroarray processingbioconductor.orgbioplexity.org

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Python

the current standard open-source scripting language

www.python.org

characteristicscan be viewed as a usable Java replacementdynamic, object-oriented, extensible languagegluing tool with many usable packageshigh-level language, not for a low-level work

package interfacesuser interface: TkInter, wxPythondatabase: DBI, SQLAlchemy, SQLObjectalgorithms: Boost library interfacenumber cruncing: NumPy/SciPy, MPI, RPy

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Python example

sample python code - usage of VTK libraries

import vtk

def setImageWrite(self, ftyp, fname):if ("PS" == ftyp):

wobj = vtk.vtkGL2PSExporter()wobj.SetFilePrefix(fname)wobj.SetRenderWindow(self.renWin)wobj.Write()

else:wobj = vtk.vtkPNGWriter()ffname = fname + ".png"w2i = vtk.vtkWindowToImageFilter()w2i.SetInput(self.renWin)w2i.Update()wobj.SetFileName(ffname)wobj.SetInput(w2i.GetOutput())self.renWin.Render()wobj.Write()

return

Martin Saturka www.Bioplexity.org Bioinformatics - Software

RPy

R interface to Python

rpy.sourceforge.net

package descriptionsimple / robust interfaceR objects available in Pythonall the R functions availableR modules available as well

package usage

import rpyrpy.r.t test([0.1, 0.11, 0.9, 0.8], [2.1, 2.0, 1.5])rpy.r.plot(rpy.r.sin, 0, 7)

Martin Saturka www.Bioplexity.org Bioinformatics - Software

Items to remember

Nota bene:

programming approaches

Softwarescripting, algebra systemsmolecular, bioinformatic tools

R systemstatistics models, datapackages, python interface

Martin Saturka www.Bioplexity.org Bioinformatics - Software