Prins Bio Lib Bosc 2009

Post on 11-May-2015

773 views 2 download

Tags:

Transcript of Prins Bio Lib Bosc 2009

BioLib Development Report (BOSC2009)

C and C++ libraries for BioPerl, BioJAVA,BioPython, BioRuby. . .

Pjotr Prins (pjotr.prins at wur.nl)

Wageningen University, Dept. of Nematology; Groningen Bioinformatics Center

BioLib Development Report (BOSC 2009) – p. 1

The stated problem

Many high-level languages used in Biology(Perl, R, Java. . . )

Duplication of effort in all Bio* efforts -BioPerl, BioConductor, BioJAVA. . .

in particular for data IO/parsing/interpretation(Alan’s keynote)

BioLib Development Report (BOSC 2009) – p. 2

What if?

What if you need some functionality (e.g. linearregression) in Perl, you can

Roll your own in Perl (performance?)

Bind against existing clib using Perl-XS (ugh)

Bind using SWIG (better, but one-off likePerl::GSL)

Bind using SWIG with Biolib (all languages)

In fact, it may already be there (GSL or Rlib)

BioLib Development Report (BOSC 2009) – p. 3

DRY-DRO

Do not repeat yourself (DRY)

Do not repeat ourselves (DRO)

Bio*: BioPerl, BioPython, BioRuby, BioJAVA,BioConductor, BioHaskell, BioCPP, . . .

Limited pool of programmers in bioinformatics

Usually 2 or 3 competing implementations

Use existing implementations

BioLib Development Report (BOSC 2009) – p. 4

Why bother?

Open Source Software is about eyes

BioLib Development Report (BOSC 2009) – p. 5

Eyes!

Eyes like these!

BioLib Development Report (BOSC 2009) – p. 6

Eyes (3)

Eyes like these!. . .

BioLib Development Report (BOSC 2009) – p. 7

Eyes (5)

Well, realistically. . .

BioLib Development Report (BOSC 2009) – p. 8

BioLib project

Objectives:

Utilize existing C/C++ libraries

Create mappings to all Bio* languages

Focus on correctness andperformance

A central place (plumbing)

An OBF affiliated project

BioLib Development Report (BOSC 2009) – p. 9

Power Trio

Plumbing power trio:

Git - modular version control

Cmake - make file generator

SWIG - simplified wrapper and interfacegenerator

BioLib Development Report (BOSC 2009) – p. 10

Power trio (1)

GIT

Version control on steroids

What source control should beEasy branching of developmentSubmodules

BioLib Development Report (BOSC 2009) – p. 11

Power trio (2)

CMake

Generator for make files

Very modular approach

Resolves complex dependencies

Looks like a simpleprogramming language

Easy on the eyes and mind

BioLib Development Report (BOSC 2009) – p. 12

Power trio (3)

SWIG

Code generator for mappings done right:Rules for generating codeMacros (DRY)Pattern matchingFlexibleSupports many languages

BioLib Development Report (BOSC 2009) – p. 13

Achievements (year one)

Affyio: Affymetrix arrays (357 methods; 10K lines)

Staden: Sequencer trace files (95; 16K)

GSL: GNU Science Library (2702; 200K)

Rlib: R routines (> 176; 43K)

R/qtl: Quantitative genetics (> 100; 10K)*

Libsequence: Sequence analysis (> 1000; 21K)*

Bio++: Sequence analysis (> 1000; 52K)*

Code base 350K lines USD 10 million R&D

BioLib Development Report (BOSC 2009) – p. 14

Source tree

|-- clibs

| |-- affyio-1.8

| |-- biolib_R

| |-- biolib_microarray

| |-- libsequence-1.6.6

|-- mappings

| ‘-- swig

| |-- perl

| | |-- affyio

| | |-- staden_io_lib

| | ‘-- test

| |-- python

| |-- ruby

104 directories, 668 files

BioLib Development Report (BOSC 2009) – p. 15

Adding a C lib

Unpack C/C++ library in./src/clibs/modulename

Add CMake file - compiles into .so sharedlibrary

Create Perl mapping in./src/mapping/swig/perl/module

Add SWIG .i file

Add CMake file - compiles into .pm and .soshared library

BioLib Development Report (BOSC 2009) – p. 16

CMake goodies

# Defining a C library build in Biolib:

SET (M_NAME staden_io_lib)

SET (M_VERSION 1.11.6)

FIND_PACKAGE(ZLIB REQUIRED)

BUILD_CLIB()

ADD_LIBRARY(${LIBNAME} SHARED

array.c

compress.c

compression.c

ctfCompress.c

(...)

INSTALL_CLIB()

BioLib Development Report (BOSC 2009) – p. 17

CMake for Perl

# Defining a C library mapping for Perl

SET (USE_ZLIB TRUE)

SET (USE_INCLUDEPATH io_lib)

FIND_PACKAGE(MapPerl)

POST_BUILD_PERL_BINDINGS()

TEST_PERL_BINDINGS()

INSTALL_PERL_BINDINGS()

BioLib Development Report (BOSC 2009) – p. 18

SWIG Map

%include <Read.h>

#define TT_ANY 0

#define TT_ZTR 7

typedef struct

{

int format;

char *trace_name;

int NPoints;

int NBases;

(...)

} Read;

Read *read_reading(char *fn, int format);

BioLib Development Report (BOSC 2009) – p. 19

Perl

use biolib::staden_io_lib;

$result = staden_io_lib::read_reading($fn,

$staden_io_lib::TT_ANY);

print("format=",staden_io_libc::Read_format_get($result));

print("NBases=",$result->{NBases});

print("base=",staden_io_libc::Read_base_get($result));

Outputs:

format=7

NBases=766

base=NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT

CGGTCCCAACTTAATTGTACA...

BioLib Development Report (BOSC 2009) – p. 20

Python

import biolib.staden_io_lib as io_lib

result = io_lib.read_reading(procsrffn,

io_lib.TT_ANY)

print result.format

print result.NBases

print result.base

7

766

NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT

CGGTCCCAACTTAATTGTACA...

BioLib Development Report (BOSC 2009) – p. 21

For the Perl coder

Adding functionality in language of choice

Easier deployment - ’install biolib-perl’

Shared correctness testing

Generated API documentation

BioLib Development Report (BOSC 2009) – p. 22

For the authors

Independent source trees

Increased exposure (Ruby, Perl. . . )

Added unit/integration testing environment

Deployment, multi-platform support (Linux,OSX, Windows)

No autoconf pain (./configure and friends)

Implicit access to other libraries (GSL, Rlib)

Online generated API documentation

BioLib Development Report (BOSC 2009) – p. 23

Future work

Automated API documentation (with doctests)

More libraries (Emboss, NCBI, . . . )

New code (HPC)

More languages (JAVA, R, OCaml, . . . )

Bio* integration (CPAN, Ruby gems, Pythoneggs)

Debian/Fedora/OSX/Windows packages

More platforms (Windows without Cygwin)

BioLib Development Report (BOSC 2009) – p. 24

Credits

Ben Bolstad (Affyio), James Bonfield (Staden), Karl Broman (R/qtl)

Jonathan Leto (GSL SWIG)

Xin Shuai (Google SoC libsequence)

Adam Smith (Google SoC Bio++)

Oswaldo Trelles, José Manuel Mateos-Duran and Andrés Rodríguez (UMA)

Chris Fields (BioPerl), Mark Jensen (BioPerl), Hilmar Lap (Nescent, OBF)

Jaap Bakker (WU), Geert Smant (WU), Ritsert Jansen (GBIC)

BioLib Development Report (BOSC 2009) – p. 25

BoF

BioLib: Birds of a Feather Session (BoF) at 16:50 hours

BioLib Development Report (BOSC 2009) – p. 26