Prins Bio Lib Bosc 2009

26
BioLib Development Report (BOSC 2009) C and C++ libraries for BioPerl, BioJAVA, BioPython, BioRuby. . . Pjotr Prins (pjotr.prins at wur.nl ) Wageningen University, Dept. of Nematology; Groningen Bioinformatics Center BioLib Development Report (BOSC 2009) – p. 1

Transcript of Prins Bio Lib Bosc 2009

Page 1: Prins Bio Lib Bosc 2009

BioLib Development Report (BOSC2009)

C and C++ libraries for BioPerl, BioJAVA,BioPython, BioRuby. . .

Pjotr Prins (pjotr.prins at wur.nl)

Wageningen University, Dept. of Nematology; Groningen Bioinformatics Center

BioLib Development Report (BOSC 2009) – p. 1

Page 2: Prins Bio Lib Bosc 2009

The stated problem

Many high-level languages used in Biology(Perl, R, Java. . . )

Duplication of effort in all Bio* efforts -BioPerl, BioConductor, BioJAVA. . .

in particular for data IO/parsing/interpretation(Alan’s keynote)

BioLib Development Report (BOSC 2009) – p. 2

Page 3: Prins Bio Lib Bosc 2009

What if?

What if you need some functionality (e.g. linearregression) in Perl, you can

Roll your own in Perl (performance?)

Bind against existing clib using Perl-XS (ugh)

Bind using SWIG (better, but one-off likePerl::GSL)

Bind using SWIG with Biolib (all languages)

In fact, it may already be there (GSL or Rlib)

BioLib Development Report (BOSC 2009) – p. 3

Page 4: Prins Bio Lib Bosc 2009

DRY-DRO

Do not repeat yourself (DRY)

Do not repeat ourselves (DRO)

Bio*: BioPerl, BioPython, BioRuby, BioJAVA,BioConductor, BioHaskell, BioCPP, . . .

Limited pool of programmers in bioinformatics

Usually 2 or 3 competing implementations

Use existing implementations

BioLib Development Report (BOSC 2009) – p. 4

Page 5: Prins Bio Lib Bosc 2009

Why bother?

Open Source Software is about eyes

BioLib Development Report (BOSC 2009) – p. 5

Page 6: Prins Bio Lib Bosc 2009

Eyes!

Eyes like these!

BioLib Development Report (BOSC 2009) – p. 6

Page 7: Prins Bio Lib Bosc 2009

Eyes (3)

Eyes like these!. . .

BioLib Development Report (BOSC 2009) – p. 7

Page 8: Prins Bio Lib Bosc 2009

Eyes (5)

Well, realistically. . .

BioLib Development Report (BOSC 2009) – p. 8

Page 9: Prins Bio Lib Bosc 2009

BioLib project

Objectives:

Utilize existing C/C++ libraries

Create mappings to all Bio* languages

Focus on correctness andperformance

A central place (plumbing)

An OBF affiliated project

BioLib Development Report (BOSC 2009) – p. 9

Page 10: Prins Bio Lib Bosc 2009

Power Trio

Plumbing power trio:

Git - modular version control

Cmake - make file generator

SWIG - simplified wrapper and interfacegenerator

BioLib Development Report (BOSC 2009) – p. 10

Page 11: Prins Bio Lib Bosc 2009

Power trio (1)

GIT

Version control on steroids

What source control should beEasy branching of developmentSubmodules

BioLib Development Report (BOSC 2009) – p. 11

Page 12: Prins Bio Lib Bosc 2009

Power trio (2)

CMake

Generator for make files

Very modular approach

Resolves complex dependencies

Looks like a simpleprogramming language

Easy on the eyes and mind

BioLib Development Report (BOSC 2009) – p. 12

Page 13: Prins Bio Lib Bosc 2009

Power trio (3)

SWIG

Code generator for mappings done right:Rules for generating codeMacros (DRY)Pattern matchingFlexibleSupports many languages

BioLib Development Report (BOSC 2009) – p. 13

Page 14: Prins Bio Lib Bosc 2009

Achievements (year one)

Affyio: Affymetrix arrays (357 methods; 10K lines)

Staden: Sequencer trace files (95; 16K)

GSL: GNU Science Library (2702; 200K)

Rlib: R routines (> 176; 43K)

R/qtl: Quantitative genetics (> 100; 10K)*

Libsequence: Sequence analysis (> 1000; 21K)*

Bio++: Sequence analysis (> 1000; 52K)*

Code base 350K lines USD 10 million R&D

BioLib Development Report (BOSC 2009) – p. 14

Page 15: Prins Bio Lib Bosc 2009

Source tree

|-- clibs

| |-- affyio-1.8

| |-- biolib_R

| |-- biolib_microarray

| |-- libsequence-1.6.6

|-- mappings

| ‘-- swig

| |-- perl

| | |-- affyio

| | |-- staden_io_lib

| | ‘-- test

| |-- python

| |-- ruby

104 directories, 668 files

BioLib Development Report (BOSC 2009) – p. 15

Page 16: Prins Bio Lib Bosc 2009

Adding a C lib

Unpack C/C++ library in./src/clibs/modulename

Add CMake file - compiles into .so sharedlibrary

Create Perl mapping in./src/mapping/swig/perl/module

Add SWIG .i file

Add CMake file - compiles into .pm and .soshared library

BioLib Development Report (BOSC 2009) – p. 16

Page 17: Prins Bio Lib Bosc 2009

CMake goodies

# Defining a C library build in Biolib:

SET (M_NAME staden_io_lib)

SET (M_VERSION 1.11.6)

FIND_PACKAGE(ZLIB REQUIRED)

BUILD_CLIB()

ADD_LIBRARY(${LIBNAME} SHARED

array.c

compress.c

compression.c

ctfCompress.c

(...)

INSTALL_CLIB()

BioLib Development Report (BOSC 2009) – p. 17

Page 18: Prins Bio Lib Bosc 2009

CMake for Perl

# Defining a C library mapping for Perl

SET (USE_ZLIB TRUE)

SET (USE_INCLUDEPATH io_lib)

FIND_PACKAGE(MapPerl)

POST_BUILD_PERL_BINDINGS()

TEST_PERL_BINDINGS()

INSTALL_PERL_BINDINGS()

BioLib Development Report (BOSC 2009) – p. 18

Page 19: Prins Bio Lib Bosc 2009

SWIG Map

%include <Read.h>

#define TT_ANY 0

#define TT_ZTR 7

typedef struct

{

int format;

char *trace_name;

int NPoints;

int NBases;

(...)

} Read;

Read *read_reading(char *fn, int format);

BioLib Development Report (BOSC 2009) – p. 19

Page 20: Prins Bio Lib Bosc 2009

Perl

use biolib::staden_io_lib;

$result = staden_io_lib::read_reading($fn,

$staden_io_lib::TT_ANY);

print("format=",staden_io_libc::Read_format_get($result));

print("NBases=",$result->{NBases});

print("base=",staden_io_libc::Read_base_get($result));

Outputs:

format=7

NBases=766

base=NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT

CGGTCCCAACTTAATTGTACA...

BioLib Development Report (BOSC 2009) – p. 20

Page 21: Prins Bio Lib Bosc 2009

Python

import biolib.staden_io_lib as io_lib

result = io_lib.read_reading(procsrffn,

io_lib.TT_ANY)

print result.format

print result.NBases

print result.base

7

766

NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT

CGGTCCCAACTTAATTGTACA...

BioLib Development Report (BOSC 2009) – p. 21

Page 22: Prins Bio Lib Bosc 2009

For the Perl coder

Adding functionality in language of choice

Easier deployment - ’install biolib-perl’

Shared correctness testing

Generated API documentation

BioLib Development Report (BOSC 2009) – p. 22

Page 23: Prins Bio Lib Bosc 2009

For the authors

Independent source trees

Increased exposure (Ruby, Perl. . . )

Added unit/integration testing environment

Deployment, multi-platform support (Linux,OSX, Windows)

No autoconf pain (./configure and friends)

Implicit access to other libraries (GSL, Rlib)

Online generated API documentation

BioLib Development Report (BOSC 2009) – p. 23

Page 24: Prins Bio Lib Bosc 2009

Future work

Automated API documentation (with doctests)

More libraries (Emboss, NCBI, . . . )

New code (HPC)

More languages (JAVA, R, OCaml, . . . )

Bio* integration (CPAN, Ruby gems, Pythoneggs)

Debian/Fedora/OSX/Windows packages

More platforms (Windows without Cygwin)

BioLib Development Report (BOSC 2009) – p. 24

Page 25: Prins Bio Lib Bosc 2009

Credits

Ben Bolstad (Affyio), James Bonfield (Staden), Karl Broman (R/qtl)

Jonathan Leto (GSL SWIG)

Xin Shuai (Google SoC libsequence)

Adam Smith (Google SoC Bio++)

Oswaldo Trelles, José Manuel Mateos-Duran and Andrés Rodríguez (UMA)

Chris Fields (BioPerl), Mark Jensen (BioPerl), Hilmar Lap (Nescent, OBF)

Jaap Bakker (WU), Geert Smant (WU), Ritsert Jansen (GBIC)

BioLib Development Report (BOSC 2009) – p. 25

Page 26: Prins Bio Lib Bosc 2009

BoF

BioLib: Birds of a Feather Session (BoF) at 16:50 hours

BioLib Development Report (BOSC 2009) – p. 26