Gautier bosc2010 pythonbioconductor
-
Upload
bosc-2010 -
Category
Technology
-
view
1.121 -
download
2
Transcript of Gautier bosc2010 pythonbioconductor
Bioconductor with Python, What else ?ISMB / BOSC
Laurent Gautier [[email protected]]
DMAC / CBS
July 10th, 2010
1 / 20
Disclaimer• This is not about the comparative merits of scripting
languages• This is about being able to access natively libraries
implemented in a different language
2 / 20
About Bioconductor
• Set of open-source packages for R• Started circa 2002 with a focus on microarrays• Rooted in statistics, data analyis, and visualization• Several hundred packages, addresses NGS, HTS, flow
cytometry, protein-protein interactions, . . .• Biannual releases• Presence on the publication circuit ( > 2, 300 citations for
the BioC publication, > 600 for limma, > 500 for affy )
3 / 20
About Python
• Simple and clear all-purpose scripting language• Sometimes used in introductions to programming• Popular for agile development• Bioinformatics libraries:
• biopython (libraries for bioinformatics)• galaxy (web front-end to pipelines)• PyCogent, pygr, bx-python (biological sequences-oriented)
• Large selection of libraries:• Web development: Zope, Django, Google App Engine• Scientific computing: Scipy / Numpy• Cloud computing: Disco, execnet• Interface with C: ctypes, Cython
4 / 20
A view on R/bioconductor and Python in bioinformatics
Bioinformaticsdata
Automation
Storage /Retrieval
SamplesMicroarray
NGS
Annotation
Flow-cytometry,
proteomics,other
assays. . .
R/BioconductorStatisticalanalysis
Visualization
Interactiveprogram-
ming
Python
Non-interactive
abilitiesData
storage /retrieval
Web
Algorithmdevelopment
Scientificcomputing
Python is an all-purpose scriptinglanguage.
Communities
ComputerScientists
Physicists
Biologists
Statisticians
5 / 20
Bioinformaticsdata
Automation
Storage /Retrieval
SamplesMicroarray
NGS
Annotation
Flow-cytometry,
proteomics,other
assays. . .
R/BioconductorStatisticalanalysis
Visualization
Interactiveprogram-
ming
Python
Non-interactive
abilitiesData
storage /retrieval
Web
Algorithmdevelopment
Scientificcomputing
Python is an all-purpose scriptinglanguage.
Communities
ComputerScientists
Physicists
Biologists
Statisticians
Bioinformaticsdata
Automation
Storage /Retrieval
SamplesMicroarray
NGS
Annotation
Flow-cytometry,
proteomics,other
assays. . .
R/BioconductorStatisticalanalysis
Visualization
Interactiveprogram-
ming
Python
Non-interactive
abilitiesData
storage /retrieval
Web
Algorithmdevelopment
Scientificcomputing
Python is an all-purpose scriptinglanguage.
Communities
ComputerScientists
Physicists
Biologists
Statisticians
Running R code from Python (an example)AimRunning edgeR from Python
MethodRobinson MD, McCarthy DJ and Smyth GK (2010). edgeR:a Bioconductor package for differential expression analysisof digital gene expression data. Bioinformatics 26, 139-140
DataControl Treated
lane1 lane2 lane3 lane4 lane5 lane6 lane8ENSG00000230758 0 0 1 0 0 0 0ENSG00000182463 0 2 4 1 5 5 0ENSG00000124208 82 124 102 136 90 120 40ENSG00000230753 0 0 0 3 0 0 0ENSG00000224628 7 8 8 18 8 7 1ENSG00000125835 138 209 227 295 281 220 54ENSG00000125834 25 31 48 56 67 61 15ENSG00000197818 17 27 16 26 41 39 9ENSG00000243473 0 0 0 2 0 0 0ENSG00000226325 0 0 2 0 3 1 0
. . . . . . . . . . . . . . . . . . . . . . . .
7 / 20
from rpy2.robjects.packages import importrfrom bioc import edger
base = importr(’base’)
summarized = edger.DGEList.new(counts = counts,lib_size = base.colSums(counts),group = grp)
disp = edger.estimateCommonDisp(summarized)
tested = edger.exactTest(disp)
results = edger.topTags(tested)
logConc logFC PValue FDRENSG00000127954 -31.03 37.97 0.00 0.00ENSG00000151503 -12.96 5.40 0.00 0.00ENSG00000096060 -11.78 4.90 0.00 0.00ENSG00000091879 -15.36 5.77 0.00 0.00ENSG00000132437 -14.15 -5.90 0.00 0.00ENSG00000166451 -12.62 4.57 0.00 0.00ENSG00000131016 -14.80 5.27 0.00 0.00ENSG00000163492 -17.28 7.30 0.00 0.00ENSG00000113594 -12.25 4.05 0.00 0.00ENSG00000116285 -13.02 4.11 0.00 0.00
8 / 20
R code / Python codelibrary(edgeR)summarized <- DGEList(counts = counts,
lib.size = colSums(counts),group = grp)
disp <- estimateCommonDisp(summarized)
from rpy2.robjects.packages import importrbase = importr(’base’)from bioc import edger
summarized = edger.DGEList.new(count = counts,lib_size = base.colSums(counts),group = grp)
disp = edger.estimateCommonDisp(summarized)
Note:• explicit in searching through namespaces• call R functions as native Python functions• use R objects as Python objects
9 / 20
Bioconductor library IRanges
10 / 20
Bioconductor library Biostrings
11 / 20
Separate communities
12 / 20
Bilingual community
13 / 20
Interpreters/Translators
14 / 20
Cost of translation
R package Python modulelines of code
AnnotationDbi 168 annotationdbi.pyBiobase 341 biobase.pyBiostrings 591 biostrings.pyBSgenome 112 bsgenome.pyedgeR 107 edger.pyGEOquery 102 geoquery.pyGGbase 104 ggbase.pyGGtools 77 ggtools.pygoseq 43 goseq.pyGSEABase 149 gseabase.pyIRanges 295 iranges.pyShortRead 301 shortread.py
15 / 20
R within Python• R is running as embedded into Python• R objects remain in the R workspace, but can be accessed
from Python• Python-level shells to access the R objects• The rpy2 package is used to achieve so
biostrings = importr(’Biostrings’)class AAString(XString):
_aastring_constructor = biostrings.AAString
@classmethoddef new(cls, x):
""" :param x: a string of amino-acids """res = cls(cls._aastring_constructor(conversion.py2ri(x)))_setExtractDelegators(res)return res
aas = AAString("PROTEIN")
16 / 20
What is needed to continue
More interpreters/translators• Many bioconductor packages.• Keep up-to-date existing translations.
Keeping up-to-date• Frequent API-breaking changes in bioconductor• Taylored interfaces increase maintenance• Meta-programming and reflexivity can alleviate this
17 / 20
Example with meta-programming:
class AssayData(rpy2.robjects.methods.RS4):""" Abstract class. That class in a ClassUnionRepresentationin R, that a is way to create a parent class for existingclasses. This is currently not modelled in Python. """__rname__ = ’AssayData’__metaclass__ = rpy2.robjects.methods.RS4_Type
__accessors__ = ((’featureNames’, ’Biobase’, ’featurenames’,True, ’maps Biobase::featureNames’),(’sampleNames’, ’Biobase’, ’samplenames’,True, ’maps Biobase::samplenames’),(’storageMode’, ’Biobase’, ’storagemode’,True, ’maps Biobase::storageMode’))
18 / 20
Example of a complete applicationA web-server to run EdgeR.
from bottle import route, runfrom my_edger import get_toptags, make_results_page@route(’/’)def index():
return ’’’<html> <body><form action="/edger" method="post" enctype="multipart/form-data"><input type="file" name="data" /> </form></body> </html>’’’
@route(’/edger’, method=’POST’)def run_edger():
data = request.files.get(’data’)if data:
counts, grp = read_count_data(data.file.name)top_tags = get_toptags(counts, grp)return make_result_page(top_tags)
else:abort(404, "Invalid count file.")
run(host=’localhost’, port=8080)
19 / 20
Acknowledgements• Users, and communities from R, Bioconductor, Python,
Biopython• (Vincent Davis, Nicolas Rapin, Brad Chapman)
URLshttp://pypi.python.org/pypi/rpy2-bioconductor-extensions/
http://bitbucket.org/lgautier/rpy2-bioc-extensions
http://packages.python.org/rpy2-bioconductor-extensions/ http://rpy2.sourceforge.net/
20 / 20
21 / 20