Building Data-intensive Pipelines
Ravi K MadduriArgonne National LabUniversity of Chicago
Recap from other talks on genomics
• FBIRN combining imaging, clinical and genetics data
• CIDR provide better value to end users– Globus Online helping CIDR to reliably transfer large
sequencing data sets to end users
• Ivo and Fabio presented various challenges in building Pipelines in Genomics – Large data volumes– Multiple, complex analytical tools
• In this talk we will focus on how we can provide workflow capabilities to end users in a way that is both easy to use and scalable
Enter Galaxy
• A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage
• Open source software that makes it easy to integrate your own tools and data and customize your own site
• Flexible architecture -> Customizable
3
Galaxy Adoption
• ~50 deployments of Galaxy– Galaxy for MicroArray analysis, Machine Learning, Drug
Discovery etc
• ~130,000 jobs a month and growing on the public instance of Galaxy
• 1 TB/week in user uploads – 60TB from China
• 150+ attendees in the Galaxy users conference– From 6 continents
• Adoption driven primarily by– Ease of use– Software as a service – Responsive to user needs
4
Opportunities for BIRN collaborators
• Galaxy for biomedical informatics– Researchers can discover, download
interesting and useful datasets provided by BIRN
– Analyze data using various BIRN tools– Create and share pipelines with other
researchers– Create virtual collaborations by
leveraging flexible, secure user and group management
5
Use case: CVRG-Galaxy
• Created a Galaxy instance for CVRG community
• Integrated it with Globus Online File transfer capabilities so researchers can get data for analysis
• Created a CVRG Toolbox in Galaxy with Bioconductor tools from CRData.org
• Investigating how individual PIs can contribute their own compute and storage
6
CVRG CRData Galaxy
7
Top Related