Advancing life sciences with IBM reference architecture for genomics

IBM Systems and TechnologySolution Brief

Life Sciences

Advancing life sciences with the IBM reference architecture for genomics A forward-looking, end-to-end and collaborative solution for genomics research and medicine

Highlights●● ● ●End-to-end, unified solution for genomics

research, translational medicine and personalized medicine

●● ● ●Scalable, software-defined, and data- centric architecture designed for cutting-edge research and clinical use

●● ● ●Developed in collaboration with leading researchers and partners worldwide

Genomic medicine is revolutionizing medical research and clinical care, enabling scientists and clinicians to identify individuals at risk of disease, provide early diagnoses, and recommend better treatments. Prolific use of next generation sequencers is f looding researchers and clinicians with data. Infrastructures often cannot scale to process, store and distribute this data in time. Many rely on third parties to process and store the raw data, which slows access for analysis. In addition, context and mapping of genomic, phenotypic and environmental data must be available.

IBM, in collaboration with key researchers and partners, created the IBM reference architecture for genomics. The end-to-end reference architecture defines the enterprise data management, workflow orches-tration and global access capabilities across key genomics, translational and personalized medicine platforms. It supports large-scale genomics sequencing and downstream data analytics, providing:

●● ● Data lifecycle management to support large scale data growth ●● ● Software-based abstraction layers for compute, storage, big data

and cloud●● ● Workload and workflow orchestrator for applications

By investigating the human genome in the context of biological pathways and environmental factors, genomic scientists and clinicians can now identify individuals at risk of disease, provide early diagnoses based on biomarkers and recommend effective treatments.

2

Solution BriefLife SciencesIBM Systems and Technology

Due to new technology and research methods, the field of genomics is being f looded with data from next-generation sequencers. Furthermore, this data must be quickly stored, analyzed, shared and archived. Many genomics, cancer and pharmaceutical research institutions are now generating so much data that they can no longer process or even transmit the information over standard communication lines in a timely fashion. Often researchers must physically ship raw data to an external computing center for processing and storage, slowing down and inhibiting ready access to and analysis of data. In addition to scale and speed, this information needs to be linked based on data models and taxonomy, or to be curated with machine or human knowledge. Only then can this “smart” data be factored into the equation when dealing with genomic, phenotypic and environmental data and made available to a common analytical platform.

The IBM reference architecture for genomics is an end-to-end reference architecture for genomic medicine. It defines the enterprise capability of data management, workflow orchestra-tion and global access across key platforms for genomics, translational and personalized medicine. Based on this architec-ture, IBM has built a data-centric, software-defined, and application-ready infrastructure in support of large-scale genomics sequencing and downstream data analytics:

●● ● Data-centric: Helps you meet the challenge of managing the explosive growth of genomics and clinical data with data lifecycle management capabilities.

●● ● Software-defined: Defines the architecture with software-based abstraction layers for computation, storage, big data and the cloud.

●● ● Application-ready: Integrates a multitude of applications with a workload and workflow orchestrator.

Data management using a data hubData lifecycle management is critical for genomics due to data volume, velocity and variety. Genomic data volume is surging as the cost of sequencing drops precipitously. The I/O throughput on a genomic system can be extremely demanding due to data volume and to the large number of file and directory objects. Many data formats, with varying degrees of lifecycle manage-ment requirements, ranging from transient files in scratch space to VCF variant-calling files, must perpetually remain online.

Leveraging IBM Elastic Storage (see Figure 1), the data man-agement layer, or a data hub, defines an enterprise capability to meet these challenges based on its scalable, extensible and high-performance architecture. First developed and optimized as a high performance computing (HPC) file system, IBM Elastic Storage serves large volumes of data at a high bandwidth and in parallel to all the compute nodes in the computing system(s). As genomic pipeline can consist of hundreds of applications engaged in concurrent data processing on large number of files, this capability is critical to feeding data to the computational genomics workflow.

Figure 1. The data management layer or data hub of the IBM reference architecture for genomics. The data hub provides a global name space and high-performance access to data stored on SSD/Flash, fast disk, slow disk and tape archive. The key functions are high data Input/Output (I/O), policy-based Information Lifecycle Management (ILM), data sharing through replication and caching, and big data

PACS

LIMS

NGS

Ref DB

Publication

Assembly & Alignment Variant Calling Annotation Bioinformatics

Datahub ILMI/O Sharing Big Data

SSDFlash

FastDisk

SlowDisk

Tape

3


As the genomics pipeline can generate petabytes of metadata and data, a system pool built upon solid state drive (SSD) and f lash disk with high-IOPS capability, can be dedicated to store metadata for files and directories, and in some cases for storing small size files directly. This feature drastically improves file system performance and responsiveness to metadata-heavy operations such as the listing of all files in any given directory.

As a file system with a connector to Hadoop MapReduce, the data hub can also serve Hadoop MapReduce big data jobs on the same set as compute nodes, thus eliminating the need for and complexity of another file system—in this case the Hadoop Distributed File System (HDFS). As the bioinformatics indus-try adopts big data technologies including Hadoop MapReduce, this data hub feature will enable firms to gain both economies of scale and performance by the sharing of nodes for both compute and big data analytics.

The policy-based data life cycle management capability allows the data hub to move data from one storage pool to the others, maximizing I/O performance, storage utilization and minimiz-ing operational cost. These storage pools can range from the high-I/O f lash disk to high-capacity storage appliance (GPFS™ storage server) to low-cost tape media (through integration with tape management solution including IBM TSM/HSM and LTFS).

The increasingly distributed nature of genomics infrastructure requires data management on a much larger and global scale. Data not only needs to be moved or shared across different sites, its movement or sharing needs to be coordinated with the computational workload and workflow. To achieve this coordination, the data hub leverages a key function of the IBM Elastic Storage called the Advanced File Placement (AFM). It enables IBM Elastic Storage to extend the global name space to multiple sites, allowing them to share both a common metadata catalogue and a cache copy of home data thus allowing local access for remote client sites. For example,

a genomic center can own, operate, and version-control all ref-erence databases or datasets, while the affiliated or partnering sites or centers can access the reference dataset through AFM. When the centralized copy of database gets updated, so are the cache copies of the other sites.

With the data hub, a system-wide metadata engine can be built to index and search all the genomic and clinical data, enabling faster, more powerful downstream analytics and translational research.

Workflow management with an orchestratorThe workflow for genomics is complex yet important. A grow-ing number of genomic applications have varying degrees of maturity and types of programming models. Many are single-threaded (R, for example) or embarrassingly parallel (BWA, for example) while others are multi-threaded or MPI-enabled (e.g. MPI BLAST). All of these applications need to work in concert or tandem in a high throughput and high performance mode in order to generate final results.

The IBM reference architecture for genomics includes a workload management or orchestrator layer which defines the capability to orchestrate applications and workflow. A unique combination of the IBM® Platform™ LSF® workload man-ager and Platform Process Manager workflow engine links, coordinates and shepherds a spectrum of computational and analytical jobs into fully automated pipelines that can be easily built, customized, shared and run on a common platform. This capability provides the necessary abstraction of applica-tions from the underlying infrastructure including HPC clusters with graphical unit processors (GPUs) and a big data cluster in the cloud.

4


The orchestrator distinguishes itself from scripted workflow tools with its capability to handle complex workflows dynamically—individual workloads or jobs can be defined through a user-friendly interface, incorporating variables, parameters and data definition using standard templates. The workload manager transparently handles the submission, placement, monitoring and completion of each job. The workflow engine connects jobs in linear progressions, condi-tional branches, or loops, based on user-defined criteria and requirements for completing and advancing them.

To maximize the throughput of the workflow for genomics sequencing analysis, a special type of workload can be defined by using job arrays so data can be split and processed by many jobs in parallel.

In another innovative use case for genomics processing, multi-ple sub-f lows can be defined as a parallel pipeline for variant analysis following the alignment of the genome. The results from each sub-f low can then be merged into a single output and provide analysts with a comparative view of multiple tools or settings.

The workflow can also be designed as a module and embedded into larger workflows as a dynamic building block. Not only will this approach enable the efficient building and reuse of the pipelines, it will also encourage collaborative sharing of genomic pipelines among a group of users or within larger scientific communities.

Figure 3 shows a screenshot of an end-to-end workflow created using the orchestrator to process raw sequence data (BCL) into variants (VCF) using a combination of applications and tools.

As more institutions are deploying hybrid cloud solutions with distributed resources, the orchestrator can coordinate the distribution of workloads based on data localities, pre-defined policies, thresholds and real-time inputs of resource availabilities. For example, a workflow can be designed for processing genomic raw data closer to sequencers, and fol-lowed by sequence alignment and assembly using the Hadoop MapReduce framework on a remote big data cluster. In another use case, a workflow can be designed to launch a proxy event of moving data from a satellite system to the central HPC cluster when the genomic processing reaches 50 percent of the com-pletion rate. The computation and data movement can happen concurrently to save time and costs.

Figure 3. The orchestrator is implemented as a genomic workflow pipeline. Starting from the left in the pipeline - Box 1: the arrival of data such as bcl files will automatically trigger CASAVA as the first step of the workflow; Box 2: a dynamic subflow will use BWA for sequence alignment; Box 3: Samtool will perform post-processing in a job array; Box 4: different variant analysis subflows can be triggered in parallel

Figure 2. The workload management or orchestrator layer in the IBM genomic medicine reference architecture. The orchestrator provides workload management, workflow engine and provenance capabilities to orchestrate complex sets of applications and workflows for genomics pipe-lines, and to abstract applications from underlying computational resources.

PACSAssembly & Alignment

Orchestrator

LIMS

NGS

Ref DB

Publication

Variant Calling Annotation Bioinformatics

Workload Workflow Provenance

Local Cluster 1 Local Cluster 2 Remote Cluster(Grid/Cloud)

Base Conversion Alignment Pre-processing

Variant Analysis

GATK

Mutect

CaVEMan

Pindel

BWABowtieCASAVA

SamtoolPicardGATK

BCLBCL FASTQ SAM/BAM Recalibrated BAM SNP

Indels

Translocation

Low-pass copy number

Exome copy number

5


Manage global access with an appcenterIn the IBM reference architecture for genomics, an appcenter is the user interface into the genomics platform. It provides an enterprise portal with role-based access and security controls while allowing researchers and clinicians easy access to data and workflow tools.

Built with Platform Application Center, this appcenter has advanced logging capabilities for tracking activities including jobs, workflow provenance and data access. This is a critical feature for event reporting, performance analysis, or rerunning analysis with prior settings.

To harvest all knowledge and information from users and enable sharing, the appcenter can function as a catalogue of pre-built and pre-tested workflow definitions and application

templates so users can easily launch them directly from the portal after uploading the data. The portal can also be config-ured with a metadata index and search engine enabling users to browse, query or search pre-indexed data or reference documents.

Extensible reference architectureThe IBM reference architecture for genomics extends beyond genomics to cover translational and personalized medicine platforms as well. These platforms also leverage the data hub, orchestrator and appcenter as common enterprise capabilities. The integrated architecture and platforms are shown in figure 4.

Figure 4. IBM reference architecture for genomics. The blue section depicts the genomics platform. The green section depicts the translational platform. The purple section highlights depicts the personalized medicine platform

Data Source Data Service Analytics Access

Per

sona

lized

Med

icin

e

Patient Portal

Clinical Portal

EMR

PublicationEMR Workflow Disease Registry Surveys Cognitive Analytics Outcome Evaluation

Translational Clinical Knowledge

Orchestrator

Clinical

RWE

Ongtologies

MDM ETL NLP Predictive Analytics

Associative Analytics

Parallel Modeling

Cohort Query

Data ExploreTr

ansl

atio

nal

Clinical DW Omics DW Translational DW

Datahub

Genomics Imaging Proteomics Cytometry

Orchestrator

Datahub

PACS

LIMS

NGS

Ref DB

Publication

Alignment & Assembly Variant Analysis Annotation Functional Genomics

AppCenter

Visualization

Monitoring

Gen

omic

s

fastq bam vcf

Please Recycle

Why IBM?IBM Platform Computing software offerings enable IT scal-ability, performance and agility that help firms work smarter, faster and outperform their peers. Our comprehensive portfolio of solutions speeds and scales large-scale simulations, predictive analytics and visualization.

Platform Computing high-performance infrastructure management software automatically optimizes IT resources usage—on-premise and in the cloud—enabling 24x7, real-time operating agility that can improve the bottom line faster.

For more informationTo learn more about the IBM reference architecture from genomics, please contact your IBM representative or IBM Business Partner, or visit the following website: ibm.com/platformcomputing

Additionally, IBM Global Financing can help you acquire the IT solutions that your business needs in the most cost-effective and strategic way possible. We’ll partner with credit-qualified clients to customize an IT financing solution to suit your busi-ness goals, enable effective cash management, and improve your total cost of ownership. IBM Global Financing is your smartest choice to fund critical IT investments and propel your business forward. For more information, visit: ibm.com/financing

© Copyright IBM Corporation 2014

IBM Corporation Systems and Technology Route 100 Somers, NY 10589

Produced in the United States of America August 2014

IBM, the IBM logo, ibm.com, GPFS, Platform, and LSF are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

It is the user’s responsibility to evaluate and verify the operation of any other products or programs with IBM products and programs.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

Statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

DCS03062-USEN-00

http://www.ibm.com/legal/copytrade.shtml

http://www.ibm.com/platformcomputing

http://www.ibm.com/financing

Advancing life sciences with IBM reference architecture for genomics

Documents

Transcript of Advancing life sciences with IBM reference architecture for genomics