ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali...

33
ARES: Advanced Networking for Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014

Transcript of ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali...

Page 1: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

ARES: Advanced Networking for Distributing Genomic Data

Gianluca Reali University of Perugia

VUB, Brussels, May 13, 2014

Page 2: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Outline

• Description of ARES

• ARES research and implementation purposes

• Technologies and design choices

• Experimental Results

Page 3: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

GÉANT

• GÉANT is the pan-European 500Gbps research and education network that interconnects Europe’s National Research and Education Networks (NRENs).

• It connects over 50 million users at 10,000 institutions across Europe, and with extensive links to global partners GÉANT reaches over 100 NRENs worldwide.

Page 4: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

ARES project

• The first GÉANT Open Call was launched on 1st April 2013 and closed on 29th April 2013. It was highly oversubscribed with a total of 70 projects received across 18 topic areas.

• 21 innovative R&D projects were selected, including ARES.

Page 5: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

ARES Partners

– University of Perugia (UoP)

• Design and Implementation of the ARES CDN network;

• Deployment software instances to manage both the network and the processing tools;

• Execution of experiments (network side);

– Polo d’Innovazione di Genomica, Genetica e Biologia SCARL (GGB)

• Definition of experimental scenarios and relevant metrological procedures.

• Execution of experiments as a CDN customers;

• Evaluation of the grade of received network service.

Page 6: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Why ARES?

Future P4 medicine framework: proactive, personalized, predictive, and participatory [1].

Berge Minassian, Hospital for Sick Children in Toronto, “I am certain that in the next few years patients walking into children’s hospitals will have their whole genomes sequenced,”[2]. FUTURE NEED OF SEQUENCING, STORING, MAKING AVAILABLE, CONTINUOUSLY ANALYZING THE GENOME OF EACH INDIVIDUAL through real-time knowledge of the latest findings!!!

[1] Hood, L., Balling, R., and Auffray, C. (2012). Revolutionizing medicine in the 21st century through systems approaches. Biotechnol. J., 7:1-10. [2] http://blogs.nature.com/spoonful/2013/01/gene-sequencing-yields-breakthrough-for-children-with-rare-parkinsons-like-disorder.html

tremendous volume of data: NEED OF SUITABLE STORAGE, NETWORKS, PROTOCOL ARCHITECTURES, APPLICATIONS,…

Page 7: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

ARES Idea (1/2)

PoP Clusters

Public Genome/Annotation Data-base

CDN node

Private Genome/Annotation Data-base

Control CDN Data Processed Data

Controller

Combined use of CDN and CLOUD/GRID technologies , specifically targeted to genomic data sets, supporting medical needs.

Page 8: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Reasoning behind technology and design choices

• Original aspects of genomic data sets

– i.1 Content growth – i.2 Content popularity – i.3 Logical content relationships

• Advanced CDN features – i.4 Content distribution logic – i.5 Suitable integration with cloud storage and processing services – i.6 Novel cache instantiation procedures – i.7 Parallel download algorithm – i.8 Multiple classes of network services supporting different

medical needs. – i.9 Networking and service paradigms contributing to the

optimized solution

Page 9: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.1 Content growth (1/2)

For just 1000 samples!

Page 10: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.1 Content growth (2/2)

time

Typical web content size over time

time

Genome data set size over time

Any genome is a huge source of information to be still unveiled !!! Research will produce a significant increase of the genomic data set for each patient!

time of creation

time of creation

Page 11: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.2 Content Popularity

time

Typical web content popularity over time

time

Genome and metadata popularity over time Not predictable shape, but it never expires!!!! Only arrivals process!!! Huge implications for CDNs!

time of creation

time of creation

Page 12: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.3 Logical content relationships (1/2)

content relationships based on gene “affinity”

Diseases may show degree of generic similarity. Information useful for driving diagnostic investigations, thus for managing data in CDNs

Each circle is associated with a disease.

Each arch is associated with a gene relationship.

Page 13: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.3 Logical content relationships (2/2)

For example, a diagnosis of Colon Cancer could induce further investigation about genetically similar diseases, such as Leukemia. The relevant metadata can be pre-loaded in suitable CDN caches.

e.g. genomic links with colon cancer.

Page 14: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.4 Content Distribution Logic (1/3)

• Based on NSIS advanced discovery algorithms and signaling

• Based on differentiated medical needs, that is the time required for downloading data according to the seriousness of a disease (better illustrated in what follows)

• Leveraging on cloud services

• Original management of virtualization services through NetServ

Page 15: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.4 Content Distribution Logic (2/3)

NSIS signaling – suite of protocols envisioned to

support various signaling application

– IETF RFC 4080

Two layers: – NTLP: NSIS Transport Layer Protocol

• GIST (Generic Internet Signaling Transport)

– NSLP: NSIS Signaling Layer Protocol • NetServ-specific NSLP

– On-path based signaling

– Three messages

» SETUP + ACK

» PROBE REQUEST/RESPONSE

» REMOVE + ACK

Page 16: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.4 Content Distribution Logic (3/3)

NetServ

NSLP

GIST

GIST packet

interception

UNIX

socket

NetServ

Controller

Linux kernel transport layer

Service

Container

Service

Container

Java OSGi

Java OSGi

Server

modules

Client-server

data packets

Forwarded

datapackets

Signaling

packets

iptables

command

Netfilter NFQUEUE #1

Packet

processingmodules

NS

IS s

ign

alin

g d

ae

mon

s

NetServ repository

Modules

verification

Modules

installation

The NetServ Architecture (developed in collaboration with Columbia University)

bundles

Page 17: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.5 Suitable integration with cloud storage and processing

services

• The NSIS driven caching allows accessing data, suitably

located, through a cloud-like interface.

• Extensive virtualization through the IaaS OpenStack service allows aggregating computing resources and storage.

Page 18: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Medical video interface

Local DB

CDN/HTTP VM repos.

GCM “the brain”

OpenStacK PoPs

NetServ Caches in PoPs + controllers

Load genome and selected diseases req metadata to DBMS metadata

List of metadata servers

NSIS signaling triggered by GCM, from VM and metadata servers to discover caches storing VMs and metadata; Selection of POPs.

get VMs @ repository selected

req VMs

send VMs VM started

get metadata @ repository selected

send metadata caches

Processing at VMs

results

CDN/ HTTP METADATA server

Components implemented as NetServ

bundles filled in red

time

optimization problem,

optimization function f

req VMs to DBMS VMs

List of VM servers

Open-stack controller

NSIS discovery from selected repository for available caches for VMs

NSIS discovery from selected repository for available caches for metadata

i.6 Novel cache instantiation algorithms and signaling protocols (1/3)

req VMs

send VMs

req metadata

NetServ

CACHES POPULATED

through advanced

NSIS signaling and available for future usage

Page 19: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.6 Novel cache instantiation algorithms and signaling

protocols (2/3) NSIS CDN Signaling

NSIS Signaling

N1 N2 HTTP Client

HTTP Server

HTTP GET

Setup

Setup

Setup 200 OK

200 OK

Probe

Probe

Probe N2 Active N1 Active

N2 Active

Setup N1->Server, N2->N1

Setup N1->Server, N2->N1

200 OK

200 OK

Page 20: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.6 Novel cache instantiation algorithms and signaling

protocols (3/3) N1 N2 HTTP

Client

HTTP Server

HTTP REDIRECT TO N2

HTTP GET

HTTP GET

HTTP GET

HTTP DATA

HTTP DATA

HTTP DATA

HTTP GET

NSIS Signaling

HTTP REDIRECT TO N2

HTTP GET

HTTP DATA

Page 21: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.7 Parallel downloading (1/2)

• Use of a novel NSIS NSLP protocol for discoverying bottleck disjoint paths of NSIS nodes.

– Off-path NSIS signaling

• Bubble, Baloon, Hose

Page 22: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.7 Parallel downloading (2/2)

• Optimization function f(g1, … , gk) being gi a function of the ith medical service request • gi(genome size, metadata size and location, VM

size, network topology and link bandwidths, required clinical service time, quality of the sequencing machine, processing reliability, download parallelization capabilities…)

Page 23: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.8 Multiple classes of network services supporting different

medical needs (1/2).

• e.g. peripheral neuroblastic tumours (Neuroblastoma, Ganglioneuroblastoma, Ganglioneuroma) must be must be diagnosed immediately, breast cancer may be handled in some days, other diagnoses can be done in some weeks

• Different CDN services must be provided, such as:

– Minimum delay CDN services for handling urgent situations.

– Short delay CDN services for handling less urgent situations.

– Balanced network load CDN services for handling all other situations.

Page 24: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.8 Multiple classes of network services supporting different

medical needs (2/2).

The table below shows some examples of tolerable times for medical personnel requiring support from the project. These tolerable times include the CDN service time, in addition to other times which depends on other medical requirements, such as the type of the sequencing, the portion of the genome to be analyzed, the processing software used and the reliability of results. Through the expertise of the researchers involved in ARES, we will translate these times in CDN service classes.

Diseases Time (days)

Neuroblastoma 2

Breast Cancer 7

Colon Cancer 7

Acute Lymphoblastic Leukemia 4

Leukemias 4

Lymphomas 4

Myeloma 7

Cervical Cancer 7

Pancreatic Cancer 4

Page 25: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

i.9 Networking and service paradigms contributing to the optimized solution

OptimizedSolution

SaaS Cloud service

Medical interface for private genome management

Page 26: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Case study(1/3)

Sample case study: 1. A doctor needs to investigate the occurrence of a

gene mutation. 2. Assume that a Copy Number Variation (CNV) analysis

is needed for this purpose. 3. The appropriate CDN service makes the data needed

available 4. The CNV analysis can start, as shown in what follows. 5. Outcome for measuring the client-side success of the

procedure: achievement of results within the pre-established timeframe, compliant with the CDN service deployed.

Page 27: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Case Study (2/3)

Sample case study on genome mutation: find Copy Number Variation (CNV)

Get raw sequence from 1000 Genomes repository

Quality Control

Annotation CNV Produce a Report

End

End

FastQC OSS is used for quality control. Trimmomatic OSS for trimming reads

Bowtie 2 is an OSS tool for aligning sequencing reads to long reference sequences.

hg19 (human genome 19) is the current reference to the human genome sequence.

CNVnator is an OSS for discovering and genotyping from read-depth analysis of personal genome sequencing.

BLAST finds regions of similarity between biological sequences

No

Trimming Reads

Trimmomatic

FastQC

Mapping Reads vs Genome

Bowtie 2 vs hg19

Find CNV

CNVnator

Custom script/BLAST Custom script

Page 28: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Case Study (3/3)

User request 1 Service time T1

CDN service mapping and execution

Service time < T1?

User request 2 Service time T2<T1

User request n Service time Tn<Tn-1

CDN service mapping and execution

CDN service mapping and execution

Processing and metadata creation

Processing and metadata creation

Processing and metadata creation

Service time < T2?

Service time < Tn?

YES

SUCCESS!

METROLOGICAL VALIDATION TEST: EXECUTION OF THE SAME DATA PROCESSING REQUIRING DIFFERENT TIME SPECIFICATIONS SO AS TO STRESS THE NETWORK CAPABILITIES.

CONCLUSION: THE CDN CAN SATISFY THE SAME SERVICE, USING THE SAME DATA TYPE AND VOLUME, ALSO WITH DIFFERENT AND STRINGENT REQUIREMENTS ON SERVICE TIME

Page 29: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Experimental Results

Pipeline

Configuration

Hypervisor

VM

image size

RAM size min

# CPU cores

VM

storage size

Auxiliary file

size

CNV

BOWTIE aligner (the computing

is performed on the whole

human genome)

KVM

3.1GB

8 GB

1

50 GB

3.5 GB

CNV

BOWTIE aligner (the computing

is performed chromosome

by chromosome)

KVM

3.1 GB

4 GB

1

50 GB

3.5 GB

DE

BOWTIE aligner

KVM

3.1 GB

4 GB 1

80 GB

3.5 GB

DE

STAR aligner

KVM

3.1 GB

32 GB

1

100 GB

26 GB

Page 30: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Experimental Results

0 2 4 6 8 10 12 14 16 185

5,5

6

6,5

7

7,5

8

8,5

9

4GB 8GB 12GB 16GB

Number of cores

Pro

ce

ssin

g T

ime

[h

]

0 2 4 6 8 10 12 14 16 182

2,2

2,4

2,6

2,8

3

3,2

3,4 32GB 64GB 96GB

Number of cores

Pro

ce

ssin

g T

ime

[h

]

CNV pipeline

DE pipeline Use of the STAR aligner

Page 31: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Experimental Results

Page 32: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Experimental Results

Page 33: ARES: Advanced Networking for Distributing Genomic Data · Distributing Genomic Data Gianluca Reali University of Perugia VUB, Brussels, May 13, 2014 . Outline •Description of ARES

Thank you for your attention!