Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for...

Post on 16-Jul-2015

260 views 2 download

Tags:

Transcript of Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for...

Redefining ETL Pipelines with Apache

Technologies to Accelerate Decision

Making for Clinical Trials

Eran Withana

www.comprehend.com

Clinical Trials – Lay of the land

Business and Technical Requirements

Technology Evaluation

High Level Architecture

Implementation

Managing Hardware

Deployments

Data Adapters: Implementation and Failure Modes

Distributed File System

Challenges

Future Work

Overview

www.comprehend.com

Open Source

Member, PMC member and committer of ASF

Apache Axis2, Web Services, Synapse,

Airavata

Education

PhD in Computer Science from Indiana

University

Software engineer at Comprehend Systems

About me …

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

Clinical Trials – Lay of the land

Number of Drugs in Development Worldwide

(Source: CenterWatch Drugs in Clinical Trial

Database 2014)

Source: http://www.phrma.org/innovation/clinical-trials

www.comprehend.com

Clinical Trials – Lay of the Land

Multiple Stakeholders• Study Managers • Program Managers• Monitors• Data Managers• Bio-statisticians• Executives• Medical Affairs• Regulatory• Vendors• CROs• CRAs

Sites

Labs

Patients

Safety

EDC

Reports

● Latent● Fragmented

Data

PV DataExcel

SponsorContract Research Organization (CRO)Sites and Investigators

www.comprehend.com

For decades, clinical development

was primarily paper-based.

www.comprehend.com

Various Software and Practices Used in Each Layer

medidata

CROs and SIs

Technologies

www.comprehend.com

Clinical Trials with Centralized Monitoring

Clinical Operations

Sites

Labs

Patients

● Consolidated● Real-time● Self-Service● Mobile

Clinical Analytics &

Collaboration

Data

Safety

EDC

PV DataExcel

www.comprehend.com

Providing up-to-date answers

Executives Medical Review

CRAs Data Management

Clinical Operations

EDC

CTMS

Safety

ePro

Other

Web

Ad-Hoc

Mobile

Collab

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

FDA, HIPAA ComplianceMetadata/Database structure synchronization

Less frequent (once a day)

Data SynchronizationMore frequent (multiple times a day)

Ability to plugin various data sourcesRAVE, MERGE, BioClinica, File Imports, DB-to-DB

Synchs

Real time event propagationsAdverse events (AEs) - the need for early

identification

Business Requirements

www.comprehend.com

Hardware agnostic for resiliency and better utilization

Repeatable deployments

Real time processing and real time events

Fault Tolerance

In flight and end state metrics for alerting and monitoring

Flexible and pluggable adapter architecture

Time travelAudit trails

Report generations

Technical Requirements

www.comprehend.com

Events all the way

Shared event bus for multiple consumers

Use of language agnostic data

representations (via protobuf)

Automatic datacenter resources

management (Mesos/Marathon/Docker)

Core Design Principles

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

• Data processing Apache Storm and Trident, Apache

Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban

• Coordination and Configuration Management Apache Zookeeper, Redis, Apache

Curator

• Event Queue Apache Kafka

• Scheduling Chronos, Apache Mesos, Marathon,

Apache Aurora

• Database Synchronization Liquibase, Flyway DB

• Data Representations Apache Thrift, protobuf, Avro

• Deployments Ansible

• File Management Apache HDFS

• Monitoring and alerting Graphite, StatsD

• Database PostgreSQL, Apache Spark

• Resource Isolation LXC, Docker

Technologies Evaluated

www.comprehend.com

Data Processing Technology Evaluation

Criteria Storm + Trident

Spark + Streaming

Samza Summingbird Scalding Falcon Chronos Aurora Azkaban

DAG Support

Y DAGScheduler Y Y Y Y Y N Y

DAG Nodes Resiliency

Y Y Y Y Y Y Y N Y

Event Driven

Y Y Y Y N N N N N

Timed Execution

Y Y Y Y Y Y Y Y

DAG Extension

Y Y Y Y Y Y Y Y Y

Inflight and end state metrics

Y Y Y Y Y Y Y Y Y

Hardware Agnostic

Y Y Y Y Y Y Y Y Y

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

High Level Architecture

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

Bare Metal Boxes

Partitioned using LXC containers

Use of Mesos to do the resource

allocations as needed for jobs

Managing Hardware

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

Ansible

Repeatable deployments

Password management

Inventory management

(nodes, dev/staging/production)

Deployments

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

Adapters – High Level

• Syncher is for DB structural changes Syncher creates a database schema

from the source information

Runs a generic database diff and applies those to the target database

• Seeder is for data synchronization

Uses the database schema created by Syncher

• Seeders gets jobs from Syncher or

Timed scheduler

Data Adapters

• Coordination and

Configuration

through Zookeeper

Job configuration

Connection information

Distributed locking and counters

Metric Maintenance

Last successful run

Data Adapters – Coordination and Configuration

www.comprehend.com

Data Adapters - Implementation

www.comprehend.com

Syncher Connectivity to source/sink systems fail

• Retry job N times and alert, if needed

Schema changes to the database fails in the middle• Transaction rollback

Seeder Connectivity to source/sink systems fail

• Retry job N times and alert, if needed

If seeding fails midway• Storm retries tuples• Failing tuples are moved to an error queue

Table and row level failues• Option to skip the tables/rows but send a report at the end

Effect on “live” tables during data synchronizations• Option to use transactions or• Use temporary tables and swap with original upon completion

Failure Modes

www.comprehend.com

Can bring in data from more data sources and more studies effectively

Run real time reports on studies and configure alerts (future)

Can configure refreshes as needed by each use case

Can throttle input and output sources at study/customer level

Ability to onboard new customers and deploy new studies with minimal human intervention

What Have We Gained

www.comprehend.com

A generic framework which

eases integration with new data sources

• For each new source, implement a method to create a

virtual schema and to get data for a given table

can scale and fault tolerant

has generic monitoring and alerting

eases maintenance since its mostly generic code

notification of important events through messages

runs on any hardware

What Have We Gained

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

AccessibilityCustomers must be able to drop files securely (SFTP like

functionality)

Ability to access resources through URLsData storage

Scalability and RedundancyScale-out by adding nodes

Resilience against loss of nodes, data centers and replication

MiscellaneousAccess control over read/write

Performance/usage/resource utilization monitoring

Distributed File System - Requirements

www.comprehend.com

Two name nodes running in HA mode, co-located with two journal nodes

Third journal node on a separate node

Data nodes on all bare metal nodes

Mounting HDFS with FUSE and enabling SFTP through OS level features

Automatic failover through DNS and HA Proxy

HDFS with High Availability Mode

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

Regulatory requirementsData encryption requirements for clinical data

Audit trails

Data qualitySource system constraintsCoordination between Synchers and Seeders

Distributed locks and counters

Automatic fail over when a name node fails in HDFSHDFS HA mode stores active name node in ZK as a

java serialized object, yikes !!

Challenges

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

www.comprehend.com

Time travel

Ability to go back in time and run reports at any

given point of time

Trail of data

Containerization

In-memory query execution with Apache

Spark

Future Work

www.comprehend.com

Team

www.comprehend.com

Thank You !!

Questions …