Internet2 Bio IT 2016 v2

Pushing Discovery with Internet2

Cloud to Supercomputing in Life Sciences

DAN TAYLOR

Director, Business Development, Internet2

BIO-IT WORLD 2016

BOSTON

APRIL, 2016

2 – 05/01/2023, © 2011 Internet2

Internet2 Overview

An advanced networking consortiumAcademiaCorporationsGovernmentOperates a best-in-class national optical network 15,000 miles of dedicated fiber100G routers and optical transport systems 8.8 Tbps capacity For over 20 years, our mission has been to Provide cost effective broadband and collaboration technologies to facilitate frictionless research in Big Science – broad collaboration, extremely large data setsCreate tomorrow’s networks & a platform for networking researchEngage stakeholders in

• Bridging the IT/Researcher gap • Developing new technologies critical to their missions

[ 3 ]

The 4th Gen Internet2 Network Internet2 Networkby the numbers17 Juniper MX960 nodes31 Brocade and Juniper

switches49 custom colocation facilities250+ amplification racks15,717 miles of newly

acquired dark fiber2,400 miles of partnered

capacity with Zayo

Communications8.8 Tbps of optical capacity100 Gbps of hybrid Layer 2

and Layer 3 capacity300+ Ciena ActiveFlex 6500

network elements

Technology

A Research Grade high speed network – optimized for “Elephant flows”

• Layer 1 – secure point to point wavelength networking• Advanced Layer 2 Services – Open virtual network for Life

Sciences with connectivity speeds up to 100 Gbs• SDN Network Virtualization customer trials now

• Advanced Layer 3 Services – High speed IP connectivity to the world

Superior economicsSecure sharing of online research resource

federated identity management system

[ 5 ]

Internet2 Members and Partners255 Higher Education members67 Affiliate members41 R&E Network members82 Industry members65+ Int’l partners reaching over 100 Nations93,000+ Community anchor institutions

Focused on member technology needs since 1996

"The idea of being able to collaborate with anybody, anywhere, without constraint…"

—Jim Bottum, CIO, Clemson University

Community

6 – 05/01/2023, © 2009 Internet2

Strong international partnerships

Agreements with international networking partners offer interoperability and accessEnable collaboration between U.S. researchers and overseas counterparts in over 100 international R & E networks

Community

Some of our Affiliate Members

7

[ 8 ]

R&E innovation

sparked IT success

*Routers

Stanford

Computer Workstations

Berkeley, Stanford

SecuritySystems

Univ of Michigan

SecuritySystems

Georgia Tech

SocialMedia

Harvard

NetworkCaching

MIT

Search

Stanford

[ 9 ]

The Route to Innovation

May 1, 2023 © 2016 Internet2

Abundant Bandwidth• Raw capacity now available on

Internet2 Network a key imagination enabler• Incent disruptive use of new, advanced

capabilities

Software Defined Networking• Open up network layer itself to innovation• Let innovators communicate with and program

the network itself• Allow developers to optimize the network for

specific applications

Science DMZ• Architect a special solution to allow

higher-performance data flows• Include end-to-end performance monitoring

server and software• Include SDN server to support programmability

Life Sciences Research TodaySharing Big Data sets (genomic, environmental, imagery) key to basic and applied research

Reproducibility - need to capture methods as well as raw data High variability in analytic processes and instrumentsInconsistent formats and standards

Lack of metadata & standards

Biological systems are immensely complicated and dynamic (S. Goff, CyVERSE/iPlant)•21k human genes can make >100k proteins•>50% of genes are controlled by day-night cycles•Proteins have an average half-life of 30 hours•Several thousand metabolites are rapidly changing•Traits are environmentally and genetically controlled

Information Technology - High Performance Computing and Networking - now can explore these systems through simulation

CollaborationCross Domain, Cross DisciplineDistribution of systems and talent is globalResources are public, private and academic

BIO-IT Trends in the Trenches 2015 with Chris Dagdigian

Take Aways- Science is changing faster than IT funding

cycle for data intensive computing environments

- Forward looking 100G multi site , multi party collaborations required - Cloud adoption driven by capability vs cost

- Centralized data center dead; future is distributed computing/data stores

- Big pharma security challenge has been met

- SDN is real and happening now; part of infrastructure automation wave- Blast radius more important than ever:

DOE’s Science DMZ architecture is a solution

https://youtu.be/U6i0THTxe4ohttp://www.slideshare.net/chrisdag/2015-bioit-trends-from-the-frenches2015 Bio-IT World Conference & Expo

• Change• Networking• Cloud• Decentralized Collaboration• Security• Mission Networks

https://youtu.be/U6i0THTxe4o

https://youtu.be/U6i0THTxe4o

Change

[ 12 ]

13 – 05/01/2023, © 2009 Internet2

Data Tsunami

Physics

Large Hadron ColliderLife Sciences

Next Generation Sequencers

CERN Illumina

Networking

[ 14 ]

15 – 05/01/2023, © 2009 Internet2

2012: US – China 10 Gbps Link

Fed Ex: 2 daysInternet+ FTP: 26 hoursChina ‐ US 10G Link: 30 secs

Dr. Lin Fang Dr. Dawei Lin

Sample.fa(24GB)

NCBI/UC-Davis/BGI : First ultra high speed transfer of genomic data between China & US, June 2012

“The 10 Gigabit network connection is even faster than transferring data to most local hard drives,” said Dr. Lin [of UC, Davis]. “The use of a 10 Gigabit network connection will be groundbreaking, very much like email replacing hand delivered mail for communication. It will enable scientists in the genomics-related fields to communicate and transfer data more rapidly and conveniently, and bring the best minds together to better explore the mysteries of life science.” (BGI press release)

Life Sciences Engagement

16 Community

Forward Looking 100G Networks & Multi Site Multi Party Collaboration

Accelerating Discovery:USDA ARS Science

Network

05/01/2023, © 2016 Internet2

[ 18 ]

USDA Agriculture Research Services Science Network

USDA scope is far beyond human

USDA Agricultural Research Services Use Cases

Drought (Soil Moisture) Project – Challenging Volumes of Data

NASA satellite data storage - 7 TB/mo., 36mo missionARS Hydrology and Remote Sensing Lab analysis - 108 TBData completely re-process 3 to 5 times Microbial Genomics Project – Computational BottlenecksIndividual Strains of bacteria and microorganism communities related to

Food SafetyAnimal HealthFeed Efficiency

[ 20 ]

ARS Big Data Initiative

Big Data Workshop Recommendations, (February 2013)

Three Pillars of the ARS Big Data Implementation Plan – Network, HPC, Virtual Research Support (April, 2014)• Develop a Science DMZ• Enable high-speed, low-latency transfer of

research data to HPC and storage from ARS locations

• Virtual Researcher Support Implementation Complete (Nov. 2015) Clay Center, NE; Albany, CA; Beltsville Labs/Nat’l Ag. Library, Beltsville, MD Stoneville, MS; Ft. Collins, CO Ames/NADC, IA

• ARS Scientific Computing Assessment

• Final Report March 2014

SCInet Locations and Gateways

USDA AGRICULTURAL RESEARCH SERVICE

Albany, CA

Ft. Collins, CO Clay Center, NE Ames, IA

Stoneville, MS

Beltsville, MD100 Gb

100 Gb

100 Gb

10 Gb

10 Gb10 Gb

Cloud & Distributed Research Computing @Scale

[ 22 ] Community

Internet2 Approach :

Agile scaling of resources and capacity

Access to multi-domain, multi-discipline expertise in one dynamic global community

Offer a bottomless toolbox for Innovation for the researcher

[ 23 ]

New High Speed Cloud Collaborations

05/01/2023

23

Federal Agencies

Agribusiness

Pharma/Biotech

Federal Data Sources –NCBI,

NAL, NOAA

TACC CyVerse/iPlant

SDSC

NCSA Federal Nat. Labs & Cloud Data

Centers

Global Research

Institutions

Google & Azure

Layer 3

AWS Direct Connect &

Layer 3

A*Star

10, x10G, x100G

Syngenta Science NetworkBringing Plant Potential to Life through enhanced

computing capacity

Syngenta Science Network

Syngenta is a leading agriculture company helping to improve global food security by enabling millions of farmers to make better use of available resources.

Key research challenge:

How to grow plants more efficiently? Internet2 members, especially land grant universities, are important research partners.

The Challenge

Increasing size of scientific data setsGrowing number of useful external resources and partnersComplexity of genomic analyses is increasingNeed for big data collaborations across the globeMust Innovate

Solution: 10 Gbs Advanced Layer 2 Service

Higher data throughput High speed connectivity to AWS Direct Connect

Surge HPCCollaborations with academic community

High speed connections to best-in-class supercomputing resources

NCSA – University of IllinoisLeverage NCSA expertise in building custom R&D workflowsLeverage NCSA Industry Partnership Program

A*Star Supercomputing Center in SingaporeSupports a global, distributed, scientific computing capability

Global scale : creating a global fabric for computing and collaboration

“I want to be 15 minutes behind NCSA and 6 months ahead of my competition”

- Keith Gray, BP

[ 28 ]

National Center for Supercomputing Applications

[ 29 ]

*Better Designed* *More Durable* *Available Sooner*

Theoretical &

Basic Research

Prototyping &

DevelopmentOptimization &

Robustification

Commercialization

[ 30 ]

NCSA Mayo Clinic @Scale Genome-Wide Association Study

for Alzheimer’s disease

NCSA Private Sector ProgramUIUC HPCBioMayo ClinicBlueWatersteam and Swiss Institute of Bioinformatics worked together to identify which genetic variants interact to influence gene expression patterns that may associate with Alzheimer’s disease

[ 31 ]

Big Data and Big Compute Problem

• 50,011,495,056 pairs of variants

• Each variant pair is tested against

181 subjects and 24,544 genic regions

• Computationally large problem,

PLINK: ~ 2 years at Mayo FastEpistasis: ~ 6 hours on BlueWaters

• Can be a big data problem:- 500 PB if keep all results- 4 TB when using a conservative cutoff

San Diego Supercomputing Center

[ 32 ]

UCSC Cancer Genomics Hub: Large Data Flows to End Users

1G

8G

15G

Cumulative TBs of CGH Files Downloaded

Data Source: David Haussler, Brad Smith, UCSC; Larry Smarr, CalIT2

30 PB

http://blogs.nature.com/news/2012/05/us-cancer-genome-repository-hopes-to-speed-research.html

[ 34 ]

SDSC Protein Data Base Archive

• Repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. Information is annotated and publicly released into the archive by the wwPDB.

http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/methods.html

SDSC • Expertise

– Bioinformatics programming and applications support.

– Computational chemistry methods.

– Compliance requirements, e.g., for dbGaP, FISMA and HIPAA.

– Data mining techniques, machine learning and predictive analytics

– HPC and storage system architecture and design.

– Scientific workflow systems and informatics pipelines.

• Education and Training – Intensive Boot camps for

working professionals - Data Mining, Graph Analytics, and Bioinformatics and Scientific Worflows.

– Customized, on-site training sessions/programs.

– Data Science Certificate program.

– “Hackathon” events in data science and other topics.

4/6/16

Sherlock Cloud: A HIPAA-Compliant Cloud

Healthcare IT Managed Services - SDSC Center of Excellence

36

• Expertise in Systems, Cyber Security, Data Management, Analytics,

Application Development, Advanced User Support and Project Management

• Operating the first & largest FISMA Data Warehouse platform for Medicaid

fraud, waste and abuse analysis

• Leveraged FISMA experience to offer HIPAA- Compliant managed

hosting for UC and academia

• Supporting HHS CMS, NIH, UCOP and other UC Campuses

• Sherlock services : Data Lab, Analytics, Case Management and

Compliant Cloud

Lawrence Livermore National Lab

[ 37 ]

38 – 05/01/2023, © Internet2

Lawrence Livermore NL HPC Innovation Center

Cardioid Electrophysiology human heart simulations allowing exploration of causes of • Arrhythmia • Sudden cardiac arrest• Predictive drug interactions.

Depicts activation of each heart muscle cell and the cell-to-cell transfer of the voltage of up to 3 billion cells - in near-real time.

Metagenomic analysis with Catalyst:

• Comparing short genetic fragments in a query dataset against a large searchable index (14 million genomes - 3x larger than those currently in use) of genomes to determine the threat an organism poses to human health

Community Data Science Resources renci RADII and GWU HIVE

Driving Infrastructure Virtualization

Enabling Reproducibility For FDA Submissions

[ 39 ]

RADII Resource Aware Datacentric collaboratIve Infrastructure

GoalMake data-driven collaborations a ‘turn-key’ experience for domain researchers and a ‘commodity’ for the science community

ApproachA new cyber-infrastructure to manage data-centric collaborations based upon natural models of collaborations that occur among scientists.

RENCI: Claris Castillo, Fan Jiang, Charles Schmidt, Paul Ruth, Anirban Mandal ,Shu Huang, Yufeng Xin, Ilya Baldin, Arcot Rajasekar SDSC: Amit MajumdarDUKE: Erich Huang

Workflows - especially data-driven workflows and workflow ensembles - are becoming a centerpiece of modern computational science.

RADII RationaleMulti-institutional research teams grapple with multitude of resources

Policy-restricted large data setsCampus compute resources National compute resourcesInstruments that produce dataInterconnected by networksCampus, regional, national providersMany options, much complexityData and infrastructure are treated separately

RADII Creates

A cyberinfrastructure that integrates data and resource management from the ground up to support data-centric research.

RADII allows scientists to easily map collaborative data-driven activities onto a dynamically configurable cloud infrastructure.

Infrastructure management have no visibility into data

resources

Data management solutions have no visibility into the

infrastructure

RADII: Foundational technologies

Data-grids present distributed data under a one single abstraction and authorization

layer

Networked Infrastructure as a Service (NIaaS) for rapid deployment of programmable network virtual infrastructure (clouds).

Disjoint solutions Incompatible resource abstractions

Gap

to reduce the data-infrastructure management gap

RADII System – Virtualizing Data, Compute and Network for Collaboration

43

Novel mechanisms to represent data-centric

collaborations using DFD formalism

Data-centric resource management

mechanisms for provisioning and de-

provisioning resources dynamically through out the lifecycle of

collaborations

Novel mechanisms to map data processes,

computations, storage and organization entities

onto infrastructure

FDA and George Washington University

Big Data Decisions:Linking Regulatory and Industry

Organizations with HIVE Bio-Compute Objects

[ 44 ]

Presented by: Dan Taylor, Internet 2 | Bio IT | Boston | 2016

EIH V

From Jan 2016: Vahan Simonyan, Raja Mazumderlecture NIH: Frontiers in Data Science Series https://videocast.nih.gov/summary.asp?Live=18299&bhcp=1

High-performance Integrated Virtual Environment A regulatory NGS data analysis platform

BIG DATA – From a range of samples and instruments to approval for use

analysis and review

sample

archivalsequencing run

experi

ment

file transfer

knowledge

extrac

tion

data retrieval

regulation

computation pipelines

produced files are massive in

sizetransfer is

slow

too large to keep forever; not

standardized

difficult to validate

difficult to visualize and

interpret

how do we avoid

mistakes?

NGS lifecycle: from a biological sample to biomedical research and regulation

• Data Size: petabyte scale, soon exa-bytes• Data Transfer: too slow over existing networks• Data Archival: retaining consistent datasets across many years of

mandated evidence maintenance is difficult• Data Standards: floating standards, multiplicity of formats, inadequate

communication protocols• Data Complexity: sophisticated IT framework needed for complex

dataflow• Data Privacy: constrictive legal framework and ownership issues across

the board from the patient bedside to the FDA regulation• Data Security: large number of complicated security rules and data

protection tax IT subsystems and cripple performance• Computation Size: distributed computing, inefficiently parallelized,

requires large investment of hardware, software and human-ware• Computation Standards: non canonical computation protocols, difficult

to compare, reproduce, rely on computations• Computation Complexity: significant investment of time and efforts to

learn appropriate skills and avoid pitfalls in complex computational pipelines

• Interpretation: large outputs from enormous computations are difficult to visualize and summarize

• Publication: peer review and audit requires communication by massive amount of information

... and how do we avoid mistakes ?

software challenges and needs

HIVE is an End to End Solution

• Data retrieval from anywhere in the world• Storage of extra large scale data• Security approved by OIM• Integrator platform to bring different data and analytics together• Tailor made analytics designed around needs • Visualization made to help in interpretation of data• Support of the entire hard-, soft-ware and knowledge

infrastructure• Expertise accumulated in the agency• Bio-Compute objects repository to provide reproducibility and

interoperability and long term referable storage of computations and results

HIVE is not • an application to perform few tasks• yet another database• a computer cluster or a cloud or a data center• an IT subsystemMore:

http://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491893.htm

Inst

antia

tion

Dat

a Ty

pe D

efin

ition

sDefinitions of

metadata types

Data Typing Engine

Definitions of computations

metadata

Data

Bio-compute

Definitions of algorithms

and pipeline descriptions

Computational protocols

Verifiable resultswithin

acceptable uncertainty/er

ror

Scientifically reliable

interpretation

HIVE data universe

industry

FDA regulatory analysis

2. compute

3. submit

1. data-forming

6. issues

resubmits

5. regulatory decision

4. SOPP/protocols

consumer

$ millions of dollars

7. yes7. no

regulatory iterations

~$800 Million R&D dollars for a single drug~$2.6 Billion total cost

industry

FDA

HIVE

public-HIVEGalaxy CLC

DNA-nexus

2. compute

3. submit 1. data-forming

6. issues

resubmits5. bio-compute

2. HIVE SOPP/protocols

4. SOPP/protocols

consumer

7. yes

7 .no

4. submitbio-compute

integration

3. compute

Facilitate integration

$ millions of dollars

bio-compute as a way to link regulatory and industry organizations

Federated Identity

[ 52 ]

[ 53 ]

Community-developed framework of trust enables:

• Secure, streamlined sharing of protected resources

• Consolidated management of user identities and access

• Delivery of an integrated portfolio of community-developed solutions

[ 53 ]

Trusted Identity in Research

The standard for over 600 higher education institutions—and counting!

[ 54 ]

15 425+

2 160+

0 2000+

7.8 million

Academic Participants

Sponsored Partners

RegisteredService Providers

Individuals servedby federated IdM

Foundation for Trust & Identity

54

• Eric Boyd, Internet2• Stephen Wolff, Internet2• Stephen Goff, PhD, CyVERSE/iPlant, University of

Arizona• Chris Dagdigian, BioTeam• Daiwei Lin, PhD, NIAID, NIH• Paul Gibson, USDA ARS• Paul Travis, Syngenta• Evan Burness, NCSA• Sandeep Chandra, SDSC• Jonathan Allen, PhD, Lawrence Livermore National

Lab• Claris Castillo, PhD, RENCI• Vahan Simonyan, PhD, FDA• Raja Mazumder, PhD, George Washington

University• Eli Dart, ESNET, US Department of Energy• BGI• Nature

[ 55 ]

Acknowledgements

Thank you!Daniel Taylor, Director, Business [email protected]

mailto:[email protected]

Back up slides

Science DMZ

[ 57 ]

[ 58 ]

Rising expectations

Network throughput required to move y bytes in x time.(US Dept of Energy - http://fasterdata.es.net).

shouldbe easy

This year

3/30/16, © 2016 Internet2

Science DMZ* and perfSONARDesign pattern to address the most common bottlenecks to moving data

* fasterdata.es.net

59

Internet2 Bio IT 2016 v2

Documents

Transcript of Internet2 Bio IT 2016 v2