Heather Boyles Director, International Relations Internet2 heather@internet2
Internet2 Bio IT 2016 v2
-
Upload
dan-taylor -
Category
Documents
-
view
13 -
download
0
Transcript of Internet2 Bio IT 2016 v2
Pushing Discovery with Internet2
Cloud to Supercomputing in Life Sciences
DAN TAYLOR
Director, Business Development, Internet2
BIO-IT WORLD 2016
BOSTON
APRIL, 2016
2 – 05/01/2023, © 2011 Internet2
Internet2 Overview
An advanced networking consortiumAcademiaCorporationsGovernmentOperates a best-in-class national optical network 15,000 miles of dedicated fiber100G routers and optical transport systems 8.8 Tbps capacity For over 20 years, our mission has been to Provide cost effective broadband and collaboration technologies to facilitate frictionless research in Big Science – broad collaboration, extremely large data setsCreate tomorrow’s networks & a platform for networking researchEngage stakeholders in
• Bridging the IT/Researcher gap • Developing new technologies critical to their missions
[ 3 ]
The 4th Gen Internet2 Network Internet2 Networkby the numbers17 Juniper MX960 nodes31 Brocade and Juniper
switches49 custom colocation facilities250+ amplification racks15,717 miles of newly
acquired dark fiber2,400 miles of partnered
capacity with Zayo
Communications8.8 Tbps of optical capacity100 Gbps of hybrid Layer 2
and Layer 3 capacity300+ Ciena ActiveFlex 6500
network elements
Technology
A Research Grade high speed network – optimized for “Elephant flows”
• Layer 1 – secure point to point wavelength networking• Advanced Layer 2 Services – Open virtual network for Life
Sciences with connectivity speeds up to 100 Gbs• SDN Network Virtualization customer trials now
• Advanced Layer 3 Services – High speed IP connectivity to the world
Superior economicsSecure sharing of online research resource
federated identity management system
[ 5 ]
Internet2 Members and Partners255 Higher Education members67 Affiliate members41 R&E Network members82 Industry members65+ Int’l partners reaching over 100 Nations93,000+ Community anchor institutions
Focused on member technology needs since 1996
"The idea of being able to collaborate with anybody, anywhere, without constraint…"
—Jim Bottum, CIO, Clemson University
Community
6 – 05/01/2023, © 2009 Internet2
Strong international partnerships
Agreements with international networking partners offer interoperability and accessEnable collaboration between U.S. researchers and overseas counterparts in over 100 international R & E networks
Community
Some of our Affiliate Members
7
[ 8 ]
R&E innovation
sparked IT success
*Routers
Stanford
Computer Workstations
Berkeley, Stanford
SecuritySystems
Univ of Michigan
SecuritySystems
Georgia Tech
SocialMedia
Harvard
NetworkCaching
MIT
Search
Stanford
[ 9 ]
The Route to Innovation
May 1, 2023 © 2016 Internet2
Abundant Bandwidth• Raw capacity now available on
Internet2 Network a key imagination enabler• Incent disruptive use of new, advanced
capabilities
Software Defined Networking• Open up network layer itself to innovation• Let innovators communicate with and program
the network itself• Allow developers to optimize the network for
specific applications
Science DMZ• Architect a special solution to allow
higher-performance data flows• Include end-to-end performance monitoring
server and software• Include SDN server to support programmability
Life Sciences Research TodaySharing Big Data sets (genomic, environmental, imagery) key to basic and applied research
Reproducibility - need to capture methods as well as raw data High variability in analytic processes and instrumentsInconsistent formats and standards
Lack of metadata & standards
Biological systems are immensely complicated and dynamic (S. Goff, CyVERSE/iPlant)•21k human genes can make >100k proteins•>50% of genes are controlled by day-night cycles•Proteins have an average half-life of 30 hours•Several thousand metabolites are rapidly changing•Traits are environmentally and genetically controlled
Information Technology - High Performance Computing and Networking - now can explore these systems through simulation
CollaborationCross Domain, Cross DisciplineDistribution of systems and talent is globalResources are public, private and academic
BIO-IT Trends in the Trenches 2015 with Chris Dagdigian
Take Aways- Science is changing faster than IT funding
cycle for data intensive computing environments
- Forward looking 100G multi site , multi party collaborations required - Cloud adoption driven by capability vs cost
- Centralized data center dead; future is distributed computing/data stores
- Big pharma security challenge has been met
- SDN is real and happening now; part of infrastructure automation wave- Blast radius more important than ever:
DOE’s Science DMZ architecture is a solution
https://youtu.be/U6i0THTxe4ohttp://www.slideshare.net/chrisdag/2015-bioit-trends-from-the-frenches2015 Bio-IT World Conference & Expo
• Change• Networking• Cloud• Decentralized Collaboration• Security• Mission Networks
Change
[ 12 ]
13 – 05/01/2023, © 2009 Internet2
Data Tsunami
Physics
Large Hadron ColliderLife Sciences
Next Generation Sequencers
CERN Illumina
Networking
[ 14 ]
15 – 05/01/2023, © 2009 Internet2
2012: US – China 10 Gbps Link
Fed Ex: 2 daysInternet+ FTP: 26 hoursChina ‐ US 10G Link: 30 secs
Dr. Lin Fang Dr. Dawei Lin
Sample.fa(24GB)
NCBI/UC-Davis/BGI : First ultra high speed transfer of genomic data between China & US, June 2012
“The 10 Gigabit network connection is even faster than transferring data to most local hard drives,” said Dr. Lin [of UC, Davis]. “The use of a 10 Gigabit network connection will be groundbreaking, very much like email replacing hand delivered mail for communication. It will enable scientists in the genomics-related fields to communicate and transfer data more rapidly and conveniently, and bring the best minds together to better explore the mysteries of life science.” (BGI press release)
Life Sciences Engagement
16 Community
Forward Looking 100G Networks & Multi Site Multi Party Collaboration
Accelerating Discovery:USDA ARS Science
Network
05/01/2023, © 2016 Internet2
[ 18 ]
USDA Agriculture Research Services Science Network
USDA scope is far beyond human
USDA Agricultural Research Services Use Cases
Drought (Soil Moisture) Project – Challenging Volumes of Data
NASA satellite data storage - 7 TB/mo., 36mo missionARS Hydrology and Remote Sensing Lab analysis - 108 TBData completely re-process 3 to 5 times Microbial Genomics Project – Computational BottlenecksIndividual Strains of bacteria and microorganism communities related to
Food SafetyAnimal HealthFeed Efficiency
[ 20 ]
ARS Big Data Initiative
Big Data Workshop Recommendations, (February 2013)
Three Pillars of the ARS Big Data Implementation Plan – Network, HPC, Virtual Research Support (April, 2014)• Develop a Science DMZ• Enable high-speed, low-latency transfer of
research data to HPC and storage from ARS locations
• Virtual Researcher Support Implementation Complete (Nov. 2015) Clay Center, NE; Albany, CA; Beltsville Labs/Nat’l Ag. Library, Beltsville, MD Stoneville, MS; Ft. Collins, CO Ames/NADC, IA
• ARS Scientific Computing Assessment
• Final Report March 2014
SCInet Locations and Gateways
USDA AGRICULTURAL RESEARCH SERVICE
Albany, CA
Ft. Collins, CO Clay Center, NE Ames, IA
Stoneville, MS
Beltsville, MD100 Gb
100 Gb
100 Gb
10 Gb
10 Gb10 Gb
Cloud & Distributed Research Computing @Scale
[ 22 ] Community
Internet2 Approach :
Agile scaling of resources and capacity
Access to multi-domain, multi-discipline expertise in one dynamic global community
Offer a bottomless toolbox for Innovation for the researcher
[ 23 ]
New High Speed Cloud Collaborations
05/01/2023
23
Federal Agencies
Agribusiness
Pharma/Biotech
Federal Data Sources –NCBI,
NAL, NOAA
TACC CyVerse/iPlant
SDSC
NCSA Federal Nat. Labs & Cloud Data
Centers
Global Research
Institutions
Google & Azure
Layer 3
AWS Direct Connect &
Layer 3
A*Star
10, x10G, x100G
Syngenta Science NetworkBringing Plant Potential to Life through enhanced
computing capacity
Syngenta Science Network
Syngenta is a leading agriculture company helping to improve global food security by enabling millions of farmers to make better use of available resources.
Key research challenge:
How to grow plants more efficiently? Internet2 members, especially land grant universities, are important research partners.
The Challenge
Increasing size of scientific data setsGrowing number of useful external resources and partnersComplexity of genomic analyses is increasingNeed for big data collaborations across the globeMust Innovate
Solution: 10 Gbs Advanced Layer 2 Service
Higher data throughput High speed connectivity to AWS Direct Connect
Surge HPCCollaborations with academic community
High speed connections to best-in-class supercomputing resources
NCSA – University of IllinoisLeverage NCSA expertise in building custom R&D workflowsLeverage NCSA Industry Partnership Program
A*Star Supercomputing Center in SingaporeSupports a global, distributed, scientific computing capability
Global scale : creating a global fabric for computing and collaboration
“I want to be 15 minutes behind NCSA and 6 months ahead of my competition”
- Keith Gray, BP
[ 28 ]
National Center for Supercomputing Applications
[ 29 ]
*Better Designed* *More Durable* *Available Sooner*
Theoretical &
Basic Research
Prototyping &
DevelopmentOptimization &
Robustification
Commercialization
[ 30 ]
NCSA Mayo Clinic @Scale Genome-Wide Association Study
for Alzheimer’s disease
NCSA Private Sector ProgramUIUC HPCBioMayo ClinicBlueWatersteam and Swiss Institute of Bioinformatics worked together to identify which genetic variants interact to influence gene expression patterns that may associate with Alzheimer’s disease
[ 31 ]
Big Data and Big Compute Problem
• 50,011,495,056 pairs of variants
• Each variant pair is tested against
181 subjects and 24,544 genic regions
• Computationally large problem,
PLINK: ~ 2 years at Mayo FastEpistasis: ~ 6 hours on BlueWaters
• Can be a big data problem:- 500 PB if keep all results- 4 TB when using a conservative cutoff
San Diego Supercomputing Center
[ 32 ]
UCSC Cancer Genomics Hub: Large Data Flows to End Users
1G
8G
15G
Cumulative TBs of CGH Files Downloaded
Data Source: David Haussler, Brad Smith, UCSC; Larry Smarr, CalIT2
30 PB
http://blogs.nature.com/news/2012/05/us-cancer-genome-repository-hopes-to-speed-research.html
[ 34 ]
SDSC Protein Data Base Archive
• Repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. Information is annotated and publicly released into the archive by the wwPDB.
SDSC • Expertise
– Bioinformatics programming and applications support.
– Computational chemistry methods.
– Compliance requirements, e.g., for dbGaP, FISMA and HIPAA.
– Data mining techniques, machine learning and predictive analytics
– HPC and storage system architecture and design.
– Scientific workflow systems and informatics pipelines.
• Education and Training – Intensive Boot camps for
working professionals - Data Mining, Graph Analytics, and Bioinformatics and Scientific Worflows.
– Customized, on-site training sessions/programs.
– Data Science Certificate program.
– “Hackathon” events in data science and other topics.
4/6/16
Sherlock Cloud: A HIPAA-Compliant Cloud
Healthcare IT Managed Services - SDSC Center of Excellence
36
• Expertise in Systems, Cyber Security, Data Management, Analytics,
Application Development, Advanced User Support and Project Management
• Operating the first & largest FISMA Data Warehouse platform for Medicaid
fraud, waste and abuse analysis
• Leveraged FISMA experience to offer HIPAA- Compliant managed
hosting for UC and academia
• Supporting HHS CMS, NIH, UCOP and other UC Campuses
• Sherlock services : Data Lab, Analytics, Case Management and
Compliant Cloud
Lawrence Livermore National Lab
[ 37 ]
38 – 05/01/2023, © Internet2
Lawrence Livermore NL HPC Innovation Center
Cardioid Electrophysiology human heart simulations allowing exploration of causes of • Arrhythmia • Sudden cardiac arrest• Predictive drug interactions.
Depicts activation of each heart muscle cell and the cell-to-cell transfer of the voltage of up to 3 billion cells - in near-real time.
Metagenomic analysis with Catalyst:
• Comparing short genetic fragments in a query dataset against a large searchable index (14 million genomes - 3x larger than those currently in use) of genomes to determine the threat an organism poses to human health
Community Data Science Resources renci RADII and GWU HIVE
Driving Infrastructure Virtualization
Enabling Reproducibility For FDA Submissions
[ 39 ]
RADII Resource Aware Datacentric collaboratIve Infrastructure
GoalMake data-driven collaborations a ‘turn-key’ experience for domain researchers and a ‘commodity’ for the science community
ApproachA new cyber-infrastructure to manage data-centric collaborations based upon natural models of collaborations that occur among scientists.
RENCI: Claris Castillo, Fan Jiang, Charles Schmidt, Paul Ruth, Anirban Mandal ,Shu Huang, Yufeng Xin, Ilya Baldin, Arcot Rajasekar SDSC: Amit MajumdarDUKE: Erich Huang
Workflows - especially data-driven workflows and workflow ensembles - are becoming a centerpiece of modern computational science.
RADII RationaleMulti-institutional research teams grapple with multitude of resources
Policy-restricted large data setsCampus compute resources National compute resourcesInstruments that produce dataInterconnected by networksCampus, regional, national providersMany options, much complexityData and infrastructure are treated separately
RADII Creates
A cyberinfrastructure that integrates data and resource management from the ground up to support data-centric research.
RADII allows scientists to easily map collaborative data-driven activities onto a dynamically configurable cloud infrastructure.
Infrastructure management have no visibility into data
resources
Data management solutions have no visibility into the
infrastructure
RADII: Foundational technologies
Data-grids present distributed data under a one single abstraction and authorization
layer
Networked Infrastructure as a Service (NIaaS) for rapid deployment of programmable network virtual infrastructure (clouds).
Disjoint solutions Incompatible resource abstractions
Gap
to reduce the data-infrastructure management gap
RADII System – Virtualizing Data, Compute and Network for Collaboration
43
Novel mechanisms to represent data-centric
collaborations using DFD formalism
Data-centric resource management
mechanisms for provisioning and de-
provisioning resources dynamically through out the lifecycle of
collaborations
Novel mechanisms to map data processes,
computations, storage and organization entities
onto infrastructure
FDA and George Washington University
Big Data Decisions:Linking Regulatory and Industry
Organizations with HIVE Bio-Compute Objects
[ 44 ]
Presented by: Dan Taylor, Internet 2 | Bio IT | Boston | 2016
EIH V
From Jan 2016: Vahan Simonyan, Raja Mazumderlecture NIH: Frontiers in Data Science Series https://videocast.nih.gov/summary.asp?Live=18299&bhcp=1
High-performance Integrated Virtual Environment A regulatory NGS data analysis platform
BIG DATA – From a range of samples and instruments to approval for use
analysis and review
sample
archivalsequencing run
experi
ment
file transfer
knowledge
extrac
tion
data retrieval
regulation
computation pipelines
produced files are massive in
sizetransfer is
slow
too large to keep forever; not
standardized
difficult to validate
difficult to visualize and
interpret
how do we avoid
mistakes?
NGS lifecycle: from a biological sample to biomedical research and regulation
• Data Size: petabyte scale, soon exa-bytes• Data Transfer: too slow over existing networks• Data Archival: retaining consistent datasets across many years of
mandated evidence maintenance is difficult• Data Standards: floating standards, multiplicity of formats, inadequate
communication protocols• Data Complexity: sophisticated IT framework needed for complex
dataflow• Data Privacy: constrictive legal framework and ownership issues across
the board from the patient bedside to the FDA regulation• Data Security: large number of complicated security rules and data
protection tax IT subsystems and cripple performance• Computation Size: distributed computing, inefficiently parallelized,
requires large investment of hardware, software and human-ware• Computation Standards: non canonical computation protocols, difficult
to compare, reproduce, rely on computations• Computation Complexity: significant investment of time and efforts to
learn appropriate skills and avoid pitfalls in complex computational pipelines
• Interpretation: large outputs from enormous computations are difficult to visualize and summarize
• Publication: peer review and audit requires communication by massive amount of information
... and how do we avoid mistakes ?
software challenges and needs
HIVE is an End to End Solution
• Data retrieval from anywhere in the world• Storage of extra large scale data• Security approved by OIM• Integrator platform to bring different data and analytics together• Tailor made analytics designed around needs • Visualization made to help in interpretation of data• Support of the entire hard-, soft-ware and knowledge
infrastructure• Expertise accumulated in the agency• Bio-Compute objects repository to provide reproducibility and
interoperability and long term referable storage of computations and results
HIVE is not • an application to perform few tasks• yet another database• a computer cluster or a cloud or a data center• an IT subsystemMore:
http://www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm491893.htm
Inst
antia
tion
Dat
a Ty
pe D
efin
ition
sDefinitions of
metadata types
Data Typing Engine
Definitions of computations
metadata
Data
Bio-compute
Definitions of algorithms
and pipeline descriptions
Computational protocols
Verifiable resultswithin
acceptable uncertainty/er
ror
Scientifically reliable
interpretation
HIVE data universe
industry
FDA regulatory analysis
2. compute
3. submit
1. data-forming
6. issues
resubmits
5. regulatory decision
4. SOPP/protocols
consumer
$ millions of dollars
7. yes7. no
regulatory iterations
~$800 Million R&D dollars for a single drug~$2.6 Billion total cost
industry
FDA
HIVE
public-HIVEGalaxy CLC
DNA-nexus
2. compute
3. submit 1. data-forming
6. issues
resubmits5. bio-compute
2. HIVE SOPP/protocols
4. SOPP/protocols
consumer
7. yes
7 .no
4. submitbio-compute
integration
3. compute
Facilitate integration
$ millions of dollars
bio-compute as a way to link regulatory and industry organizations
Federated Identity
[ 52 ]
[ 53 ]
Community-developed framework of trust enables:
• Secure, streamlined sharing of protected resources
• Consolidated management of user identities and access
• Delivery of an integrated portfolio of community-developed solutions
[ 53 ]
Trusted Identity in Research
The standard for over 600 higher education institutions—and counting!
[ 54 ]
15 425+
2 160+
0 2000+
7.8 million
Academic Participants
Sponsored Partners
RegisteredService Providers
Individuals servedby federated IdM
Foundation for Trust & Identity
54
• Eric Boyd, Internet2• Stephen Wolff, Internet2• Stephen Goff, PhD, CyVERSE/iPlant, University of
Arizona• Chris Dagdigian, BioTeam• Daiwei Lin, PhD, NIAID, NIH• Paul Gibson, USDA ARS• Paul Travis, Syngenta• Evan Burness, NCSA• Sandeep Chandra, SDSC• Jonathan Allen, PhD, Lawrence Livermore National
Lab• Claris Castillo, PhD, RENCI• Vahan Simonyan, PhD, FDA• Raja Mazumder, PhD, George Washington
University• Eli Dart, ESNET, US Department of Energy• BGI• Nature
[ 55 ]
Acknowledgements
Thank you!Daniel Taylor, Director, Business [email protected]
Back up slides
Science DMZ
[ 57 ]
[ 58 ]
Rising expectations
Network throughput required to move y bytes in x time.(US Dept of Energy - http://fasterdata.es.net).
shouldbe easy
This year
3/30/16, © 2016 Internet2
Science DMZ* and perfSONARDesign pattern to address the most common bottlenecks to moving data
* fasterdata.es.net
59