Post on 28-Dec-2015
Life Science Software and High Performance Computing Seminar Series Part IV
Craig A. Stewart
Fulbright Senior Scholar beim ZIH
Associate Vice President, Research & Academic Computing
License Terms
• Please cite this presentation as: Stewart, C.A. Life Science Software and High Performance Computing: Seminar Series Part IV. 2006. Presentation. Presented at: Technische Universitaet Dresden (Dresden, Germany, 27 Apr 2006). Available from: http://hdl.handle.net/2022/14767
• Portions of this document that originated from sources outside IU are shown here and used by permission or under licenses indicated within this document.
• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.
• Except where otherwise noted, the contents of this presentation are copyright 2007 by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.
Life Science Software and HPC Seminar Plan as of today
• Today: – Some thoughts and observations on US national projects and
centers• Funding agencies• HPC/grid computing• Bioinformatics and computational biology
– Performance analysis• Late June – another visit to Dresden, associated with the ISC• Late August – another visit to Dresden, associated with Euro-PAR
US Funding agencies (1)
• National Science Foundation - $5.5B/year annual budget, fund about 20% of all basic research in US. Basic research in comp sci, math, biology, geology, etc. www.nsf.gov
• National Institutes of Health - $27.5B/year. Funds largest share of medical research. 27 separate institutes and centers www.nih.gov
• Department of Energy. Funds much applied and basic research. Funds: Argonne National Laboratory, Brookhaven National Laboratory, Fermi National Accelerator Laboratory, Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, Sandia National Laboratories, Stanford Linear Accelerator Center, Electron accelerators, Thomas Jefferson National Accelerator Facility www.doe.gov
US Funding agencies (2)
• Department of Defense. http://www.defenselink.mil/– Defense Advanced http://www.darpa.mil/– High Productivity Computing Systems program
http://www.darpa.mil/ipto/programs/hpcs/programplan.htm• Military branches (esp. Army, Navy, Air Force)• Department of Homeland Security http://www.dhs.gov/dhspublic/• National Security Agency www.nsa.gov• Congressional markups
Some shining successes
• DARPANet/Internet/Abilene• NSF HPC Centers/NITRD• “Hallmark” demos e.g. Tornado, Caterpillar bulldozer design• It’s really possible for a good researcher to get time on a
nationally shared superocmputer and get help with it
DARPA High Productivity Computing System program
IBM, Cray, Sun currently phase II industry partners
http://www.darpa.mil/ipto/programs/hpcs/programplan.htm
Real, not peak
http://www.darpa.mil/ipto/programs/hpcs/assessment.htm
Current Top500 list• DOE impact on top of list!
• http://www.top500.org/lists/2005/11/basic
NSF strategies
• Office of Cyberinfrastructure. Daniel Atkins, Director• Report of the National Science Foundation Blue-Ribbon Advisory Panel on
Cyberinfrastructure. http://www.nsf.gov/publications/pub_summ.jsp?ods_key=cise051203 (aka “the Atkins Report”).
• Draft – NSF’s Cyberinfrastructure vision for the 21st century. http://www.nsf.gov/od/oci/ci_v5.pdf
• NSF Cyberinfrastructure panel• Systems
– $30M/year x 4 solicitations for large shared systems– $200M for a 1 PetaFLOPS *achieved* system– Focus on science results
• Software– National Middleware Initiative
National supercomputer centers• Pittsburgh Supercomputer Center• San Diego Supercomputer Center• National Computational Science Alliance• TeraGrid• Other university centers of note:
– Purdue University– Ohio Supercomputer Center– Louisiana State University– Texas Advanced Computer Center– Texas Tech– Rice– Cal-Tech– Cornell– U. Chicago (computation, electronic visualization lab)– Florida/SURA
NIH
• National Center for Research Resources• Really focused on clinical resources, not computing
resources• NIH is perhaps doing more than any other funding agency to
promote openness in research as a result of its data access policies and support for open source software
• National library of medicine, protein data bank (also supported by NSF)
A semi-random walk through some US projects
CIPRES Cyberinfrastructure for Phylogenetic Research (CIPRES)
• http://www.phylo.org/• The largest active phylogenetics group going. “The goal of
the CIPRES project is to enable large-scale phylogenetic reconstructions on a scale that will enable analyses of huge datasets containing hundreds of thousands of bio molecular sequences “ Have 5 years of funding.
• Computational phylogenetics activities: phylogenetic reconstruction from gene order, gene sequences. Horizontal gene transfer.
Renci (renaisannce computing institute)
• http://www.renci.org/• Led by Dan Reed. “a major collaborative venture of Duke
University, North Carolina State University, the University of North Carolina at Chapel Hill and the state of North Carolina.”
• Funding through the National Middleware Initiative• Key role in the TeraGrid
Argonne National Lab Biosciences Division• Let by Rick Stevens. http://www.bio.anl.gov/• LOTS of structural biology. Very focused, well funded and
dedicated group.
Cal-IT2
• Led by Larry Smarr. http://www.calit2.net/• Lots of areas of focus, including “
– “GEON: The Geosciences Network [GEON] – Laboratory for the Ocean Observatory Knowledge
INtegration Grid [LOOKING] – Sensor Networks
BIRN
• Biomedical Informatics Research Network• http://www.nbirn.net/• NIH-sponsored attempt to create health-oriented cyberinfrastructure• Function BIRN – brain function and disorders, e.g. schizophrenia• Morphometry BIRN – brain structural disorders, e.g. Alzheimers• Mouse BIRN – studying mouse brain and mouse models of human
brain disorders• Grid technology, using federated data system approach, based on
Globus, SRB, etc.
Optiputer
• “The OptIPuter, so named for its use of Optical networking, Internet Protocol, computer storage, processing and visualization technologies, is an envisioned infrastructure that will tightly couple computational resources over parallel optical networks using the IP communication mechanism. The OptIPuter exploits a new world in which the central architectural element is optical networking, not computers - creating "supernetworks".
• LambdaRAM• http://www.optiputer.net/index.html
Genomes to Life
• http://www.doegenomestolife.org/• Original goals:
– Identify and Characterize the Molecular Machines of Life — the Multiprotein Complexes That Execute Cellular Functions and Govern Cell Form
– Characterize Gene Regulatory Networks– Characterize the Functional Repertoire of Complex Microbial
Communities in Their Natural Environments at the Molecular Level
– Develop the Computational Methods and Capabilities to Advance Understanding of Complex Biological Systems and Predict Their Behavior
– (Goals taken directly from Genomes to Life web site)
Genomes to Life refactored
• The Department of Energy’s Office of Science announced ... that it is revising its plans for the deployment of new research facilities to support its Genomics:GTL program. … The specific goal of the new facilities plan will be to accelerate GTL systems biology research in the area of bioenergy, with the objective of developing cost-effective, biologically based renewable energy sources to reduce U.S. dependence on fossil fuels.
• http://www.sc.doe.gov/Sub/Newsroom/News_Releases/DOE-SC/2006/GTL/index.htm
Current Genomic PipelineArabidopsis Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for Protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
http://eol.sdsc.edu/methodology.html
Scale of Multi-genome Analysis
Genomes Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for Protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
~800 genomes @ 10k-20k per =~107 ORF’s
4 CPU years
228 CPU years
3 CPU years
9 CPU years
252 CPU years
3 CPU years
104 entries
http://eol.sdsc.edu/methodology.html
Other centers of note
• National Resource for Biomedical Supercomputing (NRBSC). Pittsburgh. Source of MCell. http://www.nrbsc.org/.
• Scientific Computing and Imaging Institute – Christopher R. Johnson http://www.sci.utah.edu/
• UCSD Bioinformatics Program - http://bioinformatics.ucsd.edu/• Wash U bioinformatics http://www.ccb.wustl.edu/• MIT, Johns Hopkins also have interesting programs• List (incomplete) at http://zlab.bu.edu/~mfrith/BioinfoCenters.html
Some international efforts• eScience project - http://www.nesc.ac.uk/. EDIAMOND• Japanese Petaflops Protein Folding project -
http://www.jsbi.org/journal/GIW02/GIW02P121.pdf
Some activities at IU
• Flybase – authoritative source of annotated fruit fly genomic information. http://flybase.bio.indiana.edu/
• Lifescienceweb http://www.lifescienceweb.org/– Mutdb http://www.mutdb.org/– SBLEST “The Structure-Based Local Environment Search Tool
uses vectors of amino acid structural environments to perform K Nearest Neighbor queries against a database of protein structures. Our Web services allow for authenticated (password protected) submission of a protein structure, or selection of an existing structure and searching it against common databases and then visualization of the results using UCSF Chimera or PyMOL.” http://www.lifescienceweb.org/index.php?mode=sBlest_about
• TeraGrid – teragrid.iu.edu• IU IT Strategic Plan• IU Life Sciences Strategic Plan
Some .orgs and commercial activities• Bioinformatics.org
– Includes BioBrew Linux• BioPerl http://www.bioperl.org/wiki/Main_Page• BioPhython http://www.biopython.org/• BioJava http://biojava.org/wiki/Main_Page• BioMoby http://biomoby.open-bio.org/index.php/what-is-moby/
• Bio grid activities– folding@home http://folding.stanford.edu/– Protein predictor @ home http://predictor.scripps.edu/– rosetta@home http://boinc.bakerlab.org/rosetta/– Fight aids @ home http://fightaidsathome.scripps.edu/– World community grid http://www.worldcommunitygrid.org/
• Commercial:– Apple bioclusters (uses SGE)– IBM Life Science Institutes of Innovation– Sun Center of Excellence– Dell Center of Excellence
Some Good Books
• Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds).• Foundations of systems biology. MIT Press, 2001. Kitano (ed)• Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in genetics.
Springer-Verlag, NY. ISBM 0-387-91562-1• Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological sequence
analysis. Cambridge University Press.• Gibas, C., and P. Jambeck. 2001. Developing bioinformatics computer
skills. O’Reilly.• Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly.• Tisdall, J. 2003. Mastering perl for bioinformatics, O’Reilly.• Gusfield, D. 1997. Algorithms on strings, trees, and sequences. Cambridge
University Press.• Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid computing: making the
grid infrastructure a reality. Wiley, Sussex
Acknowledgments
• Funding for projects described in this talk has come from the National Science Foundation, National Institutes of Health, Lilly Endowment, Inc., State of Indiana (particularly through support of I-light Initiative and the 21st Century Fund)
• The work described here was made possible by the faculty, students, and staff of Indiana University. Thanks especially to the staff of RAC, CPO, Telecommunications, PTL, UITS generally, the participants in the Indiana Genomics Initiative, and the participants in the METACyt Initiative.
• Several of the slides and ideas presented here were developed by colleagues or collaborators – the Research and Academic Computing Division of UITS in general, and Dick Repasky in particular.
• Stewart’s visit to Dresden is funded in part by the Center for the International Exchange of Scholars, the Technical University of Dresden, and Indiana University
• And thank you very much! This has been fun and educational for me!