Promise and challenge of high-performance computing, with ...

doi: 10.1098/rsta.2002.0984, 1079-1105360 2002 Phil. Trans. R. Soc. Lond. A

Thom H. Dunning, Jr, Robert J. Harrison, David Feller and Sotiris S. Xantheas examples from molecular modellingPromise and challenge of high-performance computing, with

Email alerting service hereright-hand corner of the article or click Receive free email alerts when new articles cite this article - sign up in the box at the top

http://rsta.royalsocietypublishing.org/subscriptions go to: Phil. Trans. R. Soc. Lond. ATo subscribe to

This journal is © 2002 The Royal Society

on July 13, 2011rsta.royalsocietypublishing.orgDownloaded from

http://rsta.royalsocietypublishing.org/cgi/alerts/ctalert?alertType=citedby&addAlert=cited_by&saveAlert=no&cited_by_criteria_resid=roypta;360/1795/1079&return_type=article&return_url=http://rsta.royalsocietypublishing.org/content/360/1795/1079.full.pdf

http://rsta.royalsocietypublishing.org/subscriptions

http://rsta.royalsocietypublishing.org/

10.1098/rsta.2002.0984

Promise and challenge of high-performancecomputing, with examples from molecular

modelling

By Thom H. Dunning Jr1, Robert J. Harrison

2, David Feller

2

and Sotiris S. Xantheas2

1North Carolina Supercomputing Center, Research Triangle Park, NC 27709and Department of Chemistry, University of North Carolina,

Chapel Hill, NC 27599, USA2Environmental Molecular Sciences Laboratory, Pacific Northwest National

Laboratory, 906 Battelle Boulevard, Richland, WA 99352, USA

Published online 2 May 2002

Computational modelling is one of the most significant developments in the prac-tice of scientific inquiry in the 20th century. During the past decade, advances incomputing technologies have increased the speed of computers by a factor of 100; anincrease of a factor of 1000 can be expected in the next decade. These advances have,however, come at a price, namely, radical change(s) in computer architecture. Willcomputational scientists and engineers be able to harness the power offered by thesehigh-performance computers to solve the most critical problems in science and engi-neering? In this paper, we discuss the challenges that must be addressed if we are torealize the benefits offered by high-performance computing. The task will not be easy;it will require revision or replacement of much of the software developed for vectorsupercomputers as well as advances in a number of key theoretical areas. Because ofthe pace of computing advances, these challenges must be met by close collaborationbetween computational scientists, computer scientists and applied mathematicians.The effectiveness of such a multidisciplinary approach is illustrated in a brief reviewof NWChem, a general-purpose computational chemistry code designed for parallelsupercomputers.

Keywords: high-performance computing; computational chemistry; NWCHEM;ECCE; water clusters; water–carbon structure interactions

1. Introduction

Computational modelling was one of the most significant developments in the prac-tice of scientific inquiry in the 20th century. During this century, scientific researchwas extraordinarily successful in identifying the fundamental laws that govern ourmaterial world. At the same time, the advances promised by these discoveries havenot been fully realized because the real-world systems governed by these laws can beextraordinarily complex. Computer-based modelling provides a means of addressing

One contribution of 15 to a Discussion Meeting ‘New science from high-performance computing’.

Phil. Trans. R. Soc. Lond. A (2002) 360, 1079–11051079

c© 2002 The Royal Society



1080 T. H. Dunning Jr and others

these problems. Since the development of digital computers in the mid-20th cen-tury, computational modelling has allowed scientists to describe, with ever-increasingfidelity, such processes as fluid flow and turbulence in physics, molecular structureand reactivity in chemistry, and structure–function relationships in biology. Theseadvances, in turn, have allowed scientists and engineers to better understand thebehaviour of such complex processes as fuel combustion, weather patterns and air-craft performance.

A review of research programmes funded by the federal government, e.g. thosefunded by the Office of Science in the US Department of Energy and the NationalScience Foundation, reveals that computational modelling is now a significant con-tributor to most scientific and engineering research programmes. It is also findingincreasing use in industrial applications, e.g. in the development of CFC replacementsby DuPont in the early 1990s. Computational modelling is particularly important tothe solution of problems that are insoluble, or nearly so, by traditional theoreticaland experimental approaches (prediction of future climates); hazardous to study inthe laboratory (determination of the properties of toxic or radioactive chemicals); ortime-consuming or expensive to solve by traditional means (crash testing of auto-mobiles; ‘cut-and-try’ methods for developing new materials—the problem faced byDuPont; X-ray and NMR analysis of protein structure; or the fate of contaminantsreleased in the environment). In some cases, theoretical and experimental approachesdo not provide sufficient information to understand and predict the behaviour of thesystems being studied. Computational modelling, which allows a description of thesystem to be constructed from basic theoretical principles and the available experi-mental data, can be key to solving such problems.

The role of computational modelling in research and industry can be expected toincrease dramatically in the 21st century. During the past decade, advances in com-puting technology have increased the speed of computers by a factor of 100; anotherfactor of 1000 can be expected in the next decade. These advances in computingtechnology are setting the stage for a revolution in computational modelling. Withthese new capabilities, it will be possible to dramatically extend our exploration ofthe fundamental building blocks and processes of nature. Computational physicistswill be able to thoroughly test the predictions of the Standard Model of particlephysics, computational chemists will be able to design catalysts with improved selec-tivity and efficiency, and biologists will be able to model the multi-protein complexesthat are the ‘machines of life’. At the same time, it will be possible to build trulypredictive models of complex natural and engineered systems, such as automobileengines, the Earth’s climate, and cells, organs and species. The economic impact ofthis revolution will be enormous.

Unfortunately, advances in computing technology have come at a price: radicalchange(s) in the architecture of high-performance computers. Will computationalscientists and engineers be able to efficiently harness the power offered by high-performance computers to solve the most critical problems in science and engineer-ing? In this paper, we discuss the challenges that must be addressed if we are tofully realize the benefits offered by high-performance computing. The task will notbe easy; it will require building a new generation of software for computer systemsto form the foundation of computing at the terascale and beyond as well as revi-sion or replacement of much of the scientific and engineering software developed forsupercomputers in the past. It will also require progress in a number of key areas

Phil. Trans. R. Soc. Lond. A (2002)



Promise and challenge of high-performance computing 1081

of theoretical science to develop high-fidelity models of the complex systems whosestudy is enabled by these advanced computers. (For a more complete discussion ofthese issues, see DOE (2000).) These challenges can only be met by close collab-oration between scientists and engineers who develop and employ computationalmodels in the course of their work and computer scientists and applied mathemati-cians involved in the development of new systems software and mathematical andcomputational approaches. The pace of innovation in computing technologies is sim-ply too fast for traditional, more loosely coupled approaches to be effective. In fact,this approach has already been found to be important in realizing the full potentialof today’s parallel supercomputers, a point that we will illustrate with a brief reviewof the development of NWChem, a general-purpose computational chemistry codespecifically designed for parallel supercomputers. Finally, we will discuss examples ofthe new science that is now possible using the capabilities provided by NWChem.

2. Issues in high-performance computing

By the early 1990s the peak performance of parallel computers systems was suffi-ciently high that many computational scientists and engineers began to view themas competitive with traditional (vector) supercomputers such as those offered byCray Research, Inc., in the US and Fujitsu and NEC in Japan. By making useof commodity microprocessors, memory and disks, the new parallel computer sys-tems also offered a dramatic improvement in cost-performance over traditional vectorsupercomputers. Nonetheless, the scientific and engineering communities have beenslow to embrace parallel supercomputers. There are many reasons for this. The sys-tems software for parallel supercomputers was (and still is) immature, with manyimportant pieces missing (e.g. checkpoint-restart). In addition, much of the scientificand engineering software developed for vector supercomputers must be revised orreplaced to take advantage of the architecture of parallel supercomputers. Finally,the levels of delivered performance promised by the high peak performance of par-allel supercomputers have been difficult to realize (for a more detailed account, seeBailey (1998)). However, with a strong stimulus from the Accelerated Strategic Com-puting Initiative (ASCI) in the US Department of Energy, computer vendors haveaggressively pushed the performance levels of parallel supercomputers higher andhigher and, given the economies of scale, it is now clear that the immediate futureof high-performance computing will be dominated by parallel supercomputers builtfrom commodity servers tied together by a high-performance communications fabric.

Examples of the new parallel computers include ACSI White, the machine thatIBM recently delivered to Lawrence Livermore National Laboratory with a peak per-formance of 12 trillion arithmetic operations per second (teraops). ASCI White is acluster of IBM servers linked by a 500 Mb s−1 (bidirectional) switch (Colony). A 30teraops system (ASCI Q), which will be delivered to Los Alamos National Labora-tory in 2002, uses similar technologies pioneered by Compaq (DEC) and Quadrics.The ASCI Roadmap, which envisions a 100 teraops computer in the 2004 timeframebased on this approach, is given in figure 1. Computers based on the ASCI roadmapare now becoming available to the civilian science and engineering communities. TheDOE’s National Energy Research Scientific Computing Center at Lawrence Berke-ley National Laboratory recently installed a 5 teraops IBM computer based on theASCI White design. NERSC provides computing resources for researchers in national





1998 2000 2002 2004 200619960.1

1

10

100

1000

peak

tera

op

ASCI Blue (LANL,LLNL)

ASCI White (LLNL)

ASCI Q (LANL)

ASCI Red (SNL)

Figure 1. The series of parallel supercomputers built for the DOE’s Accelerated Strategic Com-puting Initiative and the DOE laboratories in which they have been installed. ASCI Red wasbuilt by Intel. One of the ASCI Blue computers was built by IBM, the other by SGI. ASCIWhite was built by IBM. ASCI Q is being built by Compaq and is scheduled for delivery in2002.

laboratories and universities supported by the Office of Science. More recently, thePittsburgh Supercomputer Center (PSC) announced the availability of a Compaqcomputer (a forerunner of ASCI Q) capable of 6 teraops. The PSC system is sup-ported by the National Science Foundation for university researchers nationwide.

In the following sections, we discuss the nature of the challenges that must beovercome to realize the full benefits of the dramatic increases in peak performancepromised by parallel supercomputers, with a focus on scientific and engineering appli-cations. Before doing so, however, it is important to note that change will be endemicto high-performance computing in the next few decades. Although we are currentlyon a plateau in the evolution of parallel supercomputer architectures (clusters ofshared memory computers), this will not last long. New architectures are alreadyon the drawing boards that will be capable of a quadrillion arithmetic operationsper second (petaops). Such computers cannot be built using the same technologyin today’s teraops computers—they would require too much space (10 acres!) andconsume too much power (500 MW!). A glimpse at the future of high-performancecomputing is provided by IBM’s Blue Light and Blue Gene projects (Allen et al .2001). The first will involve the use of a cellular architecture with 100 000 or moreprocessors; the second involves the aggressive use of processors-in-memory technol-ogy (32 processors and 8 Mb of memory on the same chip) and will scale to a millionprocessors. Although Blue Light is intended to be a general-purpose computer forscientific and engineering applications, Blue Gene is best viewed as an application-specific computer (its initial focus will be on protein folding, although it is likelyto be suitable for a somewhat wider range of applications). However, the technicalproblems to be solved before either of these computers will be realized are substantial.

Besides the large-scale development efforts on Blue Light and Blue Gene, theremay be much to gain by optimizing parallel supercomputers for specific types ofscientific and engineering applications. In figure 2, we plot a kiviat diagram for twocomputers, one whose configuration is optimized for lattice gauge quantum chro-





1.2

14

80.035

0.5

memoryM/F

diskM/F

memoryB/FdiskB/F

networkB/F

IBM (MSCF)

IBM (QCD)

Figure 2. Kiviat diagram of the capacity (memoryM/F and diskM/F ) and bandwidth(memoryB/F , diskB/F and networkB/F ) ratios for a computer optimally configured for molec-ular electronic structure calculations (MSCF) and one configured for lattice gauge QCD calcu-lations (QCD). M is the memory/disk storage capacity in bytes, B the memory/disk/networkbandwidth in bytes per second, and F the speed of the processor in flop per second.

modynamics (QCD) calculations (N. Christ 2000, personal communication) and theother for quantum chemical calculations of the electronic structure of molecules.The axes in the figure correspond to the memory and disk storage (memoryM anddiskM) and bandwidth (memoryB, diskB and networkB) needs of the two appli-cations normalized by the floating-point speed of the processor (F ). Without goinginto details, it is obvious from figure 2 that the needs of these two applications aremarkedly different. Clearly, it would be far more cost-efficient to run calculationson computers optimized for the two applications than on a computer configured tosatisfy the average needs of the two (or, worse still, force quantum chemical calcula-tions to be run on a computer configured for QCD calculations, which would greatlyrestrict the science that could be done). In many cases, much of the optimizationindicated in figure 2 can be affected by properly configuring an existing commercialparallel supercomputer for the scientific or engineering application of interest.

(a) Computational challenges in high-performance computing

The fundamental reason that it has been difficult to realize the full benefits of par-allel supercomputers is the radical change in architecture that has accompanied thetransition from vector to parallel supercomputers. From the mid-1970s to the mid-1990s, computational scientists and engineers made good use of the vector supercom-puter architecture pioneered by Cray Research, Inc., to refine their computationalmodels and increase the speed of their calculations. The first Cray supercomputer(1976) had a peak speed of approximately 100 million arithmetic operations per sec-ond (100 megaops); the peak speed of the last traditional Cray supercomputer (1994)was ca. 30 000 megaops (or 30 gigaops). This increase, each increment of which rep-resented a refinement of the basic Cray architecture, was quickly embraced by the





memory

CPU

CPU

memory

memory

‘Good Ole’ Days1990s

memory

memory

cache

yesterday’scomputers

today’scomputers

CPU

CPU CPU CPU

CPU CPU CPU

CPU cache cache

cache cache cache

Figure 3. Evolution of supercomputer architectures, from the vector supercomputers of the 1990sthrough single-processor nodes to multiprocessor nodes. This evolution continues. For example,IBM’s new Power4 systems have two processors per chip with 1.4 Mb of shared cache on-chipand a large off-chip shared cache plus main memory.

science and engineering communities and was responsible for many major advancesin computational modelling and simulation.

The situation changed dramatically when parallel supercomputers were introducedin the 1990s. Parallel supercomputers are built from thousands of processors (theASCI Q computer will have nearly 12 000 processors) using commodity memories(dynamic random access memory (DRAM), the same memories used in personalcomputers) interconnected by a communications fabric that, although fast by net-working standards, is still far slower than access to local memory. To make full use ofparallel supercomputers, computational scientists and engineers must address threemajor challenges:

(i) management of the memory hierarchy;

(ii) expression and management of concurrency;

(iii) efficient sequential execution.

We will briefly discuss each of these issues in turn.

(i) Management of the memory hierarchy

A defining feature of today’s parallel supercomputers is the presence of a deepmemory hierarchy (see figure 3). Each processor in a parallel supercomputer has amemory hierarchy—registers, on- and/or off-chip caches, main memory and virtual





copy tolocal memory

local memory local memory local memory

computeand update

copy toshared array

shared array shared array

GET PUT

Figure 4. The Global Arrays Library contains functions to distribute data arrays over all ofthe memory in a distributed memory computer. The Library also contains functions to copyany needed data from this shared array into local memory (GET), where computations can beperformed, and then copy the data from local memory back into the shared array (PUT). Thefunctionality is similar to that provided by shared-memory programming models.

memory (disk)—with each level requiring successively more time to access (latency)and having slower transfer rates (bandwidth). A parallel computer adds an additionallevel to this hierarchy: remote memory. In exploiting the capabilities of parallel super-computers, it is critical to carefully manage this memory hierarchy. A programmingmodel based on the concept of non-uniform memory access (NUMA) explicitly rec-ognizes this hierarchy, providing numerous advantages to the scientific programmer.For example, within this model, the same methods that are used to improve theperformance of sequential algorithms, e.g. data blocking, may be directly applied toparallel algorithms. Also, from the NUMA perspective, parallel algorithms only differfrom sequential algorithms in that they must also express and control concurrency,and distributed and shared-memory computers only differ in the manner in whichmemory is accessed. NUMA algorithms are typically more efficient, as well as mucheasier to design and maintain, than those based on MPI/OpenMP.

The Global Arrays Library (GAL), developed by one of the authors (R.J.H.) andhis co-workers, implements a very efficient NUMA programming model (Nieplochaet al . 1996). In this model, the programmer explicitly manages data locality byoptimally distributing the data among a set of distributed processors/memories andthen using functions (GET, PUT) that transfer data between a global address space(distributed multi-dimension arrays) and local storage (array sub-blocks); see fig-ure 4. In this respect, the global arrays (GAs) model has similarities to shared-memory programming models. However, the GAs programming model acknowledgesthat remote data are slower to access than local data and allow data locality to beexplicitly specified and managed. The GAs model exposes the NUMA characteristicsof modern high-performance computer systems to the programmer, and, by recog-nizing the communication overhead for remote data transfer, it promotes data reuseand locality of reference. The functions in the GAL allow each process in a parallelprogram to access, asynchronously, logical blocks of physically distributed matrices.This functionality has proven to be very useful in numerous computational chem-





istry applications. Today, computational codes in other disciplines, e.g. atmosphericmodelling, fluid dynamics, and even financial forecasting, are beginning to make useof the NUMA approach and the GAL.

The GAL is available for a broad range of parallel computers from collections ofPCs (running Linux) or Unix workstations to large parallel systems such as thoseoffered by SGI, IBM, Cray, Fujitsu, etc. For further information on the GAL (as wellas other ParSoft tools), see http://www.emsl.pnl.gov:2080/docs/parsoft/.

(ii) Expression and management of concurrency

Most scientific algorithms are rich in parallelism, but to understand the perfor-mance of these algorithms on parallel supercomputers requires more informationon how much parallelism is needed and at what level of granularity. Bailey (1997)applied Little’s law from queueing theory to address this question. By consideringthe amount of data that must be in flight in order to sustain a given level of perfor-mance, F , on a single processor, he showed that the minimum concurrency neededto sustain that level of performance was the product of F and LM, the memorylatency on the processor node. For a teraops computer built with commodity 100 nsDRAM memory, a minimum concurrency of 100 000 is needed to sustain one teraop(1012×100×10−9). Using this approach, we can also estimate how much latency mustbe fine or coarse grain. For example, if each processor in the teraops computer has apeak speed of one gigaop, then within each processor the algorithm must provide afine-grain concurrency of at least 100 to support one gigaop. The additional factorof 1000 required to reach one teraop must then come from coarse-grain parallelism.

How coarse is coarse grain? The same formula can be used where the latencyis now the latency for accessing remote memory (Lrm). Lrm is determined by thesoftware and hardware overhead to access data over the communications network.Current high-performance networks have latencies in the range of 10 µs, and band-widths (BN) of ca. 1 Gbit s−1 (Gbps) or 16 megawords s−1 (assuming double precisionwords). Let us again assume that the performance, F , of the processor is one gigaop.If only a single word in remote memory is accessed, we will experience the full latencyof the network. If we wish to attain 90% of peak processor use (or 10% communi-cation overhead), then the number of operations executed per word of remote datatransferred must exceed 10FLrm = 10 × 109 × 10 × 10−6 = 100 000, a very imposingnumber indeed. If many words are transferred at once, the average latency is 1/BN,the number must exceed 10F/BN = 10×1000/16 = 625 to attain 90% processor use.Clearly, large messages must dominate.

While the above analysis is only semi-quantitative, the conclusions are clear.

(i) Fine-grain parallelism is essential to attain efficient sequential execution be-cause of the mismatch in speed between that of the processor and that ofthe (local) memory subsystem. Further, modern superscalar processors achievepeak speed by executing multiple instructions per cycle. More instructions percycle increase the demand for fine-grain parallelism.

(ii) The granularity of coarse-grain parallelism is determined by the ratio of thesingle-processor speed to the average latency of remote memory references. Thislatency is controlled by the characteristics of both the communication hardware





and the algorithm. With the network latencies and bandwidths common today,it is critical to transfer large blocks of numbers.

Given the different objectives of fine- and coarse-grain parallelism, it may initiallyseem unreasonable to combine them into a single programming model. However, theabove discussion is dominated by consideration of different levels of the memory hier-archy (local and remote), and, thus, the same mechanisms (such as blocking of datato increase reuse) suffice to optimize for both. Unfortunately, current portable paral-lel environments do not provide this unity, and it is presently necessary to manuallyexpress coarse-grain parallelism (e.g. with MPI or GAs) and rely upon compilers orlibraries (e.g. basic linear algebra subroutines (BLAS)) to take advantage of fine-grain parallelism. An excellent and current general reference to this topic is Hwang& Xu (1997).

(iii) Efficient sequential execution

To make effective use of a parallel supercomputer to solve scientific and engineeringproblems, the computational scientist or engineer must not only learn how to useefficiently a large number of processors to solve the problem of interest, but she/hemust also make efficient use of each of these processors. This is a non-trivial problem.The mathematical approaches and algorithms developed in the Cray era often madeexplicit use of the high memory bandwidths of Cray supercomputers. To illustratethe magnitude of this problem, it is useful to compare the B/F ratio for vector andparallel supercomputers. The B/F ratio is the bandwidth to memory in bytes persecond divided by the speed of the processor in operations per second. With a largeB/F ratio, a processor can keep its functional units busy because the needed datacan be rapidly transferred from memory to the units. A small B/F ratio, on the otherhand, means that the processor may often sit idle, because it will be waiting for datato be moved from memory to the functional units. The last supercomputer built byCray Research, Inc. (Cray T90), had a B/F ratio of 15. This was sufficient to runmany vector operations at full speed out of main memory. The B/F ratio for theprocessor/memory subsystem in ASCI White, on the other hand, is just 0.67–1.33.This means that many of the traditional algorithms used in computational scienceand engineering will not perform well on parallel supercomputers without substantialtuning, revision, or even replacement.

Designers of parallel supercomputers have attempted to mitigate the memory per-formance problem by interposing a high-speed cache between the processor and mainmemory. Caches are built from fast, but expensive memory. For example, each pro-cessor in the ASCI White system has an 8 Mb cache. The B/F ratio for this cacheis 5.3, an improvement of as much as a factor of 8 over retrieving data from mainmemory. Clearly, if an algorithm can be structured so that the data are in cache whenneeded, the performance of the microprocessor will improve substantially. For a smallnumber of scientific applications, it may be possible to keep the data entirely in cache.For others, the key to successful use of the cache is data reuse, but this often requiresa significant change in the algorithm. State-of-the-art microprocessors also have anumber of other features to enhance memory performance, including instruction anddata pre-fetch, and the ability to schedule multiple outstanding memory requests.Both the ASCI Path Forward programme and the National Security Agency are cur-rently funding research aimed at increasing the performance of memory subsystems





in parallel supercomputers. Although we expect to see substantial improvements inthe B/F ratio of microprocessor-based parallel supercomputers in the next few years,in the meantime, there is much to be gained by research on algorithms that promiseto improve the sequential performance of parallel scientific and engineering codes.

(iv) Computer science and applied mathematics

From the above it is clear that realizing the full power of parallel supercomputersis not a problem that can be solved by computational scientists or engineers alone.A significant effort in applied mathematics will be required to develop new mathe-matical approaches and algorithms that make full use of caches, increasing the reuseof data already stored in the caches and using the ability of modern microprocessorsto keep several memory references in flight at the same time to (effectively) reducememory latency and increase memory bandwidth. This is a research effort in appliedmathematics; so time will be required to realize the benefits of any efforts directedat the problems in scientific computing.

In addition to the effort in applied mathematics, an allied effort in computer sci-ence is needed. We have already alluded to the fact that current high-performanceprogramming environments do not support the simultaneous control and expressionof coarse- and fine-grain parallelism and do not adequately address NUMA machinecharacteristics. Current programming languages are also very low level and one ofthe authors (R.J.H.) is engaged in an effort (Cociorva et al . 2002) to raise the level ofcomposition for scientific codes, perhaps to the point where a mathematical expres-sion (in this case restricted to a linear combination of tensor contractions) can becompiled directly to efficient parallel code that performs close to the minimum num-ber of operations while respecting both the memory hierarchy and amount of memoryavailable. Achieving this goal is still in the future, but its utility is clear.

Computer science efforts are also needed to help develop problem-solving environ-ments for scientific and engineering applications. The set-up and analysis of largescientific calculations are time consuming and error prone. Unnecessary details ofcodes, computers, operating systems, networks and data-storage formats combineto increase complexity to unreasonable levels. Problem-solving environments aimto streamline the process of scientific computation and to enable the scientist tofocus upon the scientific content with familiar concepts from the domain of inquiry,while hiding all unnecessary details of how the computation is actually launched andthe resulting data stored. The Extensible Computational Chemistry Environment(Ecce) was developed as a sister project to NWChem and is an example of sucha problem-solving environment for computational chemistry. We will describe Ecce

in more detail in the section on NWChem.

(b) Theoretical challenges in high-performance computing

Another set of challenges that must be addressed if we are to realize the full powerof advanced high-performance computers relates to the computational models thathave been developed to describe physical, chemical, and/or biological systems as wellas more complex natural and engineered systems. The theoretical challenges asso-ciated with the development of computational models for natural and engineeredsystems fall into two general classes: high-fidelity theoretical models and more effi-cient computational approaches.





(i) High-fidelity theoretical models

As noted in the Introduction, the fundamental physical laws that determine thebehaviour of many physical, chemical and biological systems are known. For example,the mathematical equation that governs all of chemistry and biology and much ofphysics has been known since the mid-1920s. However, the Schrodinger equationis exactly solvable only for a few very simple systems such as the hydrogen atom.It cannot be solved exactly, even on a computer, for any of the molecules of realinterest to chemists and biologists. Many approximate methods have been developedto solve the Schrodinger equation and are in use by a worldwide community oftheoretical and experimental chemists. Recently, it was shown that one of the mostwidely used techniques for solving the electronic Schrodinger equation, perturbationtheory, was fundamentally flawed. This discovery was a direct result of the abilityof computational chemists to push approximate methods for solving the Schrodingerequation to the limit using the advances made in computing technology in the lastdecade. Such discoveries are likely to occur in other areas of computational scienceand engineering as increasing computational power allows scientists and engineers tothoroughly benchmark the approximate methods in current use.

Other issues arise in the computational models used to model complex systems.For example, in modelling the Earth’s climate or an automobile engine, many ofthe processes that determine the behaviour of the system cannot be resolved on thescale of the numerical grids used to describe the phenomenon. Examples of suchprocesses include precipitation in global climate models and chemical reactions ordroplet evaporation in automobile engines. Although advances in computing powerwill continue to increase the resolution of the grids, many of these processes maynever be resolved—the spatial range that must be covered is simply too large. Workis needed to better understand how to properly represent such sub-grid processes.In molecular science, a related problem is encountered in modelling large molecules,such as proteins, where the overall motion of large floppy segments of the moleculecan occur on time-scales that are very long (seconds) compared with the motionof the individual atoms in the molecule (femtoseconds, 10−15 s). Yet, these floppymotions can impact on the rates of chemical processes, e.g. by exposing a reactivecleft in an enzyme. Many other such problems, too numerous to discuss here, will,unless sufficient investments are made, limit our ability to fully realize the benefitsof teraops and beyond computers.

(ii) Efficient computational approaches

Although it is tempting to simply scale up present-day approaches to modellingphysical, chemical and biological systems for parallel supercomputers, and this isindeed an important first step, more advanced mathematical methods may offersignificant advantages for terascale (and beyond) computers. An example of thisin molecular science is the use of advanced mathematical methods to reduce thescaling of molecular calculations. The Hartree–Fock (HF) method is a simple, yetoften effective, means of obtaining approximate molecular wave functions. A formalnumerical analysis of the method shows that it scales as N4, where N is the numberof basis functions used to describe the orbitals in the HF wave function. However, ithas long been known that the use of screening techniques in the HF algorithm canreduce the exponent in the scaling law to two once the molecule becomes sufficiently





computerscience

appliedmathematics

theoreticalscience

Figure 5. Computational science (shaded area) lies at the intersection ofdisciplinary theoretical science, applied mathematics and computer science.

large. Recently, Ochsenfeld et al . (1998) showed that, using the fast multipole method(FMM) to handle the long-range Coulomb interaction (Greengard & Rohklin 1997)and a separate treatment of the exchange interaction, it is possible to develop an HFalgorithm that approaches linear scaling (N) as the size of the molecule increases.This gives rise to a positive feedback loop: use of parallel supercomputers enablescalculations on larger molecules, use of advanced mathematical methods reducesthe cost of calculations on molecules as they become larger, which then enablescalculations on even larger molecules, and so on.

Other approaches to solving the electronic Schrodinger equation are also beingactively investigated. Although basis-set techniques have been used very effectivelyto solve this equation, they are also the source of many of the problems in quantumchemistry. It can be shown that

Limr12→0

Ψ(r12) = (1 + 12r12)Ψ(0), (2.1)

where r12 is the interelectronic coordinate. This is a result of the singularity in theelectron–electron interaction (1/r12) when r12 → 0. Expansions in basis sets of one-electron functions cannot reproduce this behaviour and is the reason for the slowconvergence of many molecular properties, e.g. bond energies, with basis set. Thisproblem can be addressed by including r12 terms in the expansion. However, thisleads to substantial mathematical and computational complications in the solutionof the Schrodinger equation. Klopper (1998) has recently reviewed approaches thatexplicitly consider the use of interelectronic coordinates in the wave function. Theperformance of such approaches on parallel supercomputers has yet to be determined.

(c) Educational challenges in high-performance computing

Computational science may be considered to be the intersection of theoretical sci-ence, computer science and applied mathematics; see figure 5. Because of the rapidpace of innovation in computing technologies, the usual approach of each disciplinedefining its own research agenda and then publishing work for use by others is not





effective. Close collaborations must be established between disciplinary computa-tional scientists and engineers (physicists, chemists, biologists, mechanical engineers,etc.), computer scientists and applied mathematicians. However, each of the individ-uals involved in the collaboration must be able to understand enough about the workof the others to effectively contribute to the solution of the scientific and engineeringproblem of interest. Unfortunately, the trend over the past decade has been for thethree groups to go their separate ways. The parting has been particularly acute incomputer science, where there has been an increasing number of non-scientific out-lets for their expertise. The net result of this trend is a near crisis in education incomputational science and engineering at the same time that revolutionary advancesin computer technologies are promising major benefits to science and society if theseadvances can be properly harnessed.

Changes in both graduate and undergraduate curricula will be needed to addressthe educational issues involved in computational modelling. In the graduate curricu-lum, the interdisciplinary nature of computational science and engineering must beexplicitly recognized. Graduate students planning a career in computational scienceor engineering must gain a thorough understanding of the mathematical, computa-tional and computer science issues that underlie high-performance computing. Toaccomplish this, courses in applied mathematics and computer science must be arequirement for all graduate students in disciplinary computational science fields,and courses in disciplinary computational science a requirement for all computer sci-entists and applied mathematicians interested in scientific computing. Defining thecontent of these courses will require close collaboration between the faculty in thesethree areas, with the courses in computer science and applied mathematics forminga core curriculum common to many disciplinary computational sciences. The DOE’sComputational Science Graduate Fellowship programme, which has been in existencefor 10 years, is a prototype graduate programme that attempts to meet this need.More information on the Computational Science Graduate Fellowship programmemay be found at http://www.krellinst.org/csgf/.

Undergraduates majoring in science and engineering may not need an in-depthunderstanding of the mathematical and computational issues underlying compu-tational science and engineering. However, all undergraduates need to understandthe role of computational modelling and simulation in science and engineering aswell as the basic principles involved in using this approach to address scientific andengineering problems. This will only be possible if there is a bottom-to-top trans-formation of the undergraduate science and engineering curriculum. In chemistry,for example, a student should be exposed to computational modelling concepts andtechniques in every year of college, from first-year general chemistry through second-year organic chemistry to third-year physical chemistry. In addition, there shouldbe a fourth-year course in computational chemistry that provides a more detailedexamination of the concepts and techniques of computational chemistry. Coverageof computational modelling should progress from simple applications of computa-tional modelling packages (including spreadsheets and programs such as MatLab) inthe first year to an examination of the details of the mathematical models used inmolecular modelling packages such as Gaussian

tm in the fourth-year course.Only by taking a comprehensive approach to computational modelling and simu-

lation in both the undergraduate and graduate curriculum will we be in a position toreap the full benefits of the revolution that is occurring in scientific computing. Only





by a bottom-to-top transformation of the educational process will it be possible toproperly prepare young scientists and engineers for their careers in the 21st century.

3. NWCHEM: framework for molecular modelling onparallel supercomputers

Parallel supercomputers require a substantial change in the process by which scien-tific and engineering codes are designed and implemented. A thorough understandingof the issues in parallel computing and careful planning are required as is the useof modern software engineering practices. Skimping on these can lead the softwaredeveloper down the path of continual, and sometimes massive, revision of the soft-ware as computer architectures evolve.

Before the design process can begin, a number of fundamental issues must be care-fully scrutinized. These include a thorough examination of the data structures andalgorithms used in the existing sequential codes and an investigation of the perfor-mance issues that will arise when these data structures and algorithms are mappedto a distributed set of computational nodes. In addition, the choice of algorithmsmust be re-examined in the light of the memory hierarchy used in each of the pro-cessors. The results of these studies may indicate the need for new algorithms, datastructures and/or data-distribution schemes. In fact, it was from such studies in theearly 1990s that the benefits of a NUMA model became evident in computationalchemistry.

In this section, we discuss the major issues in the development of NWChem

(Kendall et al . 2000). NWChem implements many of the standard electronic struc-ture methods currently used to compute the properties of molecules and periodicsolids. In addition, NWChem has the ability to perform classical molecular-dynamics(MD) and free-energy simulations with the forces for these simulations being obtainedfrom a variety of sources, including ab initio calculations. Examples of NWChem’scapabilities include the following.

(i) Direct, semi-direct, and conventional HF (RHF, UHF, ROHF) calculationsusing up to 10 000 Gaussian basis functions; analytic first and second deriva-tives of the HF energy.

(ii) Direct, semi-direct, and conventional density functional theory (DFT) calcula-tions with a wide variety of local and non-local exchange-correlation potentials,using up to 10 000 basis functions; analytic first and second derivatives of theDFT energy.

(iii) Complete active space self-consistent-field (CASSCF) calculations; analyticfirst and numerical second derivatives of the CASSCF energy.

(iv) Semi-direct and RI(resolution-of-the-identity)-based second-order perturbationtheory (Møller–Plesset perturbation theory (MP2)) calculations for RHF andUHF wave functions using up to 3000 basis functions; fully direct calculationsbased on RHF wave functions; analytic first derivatives and numerical secondderivatives of the MP2 energy.

(v) Coupled cluster, CCSD and CCSD(T), calculations based on RHF wave func-tions using up to 3000 basis functions; numerical first and second derivativesof the coupled cluster energy.





In the NWChem self-consistent field (SCF) and DFT modules, all data of O(N2)or higher are distributed over all of the processors. Thus, calculations are lim-ited by the resources available in the system, not those available in a single pro-cessor or node. Also, some of these calculations make use of the new O(N) or‘fast’ Coulomb, exchange, and quadrature techniques, and more will be includedin the future. For a detailed description of the capabilities of NWChem, includ-ing its capabilities for periodic solids and MD simulations, see the NWChem Website http://www.emsl.pnl.gov:2080/docs/nwchem/nwchem.html. In addition to theabove, a powerful scripting interface, based on the widely used Python object-oriented language, is available for NWChem.

NWChem runs on essentially all parallel supercomputing platforms, includingIBM SPs, Cray T3Ds and T3Es, and SGI Origin 2000s and 3000s. It also runsefficiently on vector supercomputers such as the Cray C90 and T90. Despite thefact that NWChem was designed for parallel supercomputers, it also runs on Unixworkstations, PCs running Linux or Windows, as well as clusters of Unix desktopcomputers or workgroup servers.

(a) NWCHEM team

The NWChem Project was organized around a core team of theoretical andcomputational chemists, computer scientists and applied mathematicians at PacificNorthwest National Laboratory, augmented by a worldwide group of collaborators.The core team was responsible for designing the overall architecture of NWChem,writing a major portion of the code, delegating work to and monitoring the work ofcollaborators, resolving major issues related to the parallel implementation of math-ematical algorithms, and integrating the software provided by collaborators intoNWChem.

The core team consisted of six theoretical and computational chemists plus anequal number of postdoctoral fellows and three computer scientists and applied math-ematicians. It was a tightly knit group, and this level of integration was essential toachieving the goals set for NWChem. Many of the problems encountered in devel-oping NWChem were technically complex and their resolution required an in-depthunderstanding of both the scientific algorithms and the computer hardware and sys-tems software. The solutions to these problems were often unique. Both GAs andParallel IO (ParIO) were developed because of the need for efficient solutions to thedata distribution/access and input/output (I/O) problems for molecular calculations.

The team of external collaborators consisted of more than 15 senior scientists plustheir students and postdoctoral associates. Approximately half of the collaboratorswere from the US; the remainder were from Europe (England, Germany, Austria)and Australia. The collaborators were carefully selected to provide expertise thatdid not exist in the core team or to supplement the efforts of the core team whenadditional human resources were needed.

(b) Architecture of NWCHEM

A key to achieving the goals set for NWChem was a carefully designed architecturethat emphasizes layering and modularity; see figure 6. At each layer of NWChem,subroutine interfaces or styles were specified in order to control critical character-istics of the code, such as ease of restart, interaction between different tasks in the





inte

gral

API

...

geom

etry

obj

ect

basi

s se

t obj

ect

...

PeI

GS

...parallel IO

memory allocator

global arrays

MP2 energy, gradient, ...

SCF energy, gradient, ...

...

molecular energy

molecularcalculationmodules

molecularmodeling

toolkit

parallel software

developmenttoolkit

molecular structure

...

generictasks

RTDB

Figure 6. The architecture of NWChem is both layered and modular. Layering allows tasks tobe allocated to different disciplines (e.g. computer scientists created and maintain the ParSoftDevelopment Toolkit). Modularity facilitates both creation and maintenance.

same job, and reliable parallel execution. Object-oriented design concepts were usedextensively within NWChem. Basis sets, molecular geometries, chunks of dynami-cally allocated local memory and shared parallel arrays are all examples of ‘objects’within NWChem. NWChem is implemented in a mixture of C and Fortran-77, sinceneither C++ nor Fortran-90/95 was suitable at the start of the project (early 1990s),though they will be supported in the future. Since a true object-oriented language,one that supports inheritance, was not used, NWChem does not have ‘objects’ in thestrict sense of the word. However, careful design with consideration of both the dataand the actions performed upon the data, and the use of data hiding and abstraction,permitted many of the benefits of an object-oriented design to be realized.

In the bottom layer of NWChem are the memory allocator (MA), GAs and ParIO.MA is a type-safe memory allocator that allows a multi-language application tomanage dynamically allocated memory. The GAL, the parallel-programming librarythat provides one-sided access to shared, distributed multi-dimension arrays, wasbriefly described earlier. ParIO is a library that supports several models of I/Oon a wide range of parallel computers. The models include independent files foreach process, or a file shared between processes which may independently performoperations with independent file-pointers. It also supports an extension of GAs to





disk, called disk arrays, through which processes can collectively move complete orpartial GAs to/from external storage.

The Parallel Software Development Toolkit (ParSoft) was (and still is) primarilythe responsibility of the computer scientists involved in the NWChem project. Itdefines a ‘hardware abstraction’ layer that provides a machine-independent interfaceto the upper layers of NWChem. When NWChem is ported from one computersystem to another, nearly all changes occur in this layer, with most of the changeselsewhere being for tuning or to accommodate machine-specific problems such ascompiler flaws. The ParSoft contains only a small fraction of the code in NWChem,a few per cent, and only a small fraction of the code in ParSoft is machine dependent(notably the address-translation and transport mechanisms for one-sided memoryoperations). This is the reason that NWChem is so easily ported to new computersystems.

The next layer, the Molecular Modelling Toolkit, provides the basic functional-ity required by computational chemistry algorithms. This functionality is providedthrough ‘objects’ and application programmer interfaces (APIs). Two of the ‘objects’defined here support the simultaneous use of multiple basis sets, e.g. as requiredin some density functional calculations and multiple molecular geometries, e.g. asrequired for manipulating molecular fragments or molecular recognition. Examplesof the APIs include those for the one- and two-electron integrals, that for the quadra-tures used by the DFT method, and a number of basic mathematical routines includ-ing those for parallel linear algebra and Fock-matrix construction. Nearly everythingthat might be used by more than one type of computational method is exposedthrough a subroutine interface. Common blocks are not used for passing data acrossAPIs, but are used to support data hiding behind APIs. Both computational chemistsand applied mathematicians were heavily involved in the development of the basicalgorithms in the Molecular Modelling Toolkit.

The runtime database (RTDB) is a key component of NWChem, tying togetherall of the layers of NWChem. This database plays a role similar to the Gaussian

tm

checkpoint file or the GAMESS dumpfile, but is much easier and safer to use. Arraysof typed data are stored in the database using simple ASCI strings for keys (ornames). The database may be accessed either sequentially or in parallel.

The next layer within NWChem, the molecular calculation modules, comprisesindependent modules that communicate with other modules only via the RTDB orother persistent forms of storage. This design ensures that, when a module com-pletes, all critical information is in a consistent state. All modules that compute anenergy store it in a consistently named database entry, in this case <module>:energy,substituting the name of the module for <module>. Examples of modules includecomputation of the energy for SCF, DFT and multiconfiguration SCF (MCSCF)wave functions. The code to read the user input is also a module. This makes thebehaviour of the code more predictable when restarting a job with multiple tasksor steps, by forcing the state of persistent information to be consistent with theinput already processed. Modules often invoke other modules, e.g. the MP2 moduleinvokes the module to compute the SCF wave function, and the QM/MM moduleinvokes the modules to compute the energy and gradient. Computational chemistswere primarily responsible for the molecular calculation modules.

The highest layer within NWChem is the generic task layer. Functions at this levelare also modules—all of their inputs and outputs are communicated via the database.





However, these capabilities are no longer tied to specific types of wave functions orother computational details. Thus, regardless of the type of wave function requestedby the user, the energy may always be computed by invoking task_energy() andretrieving the energy from the database entry named task:energy. This greatlysimplifies the use of generic capabilities such as optimization, numeric differentiationof energies or gradients and molecular dynamics. It is the responsibility of the task-layer routines to determine the appropriate module to invoke.

NWChem was designed to be extensible. First, the clearly defined task and modulelayers make it easy to add new capabilities to NWChem. This can be done for bothparallel and sequential codes. An example of this would be the recent integrationof POLYRATE, which performs direct dynamics calculations (Chuang et al . 2000),into NWChem. Although tighter integration is possible by using lower-level APIs,this simple approach can provide much of the needed functionality. Second, the wideselection of lower-level APIs, e.g. for manipulating basis sets and evaluating integrals,makes it easier to develop new capabilities within NWChem than within codes inwhich these capabilities are not as easy to access. Finally, having a standard APImeans that a change to an implementation will affect the whole code. For example, toadd scalar relativistic corrections to all supported wave-function or density-functionalmodels, only the routines that compute the new one- or two-electron integrals hadto be modified. Adding a faster integrals module also speeds up the whole code.Sometimes, however, new APIs must be added. This was needed when the Texasintegral package was integrated into NWChem several years ago, and an API thatsupported the more efficient batch evaluation of integrals was needed. This can bedone straightforwardly, but requires more extensive code or algorithmic changes torealize the benefit.

(c) ECCE: managing computational studies with NWCHEM

We would be remiss if we did not mention, however briefly, the other part of compu-tational science and engineering—the use of computational tools such as NWChem

to help solve problems in science and engineering. As the investigations becomemore and more complex, it is critical that we use computer technology to help theresearcher compose, manage and analyse their work. The Extensible ComputationalChemistry Environment (Ecce) is such a problem-solving environment for compu-tational chemistry. Ecce is composed of a suite of distributed client/server Unix-based Graphical User Interface (GUI) applications seamlessly integrated together.The resulting environment enables research scientists to use complex computationalmodelling software such as NWChem and Gaussian

tm, transparently accessing allof the high-performance computers available to them from their desktop worksta-tions.

The Ecce software architecture is based on an object-oriented chemistry datamodel and supports the management of both computational and experimental molec-ular data. A flexible high-performance data-management component, based on Webtechnologies for data interchange and Extensible Markup Language (XML) for datastorage, has been developed by the Ecce team. The environment is extensible, sup-porting new computational codes and new capabilities through script files for man-aging both code input and output rather than reworking the core applications. TheEcce application software currently runs on Sun, SGI and Linux workstations and is





written in C++ using the X Window System Motif user interface toolkit and OpenGLgraphics. Ecce provides a sophisticated GUI, analysis and visualization tools, andan underlying data-management framework enabling scientists to efficiently set upcalculations and store, retrieve and analyse the rapidly growing volumes of dataproduced by computational chemistry studies.

Features of Ecce include:

(i) graphical composition of molecular models;

(ii) GUI for basis set selection;

(iii) remote submission of calculations to Unix and Linux workstations, Linuxclusters and supercomputers running LoadLevelertm, Maui Scheduler, NQE/NQStm and PBStm;

(iv) remote communications using the standard protocols: secure shell and copy(ssh/scp), telnet/ftp, remote shell and copy (rsh/rcp) and Globus;

(v) support for importing results from NWChem, Gaussian94tm and Gaussian-98tm calculations run outside of the Ecce environment;

(vi) three-dimensional visualization and graphical display of a number of molecularproperties while jobs are running and after completion.

Only the computational code must be installed on the remote-compute servers(not even a ‘root’ daemon). Ecce also has an extensive Web-based help systemand tutorials with examples. See http://www.emsl.pnl.gov:2080/docs/ecce/ for addi-tional information on Ecce.

4. New science from NWCHEM

In this section we briefly discuss two examples of the new science that have beenenabled by NWChem. Our examples are selected from research carried out in theEnvironmental Molecular Sciences Laboratory (EMSL) at the Pacific NorthwestNational Laboratory and describe high-accuracy calculations of hydrogen-bond inter-actions in complex molecular systems. A detailed description of hydrogen-bond inter-actions is critical to understanding many environmental processes, e.g. chemical reac-tions in the groundwater. These calculations, which sometimes stretch the capabili-ties of the small 1

4 teraops computer in the EMSL, would be little more than routinecalculations on NERSC’s 5 teraops computer. NWChem is now being used at over400 institutions worldwide, so a much larger set of examples of the new scienceenabled by NWChem is certainly available. However, it would be impossible toreview them in the limited space available here.

(a) Interaction of water with carbon nanostructures

Recently, there has been growing interest in the electrical and mechanical proper-ties of a class of molecular aggregates known as carbon nanotubes (Iijima 1991). Inaddition, synthetic chemists are interested in nanotubes for their potential use as verysmall chemical reaction chambers, or ‘nanoreactors’. Nanotubes can be thought of





as elongated analogues of the spherical carbon molecules called fullerenes or ‘bucky-balls’. Since water is likely to be the preferred solvent for many nanoreactor appli-cations, it is important to develop a reliable computational model of the behaviourof water inside a nanotube. Because the interior of nanotubes is quite narrow andthe ratio of surface area to volume is large, water in nanotubes may assemble intostructures quite different from the bulk (Koga et al . 2001).

To carry out reliable MD simulations of water inside nanotubes, it will be necessaryto augment the existing, high-quality water–water potentials (Rahman & Stillinger1971; Stillinger & David 1978; Niesar et al . 1990; Corongiu & Clementi 1992; Sprik &Klein 1988; Rick et al . 1994; Dang & Chang 1997; Burnham et al . 1999) with accuratewater–nanotube interaction potentials. A natural starting point for this endeavouris the development of a model potential for water on planar graphite. Normally, thisis accomplished by adjusting the parameters in a force field representation of thepotential to reproduce as closely as possible existing experimental or theoretical data.In this case, however, there is very little information on water–graphite interactions.The existing empirical force fields produce water–graphite binding energies for asingle water monomer in the 1.7–4.3 kcal mol−1 range (Vernov & Steele 1992; Gale& Beebe 1964). The only available experimental measurement yields a binding energyof 3.6 kcal mol−1 (Avgul & Kieslev 1970).

Ab initio electronic-structure calculations can provide invaluable information onmolecular interaction energies, provided they are performed at a high enough level toassure their reliability (see Dunning 1999). Current techniques for handling infinitelyrepeating structures, such as single-sheet graphite, are not adequate for systemat-ically studying the strength of the water–graphite interaction. Consequently, theproblem was approached via molecular-cluster calculations on water bound to aseries of fused-carbon-ring systems, ranging from benzene (C6H6) up to C96H24.

In the case of the water–benzene system, it was possible to explore thoroughlythe sensitivity of the binding energy to the completeness of the one-particle and n-particle basis sets, the two primary sources of error in these calculations. Previousstudies of hydrogen-bonded systems involving water had demonstrated that second-order MP2 reproduced binding energies accurate to a few tenths of a kcal mol−1

(Feller 1992; Feyereisen et al . 1996). Thus, the initial calculations were carried outat the MP2 level of theory with correlation consistent basis sets (Dunning et al .1998), which have the advantage that they systematically approach the completebasis-set limit. The largest of these sets (aug-cc-pV5Z) involved 1077 basis functionsand required 10 h of computer time when using a 128-processor partition of the IBMSP2 computer in the EMSL. In addition to the raw binding energies, correspondingvalues were also determined with a counterpoise (CP) correction designed to removethe undesirable effects of basis-set superposition error (BSSE). BSSE is a result ofthe use of finite basis sets and decreases in magnitude as the basis set approachescompleteness.

Geometry optimization showed that one of the water’s hydrogen atoms pointstoward the π-electron cloud of the benzene ring (see the inset in figure 7). With theMP2 method the binding energy in the basis-set limit is 3.7 ± 0.2 kcal mol−1 (Feller1999). (One advantage of systematically approaching the basis-set limit is that itallows one to assign error bars with some confidence.) Previous theoretical estimateshad ranged from 1.4 to 5.2 kcal mol−1, with what were thought to be the best valuescentring ca. 2.5 kcal mol−1 (Fredericks et al . 1996). At a higher level of theory, known





basis set

aVDZ aVTZ aVQZ V5Z

–5

–4

–3

–2

∆Eel

(kca

l mol

-1)

C6H6–H2O best. MP2 (FC) est.

H2O–H2O

MP2MP2(CP)CCSD(T)CCSD(T)(CP)

Figure 7. Convergence of the equilibrium binding energy for H2O–C6H6 with basis set. The solidand the dashed horizontal lines are respectively the H2O–C6H6 and H2O–H2O binding energiespredicted by the MP2 method at the (estimated) complete basis-set limit.

as coupled cluster theory with perturbative triple excitations (CCSD(T)), the bindingenergy is slightly larger, leading to a best estimate for ∆Ee of 3.9 ± 0.2 kcal mol−1

(Feller 1999). When zero-point vibrational effects are included, a ∆E0(0 K) valueof 2.9 ± 0.2 kcal mol−1 is obtained. This value is in good agreement with a recentthreshold photoionization experiment that yielded 2.4 ± 0.1 kcal mol−1 (Courty etal . 1998) and with the larger value of 2.8 kcal mol−1 determined from dispersedfluorescence spectra (Gotch & Zwier 1992). Since the present value is larger thanmost existing estimates of the water–graphite interaction energy and the value isexpected to grow as the size of the system of fused benzene rings grows, the reliabilityof the earlier estimates is already seen to be questionable.

Having demonstrated the level of theory required for the benzene–water system,MP2 calculations with a large basis set were then performed on complexes formedfrom water and progressively larger centro-symmetric fused carbon rings (C24H12,C54H18 and C96H24). Figure 8 shows the largest of these. The position of the watermolecule was optimized for the smallest of these carbon-ring systems and held fixedfor other two. The calculations show that the net binding energy of a water moleculeto the surface grows slowly with increasing cluster size. Based on these results andthose for the water–benzene complex, we estimate that the binding energy, ∆Ee, ofa water molecule to the graphite surface is ca. 5 kcal mol−1 (Feller & Jordan 2000).Using the NWChem program running on 256 nodes of our IBM SP2, the largestcalculations involved 1649 basis functions in C1 symmetry and required ca. 29 h ofcomputer time.





Figure 8. The structure of the H2O–C96H24 equilibrium complex.

Future studies will involve a water monomer interacting with segments of carbonnanotubes. Even the smallest of these will require substantial increases in compu-tational power relative to what was available for the previous work. The resultswill then be used to determine whether the model potential based on the water–graphite data is adequate for describing water–nanotube interactions. If this turnsout not to be the case, the ab initio results on the nanotube systems will be used toreparametrize the model potential.

Although all of the calculations discussed so far have involved only a single watermolecule, it will also be necessary to establish the magnitude of three-body interac-tions involving carbon surfaces and two water molecules. This will be accomplishedby carrying out MP2 calculations of the graphite and nanotube model systems inter-acting with two water molecules over a range of geometries. We expect these three-body interactions to be sufficiently large as to necessitate their inclusion in the modelpotentials.

(b) Structure and energetics of water clusters

The ubiquitous nature of water and its role as a (nearly) universal solvent hasfuelled a growing interest in developing a thorough knowledge of its properties undervarious conditions and in diverse environments. A molecular-level picture of thestructure and function of water is essential to understanding the mechanisms ofmany important chemical and biological processes, and the growing sophisticationin experimental techniques that probe the properties of water and water clustersrequire more detailed models for their interpretation than are currently available.Such models must go far beyond the simplistic approach of effective pairwise additiveinteractions and incorporate important cooperative effects into the functional formof the interaction potential. Non-additive (three- and higher-body) contributions tothe interaction energy amount to 30–40% for clusters (Xantheas 1994, 2000; Hodgeset al . 1997), liquid water (Dang & Chang 1997), and ice Ih (Burnham et al . 1999).





The issue of transferability of the interaction potential across different environments(gas, liquid, solid and interfaces) is also important in order to ultimately addressenvironmentally significant problems such as solvation, reaction and transport.

The study of small clusters of water offers the advantage of probing intermolec-ular interactions and collective phenomena at the detailed molecular level and pro-viding a quantitative account of the most important terms in these interactions.Current experimental techniques have provided detailed spectroscopic informationabout these systems but have yet to yield energetic information, which is necessaryfor parametrizing and benchmarking interaction potentials for water. First principleselectronic structure calculations offer a viable alternative. However, cluster bindingenergies that are accurate to 1–2% are needed in order to produce realistic potentials.The computational cost of these calculations, for which second-order perturbationtheory (MP2) is usually sufficient (Dunning 1999), scales (depending on implementa-tion) between N3 and N5, where (as before) N is the number of basis functions in thesystem. For the water hexamer, for example, reducing the error in the computed bind-ing energies from 10% to 1%, requires a 10 000× increase in the computational cost.Such calculations are currently impossible on conventional computers using serialarchitectures. However, they are feasible with NWChem on the IBM supercomputerin the EMSL. These studies with NWChem provide indispensable and irreplaceableinformation needed for the development of accurate interaction potentials.

We have computed the binding energies of water clusters from the trimer throughto the hexamer at the MP2 level of theory with the family of augmented correlation-consistent basis sets of double through quintuple zeta quality. The largest calcula-tion for the water hexamer isomers with the aug-cc-pV5Z basis set (Dunning et al .1998), involved a total of 1722 contracted Gaussian basis functions and was run on128 processors of the parallel IBM SP2 at the EMSL. Complete basis-set (CBS)limits were estimated using a polynomial of inverse powers of the maximum angu-lar momentum in the sets. The MP2/CBS estimates for the water cluster bindingenergies are 5.0 kcal mol−1 (dimer) (Feller 1992; Feyereisen et al . 1996; Halkier etal . 1997), 15.8 kcal mol−1 (trimer) (Nielsen et al . 1999), 27.6 kcal mol−1 (tetramer),36.3 kcal mol−1 (pentamer), 45.9 kcal mol−1 (prism hexamer), 45.8 kcal mol−1 (cagehexamer), 45.6 kcal mol−1 (book hexamer) and 44.8 kcal mol−1 (ring S6 hexamer)(Xantheas et al . 2002). The effects of higher correlation were estimated at thecoupled cluster plus single and double with a perturbative estimate of the tripleexcitations (CCSD(T)) level of theory. In addition, core–valence correlation effectswere included. These calculations indicate that the MP2 estimates are accurate toca. 0.2 kcal mol−1. We used these calculated cluster energetics as well as extensiveparts of the water dimer potential energy surface (Burnham & Xantheas 2002a) inorder to reparametrize an all-atom, Thole-type, rigid, polarizable interaction poten-tial for water (TTM2-R) (Burnham & Xantheas 2002b). The binding energies forwater clusters, (H2O)2–(H2O)6, obtained with the new potential, together with theMP2/CBS results and the binding energies predicted by other models commonlyused in the simulation of the properties of liquid water, are listed in table 1.

It is seen that the newly developed TTM2-R potential reproduces the MP2/CBScluster binding energies for n = 2–6 to an impressive accuracy (a root-mean-square(RMS) error of just 0.44 kcal mol−1). As noted in a recent series of papers (Burnham& Xantheas 2002a, b; Xantheas et al . 2002), the potential also yields excellent agree-ment with the experimental second virial coefficient over the 423–773 K temperature





Table 1. Comparison of binding energies of water clusters, (H2O)n, n = 2–6, from existinginteraction potentials, the new potential (TTM2-R), and MP2/CBS estimates

(A = TIP4P; B = DC; C = ASP-W; D = ASP-W4; E = ASP-VRT; F = TTM; G = TTM2-R;H = MP2/CBS.)

cluster Aa Bb Cc Da Ed Fe Gf Hg

2 (Cs) −6.24 −4.69 −4.68 −4.99 −4.91 −5.33 −4.98 −4.983 −16.73 −13.34 −14.81 −15.49 −15.65 −16.68 −15.59 −15.8

4 (S4) −27.87 −23.95 −24.32 −26.97 −25.93 −28.57 −27.03 −27.65 −36.35 −31.80 −30.88 −35.09 −33.52 −37.91 −36.05 −36.3

6 (cage) −47.27 −40.85 −42.54 −45.86 −45.00 −48.91 −45.65 −45.86 (prism) −46.91 −41.00 −43.31 −47.06 −45.90 −48.54 −45.11 −45.96 (book) −46.12 −40.43 −40.48 −45.00 −43.44 −47.73 −45.14 −45.66 (S6) −44.38 −39.39 −37.58 −43.37 −41.09 −46.48 −44.28 −44.8

RMS error 0.88 4.25 4.14 0.84 1.92 1.89 0.44 —

aWales & Hodges (1998). bDang & Chang (1997). c M. P. Hodges (2001, personal communica-tion). dLiu et al . (1996). eBurnham et al . (1999). fBurnham & Xantheas (2002b). gXantheas etal . (2002).

range. It produces a diffusion coefficient of 2.23 × 10−5 cm2 s−1 for the liquid at300 K, in excellent agreement with the experimental result of 2.3 × 10−5 cm2 s−1.The internal energies for liquid water (300 K) and ice (0 K) are −11.21 kcal mol−1

per molecule and −14.69 kcal mol−1 per molecule, respectively. The computed liquidradial distribution functions are in excellent agreement with the recent ones obtainedfrom neutron scattering (Soper 2000) and X-ray diffraction experiments (Hura et al .2000; Sorenson et al . 2000). Furthermore, the calculated densities are 1.046 g cm−3

for liquid water (300 K) and 0.942 g cm−3 for ice Ih (0 K). The macroscopic resultsfor liquid water and ice with the newly developed potential are quite encouragingsince they validate the parametrization of interaction potentials for water from high-accuracy calculations on a range of clusters.

5. Conclusions

Parallel supercomputers provide computing resources that are orders of magnitudelarger than those available in traditional vector supercomputers. For example, theASCI Q computer will not only have a peak speed that is 1000× that of the lastCray supercomputer (T90) but it also will have 12 terabytes of memory and 600terabytes of disk storage, again nearly 1000× that available in the Cray T90. Theseincreases—in computer speed, memory size and disk storage—suggest more thana modest change in how we use computers to model physical, chemical and bio-logical systems. However, to realize this potential will take a determined effort bydisciplinary theoretical and computational scientists, computer scientists and appliedmathematicians. Much of the software created for traditional (vector) supercomput-ers must be substantially revised or completely rewritten. This must be a collabora-tive effort. The more loosely coupled efforts that are traditional in scientific researchare simply unable to respond in a timely fashion to the rapid pace of advancementsin computing technologies.





The NWChem program was created by a team of theoretical and computationalchemists, computer scientists and applied mathematicians specifically for parallelsupercomputers. It is able to take full advantage of the capabilities of such computers,extending HF calculations to 1000 atoms and the far more accurate coupled-clustercalculations to 100 atoms. And all of this on the (now) small parallel computer(14 teraops) in the Environmental Molecular Sciences Laboratory, and taking only

limited advantage of the new mathematical approaches to reduce the scaling of thecalculations with basis set (Nm → N). Because of its layered and modular design,little must be done to NWChem to enable it to take full advantage of the new parallelsupercomputers being installed in the national supercomputing centres supportedby the DOE and the NSF. Nonetheless, work on NWChem continues. The softwaredeveloper’s job is never done; there are always new features to add, more efficientalgorithms to be developed, and a myriad other tasks.

NWChem has been distributed to over 400 institutions worldwide and is beingused to model molecular systems of a complexity that was intractable just a few yearsago. Within the Environmental Molecular Sciences Laboratory at Pacific North-west National Laboratory, NWChem has been used to model the hydrogen-bondedmolecules described in this paper as well as proteins and polyliposaccharides, chlo-rinated hydrocarbons, uranium compounds, and a wide range of other molecules.We look forward to the advances in our understanding of molecules and molecularprocesses that will be enabled by NWChem.We are grateful to Dr Matt Hodges for providing the energies of the pentamer and hexamer iso-mers with the ASP, ASP-W4 and ASP-VRT models. This work referenced here was performedin the William R. Wiley Environmental Molecular Sciences Laboratory (EMSL) under the aus-pices of the Division of Chemical Sciences, Office of Basic Energy Sciences, US Department ofEnergy under Contract DE-AC06-76RLO 1830 with Battelle Memorial Institute, which operatesthe Pacific Northwest National Laboratory. The EMSL is a national user facility funded by theOffice of Biological and Environmental Research in the US Department of Energy.

References

Allen, F. et al . 2001 Blue Gene. A vision for protein science using a petaflop supercomputer.IBM Syst. J. 40, 310–327.

Avgul, N. N. & Kieslev, A. V. 1970 In Chemistry and physics of carbon (ed. P. L. Walker),vol. 6. New York: Dekker.

Bailey, D. H. 1997 Little’s law and high-performance computing. Available at http://www.nersc.gov/˜dhbailey/dhbpapers/little.pdf.

Bailey, D. H. 1998 Challenges of future high-end computing. In High performance computersystems and applications (ed. J. Schaeffer). Kluwer.

Burnham, C. J. & Xantheas, S. S. 2002a Development of transferable interaction potentials forwater. I. Prominent features of the water dimer potential energy surface. J. Chem. Phys. 116,1479–1492.

Burnham, C. J. & Xantheas, S. S. 2002b Development of transferable interaction potentials forwater. III. Re-parameterization of an all-atom polarizable rigid model (TTM2-R) from firstprinciples. J. Chem. Phys. 116, 1500–1510.

Burnham, C. J., Li, J. C., Xantheas, S. S. & Leslie, M. 1999 The parameterization of a Thole-type all-atom polarizable water model from first principles and its application to the study ofwater clusters (n = 2–21) and the phonon spectrum of ice Ih. J. Chem. Phys. 110, 4566–4581.

Chuang, Y.-Y. (and 18 others) 2000 POLYRATE-version 8.4.1, University of Minnesota, MN.For further information see http://comp.chem.umn.edu/polyrate/.





Cociorva, D., Wilkins, J. W., Baumgartner, G., Sadayappan, P., Ramanujan, J., Nooijen, M.,Bernholdt, D. E. & Harrison, R. J. 2002 Towards automatic synthesis of high-performancecodes for electronic structure calculations: data locality optimization. In Proc. 8th Int. Conf.on High-Performance Computing. Springer. (In the press.)

Corongiu, G. & Clementi, E. 1992 Liquid water with an ab initio potential: X-ray and neutronscattering from 238 to 368 K. J. Chem. Phys. 97, 2030–2038.

Courty, A., Mons, M., Dimicoli, I., Piuzzi, F., Gaigeot, M.-P., Brenner, V., Pujo, P. D. & Millie,P. 1998 Quantum effects in the threshold photoionization and energetics of the benzene–H2Oand benzene–D2O complexes: experiment and simulation. J. Phys. Chem. A106, 6590–6600.

Dang, L. X. & Chang, T.-M. 1997 Molecular dynamics study of water clusters, liquid, andliquid–vapor interface of water with many-body potentials. J. Chem. Phys. 106, 8149–8159.

DOE 2000 DOE/US Department of Energy, Office of Science 2000 Scientific discovery throughadvanced computing. Programme plan submitted to the US Congress, 30 March 2000. Acopy of the SciDAC Plan is available at http://www.sc.doe.gov/production/octr/mics/scidac/index.html.

Dunning Jr, T. H. 1999 A road map for the calculation of molecular binding energies. J. Phys.Chem. A104, 9062–9080.

Dunning Jr, T. H., Peterson, K. A. & Woon, D. E. 1998 Basis sets: correlation consistent. In Theencyclopedia of computational chemistry (ed. P. von Rague Schleyer, N. L. Allinger, T. Clark,J. Gasteiger, P. A. Kollman, H. F. Schaefer III & P. R. Screiner). Wiley.

Feller, D. 1992 Application of systematic sequences of wave functions to the water dimer. J.Chem. Phys. 96, 6104–6114.

Feller, D. 1999 The strength of the benzene–water hydrogen bond. J. Phys. Chem. A103, 7558–7561.

Feller, D. & Jordan, K. D. 2000 Estimating the strength of the water/single-layer graphiteinteraction. J. Phys. Chem. A104, 9971–9975.

Feyereisen, M. W., Feller, D. & Dixon, D. A. 1996 Hydrogen bond energy of the water dimer.J. Phys. Chem. 100, 2993–2997.

Fredericks, S. Y., Jordan, K. D. & Zwier, T. S. 1996 Theoretical characterization of the structuresand vibrational spectra of benzene-(H2O)n (n = 1–3) clusters. J. Phys. Chem. 100, 7810–7821.

Gale, R. L. & Beebe, R. A. 1964 Determination of heats of adsorption on carbon blacks and bonemineral by chromatography using the eluted pulse technique. J. Phys. Chem. 68, 555–567.

Gotch, A. J. & Zwier, T. S. 1992 Multiphoton ionization studies of clusters of immiscible liquids.I. C6H6-(H2O)n, n = 1, 2. J. Chem. Phys. 96, 3388–3401.

Greengard, L. & Rohklin, V. 1997 A new version of the fast multiple method for the Laplaceequation in three dimensions. Acta Numer. 6, 229 (and references therein).

Halkier, A., Koch, H., Jorgensen, P., Christiansen, O., Nielsen, I. M. B. & Helgaker, T. 1997A systematic ab initio study of the water dimer in hierarchies of basis sets and correlationmentods. Theor. Chem. Acc. 97, 150–157.

Hodges, M. P., Stone, A. J. & Xantheas, S. S. 1997 The contribution of many-body terms tothe interaction energy of small water-clusters: a comparison of ab-initio and accurate modelpotentials. J. Phys. Chem. A101, 9163–9168.

Hura, G., Sorenson, J. M., Glaeser, R. M. & Head-Gordon, T. 2000 A high-quality X-ray scat-tering experiment on liquid water at ambient conditions. J. Chem. Phys. 113, 9140–9148.

Hwang, K. & Xu, Z. 1997 Scalable parallel computing. McGraw-Hill.Iijima, S. 1991 Helical microtubules of graphic carbon. Nature 354, 56–59.Kendall, R. A. (and 12 others) 2000 High-performance computational chemistry: an overview

of NWChem, a distributed parallel application. Computat. Phys. Commun. 128, 260–283.





Klopper, W. 1998 12-dependent wavefunctions. In The encyclopedia of computational chemistry(ed. P. von Rague Schleyer, N. L. Allinger, T. Clark, J. Gasteiger, P. A. Kollman, H. F.Schaefer III & P. R. Screiner). Wiley.

Koga, K., Gao, G. T., Tanaka, H. & Zeng, X. C. 2001 Formation of ordered ice nanotubes insidecarbon nanotubes. Nature 412, 802–805.

Liu, K., Brown, M. G., Carter, C., Saykally, R. J., Gregory, J. K. & Clary, D. C. 1996 Charac-terization of a cage form of the water hexamer. Nature 381, 501–503.

Nielsen, I. M. B., Seidl, E. T. & Janssen, C. L. 1999 Accurate structures and binding energiesfor small water clusters: the water trimer. J. Chem. Phys. 110, 9435–9442.

Nieplocha, J., Harrison, R. J. & Littlefield, R. J. 1996 Global arrays: a nonuniform memoryaccess programming model for high-performance computers. J. Supercomput. 10, 197–220.

Niesar, U., Corongiu, G., Clementi, E., Kneller, G. R. & Bhattachrya, D. K. 1990 Liquid waterwith an ab initio potential: X-ray and neutron scattering from 238 to 368 K. J. Phys. Chem.94, 7949–7956.

Ochsenfeld, C., White, C. A. & Head-Gordon, M. 1998 Linear and sublinear scaling formationof Hartree–Fock-type exchange matrices. J. Chem. Phys. 109, 1663–1669.

Rahman, A. & Stillinger, F. H. 1971 Molecular dynamics study of liquid water. J. Chem. Phys.55, 3336–3359.

Rick, S. W., Stuart, S. J. & Berne, B. J. 1994 Dynamical fluctuating charge force fields: appli-cation to liquid water. J. Chem. Phys. 101, 6141–6156.

Soper, A. K. 2000 The radial distribution functions of water and ice from 220 to 673 K and atpressures up to 400 MPa. J. Chem. Phys. 258, 121–137.

Sorenson, J. M., Hura, G., Glaeser, R. M. & Head-Gordon, T. 2000 What can X-ray scatteringtell us about the radial distribution functions of water? J. Chem. Phys. 113, 9149–9161.

Sprik, M. & Klein, M. L. 1988 A polarizable model for water using distributed charge sites. J.Chem. Phys. 89, 7556–7560.

Stillinger, F. H. & David, C. W. 1978 Polarization model for water and its ionic dissociationproducts. J. Chem. Phys. 69, 1473–1484.

Vernov, A. & Steele, W. A. 1992 The electrostatic field at a graphite surface and its effect onmolecule–solid interactions. Langmuir 8, 155–159.

Wales, D. & Hodges, M. P. 1998 Global minima of water clusters (H2O)n, n = 21, described byan empirical potential. Chem. Phys. Lett. 286, 65–72.

Xantheas, S. S. 1994 Ab-initio studies of the cyclic water clusters (H2O)n n = 1–6. II. Analysisof many-body interactions. J. Chem. Phys. 100, 7523–7534.

Xantheas, S. S. 2000 Cooperativity and hydrogen bonding network. Chem. Phys. 258, 225–231.Xantheas, S. S., Burnham, C. J. & Harrison, R. J. 2002 Development of transferable interaction

potentials for water. II. Accurate energetics of the first few water clusters from first principles.J. Chem. Phys. 116, 1493–1499.




Promise and challenge of high-performance computing, with ...

Documents

Transcript of Promise and challenge of high-performance computing, with ...