Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh...
-
Upload
reynold-pearson -
Category
Documents
-
view
215 -
download
0
Transcript of Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh...
Science on Supercomputers:
Pushing the (back of) the envelope
Jeffrey P. GardnerJeffrey P. Gardner
Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterCarnegie Mellon UniversityCarnegie Mellon University
University of PittsburghUniversity of Pittsburgh
Outline History (the past)
Characteristics of scientific codes Scientific computing, supercomputers, and the
Good Old Days Reality (the present)
Is there anything “super” about computers anymore?
Why “network” means more net work on your part.
Fantasy (the future) Strategies for turning a huge pile of processors
into something scientists can actually use.
A (very brief) Introduction of Scientific Computing
Properties of “interesting” scientific datasets
Very large dataset where: Calculation is “tightly-coupled”
Example Science Application:Cosmology
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
100 million light years
To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset
To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset
““read-only” read-only” couplingcoupling““read-only” read-only” couplingcoupling
Example Science Application:Cosmology
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
100 million light years
To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particlesparticles
To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particlesparticles
““read-write” read-write” couplingcoupling““read-write” read-write” couplingcoupling
Scientific Computing
Transaction Processing1:A transaction is an information processing operation
that cannot be subdivided into smaller operations. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state.2
Functional definition:A transaction is any computational task:1. That cannot be easily subdivided because the
overhead in doing so would exceed the time required for the non-divided form to complete.
2. Where any further subdivisions cannot be written in such a way that they are independent of one another.
2From Wikipedia
1term borrowed (and generalized with apologies) from database management
Scientific Computing
Functional definition:A transaction is any computational task:1. That cannot be easily subdivided because the
overhead in doing so would exceed the time required for the non-divided to complete.
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
To resolve the To resolve the gravitationalgravitational force force on any single on any single particle requires particle requires the entire datasetthe entire dataset
To resolve the To resolve the gravitationalgravitational force force on any single on any single particle requires particle requires the entire datasetthe entire dataset
““read-only” couplingread-only” coupling““read-only” couplingread-only” coupling
Scientific Computing
Functional definition:A transaction is any computational task:2. Where any further subdivisions cannot be
written in such a way that they are independent of one another.
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particle sparticle s
To resolve the To resolve the hydrodynamichydrodynamic forces requires forces requires information information exchange between exchange between particle sparticle s
““read-write” couplingread-write” coupling““read-write” couplingread-write” coupling
Scientific Computing
In most business andweb applications: A single CPU usually
processes many transactions per second
Transaction sizes are typically small
Scientific Computing
In many scienceapplications: A single transaction
can take CPU hours, days, or years
Transaction sizes can be extremely large
What Made Computers “Super”?
Since the transaction is memory-resident in order to not be I/O bound, the next bottleneck is memory.
The original Supercomputers differed from “ordinary” computers in their memory bandwidth and latency characteristics.
The “Golden Age” of Supercomputing
1976-1982: The Cray-1 is the most powerful computer in the world
The Cray-1 is a vector platform:
i.e. it performs the same operation on many contiguous memory elements in one clock tick.
Memory subsystem was optimized to feed data to the processor at its maximum flop rate.
The “Golden Age” of Supercomputing
1985-1989: The Cray-2 is the most powerful computer in the world
The Cray-2 is also a vector platform
Scientists Liked Supercomputers.They were simple to program!
1. They were serial machines2. “Caches? We don’t need no
stinkin’ caches!” Scalar machines had no memory
latency This is as close as you get to an ideal
computer Vector machines offered substantial
performance increases over scalar machines if you could “vectorize” your code.
“Triumph” of the Masses
In the 1990s, commercial off-the-shelf (COTS) technology became so cheap, it was no longer cost-effective to produce fully-custom hardware
“Triumph” of the Masses
Instead of producing faster processors with faster memory, supercomputer companies built machines with lots of processors in them.
A single processor Cray-2A 1024-processor Cray (CRI) T3D
“Triumph” of the Masses
These were known as massively parallel platforms, or MPPs.
A single processor Cray-2 A 1024-processor Cray T3D
“Triumph” of the Masses(?)
A single processor Cray-2,The world’s fastest computer in 1989
A 1024-processor Cray T3D,The world’s fastest computer in 1994
(almost)
Part II: The Present
Why “network” means more net work on your part
The “Social Impact” of MPPs
The transition from serial supercomputers to MPPs actually resulted in far fewer scientists using supercomputers. MPPs are really hard to program!
Developing scientific applications for MPPs became an area of study in its own right: High Performance Computing (HPC)
Characteristics of HPC Codes
Large dataset Data must be
distributed across many compute nodes
ProcessorRegisters
Main memory
L2 cache
L1 cache
~2 cycles
~10 cycles
~100 cycles
Off-processor memory
~300,000 cycles!
The CPU memory hierarchyThe MPP memory hierarchy
An N-Body cosomologysimulation
Proc 0 Proc 1 Proc 2
Proc 5Proc 4Proc 3
Proc 6 Proc 7 Proc 8
What makes computers “super” anymore?
Cray T3D in 1994:Cray-built interconnect fabric
PSC Cray XT3 in 2006:Cray-built interconnect fabric
PSC “Terascale Compute System” (TCS) in 2000:Custom interconnect fabric by Quadrics
What makes computers “super” anymore?
I would propose the following definition:
A “supercomputer” differs from “a pile of workstations” in that: a supercomputer is optimized to spread
a single large transaction across many many processors.
In practice, this means that the network interconnect fabric is identified as the principle bottleneck.
What makes computers “super” anymore?
Google’s 30-acre campus in The Dalles, Oregon
Review: Hallmarks of Computing
FORTRAN heralded as the world’s first “high-level” language
Seymour Cray develops the CDC 6600, the first “supercomputer”
Cray-1 marks the beginning of the Golden Age of supercomputing
Cray-2 marks the end of the Golden Age of supercomputing
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1966:
1976:
1989:
1990s:
1956:
1986:Pittsburgh Supercomputer Center is founded
Seymour Cray founds Cray Research Inc (CRI)1972:
1998: Google Inc. is founded20??:Google achieves world domination;
Scientists still program in a “high-level” language they call FORTRAN
Review: HPC High-Performance Computing (HPC)
refers to a type of computation whereby a single, large transaction is spread across 100s to 1000s of processors.
In general, this kind of computation is sensitive to network bandwidth and latency.
Therefore, most modern-day “supercomputers” seek to maximize interconnect bandwidth and minimize interconnect latency within economic limits.
Naïve algorithm is Order N2
Gasoline: N-Body Treecode (Order N log N) Began development in 1994…and
continues to this day
PE
kd-tree (subset of Binary Space Partitioning tree)
Example HPC Application:Cosmological N-Body Simulation
Cosmological N-Body Simulation
Everything in the Universe attracts everything else
Dataset is far too large to replicate in every PE’s memory
Difficult to parallelize
PROBLEM:PROBLEM:
Cosmological N-Body Simulation
Everything in the Universe attracts everything else
Dataset is far too large to replicate in every PE’s memory
Difficult to parallelize
PROBLEM:PROBLEM: Only 1 in 3000
memory fetches can result in an off-processor message being sent!
Characteristics of HPC Codes
Large dataset Data must be
distributed across many compute nodes
ProcessorRegisters
Main memory
L2 cache
L1 cache
~2 cycles
~10 cycles
~100 cycles
Off-processor memory
~300,000 cycles!
The MPP memory hierarchy
An N-Body cosomologysimulation
Proc 0 Proc 1 Proc 2
Proc 5Proc 4Proc 3
Proc 6 Proc 7 Proc 8
FeaturesFeatures Advanced interprocessor data caching
Application data is organized into cache-lines Read cache:
Requests for off-PE data result in fetching of “cache line”
Cache line is stored locally and used for future requests
Write cache: Updates to off-PE data are processed locally, then
flushed to remote thread when necessary
< 1 in 100,000 off-PE requests actually result in communication.
FeaturesFeatures Load Balancing
The amount of work each particle required for step t is tracked.
This information is used to distribute work evenly amongst processors for step t+1
PerformanPerformancece
85% linearity on 512 85% linearity on 512 PEs PEs with pure MPI with pure MPI (Cray XT3)(Cray XT3)92% linearity on 512 92% linearity on 512 PEs PEs with one-sided with one-sided comms (Cray T3E comms (Cray T3E Shmem)Shmem)
92% linearity on 2048 92% linearity on 2048 PEs PEs on Cray XT3 for on Cray XT3 for optimal problem size optimal problem size (>100,000 particles (>100,000 particles per processor)per processor)
FeaturesFeatures Portability
Interprocessor communication by high-level requests to “Machine-Dependent Layer” (MDL)
Only 800 lines of code per architecture MDL is rewritten to take advantage of each parallel
architecture (e.g. one-sided communication). MPI-1, POSIX Threads, SHMEM, Quadrics, & more
Parallel Thread
GASOLINE
MDL
Parallel Thread
GASOLINE
MDLCommunication
ApplicationApplicationss
Galaxy Formation(10 million particles)
ApplicationApplicationss
Solar System Planet
Formation(1 million particles)
ApplicationApplicationss
Asteroid Collisions(2000 particles)
ApplicationApplicationss
Piles of Sand (?!)
(~1000 particles)
SummarSummaryy
N-Body simulation are difficult to parallelize: Gravity says: everything interacts with everything
else GASOLINE achieves high scalability by using
several beneficial concepts: Interprocessor data caching for both reads and
writes Maximal exploitation of any parallel architecture Load balancing on a per-particle basis
GASOLINE proved useful for a wide range of applications that simulate particle interactions
Flexible client-server architecture aids in porting to new science domains
Part III: The FutureTurning a huge pile of processors into something that scientists can actually use.
How to turn simulation output into scientific knowledge
Step 1: Run simulation
Step 2: Analyze simulationon workstation
Step 3: Extract meaningfulscientific knowledge
(happy scientist)Using 300 processors:(circa 1996)
How to turn simulation output into scientific knowledge
Step 1: Run simulation
Step 2: Analyze simulationon server
Step 3: Extract meaningfulscientific knowledge
(happy scientist)Using 1000 processors:(circa 2000)
How to turn simulation output into scientific knowledge
Step 1: Run simulation
Step 2: Analyze simulationon ???
(unhappy scientist)Using 2000+ processors:(circa 2005)
X
How to turn simulation output into scientific knowledge
Step 1: Run simulation
Step 2: Analyze simulationon ???
Using 100,000 processors?:(circa 2012)
X
The NSF has announced that it will be providing $200 million to build and operate a Petaflop machine by
2012.
Turning TeraFlops into Scientific Understanding
Problem:The size of simulations is no longer
limited by the scalability of the simulation code, but by the scientists inability to process the resultant data.
Turning TeraFlops into Scientific Understanding
As MPPs increase in processor count, analysis tools must also run on MPPs!
PROBLEM: 1. Scientists usually write their own analysis programs2. Parallel program are hard to write!
HPC world is dominated by simulations: Code is often reused for many years by many people Therefore, you can afford to spend lots of time
writing the code. Example: Gasoline required 10 FTE-years of
development!
Turning TeraFlops into Scientific Understanding
Data analysis implies: Rapidly changing scientific inqueries Much less code reuse
Data analysis requires rapid algorithm development!
We need to rethink how we as scientists interact with our data!
A Solution(?): N tropy
Scientists tend to write their own code
So give them something that makes that easier for them.
Build a framework that is: Sophisticated enough to take care of
all of the parallel bits for you Flexible enough to be used for a large
variety of data analysis applications
N tropy: A framework for multiprocessor development
GOAL: Minimize development time for parallel applications.
GOAL: Enable scientists with no parallel programming background (or time to learn) to still implement their algorithms in parallel by writing only serial code.
GOAL: Provide seamless scalability from single processor machines to MPPs…potentially even several MPPs in a computational Grid.
GOAL: Do not restrict inquiry space.
Methodology Limited Data Structures:
Astronomy deals with point-like data in an N-dimension parameter space
Most efficient methods on these kind of data use trees. Limited Methods:
Analysis methods perform a limited number of fundamental operations on these data structures.
N tropy DesignGASOLINE already provides a number of
advanced services GASOLINE benefits to keep:
Flexible client-server scheduling architecture Threads respond to service requests issued by master. To do a new task, simply add a new service.
Portability Interprocessor communication occurs by high-level
requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture.
Advanced interprocessor data caching < 1 in 100,000 off-PE requests actually result in
communication.
N tropy Design
Dynamic load balancing (available now) Workload and processor domain
boundaries can be dynamically reallocated as computation progresses.
Data pre-fetching (To be implemented) Predict request off-PE data that will be
needed for upcoming tree nodes.
N tropy Design
Computing across grid nodes Much more difficult than between nodes on
a tightly-coupled parallel machine: Network latencies between grid resources 1000
times higher than nodes on a single parallel machine.
Nodes on a far grid resources must be treated differently than the processor next door:
Data mirroring or aggressive prefetching. Sophisticated workload management,
synchronization
N tropy Features By using N tropy you will get a lot of
features “for free”: Tree objects and methods
Highly optimized and flexible
Automatic parallelization and scalability You only write serial bits of code!
Portability Interprocessor communication occurs by high-level
requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture.
MPI, ccNUMA, Cray XT3, Quadrics Elan (PSC TCS), SGI Altix
N tropy Features By using N tropy you will get a lot of
features “for free”: Collectives
AllToAll, AllGather, AllReduce, etc. Automatic reduction variables
All of your routines can return scalars to be reduced across all processors
Timers 4 automatic N tropy timers 10 custom timers
Automatic communication and I/O statistics Quickly identify bottlenecks
Serial Performance
N tropy vs. an existing serial n-point correlation function calculator:
N tropy is 6 to 30 times faster in serial! Conclusions:
1. Not only does it takes much less time to write an application using N tropy,
2. You application may run faster than if you wrote it from scratch!
Performance
10 million particlesSpatial 3-Point3->4 Mpc
This problem is substantially harder than gravity!
3 FTE months ofdevelopment time!
N tropy “Meaningful” Benchmarks
The purpose of this framework is to minimize development time!
Development time for:1. N-point correlation function calculator
3 months
2. Friends-of-Friends group finder 3 weeks
3. N-body gravity code 1 day!*
*(OK, I cheated a bit and used existing serial N-body code fragments)
N tropy Conceptual Schematic
Computational Steering LayerC, C++, Python (Fortran?)
Framework (“Black Box”)
User serial collective staging and processing routines
Web Service Layer (at least from Python)
Domain Decomposition/Tree Building
Tree TraversalParallel I/O
User serial I/O routines
VO
WSDL?SOAP? Key:
Framework ComponentsTree ServicesUser Supplied
Collectives
Dynamic Workload Management
User tree traversal routines
User tree and particle data
Scientists no longer run on their simulations on the biggest MPPs because they cannot analyze the output. Scientists are seriously bummed.
Summary
Scientists run on serial supercomputers. Scientists write many programs for them. Scientists are happy.
MPPs are born. Scientists scratch their heads and figure out how to parallelize their algorithms.
early 1990s:
Ancient times:
mid 1990s: Scientists start writing scalable code for MPPs. After much effort, scientists are kind of happy again.
early 2000s:
Prehistoric times: FORTRAN is heralded as the first “high-level” language.
20??:Google achieves world domination; Scientists still program in a “high-level” language they call FORTRAN
Summary N tropy is an attempt to allow scientists
to rapidly develop their analysis codes for a multiprocessor environment.
Our results so far show that it is worthwhile to invest time developing a individual frameworks that are:
1. Serially optimized2. Scalable3. Flexible enough to be customized to many
different applications, even applications that you do not currently envision.
Is this a solution for the 100,000 processor world of tomorrow??
Pittsburgh Supercomputing Center
Founded in 1986 Joint venture between Carnegie Mellon
University, University of Pittsburgh, and Westinghouse Electric Co.
Funded by several federal agencies as well as private industries.
Main source of support is National Science Foundation, Office of Cyberinfrastructure
Pittsburgh Supercomputing Center
PSC is the third largest NSF sponsored supercomputing center
BUT we provide over 60% of the computer time used by the NSF research
AND PSC is the only academic super- computing center in the U.S. to have had the most powerful supercomputer in the world (for unclassified research)
Pittsburgh Supercomputing Center
GOAL: To use cutting edge computer technology to do science that would not otherwise be possible
Conclusions Most data analysis in astronomy is
done using trees as the fundamental data structure.
Most operations on these tree structures are functionally identical.
Based on our studies so far, it appears feasible to construct a general purpose multiprocessor framework that users can rapidly customize to their needs.
Cosmological N-Body Simulation
Time required for 1 floating point operation:0.25 ns
Time required for 1 memory fetch:~10ns (40 floats)
Time required for 1 off-processor fetch:~10ms (40,000 floats)
Lesson: Only 1 in 1000 memory fetches can result in network activity!
Timings:Timings:
The very first “Super Computer”
1929: New York World newspaper coins the term “super computer” when talking about a giant tabulator custom-built by IBM for Columbia University
Review: Hallmarks of Computing
FORTRAN heralded as the world’s first “high-level” language
Seymour Cray develops the CDC 6600, the first “supercomputer”
Cray-1 marks the beginning of the Golden Age of supercomputing
Cray-2 marks the end of the Golden Age of supercomputing
MPPs are born (e.g. CM5, T3D, KSR1, etc)
1966:
1976:
1989:
1990s:
1956:
1986:Pittsburgh Supercomputer Center is founded
1995:Cray Computer Corporation (CCC) goes bankrupt
Seymour Cray founds Cray Research Inc (CRI)1972:
Seymour Cray leaves CRI and founds Cray Computer Corp. (CCC)1989:
1996:Cray Research Inc. acquired by SGI
1998: Google Inc. is founded20??:Google achieves world domination;
Scientists still program in a “high-level” language they call FORTRAN
The T3D MPP
1024 Dec Alpha processors (COTS)
128MB of RAM per processor (COTS)
Cray Custom-built network fabric ($$$)
A 1024-processor Cray T3D in 1994
General characteristics of MPPs
COTS processors COTS memory
subsystem Linux-based kernel Custom networking Custom networking
in MPPs has replaced the custom memory systems of vector machines
The 2068 processor Cray XT3 at PSC in 2006
Why??
Example Science Applications:Weather Prediction
Looking for Tornados(credits: PSC, Center for Analysis and Prediction of Storms)
Reasons for being sensitive to communication latency
A given processor (PE) may “touch” a very large subsample of the total dataset Example: self-gravitating system
PEs must exchange information many times during a single transaction Example: along domain boundaries of
a fluid calculation
FeaturesFeatures Flexible client-server scheduling
architecture Threads respond to service requests
issued by master. To do a new task, simply add a new
service. Computational steering involves
trivial serial programming
DesignDesign
Computational Steering Layer
Parallel Management Layer
Serial Layer
Gravity Calculator Hydro Calculator
Gasoline Functional Layout
Executes on master processor only
Coordinates execution and data distribution among processors
Executes “independently” on all processors
Machine Dependent Layer (MDL) Interprocessor communication
Serial layers Parallel layers
Cosmological N-Body Simulation
Simulate how structure in the Universe forms from initial linear density fluctuations:
1. Linear fluctuations in early Universe supplied by cosmological theory.
2. Calculate non-linear final states of these fluctuations.
3. See if these look anything like the real Universe.
4. No? Go to step 1
SCIENCE:SCIENCE: