1 Interactive Tools for Combinatorial Supercomputing John R. Gilbert University of California, Santa...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of 1 Interactive Tools for Combinatorial Supercomputing John R. Gilbert University of California, Santa...
1
Interactive Tools for Combinatorial Supercomputing
John R. GilbertUniversity of California, Santa Barbara
Viral Shah, Imran Patel (UCSB)Alan Edelman (MIT and Interactive Supercomputing)Ron Choy, David Cheng (MIT)Parry Husbands (Lawrence Berkeley Lab)Steve Reinhardt, Todd Letsche (SGI)
Support: DOE Office of Science, DARPA, SGI, ISC
2
Parallel Computing Today
Departmental Beowulf cluster
Columbia, NASA Ames Research Center
3
But!
How do you program it?
4
C with MPI
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "mpi.h"
#define A(i,j) ( 1.0/((1.0*(i)+(j))*(1.0*(i)+(j)+1)/2 + (1.0*(i)+1)) )
void errorExit(void);
double normalize(double* x, int mat_size);
int main(int argc, char **argv)
{
int num_procs;
int rank;
int mat_size = 64000;
int num_components;
double *x = NULL;
double *y_local = NULL;
double norm_old = 1;
double norm = 0;
int i,j;
int count;
if (MPI_SUCCESS != MPI_Init(&argc, &argv)) exit(1);
if (MPI_SUCCESS != MPI_Comm_size(MPI_COMM_WORLD,&num_procs)) errorExit();
5
C with MPI (2)
if (0 == mat_size % num_procs) num_components = mat_size/num_procs;
else num_components = (mat_size/num_procs + 1);
mat_size = num_components * num_procs;
if (0 == rank) printf("Matrix Size = %d\n", mat_size);
if (0 == rank) printf("Num Components = %d\n", num_components);
if (0 == rank) printf("Num Processes = %d\n", num_procs);
x = (double*) malloc(mat_size * sizeof(double));
y_local = (double*) malloc(num_components * sizeof(double));
if ( (NULL == x) || (NULL == y_local) )
{
free(x);
free(y_local);
errorExit();
}
if (0 == rank)
{
for (i=0; i<mat_size; i++)
{
x[i] = rand();
}
norm = normalize(x,mat_size);
}
6
C with MPI (3)
if (MPI_SUCCESS !=
MPI_Bcast(x, mat_size, MPI_DOUBLE, 0, MPI_COMM_WORLD)) errorExit();
count = 0;
while (fabs(norm-norm_old) > TOL) {
count++;
norm_old = norm;
for (i=0; i<num_components; i++)
{
y_local[i] = 0;
}
for (i=0; i<num_components && (i+num_components*rank)<mat_size; i++)
{
for (j=mat_size-1; j>=0; j--)
{
y_local[i] += A(i+rank*num_components,j) * x[j];
}
}
if (MPI_SUCCESS != MPI_Allgather(y_local, num_components, MPI_DOUBLE, x, num_components, MPI_DOUBLE, MPI_COMM_WORLD)) errorExit();
7
C with MPI (4)
norm = normalize(x, mat_size);
}
if (0 == rank)
{
printf("result = %16.15e\n", norm);
}
free(x);
free(y_local);
MPI_Finalize();
exit(0);
}
void errorExit(void)
{
int rank;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
printf("%d died\n",rank);
MPI_Finalize();
exit(1);
}
8
C with MPI (5)
double normalize(double* x, int mat_size)
{
int i;
double norm = 0;
for (i=mat_size-1; i>=0; i--)
{
norm += x[i] * x[i];
}
norm = sqrt(norm);
for (i=0; i<mat_size; i++)
{
x[i] /= norm;
}
return norm;
}
9
Star-P
A = rand(4000*p, 4000*p);
x = randn(4000*p, 1);
y = zeros(size(x));
while norm(x-y) / norm(x) > 1e-11
y = x;
x = A*x;
x = x / norm(x);
end;
10
• Matlab*P 1.0 (1998): Edelman, Husbands, Isbell (MIT)
• Matlab*P 2.0 (2002- ): MIT / UCSB / LBNL
• Star-P (2004- ): Interactive Supercomputing / SGI
Background
11
Data-Parallel Operations
< M A T L A B >
Copyright 1984-2001 The MathWorks, Inc.
Version 6.1.0.1989a Release 12.1
>> A = randn(500*p, 500*p)
A = ddense object: 500-by-500
>> E = eig(A);
>> E(1)
ans = -4.6711 +22.1882i
e = E(:,:);
>> ppwhos
Name Size Bytes Class
A 500px500p 688 ddense object
E 500px1 652 ddense object
e 500x1 8000 double array (complex)
12
>> quad('4./(1+x.^2)', 0, 1);
ans = 3.14159270703219
>> a = (0:3*p) / 4
a = ddense object: 1-by-4
>> a(:,:)
ans =
0
0.25000000000000
0.50000000000000
0.75000000000000
>> b = a + .25;
>> c = ppeval('quad','4./(1+x.^2)', a, b);
c = ddense object: 1-by-4
>> sum(c)
ans = 3.14159265358979
Task-Parallel Operations
13
MATLAB®
Star-P Architecture
Ordinary Matlab variables
Star-P
client manager
server manager
package manager
processor #0
processor #N-1
processor #1
processor #2
processor #3
. . .
ScaLAPACK
FFTW
FPGA interface
matrix manager Distributed matrices
pressure
temperature
UPC user code
sort
dense/sparse
UPC user code
MPI user code
14
Matlab sparse matrix design principles
• All operations should give the same results for sparse and full matrices (almost all)
• Sparse matrices are never created automatically, but once created they propagate
• Performance is important -- but usability, simplicity, completeness, and robustness are more important
• Storage for a sparse matrix should be O(nonzeros)
• Time for a sparse operation should be O(flops)(as nearly as possible)
Star-P dsparse matrices: same principles, but some different tradeoffs
15
Sparse data structure
• Full:
• 2-dimensional array of real or complex numbers
• (nrows*ncols) memory
31 0 53
0 59 0
41 26 0
31 53 59 41 26
1 3 2 1 2
• Sparse:
• compressed row storage
• about (2*nzs + nrows) memory
16
P0
P1
P2
Pn
5941 532631
23 131
Each processor stores:
• # of local nonzeros• range of local rows• nonzeros in compressed form
Distributed sparse data structure
17
The sparse( ) constructor
• A = sparse (I, J, V, nr, nc)
• Input: ddense vectors I, J, V, dimensions nr, nc
• Output: A(I(k), J(k)) = V(k)
• Sort triples < i, j, v > by < i, j >
• Locally convert to compressed row indices
• Sum values with duplicate indices
18
Sparse array and matrix operations
• dsparse layout, same semantics as ddense
• Data distribution by rows
• Matrix arithmetic: +, *, max, sum, etc.
• Matrix indexing and concatenation
A (1:3, [4 5 2]) = [ B(:, J) C ] ;
• Linear solvers: x = A \ b; using SuperLU (MPI)
• Eigensolvers: [V, D] = eigs(A); using PARPACK (MPI)
19
Sorting in Star-P
• [V, perm] = sort (V)
• Common primitive for several sparse array algorithms, including the sparse constructor
• Star-P uses a parallel sample sort
20
Sparse matrix times dense vector
• y = A * x
• First matvec with A caches a communication schedule
• Later matvecs with A use the cached schedule
• Communication and computation overlap
21
Combinatorial Scientific Computing
• Sparse matrix methods
• Graph matching
• Machine learning
• Web search and information retrieval
• Geometric modeling
• Computational biology
• Bioinformatics
• . . .
How will combinatorial methods be used by nonexperts?
22
Analogy: Matrix division in Matlab
x = A \ b;
• Works for either full or sparse A
• Is A square?
no => use QR to solve least squares problem
• Is A triangular or permuted triangular?
yes => sparse triangular solve
• Is A symmetric with positive diagonal elements?
yes => attempt Cholesky after symmetric minimum degree
• Otherwise
=> use LU on A(:, colamd(A))
23
Combinatorics in Star-P
• Represent a graph as a sparse adjacency matrix
• A sparse matrix language is a good start on primitives for computing with graphs
– Random-access indexing: A(i,j)
– Neighbor sequencing: find (A(i,:))
– Sparse table construction: sparse (I, J, V)
– Breadth-first search step : A * v
24
Sparse adjacency matrix and graph
• Adjacency matrix: sparse array w/ nonzeros for graph edges
• Storage-efficient implementation from sparse data structures
x ATx
1 2
3
4 7
6
5
AT
25
Breadth-first search: sparse mat * vec
• Multiply by adjacency matrix step to neighbor vertices
• Efficient implementation from sparse data structures
x ATx
1 2
3
4 7
6
5
AT
26
Breadth-first search: sparse mat * vec
• Multiply by adjacency matrix step to neighbor vertices
• Efficient implementation from sparse data structures
x ATx
1 2
3
4 7
6
5
AT
27
Breadth-first search: sparse mat * vec
• Multiply by adjacency matrix step to neighbor vertices
• Efficient implementation from sparse data structures
AT
1 2
3
4 7
6
5
(AT)2x
x ATx
28
Connected components of a graph
• Sequential Matlab uses depth-first search (dmperm), which doesn’t parallelize well
• Shiloach-Vishkin pointer-jumping algorithm:– repeat
• Link every (super)vertex to a neighbor
• Shrink each tree to a supervertex by pointer jumping
– until no further change
• Other planned graph kernels:– Shortest-path search (after Husbands, LBNL)
– Bipartite matching (after Riedy, UCB)
– Strongly connected components (after Pinar, LBNL)
29
HPCS Scalable Synthetic Compact Apps
SSCA#1: Bioinformatics / Pattern Matching
SSCA#2: Graph Analysis
SSCA#3: Sensor Processing / Imaging
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
HPCchallengeBenchmarks
Micro &Kernel
BenchmarksMission Partner
ApplicationBenchmarks
6.Signal
ProcessingKnowledgeFormation
Ex
isti
ng
Ap
pli
ca
tio
ns
Em
erg
ing
Ap
pli
cati
on
s
Fu
ture
Ap
pli
cati
on
s
Sim
ula
tio
nIn
tellig
en
ce
Re
co
nn
ais
san
ce
5.Simulation
Multi-Physics
1.OptimalPattern
Matching
1.OptimalPattern
Matching
4.SimulationNAS PB AU
3.SimulationNWCHEM
Scalable SyntheticCompact Applications
HPCSSpanning
Set ofKernels
Kernels
DiscreteMath…GraphAnalysis…LinearSolvers…SignalProcessing…Simulation…I/O
ExecutionPerformance
Bounds
ExecutionPerformance
Indicators
LocalDGEMMSTREAM
RandomAccess1D FFT
GlobalLinpackPTRANS
RandomAccess1D FFT
CurrentUM2000
GAMESSOVERFLOW
LBMHDRFCTHHYCOM
Near-FutureNWChemALEGRA
CCSM
Execution andDevelopment
Performance Indicators
System Bounds
Data Generator
Graph Construction
Sort Large Sets
Graph Extraction
Graph Clustering
Data Generator
Graph Construction
Sort Large Sets
Graph Extraction
Graph Clustering
2.Graph
Analysis
2.Graph
Analysis
30
• Many tight clusters, loosely interconnected
• Input data is edge triples < i, j, label(i,j) >
• Vertices and edges permuted randomly
SSCA#2: “Graph Analysis”
Fine-grained, irregular data access
Searching and clustering
31
Concise SSCA#2 in Star-P
Kernel 1: Construct graph data structures
• Graphs are dsparse matrices, created by sparse( )
32
Kernels 2 and 3
Kernel 2: Search by edge labels
• About 12 lines of executable Matlab or Star-P
Kernel 3: Extract subgraphs
• Returns subgraphs consisting of vertices and edges within fixed distance of given starting vertices
• Sparse matrix-vector product for breadth-first search
• About 25 lines of executable Matlab or Star-P
33
Kernel 4: Vertex clustering
% Grow each seed to vertices
% reached by at least two
% paths of length 1 or 2
C = sparse(seeds, 1:ns, 1, n, ns);
C = A * C;
C = C + A * C;
C = C > 1;
• Grow local clusters from many seeds in parallel
• Breadth-first search by sparse matrix * matrix
34
Expressive Power: SSCA#2 Kernel 3
Star-P (25 lines)A = spones(G.edgeWeights{1});
nv = max(size(A));
npar = length(G.edgeWeights);
nstarts = length(starts);
for i = 1:nstarts
v = starts(i);
% x will be a vector whose nonzeros
% are the vertices reached so far
x = zeros(nv,1);
x(v) = 1;
for k = 1:pathlen
x = A*x;
x = (x ~= 0);
end;
vtxmap = find(x);
S.edgeWeights{1} = G.edgeWeights{1}(vtxmap,vtxmap);
for j = 2:npar
sg = G.edgeWeights{j}(vtxmap,vtxmap);
if nnz(sg) == 0
break;
end;
S.edgeWeights{j} = sg;
end;
S.vtxmap = vtxmap;
subgraphs{i} = S;
end
MATLABmpi (91 lines)declareGlobals;
intSubgraphs = subgraphs(G, pathLength, startSetInt);
strSubgraphs = subgraphs(G, pathLength, startSetStr);
%| Finish helping other processors.
if P.Ncpus > 1
if P.myRank == 0 % if we are the leader
for unused = 1:P.Ncpus-1
[src tag] = probeSubgraphs(G, [P.tag.K3.results]);
[isg ssg] = MPI_Recv(src, tag, P.comm);
intSubgraphs = [intSubgraphs isg];
strSubgraphs = [strSubgraphs ssg];
end
for dest = 1:P.Ncpus-1
MPI_Send(dest, P.tag.K3.done, P.comm);
end
else
MPI_Send(0, P.tag.K3.results, P.comm, ...
intSubgraphs, strSubgraphs);
[src tag] = probeSubgraphs(G, [P.tag.K3.done]);
MPI_Recv(src, tag, P.comm);
end
end
function graphList = subgraphs(G, pathLength, startVPairs)
graphList = [];
% Estimated # of edges in a subgraph. Memory will grow as needed.
estNumSubGEdges = 100; % depends on cluster size and path length
%--------------------------------------------------------------------------
% Find subgraphs.
%--------------------------------------------------------------------------
% Loop over vertex pairs in the starting set.
for vertexPair = startVPairs.'
subg.edgeWeights{1} = ...
spalloc(G.maxVertex, G.maxVertex, estNumSubGEdges);
startVertex = vertexPair(1);
endVertex = vertexPair(2);
% Add an edge with the first weight.
subg.edgeWeights{1}(endVertex, startVertex + P.myBase) = ...
G.edgeWeights{1}(endVertex, startVertex);
if ENABLE_PLOT_K3DB
plotEdges(subg.edgeWeights{1}, startVertex, endVertex, 1);
end
% Follow edges pathLength times in adj matrix to grow subgraph as big as
% required.
%| This code could be modified to launch new parallel requests (using
%| eliminating the need to pass back the start-set (and path length).
newStarts = [endVertex]; % Not including startVertex.
allStarts = newStarts;
for k = 2:pathLength
% Find the edges emerging from the current subgraph.
if ~P.paral
newEdges = G.edgeWeights{1}(:, newStarts);
subg.edgeWeights{1}(:, newStarts) = newEdges;
[allNewEnds unused] = find(newEdges);
else % elseif P.paral
allNewEnds = []; % Column vector of edge-ends so far.
numRqst = 0; % Number of requests made so far.
% For each processor which has any of the vertices we need:
startDests = floor((newStarts - 1) / P.myV);
uniqDests = unique(startDests);
for dest = uniqDests
starts = newStarts(startDests == dest);
if dest == P.myRank
newEdges = G.edgeWeights{1}(:, starts - P.myBase);
subg.edgeWeights{1}(:, starts) = newEdges;
[allNewEnds unused] = find(newEdges);
elseif ~isempty(starts)
MPI_Send(dest, P.tag.K3.dataReq, P.comm, starts);
numRqst = numRqst + 1;
end
end
% Wait for a response for each request we sent out.
for unused = 1:numRqst
[src tag] = probeSubgraphs(G, [P.tag.K3.dataResp]);
[starts newEdges] = MPI_Recv(src, tag, P.comm);
subg.edgeWeights{1}(:, starts) = newEdges;
[newEnds unused] = find(newEdges);
allNewEnds = [allNewEnds; newEnds];
end
end % of if ~P.paral
% Eliminate any new ends already in the all starts list.
newStarts = setdiff(allNewEnds.', allStarts);
allStarts = [allStarts newStarts];
if ENABLE_PLOT_K3DB
plotEdges(subg.edgeWeights{1}, startVertex, endVertex, k);
end % of ENABLE_PLOT_K3DB
if isempty(newStarts) % if empty we can quit early.
break;
end
end
% Append to array of subgraphs.
graphList = [graphList subg];
end
function [src, tag] = probeSubgraphs(G, recvTags)
while true
[ranks tags] = MPI_Probe('*', P.tag.K3.any, P.comm);
requests = find(tags == P.tag.K3.dataReq);
for mesg = requests.'
src = ranks(mesg);
starts = MPI_Recv(src, P.tag.K3.dataReq, P.comm);
newEdges = G.edgeWeights{1}(:, starts - P.myBase);
MPI_Send(src, P.tag.K3.dataResp, P.comm, starts, newEdges);
end
mesg = find(ismember(tags, recvTags));
if ~isempty(mesg)
break;
end
end
src = ranks(mesg(1));
tag = tags(mesg(1));
Lines of code cSSCA2 executable
specC/Pthreads/
SIMPLE
Kernel 1 29 68 256
Kernel 2 12 44 121
Kernel 3 25 91 297
Kernel 4 44 295 241
35
Scaling up
Early results on SGI Altix (up to 64 processors):
• Have run cSSCA2 on graphs with 227 = 134 million vertices
• Have manipulated graphs with 400 million vertices and 4 billion edges
• Timings scale well – for large graphs,
• 2x problem size 2x time• 2x problem size & 2x processors same time
Using this benchmark to tune lots of infrastructure
36
Work in progress: Toolbox for Graph Analysis and Pattern Discovery
Layer 1: Graph Theoretic Tools
• Graph operations
• Global structure of graphs
• Graph partitioning and clustering
• Graph generators
• Visualization and graphics
• Scan and combining operations
• Utilities