High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf ·...

High-performance and automatic computing

Paolo Bientinesi

AICES, RWTH [email protected]

October 22nd, 2013Goethe Universität Frankfurt am Main

1 / 23

1 HPAC: Introduction

2 Application: Genome-Wide Association Studies

3 Performance experiments

4 Conclusions

2 / 23

Overview

High-performance computingnumerical computations, parallel architectures→ time-to-solution, efficiency, scalability, . . .

Automatic computingoptimization, search, but also derivation & deduction→ range of algorithms, productivity

Applications→ application-specific properties & needs, large scale, full code

3 / 23

Overview

3 / 23

Overview

3 / 23

hpac.rwth-aachen.de – group overview

methods & applicationsP. Bientinesi

E. Di Napoli Electronic structure calculations

M. Petschow Parallel eigensolvers

D. Fabregat Automation, Computational biology

D. Tameling? Molecular dynamics

E. Peise Performance modeling & prediction

L. Beyer Density Functional Theory

F. Kürten, Y. Madzhunkov, A. Frank, P. Springer, . . .

4 / 23

hpac.rwth-aachen.de – group overview

methods & applicationsP. Bientinesi

E. Di Napoli Electronic structure calculations

M. Petschow Parallel eigensolvers

D. Fabregat Automation, Computational biology

D. Tameling? Molecular dynamics

E. Peise Performance modeling & prediction

L. Beyer Density Functional Theory

F. Kürten, Y. Madzhunkov, A. Frank, P. Springer, . . .

D. Fabregat

E. Peise

Y. Aulchenko

4 / 23

4 Conclusions

5 / 23

Genome-Wide Association Studies

Source: David Hall

Yurii Paolo

“Mixed models” ???

Linear regression with non-independent outcomes ???

Generalized least-square problems . . .

b :=(XTM−1X

)−1XTM−1y

Inputs: M ∈ Rn×n, X ∈ Rn×p, y ∈ Rn

Output: b ∈ Rp

?To be repeated millions of times?

7 / 23

Yurii Paolo

b :=(XTM−1X

)−1XTM−1y

Output: b ∈ Rp

7 / 23

Yurii Paolo

b :=(XTM−1X

)−1XTM−1y

Output: b ∈ Rp

7 / 23

Yurii Paolo

b :=(XTM−1X

)−1XTM−1y

Output: b ∈ Rp

7 / 23

Yurii Paolo

b :=(XTM−1X

)−1XTM−1y

Output: b ∈ Rp

7 / 23

Mixed models b :=(XTM−1X

)−1XTM−1y

= -1 -1 -1

Genome-wide association analysisy: phenotype (outcome; vector of observations)E.g.: height, blood pressure for a set of people

X : genome measurements and covariates(design matrix; predictors)E.g.: sex and age over height

M : dependencies between observationsE.g.: tall parents have tall children

b: relation between a variation in the outcome (y)and a variation in the genome sequence (X)

8 / 23

n: Population size

9 / 23

m: Number of SNPs

9 / 23

Problem definition (1)

b :=(XTM−1X

)−1XTM−1y

“to be repeated millions of times”

⇓bi :=

i M−1i Xi

)−1XT

i M−1i yi

for i = 1, . . . ,m

⇓Problem size

Mi ∈ Rn×n 1000 ≤ n ≤ 20k+ 7.5MBs – 3GBsXi ∈ Rn×p 3 ≤ p ≤ 20 30 – 625KBsyj ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 7.5 – 3000 TBs

10 / 23

b :=(XTM−1X

)−1XTM−1y

“to be repeated millions of times”⇓

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

for i = 1, . . . ,m

⇓Problem size

10 / 23

b :=(XTM−1X

)−1XTM−1y

“to be repeated millions of times”⇓

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

for i = 1, . . . ,m

⇓Problem size

10 / 23

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

bi :=(XT

i M−1Xi

)−1XT

i M−1y

and Xi = [XL|XRi],for i = 1, . . . ,m

⇓Problem size

M ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 74GBs – 7 TBs

10 / 23

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

bi :=(XT

i M−1Xi

)−1XT

i M−1y

and Xi = [XL|XRi],for i = 1, . . . ,m

⇓Problem size

M ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 74GBs – 7 TBs

10 / 23

bi :=(XT

i M−1Xi

)−1XT

i M−1y

bij :=(XT

i M−1j Xi

)−1XT

i M−1j yj

andMj = σj(Φ + hjI),

for i = 1, . . . ,m and j = 1, . . . , t

Moreover, either t = 1 or t ≤ 105.

10 / 23

bi :=(XT

i M−1Xi

)−1XT

i M−1y

bij :=(XT

i M−1j Xi

)−1XT

i M−1j yj

andMj = σj(Φ + hjI),

for i = 1, . . . ,m and j = 1, . . . , t

Moreover, either t = 1 or t ≤ 105.

10 / 23

GWAS: complete problem definition

bγ,α

Mα Xγ

XTγ Mα yα

bγ,β

Mβ Xγ

XTγ Mβ yβ bδ,β

Mβ Xδ

Mβ yβ

XL XRδ

Problem sizeM ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi, yj ∈ Rn 8 – 780KBsbij ∈ Rp 3 ≤ p ≤ 20 24 – 160 BytesTotal m ≤ 108, t ≤ 105 1.5 – 100s TBs

11 / 23

GWAS: complete problem definition

bγ,α

Mα Xγ

XTγ Mα yα

bγ,β

Mβ Xγ

XTγ Mβ yβ bδ,β

Mβ Xδ

Mβ yβ

XL XRδ

Problem sizeM ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi, yj ∈ Rn 8 – 780KBsbij ∈ Rp 3 ≤ p ≤ 20 24 – 160 BytesTotal m ≤ 108, t ≤ 105 1.5 – 100s TBs

11 / 23

Vision: domain-specific compiler

α, β, γ ∈ R

µ :=(γ ∗ α−1 ∗ γ

)−1 ∗ β⇓

compiler⇓

code for µ :=αβ

Nov. 1954: The IBM Mathematical FORmula TRANslating System, FORTRAN

“a set of programs to accept a concise formulation of a problemand to produce automatically a solution of the problem”

12 / 23

Vision: domain-specific compiler

α, β, γ ∈ R

µ :=(γ ∗ α−1 ∗ γ

)−1 ∗ β⇓

compiler⇓

code for µ :=αβ

Nov. 1954: The IBM Mathematical FORmula TRANslating System, FORTRAN

“a set of programs to accept a concise formulation of a problemand to produce automatically a solution of the problem”

12 / 23

Automatic generation

Y ∈ Rn×p, A ∈ Rn×n, b ∈ Rn

v := (Y T ∗A−1 ∗ Y )−1 ∗ b⇓

< lin. alg. compiler >⇓

algorithm, code

Matrix algebra

Inference of properties

Building blocks

Sequences of operations

Cost analysis

Code generation

13 / 23

Automatic generation

Y ∈ Rn×p, A ∈ Rn×n, b ∈ Rn

v := (Y T ∗A−1 ∗ Y )−1 ∗ b⇓

< lin. alg. compiler >⇓

algorithm, code

Matrix algebra

Inference of properties

Building blocks

Sequences of operations

Cost analysis

Code generation

13 / 23

Algorithms generated

Algorithm 1 Algorithm 2 . . . Algorithm 20 . . .

LLT = M

X := L−1X

S := XTX

GGT = S

y := L−1y

b := XT y

b := G−1b

b := G−T b

LLT = M

X := L−1X

QR := X

y := L−1y

b := QT y

b := R−1b

ZWZT = Φ

D :=(hW +(1−h)I)−1

KKT = D

X := ZTX

X := KTX

QR := X

y := L−1y

b := QT y

b := R−1b

14 / 23

Many algorithms! Predictions?

Flop count – rough estimate

Alg. 1 Alg. 2 Alg. 20

Single instance(t = 1)

O(n3) O(n3) O(n3)

2D sequence(t� 1)

O(tn3 + mtn2) O(tn3 + mtn2) O(n3 + mtn)

Analytic modelsRoman Iakymchuk

Model-based predictionElmar Peise

15 / 23

O(n3) O(n3) O(n3)

2D sequence(t� 1)

15 / 23

O(n3) O(n3) O(n3)

2D sequence(t� 1)

15 / 23

Algorithm→ implementations

operandsX input 100s GBs – 2 TBs streaming from disky input 1 – 10 GBs streaming from diskM input MBs – 80 GBs read onceb output 100s MBs or 10s TBs streaming to disk

Does M fit in memory?YES⇒ single node + multithreadingstreaming HD↔CPU, double buffering, in-core implementation

Does M fit in GPU-memory?Yes⇒ acceleratorstreaming HD↔CPU↔GPU, triple+double buffering, GPU implementation

NO⇒ distributed memory + MPIpartitioning + streaming HD↔CPUs, double buffering, data distribution

16 / 23

Does M fit in memory?

YES⇒ single node + multithreadingstreaming HD↔CPU, double buffering, in-core implementation

16 / 23

4 Conclusions

17 / 23

Results t = 1Single-trait analysis

1,000 10,000 20,000 40,000

Sample size (n)

44 hours

7 hours

25 hours

EMMAXGWFGLSFaSTLMMCLAK-Chol

106 107 3.6*107Ti

Number of SNPs (m)

68 hours

6 hours

EMMAXGWFGLSFaSTLMMCLAK-Chol

18 / 23

Results t = 1, large-scale

Single node MPI

1 hour

HP-GWASOOC-HP-GWAS

1 load start first Xblk

2 LLT := M

3 XL := L!1XL , y := L!1y

4 copy XL := XL , y := y

5 STL := XTLXL , bT := X

6 for each blk7 load wait current Xblk

8 i f not last blk : load start next Xblk

9 set Xblk := combine (Xblk )

10 Xblk := L!1Xblk

11 set Xblk := l o c a l p a r t (Xblk )

12 Sblk := XblkT XL

13 for i in {1, . . . , mblknp

}14 set XRi := Xblk[i] , SBLi := Sblk[i]

15 SBRi := XRiT XRi

16 bBi := XRiT y

17 set Si :=

!STL !SBLi SBRi

", bi :=

18 bi := S!1i bi

19 set bblk[i] := bi

20 end21 i f not first blk : store wait previous bblk

22 store start current bblk

23 end24 store wait last bblk

Algorithm 3: Distributed memory version ofAlgorithm 1. Asynchronous I/O operations are de-picted green, distributed matrices and operations inblue, and quantities that di!er across processes inred.

In addition to blocking XRi and bBi, the computation of allrow vectors SBLi belonging to the current block is combinedinto a single matrix product (line 12) resulting in the SBLi

being stacked in a block Sblk. In line 14, SBLi is selectedfrom Sblk, along with XRi from Xblk for the innermost loop.This loop then computes the local bblk independently oneach process. Finally, bblk (whose columns bi correspondsto the initially loaded vectors XRi within Xblk) is storedasynchronously, while the next iteration commences.

4.3 Performance ResultsWe compile Elem-OOC, the C++-implementation of Algorithm 3,with the GNU C compiler (version 4.7.2), use Elemental(version 0.78-dev) with OpenMPI (version 1.6.4) and linkto Intel’s Math Kernel Library (MKL version 11.0). In ourtests, we use a compute cluster with 40 nodes, each equippedwith 16 GB of RAM and two quad-core Intel HarpertownE5450 processors running at 3.00 Ghz. The nodes are con-nected via InfiniBand and access a high speed Lustre filesystem.

Throughout all our experiments, we use the empirically op-timal local block-size mblk

np= 256 by choosing mblk = 256np.

4.3.1 Processing huge numbers of SNPs out-of-coreSince Elem-OOC incorporates the double-bu!ering methodintroduced in Section 3, it can process datasets with arbi-

104 105 106 107

10 min

1 hour

128 GB

np = 8

np = 16

np = 32

np = 64

Figure 5: Performance of Elem-OOC as a functionof m. Here, n = 40,000, p = 4, and m ranges from2,048 to 8.2 · 106. The vertical lines are limits for atheoretical in-core version of the parallel algorithmimposed by the accumulated RAM sizes.

0 20,000 40,000 60,000 80,000 1050

np = 8

np = 16

np = 32

np = 64

Figure 6: Performance of Elem-OOC as a functionof n. p = 4, m = 65,536, and n ranges from 5,000to 100,000. The vertical lines indicate the limits im-posed by the accumulated RAM sizes.

trarily large m without introducing any overhead due to I/Ooperations. To confirm this claim, we perform a series of ex-periments, using np = 8, 16, 32, and 64 cores (1, 2, 4, and 8nodes) to solve a system of size n = 40,000 and p = 4 withincreasing dataset size m. The performance of these experi-ments is presented in Figure 5, where the vertical lines markthe points at which the 16 GB of RAM per node are insuf-ficient to store all m vectors XRi. The plot shows a verysmooth behavior with m (dominated by the triangular solvein Algorithm 3, line 10) well beyond this in-core memorylimit.

4.3.2 Increasing the population size nWe now turn to the main goal of our e!ort: performingcomputations on systems whose matrix M " Rn"n exceedsthe capacity of the main memory. For this purpose, we usem = 65,536, p = 4 and execute Elem-OOC on np = 8,16, 32, and 64 cores (1, 2, 4, and 8 nodes) with increasingmatrix size n. Figure 6 reports the performance of theseexecutions, which is dominated by the cubic complexity ofthe Cholesky factorization of M (Algorithm 3, line 2). Thevertical lines indicate where the nodes’ memory would beexceeded by the size of the distributed M and the bu!ersfor Xblk. The plot shows that our implementation succeeds

19 / 23

Results t = 1, large-scale

Single node MPI

1 hour

HP-GWASOOC-HP-GWAS

1 load start first Xblk

2 LLT := M

3 XL := L!1XL , y := L!1y

4 copy XL := XL , y := y

5 STL := XTLXL , bT := X

6 for each blk7 load wait current Xblk

8 i f not last blk : load start next Xblk

9 set Xblk := combine (Xblk )

10 Xblk := L!1Xblk

11 set Xblk := l o c a l p a r t (Xblk )

12 Sblk := XblkT XL

13 for i in {1, . . . , mblknp

}14 set XRi := Xblk[i] , SBLi := Sblk[i]

15 SBRi := XRiT XRi

16 bBi := XRiT y

17 set Si :=

!STL !SBLi SBRi

", bi :=

18 bi := S!1i bi

19 set bblk[i] := bi

20 end21 i f not first blk : store wait previous bblk

22 store start current bblk

23 end24 store wait last bblk

Algorithm 3: Distributed memory version ofAlgorithm 1. Asynchronous I/O operations are de-picted green, distributed matrices and operations inblue, and quantities that di!er across processes inred.

In addition to blocking XRi and bBi, the computation of allrow vectors SBLi belonging to the current block is combinedinto a single matrix product (line 12) resulting in the SBLi

being stacked in a block Sblk. In line 14, SBLi is selectedfrom Sblk, along with XRi from Xblk for the innermost loop.This loop then computes the local bblk independently oneach process. Finally, bblk (whose columns bi correspondsto the initially loaded vectors XRi within Xblk) is storedasynchronously, while the next iteration commences.

4.3 Performance ResultsWe compile Elem-OOC, the C++-implementation of Algorithm 3,with the GNU C compiler (version 4.7.2), use Elemental(version 0.78-dev) with OpenMPI (version 1.6.4) and linkto Intel’s Math Kernel Library (MKL version 11.0). In ourtests, we use a compute cluster with 40 nodes, each equippedwith 16 GB of RAM and two quad-core Intel HarpertownE5450 processors running at 3.00 Ghz. The nodes are con-nected via InfiniBand and access a high speed Lustre filesystem.

Throughout all our experiments, we use the empirically op-timal local block-size mblk

np= 256 by choosing mblk = 256np.

4.3.1 Processing huge numbers of SNPs out-of-coreSince Elem-OOC incorporates the double-bu!ering methodintroduced in Section 3, it can process datasets with arbi-

104 105 106 107

10 min

1 hour

128 GB

np = 8

np = 16

np = 32

np = 64

Figure 5: Performance of Elem-OOC as a functionof m. Here, n = 40,000, p = 4, and m ranges from2,048 to 8.2 · 106. The vertical lines are limits for atheoretical in-core version of the parallel algorithmimposed by the accumulated RAM sizes.

0 20,000 40,000 60,000 80,000 1050

np = 8

np = 16

np = 32

np = 64

Figure 6: Performance of Elem-OOC as a functionof n. p = 4, m = 65,536, and n ranges from 5,000to 100,000. The vertical lines indicate the limits im-posed by the accumulated RAM sizes.

trarily large m without introducing any overhead due to I/Ooperations. To confirm this claim, we perform a series of ex-periments, using np = 8, 16, 32, and 64 cores (1, 2, 4, and 8nodes) to solve a system of size n = 40,000 and p = 4 withincreasing dataset size m. The performance of these experi-ments is presented in Figure 5, where the vertical lines markthe points at which the 16 GB of RAM per node are insuf-ficient to store all m vectors XRi. The plot shows a verysmooth behavior with m (dominated by the triangular solvein Algorithm 3, line 10) well beyond this in-core memorylimit.

4.3.2 Increasing the population size nWe now turn to the main goal of our e!ort: performingcomputations on systems whose matrix M " Rn"n exceedsthe capacity of the main memory. For this purpose, we usem = 65,536, p = 4 and execute Elem-OOC on np = 8,16, 32, and 64 cores (1, 2, 4, and 8 nodes) with increasingmatrix size n. Figure 6 reports the performance of theseexecutions, which is dominated by the cubic complexity ofthe Cholesky factorization of M (Algorithm 3, line 2). Thevertical lines indicate where the nodes’ memory would beexceeded by the size of the distributed M and the bu!ersfor Xblk. The plot shows that our implementation succeeds

19 / 23

1 GPU Scalability

m CPU only CPU + 1GPU CPU + 2GPU1000500010000200002500030000400005000060000700008000090000100000

1,79 0,490 0,2941,22

11,6 4,174 2,25124,9 8,201 4,308

10,44 5,32632,9 12,47 6,343,1 16,56 8,352,4 20,59 10,365,6 24,62 12,474,6 28,75 14,384,8 32,76 16,396,7 36,75 18,3

0K 22,5K 45K 67,5K 90K

m (SNP count)

CPU only CPU + 1GPU CPU + 2GPU

0K 22,5K 45K 67,5K 90K

m (SNP count)

CPU only CPU + 1GPU

⟵in-core out-of-core⟶⟵in-core out-of-core⟶

1 2 3 4RuntimeIdeal scalability

41 21,6 16,2 11,740,7 20,4 13,6 10,2

45 40,7s

Runtime Ideal scalability

1 2 3 4

16,211,7

Number of GPUs

20 / 23

1 GPU Scalability

m CPU only CPU + 1GPU CPU + 2GPU1000500010000200002500030000400005000060000700008000090000100000

1,79 0,490 0,2941,22

11,6 4,174 2,25124,9 8,201 4,308

10,44 5,32632,9 12,47 6,343,1 16,56 8,352,4 20,59 10,365,6 24,62 12,474,6 28,75 14,384,8 32,76 16,396,7 36,75 18,3

0K 22,5K 45K 67,5K 90K

m (SNP count)

CPU only CPU + 1GPU CPU + 2GPU

0K 22,5K 45K 67,5K 90K

m (SNP count)

CPU only CPU + 1GPU

⟵in-core out-of-core⟶⟵in-core out-of-core⟶

1 2 3 4RuntimeIdeal scalability

41 21,6 16,2 11,740,7 20,4 13,6 10,2

45 40,7s

1 2 3 4

16,211,7

Number of GPUs

20 / 23

Results t� 1Multi-trait analysis, “OMICS”-data

104 5 * 104 105

Number of traits (t)

14 hours

20 months

26 months

4.58 yearsEMMAXFaSTLMMGWFGLSCLAK-Eig

1 250 500 750

1000 1250 1500 1750 2000

100 101 102 103 104 105R

gNumber of traits (t)

FaST-LMM: 1352x

GWFGLS: 1012x

EMMAX: 2789x

CLAK-Eig: 1x

21 / 23

4 Conclusions

22 / 23

Final comments

HPC’s perspectiveIn-core efficiencyHow to sustain efficiency?

App’s perspectiveMany data formatsMissing data, bogus dataOutput data. Post-processing?New features

HUGE gap:algorithm↔ optimized implementation(data management, parallelism)

Development cycle: several months!

How to deal with BIG problems?Expose knowledge, exploit knowledge

23 / 23

Final comments

23 / 23

Final comments

23 / 23

Final comments

23 / 23

Final comments

23 / 23

High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf ·...

Documents

Transcript of High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf ·...