High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf ·...

53
High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen [email protected] October 22nd, 2013 Goethe Universität Frankfurt am Main 1 / 23

Transcript of High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf ·...

Page 1: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

High-performance and automatic computing

Paolo Bientinesi

AICES, RWTH [email protected]

October 22nd, 2013Goethe Universität Frankfurt am Main

1 / 23

Page 2: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

1 HPAC: Introduction

2 Application: Genome-Wide Association Studies

3 Performance experiments

4 Conclusions

2 / 23

Page 3: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Overview

High-performance computingnumerical computations, parallel architectures→ time-to-solution, efficiency, scalability, . . .

Automatic computingoptimization, search, but also derivation & deduction→ range of algorithms, productivity

Applications→ application-specific properties & needs, large scale, full code

3 / 23

Page 4: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Overview

High-performance computingnumerical computations, parallel architectures→ time-to-solution, efficiency, scalability, . . .

Automatic computingoptimization, search, but also derivation & deduction→ range of algorithms, productivity

Applications→ application-specific properties & needs, large scale, full code

3 / 23

Page 5: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Overview

High-performance computingnumerical computations, parallel architectures→ time-to-solution, efficiency, scalability, . . .

Automatic computingoptimization, search, but also derivation & deduction→ range of algorithms, productivity

Applications→ application-specific properties & needs, large scale, full code

3 / 23

Page 6: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

hpac.rwth-aachen.de – group overview

methods & applicationsP. Bientinesi

E. Di Napoli Electronic structure calculations

M. Petschow Parallel eigensolvers

D. Fabregat Automation, Computational biology

D. Tameling? Molecular dynamics

E. Peise Performance modeling & prediction

L. Beyer Density Functional Theory

F. Kürten, Y. Madzhunkov, A. Frank, P. Springer, . . .

4 / 23

Page 7: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

hpac.rwth-aachen.de – group overview

methods & applicationsP. Bientinesi

E. Di Napoli Electronic structure calculations

M. Petschow Parallel eigensolvers

D. Fabregat Automation, Computational biology

D. Tameling? Molecular dynamics

E. Peise Performance modeling & prediction

L. Beyer Density Functional Theory

F. Kürten, Y. Madzhunkov, A. Frank, P. Springer, . . .

D. Fabregat

E. Peise

Y. Aulchenko

4 / 23

Page 8: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

1 HPAC: Introduction

2 Application: Genome-Wide Association Studies

3 Performance experiments

4 Conclusions

5 / 23

Page 9: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Genome-Wide Association Studies

Source: David Hall

Page 10: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Genome-Wide Association Studies

Source: David Hall

Page 11: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Genome-Wide Association Studies

Yurii Paolo

“Mixed models” ???

Linear regression with non-independent outcomes ???

Generalized least-square problems . . .

b :=(XTM−1X

)−1XTM−1y

Inputs: M ∈ Rn×n, X ∈ Rn×p, y ∈ Rn

Output: b ∈ Rp

?To be repeated millions of times?

7 / 23

Page 12: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Genome-Wide Association Studies

Yurii Paolo

“Mixed models” ???

Linear regression with non-independent outcomes ???

Generalized least-square problems . . .

b :=(XTM−1X

)−1XTM−1y

Inputs: M ∈ Rn×n, X ∈ Rn×p, y ∈ Rn

Output: b ∈ Rp

?To be repeated millions of times?

7 / 23

Page 13: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Genome-Wide Association Studies

Yurii Paolo

“Mixed models” ???

Linear regression with non-independent outcomes ???

Generalized least-square problems . . .

b :=(XTM−1X

)−1XTM−1y

Inputs: M ∈ Rn×n, X ∈ Rn×p, y ∈ Rn

Output: b ∈ Rp

?To be repeated millions of times?

7 / 23

Page 14: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Genome-Wide Association Studies

Yurii Paolo

“Mixed models” ???

Linear regression with non-independent outcomes ???

Generalized least-square problems . . .

b :=(XTM−1X

)−1XTM−1y

Inputs: M ∈ Rn×n, X ∈ Rn×p, y ∈ Rn

Output: b ∈ Rp

?To be repeated millions of times?

7 / 23

Page 15: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Genome-Wide Association Studies

Yurii Paolo

“Mixed models” ???

Linear regression with non-independent outcomes ???

Generalized least-square problems . . .

b :=(XTM−1X

)−1XTM−1y

Inputs: M ∈ Rn×n, X ∈ Rn×p, y ∈ Rn

Output: b ∈ Rp

?To be repeated millions of times?

7 / 23

Page 16: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Mixed models b :=(XTM−1X

)−1XTM−1y

= -1 -1 -1

Genome-wide association analysisy: phenotype (outcome; vector of observations)E.g.: height, blood pressure for a set of people

X : genome measurements and covariates(design matrix; predictors)E.g.: sex and age over height

M : dependencies between observationsE.g.: tall parents have tall children

b: relation between a variation in the outcome (y)and a variation in the genome sequence (X)

8 / 23

Page 17: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Stats

n: Population size

9 / 23

Page 18: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Stats

m: Number of SNPs

9 / 23

Page 19: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Problem definition (1)

b :=(XTM−1X

)−1XTM−1y

“to be repeated millions of times”

⇓bi :=

(XT

i M−1i Xi

)−1XT

i M−1i yi

for i = 1, . . . ,m

⇓Problem size

Mi ∈ Rn×n 1000 ≤ n ≤ 20k+ 7.5MBs – 3GBsXi ∈ Rn×p 3 ≤ p ≤ 20 30 – 625KBsyj ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 7.5 – 3000 TBs

10 / 23

Page 20: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Problem definition (1)

b :=(XTM−1X

)−1XTM−1y

“to be repeated millions of times”⇓

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

for i = 1, . . . ,m

⇓Problem size

Mi ∈ Rn×n 1000 ≤ n ≤ 20k+ 7.5MBs – 3GBsXi ∈ Rn×p 3 ≤ p ≤ 20 30 – 625KBsyj ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 7.5 – 3000 TBs

10 / 23

Page 21: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Problem definition (1)

b :=(XTM−1X

)−1XTM−1y

“to be repeated millions of times”⇓

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

for i = 1, . . . ,m

⇓Problem size

Mi ∈ Rn×n 1000 ≤ n ≤ 20k+ 7.5MBs – 3GBsXi ∈ Rn×p 3 ≤ p ≤ 20 30 – 625KBsyj ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 7.5 – 3000 TBs

10 / 23

Page 22: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Problem definition (2)

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

bi :=(XT

i M−1Xi

)−1XT

i M−1y

and Xi = [XL|XRi],for i = 1, . . . ,m

⇓Problem size

M ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 74GBs – 7 TBs

10 / 23

Page 23: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Problem definition (2)

bi :=(XT

i M−1i Xi

)−1XT

i M−1i yi

bi :=(XT

i M−1Xi

)−1XT

i M−1y

and Xi = [XL|XRi],for i = 1, . . . ,m

⇓Problem size

M ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi ∈ Rn 8 – 780KBsbi ∈ Rp 24 – 160 BytesTotal 106 ≤ m ≤ 108 74GBs – 7 TBs

10 / 23

Page 24: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Problem definition (3)

bi :=(XT

i M−1Xi

)−1XT

i M−1y

bij :=(XT

i M−1j Xi

)−1XT

i M−1j yj

andMj = σj(Φ + hjI),

for i = 1, . . . ,m and j = 1, . . . , t

Moreover, either t = 1 or t ≤ 105.

10 / 23

Page 25: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Problem definition (3)

bi :=(XT

i M−1Xi

)−1XT

i M−1y

bij :=(XT

i M−1j Xi

)−1XT

i M−1j yj

andMj = σj(Φ + hjI),

for i = 1, . . . ,m and j = 1, . . . , t

Moreover, either t = 1 or t ≤ 105.

10 / 23

Page 26: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

GWAS: complete problem definition

γ δ

α

β

m

t

p

B

bγ,α

=XTγ

−1

Mα Xγ

−1

XTγ Mα yα

bγ,β

=XTγ

−1

Mβ Xγ

−1

XTγ Mβ yβ bδ,β

=XTδ

−1

Mβ Xδ

−1

XTδ

Mβ yβ

XL XRδ

Problem sizeM ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi, yj ∈ Rn 8 – 780KBsbij ∈ Rp 3 ≤ p ≤ 20 24 – 160 BytesTotal m ≤ 108, t ≤ 105 1.5 – 100s TBs

11 / 23

Page 27: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

GWAS: complete problem definition

γ δ

α

β

m

t

p

B

bγ,α

=XTγ

−1

Mα Xγ

−1

XTγ Mα yα

bγ,β

=XTγ

−1

Mβ Xγ

−1

XTγ Mβ yβ bδ,β

=XTδ

−1

Mβ Xδ

−1

XTδ

Mβ yβ

XL XRδ

Problem sizeM ∈ Rn×n 1000 ≤ n ≤ 100k 7.5MBs – 74.5GBsXRi, yj ∈ Rn 8 – 780KBsbij ∈ Rp 3 ≤ p ≤ 20 24 – 160 BytesTotal m ≤ 108, t ≤ 105 1.5 – 100s TBs

11 / 23

Page 28: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Vision: domain-specific compiler

α, β, γ ∈ R

µ :=(γ ∗ α−1 ∗ γ

)−1 ∗ β⇓

compiler⇓

code for µ :=αβ

γ2

Nov. 1954: The IBM Mathematical FORmula TRANslating System, FORTRAN

“a set of programs to accept a concise formulation of a problemand to produce automatically a solution of the problem”

12 / 23

Page 29: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Vision: domain-specific compiler

α, β, γ ∈ R

µ :=(γ ∗ α−1 ∗ γ

)−1 ∗ β⇓

compiler⇓

code for µ :=αβ

γ2

Nov. 1954: The IBM Mathematical FORmula TRANslating System, FORTRAN

“a set of programs to accept a concise formulation of a problemand to produce automatically a solution of the problem”

12 / 23

Page 30: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Automatic generation

Y ∈ Rn×p, A ∈ Rn×n, b ∈ Rn

v := (Y T ∗A−1 ∗ Y )−1 ∗ b⇓

< lin. alg. compiler >⇓

algorithm, code

Matrix algebra

Inference of properties

Building blocks

Sequences of operations

Cost analysis

Code generation

13 / 23

Page 31: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Automatic generation

Y ∈ Rn×p, A ∈ Rn×n, b ∈ Rn

v := (Y T ∗A−1 ∗ Y )−1 ∗ b⇓

< lin. alg. compiler >⇓

algorithm, code

Matrix algebra

Inference of properties

Building blocks

Sequences of operations

Cost analysis

Code generation

13 / 23

Page 32: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Algorithms generated

Algorithm 1 Algorithm 2 . . . Algorithm 20 . . .

LLT = M

X := L−1X

S := XTX

GGT = S

y := L−1y

b := XT y

b := G−1b

b := G−T b

LLT = M

X := L−1X

QR := X

y := L−1y

b := QT y

b := R−1b

ZWZT = Φ

D :=(hW +(1−h)I)−1

KKT = D

X := ZTX

X := KTX

QR := X

y := L−1y

b := QT y

b := R−1b

14 / 23

Page 33: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Many algorithms! Predictions?

Flop count – rough estimate

Alg. 1 Alg. 2 Alg. 20

Single instance(t = 1)

O(n3) O(n3) O(n3)

2D sequence(t� 1)

O(tn3 + mtn2) O(tn3 + mtn2) O(n3 + mtn)

Analytic modelsRoman Iakymchuk

Model-based predictionElmar Peise

15 / 23

Page 34: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Many algorithms! Predictions?

Flop count – rough estimate

Alg. 1 Alg. 2 Alg. 20

Single instance(t = 1)

O(n3) O(n3) O(n3)

2D sequence(t� 1)

O(tn3 + mtn2) O(tn3 + mtn2) O(n3 + mtn)

Analytic modelsRoman Iakymchuk

Model-based predictionElmar Peise

15 / 23

Page 35: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Many algorithms! Predictions?

Flop count – rough estimate

Alg. 1 Alg. 2 Alg. 20

Single instance(t = 1)

O(n3) O(n3) O(n3)

2D sequence(t� 1)

O(tn3 + mtn2) O(tn3 + mtn2) O(n3 + mtn)

Analytic modelsRoman Iakymchuk

Model-based predictionElmar Peise

15 / 23

Page 36: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Algorithm→ implementations

operandsX input 100s GBs – 2 TBs streaming from disky input 1 – 10 GBs streaming from diskM input MBs – 80 GBs read onceb output 100s MBs or 10s TBs streaming to disk

Does M fit in memory?YES⇒ single node + multithreadingstreaming HD↔CPU, double buffering, in-core implementation

Does M fit in GPU-memory?Yes⇒ acceleratorstreaming HD↔CPU↔GPU, triple+double buffering, GPU implementation

NO⇒ distributed memory + MPIpartitioning + streaming HD↔CPUs, double buffering, data distribution

16 / 23

Page 37: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Algorithm→ implementations

operandsX input 100s GBs – 2 TBs streaming from disky input 1 – 10 GBs streaming from diskM input MBs – 80 GBs read onceb output 100s MBs or 10s TBs streaming to disk

Does M fit in memory?

YES⇒ single node + multithreadingstreaming HD↔CPU, double buffering, in-core implementation

Does M fit in GPU-memory?Yes⇒ acceleratorstreaming HD↔CPU↔GPU, triple+double buffering, GPU implementation

NO⇒ distributed memory + MPIpartitioning + streaming HD↔CPUs, double buffering, data distribution

16 / 23

Page 38: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Algorithm→ implementations

operandsX input 100s GBs – 2 TBs streaming from disky input 1 – 10 GBs streaming from diskM input MBs – 80 GBs read onceb output 100s MBs or 10s TBs streaming to disk

Does M fit in memory?YES⇒ single node + multithreadingstreaming HD↔CPU, double buffering, in-core implementation

Does M fit in GPU-memory?Yes⇒ acceleratorstreaming HD↔CPU↔GPU, triple+double buffering, GPU implementation

NO⇒ distributed memory + MPIpartitioning + streaming HD↔CPUs, double buffering, data distribution

16 / 23

Page 39: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Algorithm→ implementations

operandsX input 100s GBs – 2 TBs streaming from disky input 1 – 10 GBs streaming from diskM input MBs – 80 GBs read onceb output 100s MBs or 10s TBs streaming to disk

Does M fit in memory?YES⇒ single node + multithreadingstreaming HD↔CPU, double buffering, in-core implementation

Does M fit in GPU-memory?Yes⇒ acceleratorstreaming HD↔CPU↔GPU, triple+double buffering, GPU implementation

NO⇒ distributed memory + MPIpartitioning + streaming HD↔CPUs, double buffering, data distribution

16 / 23

Page 40: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Algorithm→ implementations

operandsX input 100s GBs – 2 TBs streaming from disky input 1 – 10 GBs streaming from diskM input MBs – 80 GBs read onceb output 100s MBs or 10s TBs streaming to disk

Does M fit in memory?YES⇒ single node + multithreadingstreaming HD↔CPU, double buffering, in-core implementation

Does M fit in GPU-memory?Yes⇒ acceleratorstreaming HD↔CPU↔GPU, triple+double buffering, GPU implementation

NO⇒ distributed memory + MPIpartitioning + streaming HD↔CPUs, double buffering, data distribution

16 / 23

Page 41: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

1 HPAC: Introduction

2 Application: Genome-Wide Association Studies

3 Performance experiments

4 Conclusions

17 / 23

Page 42: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Results t = 1Single-trait analysis

1020

40

60

80

100

1,000 10,000 20,000 40,000

Tim

e ( h

ours

)

Sample size (n)

44 hours

7 hours

25 hours

EMMAXGWFGLSFaSTLMMCLAK-Chol

1020

40

60

80

106 107 3.6*107Ti

me

( hou

rs )

Number of SNPs (m)

68 hours

6 hours

EMMAXGWFGLSFaSTLMMCLAK-Chol

18 / 23

Page 43: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Results t = 1, large-scale

Single node MPI

1 min

1 hour

1 day

103

104

105

106

107

Tim

e

m

HP-GWASOOC-HP-GWAS

1 load start first Xblk

2 LLT := M

3 XL := L!1XL , y := L!1y

4 copy XL := XL , y := y

5 STL := XTLXL , bT := X

TLy

6 for each blk7 load wait current Xblk

8 i f not last blk : load start next Xblk

9 set Xblk := combine (Xblk )

10 Xblk := L!1Xblk

11 set Xblk := l o c a l p a r t (Xblk )

12 Sblk := XblkT XL

13 for i in {1, . . . , mblknp

}14 set XRi := Xblk[i] , SBLi := Sblk[i]

15 SBRi := XRiT XRi

16 bBi := XRiT y

17 set Si :=

!STL !SBLi SBRi

", bi :=

!bT

bBi

"

18 bi := S!1i bi

19 set bblk[i] := bi

20 end21 i f not first blk : store wait previous bblk

22 store start current bblk

23 end24 store wait last bblk

Algorithm 3: Distributed memory version ofAlgorithm 1. Asynchronous I/O operations are de-picted green, distributed matrices and operations inblue, and quantities that di!er across processes inred.

In addition to blocking XRi and bBi, the computation of allrow vectors SBLi belonging to the current block is combinedinto a single matrix product (line 12) resulting in the SBLi

being stacked in a block Sblk. In line 14, SBLi is selectedfrom Sblk, along with XRi from Xblk for the innermost loop.This loop then computes the local bblk independently oneach process. Finally, bblk (whose columns bi correspondsto the initially loaded vectors XRi within Xblk) is storedasynchronously, while the next iteration commences.

4.3 Performance ResultsWe compile Elem-OOC, the C++-implementation of Algorithm 3,with the GNU C compiler (version 4.7.2), use Elemental(version 0.78-dev) with OpenMPI (version 1.6.4) and linkto Intel’s Math Kernel Library (MKL version 11.0). In ourtests, we use a compute cluster with 40 nodes, each equippedwith 16 GB of RAM and two quad-core Intel HarpertownE5450 processors running at 3.00 Ghz. The nodes are con-nected via InfiniBand and access a high speed Lustre filesystem.

Throughout all our experiments, we use the empirically op-timal local block-size mblk

np= 256 by choosing mblk = 256np.

4.3.1 Processing huge numbers of SNPs out-of-coreSince Elem-OOC incorporates the double-bu!ering methodintroduced in Section 3, it can process datasets with arbi-

104 105 106 107

10 min

1 hour

1 day

16 GB

32 GB

64 GB

128 GB

m

np = 8

np = 16

np = 32

np = 64

Figure 5: Performance of Elem-OOC as a functionof m. Here, n = 40,000, p = 4, and m ranges from2,048 to 8.2 · 106. The vertical lines are limits for atheoretical in-core version of the parallel algorithmimposed by the accumulated RAM sizes.

0 20,000 40,000 60,000 80,000 1050

1

2

3

4

16 GB

32 GB

64 GB

n

tim

e[h

ours

]

np = 8

np = 16

np = 32

np = 64

Figure 6: Performance of Elem-OOC as a functionof n. p = 4, m = 65,536, and n ranges from 5,000to 100,000. The vertical lines indicate the limits im-posed by the accumulated RAM sizes.

trarily large m without introducing any overhead due to I/Ooperations. To confirm this claim, we perform a series of ex-periments, using np = 8, 16, 32, and 64 cores (1, 2, 4, and 8nodes) to solve a system of size n = 40,000 and p = 4 withincreasing dataset size m. The performance of these experi-ments is presented in Figure 5, where the vertical lines markthe points at which the 16 GB of RAM per node are insuf-ficient to store all m vectors XRi. The plot shows a verysmooth behavior with m (dominated by the triangular solvein Algorithm 3, line 10) well beyond this in-core memorylimit.

4.3.2 Increasing the population size nWe now turn to the main goal of our e!ort: performingcomputations on systems whose matrix M " Rn"n exceedsthe capacity of the main memory. For this purpose, we usem = 65,536, p = 4 and execute Elem-OOC on np = 8,16, 32, and 64 cores (1, 2, 4, and 8 nodes) with increasingmatrix size n. Figure 6 reports the performance of theseexecutions, which is dominated by the cubic complexity ofthe Cholesky factorization of M (Algorithm 3, line 2). Thevertical lines indicate where the nodes’ memory would beexceeded by the size of the distributed M and the bu!ersfor Xblk. The plot shows that our implementation succeeds

19 / 23

Page 44: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Results t = 1, large-scale

Single node MPI

1 min

1 hour

1 day

103

104

105

106

107

Tim

e

m

HP-GWASOOC-HP-GWAS

1 load start first Xblk

2 LLT := M

3 XL := L!1XL , y := L!1y

4 copy XL := XL , y := y

5 STL := XTLXL , bT := X

TLy

6 for each blk7 load wait current Xblk

8 i f not last blk : load start next Xblk

9 set Xblk := combine (Xblk )

10 Xblk := L!1Xblk

11 set Xblk := l o c a l p a r t (Xblk )

12 Sblk := XblkT XL

13 for i in {1, . . . , mblknp

}14 set XRi := Xblk[i] , SBLi := Sblk[i]

15 SBRi := XRiT XRi

16 bBi := XRiT y

17 set Si :=

!STL !SBLi SBRi

", bi :=

!bT

bBi

"

18 bi := S!1i bi

19 set bblk[i] := bi

20 end21 i f not first blk : store wait previous bblk

22 store start current bblk

23 end24 store wait last bblk

Algorithm 3: Distributed memory version ofAlgorithm 1. Asynchronous I/O operations are de-picted green, distributed matrices and operations inblue, and quantities that di!er across processes inred.

In addition to blocking XRi and bBi, the computation of allrow vectors SBLi belonging to the current block is combinedinto a single matrix product (line 12) resulting in the SBLi

being stacked in a block Sblk. In line 14, SBLi is selectedfrom Sblk, along with XRi from Xblk for the innermost loop.This loop then computes the local bblk independently oneach process. Finally, bblk (whose columns bi correspondsto the initially loaded vectors XRi within Xblk) is storedasynchronously, while the next iteration commences.

4.3 Performance ResultsWe compile Elem-OOC, the C++-implementation of Algorithm 3,with the GNU C compiler (version 4.7.2), use Elemental(version 0.78-dev) with OpenMPI (version 1.6.4) and linkto Intel’s Math Kernel Library (MKL version 11.0). In ourtests, we use a compute cluster with 40 nodes, each equippedwith 16 GB of RAM and two quad-core Intel HarpertownE5450 processors running at 3.00 Ghz. The nodes are con-nected via InfiniBand and access a high speed Lustre filesystem.

Throughout all our experiments, we use the empirically op-timal local block-size mblk

np= 256 by choosing mblk = 256np.

4.3.1 Processing huge numbers of SNPs out-of-coreSince Elem-OOC incorporates the double-bu!ering methodintroduced in Section 3, it can process datasets with arbi-

104 105 106 107

10 min

1 hour

1 day

16 GB

32 GB

64 GB

128 GB

m

np = 8

np = 16

np = 32

np = 64

Figure 5: Performance of Elem-OOC as a functionof m. Here, n = 40,000, p = 4, and m ranges from2,048 to 8.2 · 106. The vertical lines are limits for atheoretical in-core version of the parallel algorithmimposed by the accumulated RAM sizes.

0 20,000 40,000 60,000 80,000 1050

1

2

3

4

16 GB

32 GB

64 GB

n

tim

e[h

ours

]

np = 8

np = 16

np = 32

np = 64

Figure 6: Performance of Elem-OOC as a functionof n. p = 4, m = 65,536, and n ranges from 5,000to 100,000. The vertical lines indicate the limits im-posed by the accumulated RAM sizes.

trarily large m without introducing any overhead due to I/Ooperations. To confirm this claim, we perform a series of ex-periments, using np = 8, 16, 32, and 64 cores (1, 2, 4, and 8nodes) to solve a system of size n = 40,000 and p = 4 withincreasing dataset size m. The performance of these experi-ments is presented in Figure 5, where the vertical lines markthe points at which the 16 GB of RAM per node are insuf-ficient to store all m vectors XRi. The plot shows a verysmooth behavior with m (dominated by the triangular solvein Algorithm 3, line 10) well beyond this in-core memorylimit.

4.3.2 Increasing the population size nWe now turn to the main goal of our e!ort: performingcomputations on systems whose matrix M " Rn"n exceedsthe capacity of the main memory. For this purpose, we usem = 65,536, p = 4 and execute Elem-OOC on np = 8,16, 32, and 64 cores (1, 2, 4, and 8 nodes) with increasingmatrix size n. Figure 6 reports the performance of theseexecutions, which is dominated by the cubic complexity ofthe Cholesky factorization of M (Algorithm 3, line 2). Thevertical lines indicate where the nodes’ memory would beexceeded by the size of the distributed M and the bu!ersfor Xblk. The plot shows that our implementation succeeds

19 / 23

Page 45: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

1 GPU Scalability

m CPU only CPU + 1GPU CPU + 2GPU1000500010000200002500030000400005000060000700008000090000100000

1,79 0,490 0,2941,22

11,6 4,174 2,25124,9 8,201 4,308

10,44 5,32632,9 12,47 6,343,1 16,56 8,352,4 20,59 10,365,6 24,62 12,474,6 28,75 14,384,8 32,76 16,396,7 36,75 18,3

0

25

50

75

100

0K 22,5K 45K 67,5K 90K

runt

ime

[s]

m (SNP count)

CPU only CPU + 1GPU CPU + 2GPU

0

25

50

75

100

0K 22,5K 45K 67,5K 90K

runt

ime

[s]

m (SNP count)

CPU only CPU + 1GPU

⟵in-core out-of-core⟶⟵in-core out-of-core⟶

1 2 3 4RuntimeIdeal scalability

41 21,6 16,2 11,740,7 20,4 13,6 10,2

0

11

23

34

45 40,7s

Runtime Ideal scalability

0

12,5

25

37,5

50

1 2 3 4

40,7

21,6

16,211,7

Runt

ime

[s]

Number of GPUs

Runtime Ideal scalability

20 / 23

Page 46: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

1 GPU Scalability

m CPU only CPU + 1GPU CPU + 2GPU1000500010000200002500030000400005000060000700008000090000100000

1,79 0,490 0,2941,22

11,6 4,174 2,25124,9 8,201 4,308

10,44 5,32632,9 12,47 6,343,1 16,56 8,352,4 20,59 10,365,6 24,62 12,474,6 28,75 14,384,8 32,76 16,396,7 36,75 18,3

0

25

50

75

100

0K 22,5K 45K 67,5K 90K

runt

ime

[s]

m (SNP count)

CPU only CPU + 1GPU CPU + 2GPU

0

25

50

75

100

0K 22,5K 45K 67,5K 90K

runt

ime

[s]

m (SNP count)

CPU only CPU + 1GPU

⟵in-core out-of-core⟶⟵in-core out-of-core⟶

1 2 3 4RuntimeIdeal scalability

41 21,6 16,2 11,740,7 20,4 13,6 10,2

0

11

23

34

45 40,7s

Runtime Ideal scalability

0

12,5

25

37,5

50

1 2 3 4

40,7

21,6

16,211,7

Runt

ime

[s]

Number of GPUs

Runtime Ideal scalability

20 / 23

Page 47: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Results t� 1Multi-trait analysis, “OMICS”-data

1

2

3

4

5

104 5 * 104 105

Tim

e ( y

ears

)

Number of traits (t)

14 hours

20 months

26 months

4.58 yearsEMMAXFaSTLMMGWFGLSCLAK-Eig

1 250 500 750

1000 1250 1500 1750 2000

100 101 102 103 104 105R

atio

ove

r CLA

K-Ei

gNumber of traits (t)

FaST-LMM: 1352x

GWFGLS: 1012x

EMMAX: 2789x

CLAK-Eig: 1x

21 / 23

Page 48: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

1 HPAC: Introduction

2 Application: Genome-Wide Association Studies

3 Performance experiments

4 Conclusions

22 / 23

Page 49: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Final comments

HPC’s perspectiveIn-core efficiencyHow to sustain efficiency?

App’s perspectiveMany data formatsMissing data, bogus dataOutput data. Post-processing?New features

HUGE gap:algorithm↔ optimized implementation(data management, parallelism)

Development cycle: several months!

How to deal with BIG problems?Expose knowledge, exploit knowledge

23 / 23

Page 50: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Final comments

HPC’s perspectiveIn-core efficiencyHow to sustain efficiency?

App’s perspectiveMany data formatsMissing data, bogus dataOutput data. Post-processing?New features

HUGE gap:algorithm↔ optimized implementation(data management, parallelism)

Development cycle: several months!

How to deal with BIG problems?Expose knowledge, exploit knowledge

23 / 23

Page 51: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Final comments

HPC’s perspectiveIn-core efficiencyHow to sustain efficiency?

App’s perspectiveMany data formatsMissing data, bogus dataOutput data. Post-processing?New features

HUGE gap:algorithm↔ optimized implementation(data management, parallelism)

Development cycle: several months!

How to deal with BIG problems?Expose knowledge, exploit knowledge

23 / 23

Page 52: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Final comments

HPC’s perspectiveIn-core efficiencyHow to sustain efficiency?

App’s perspectiveMany data formatsMissing data, bogus dataOutput data. Post-processing?New features

HUGE gap:algorithm↔ optimized implementation(data management, parallelism)

Development cycle: several months!

How to deal with BIG problems?Expose knowledge, exploit knowledge

23 / 23

Page 53: High-performance and automatic computinghpac.rwth-aachen.de/~pauldj/talks/Frankfurt13.pdf · High-performance and automatic computing Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

Final comments

HPC’s perspectiveIn-core efficiencyHow to sustain efficiency?

App’s perspectiveMany data formatsMissing data, bogus dataOutput data. Post-processing?New features

HUGE gap:algorithm↔ optimized implementation(data management, parallelism)

Development cycle: several months!

How to deal with BIG problems?Expose knowledge, exploit knowledge

23 / 23