Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector...

177
Erin C. Carson Seminar of Numerical Mathematics Katedra numerické matematiky, Matematicko-fyzikální fakulta, Univerzita Karlova November 15, 2018 Sparse Matrix Computations in the Exascale Era This research was partially supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495

Transcript of Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector...

Page 1: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Erin C. Carson

Seminar of Numerical Mathematics

Katedra numerické matematiky, Matematicko-fyzikální fakulta, Univerzita Karlova

November 15, 2018

Sparse Matrix Computationsin the

Exascale Era

This research was partially supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495

Page 2: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale Computing: The Modern Space Race

• "Exascale": 1018 floating point operations per second

1

Page 3: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale Computing: The Modern Space Race

• "Exascale": 1018 floating point operations per second

• Will enable new frontiers in science and engineering• Environment and climate• Material, manufacturing, design• Healthcare, biology, biomedicine• Cosmology and astrophysics• High-energy physics

1

Nothing tends so much to the advancement of knowledge as the application of a new instrument.

- Sir Humphry Davy

• Advancing knowledge, addressing social challenges, improving quality of life, influencing policy, economic competitiveness

Page 4: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale Computing: The Modern Space Race

• "Exascale": 1018 floating point operations per second

• Will enable new frontiers in science and engineering• Environment and climate• Material, manufacturing, design• Healthcare, biology, biomedicine• Cosmology and astrophysics• High-energy physics

1

• Much research investment toward achieving exascale within 5-10 years

EuroHPC declaration (2017): €1 billion investment in building exascaleinfrastructure by 2023

Nothing tends so much to the advancement of knowledge as the application of a new instrument.

- Sir Humphry Davy

• Advancing knowledge, addressing social challenges, improving quality of life, influencing policy, economic competitiveness

Page 5: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale Computing: The Modern Space Race

• "Exascale": 1018 floating point operations per second

• Will enable new frontiers in science and engineering• Environment and climate• Material, manufacturing, design• Healthcare, biology, biomedicine• Cosmology and astrophysics• High-energy physics

1

• Much research investment toward achieving exascale within 5-10 years

EuroHPC declaration (2017): €1 billion investment in building exascaleinfrastructure by 2023

• Challenges at all levels

Nothing tends so much to the advancement of knowledge as the application of a new instrument.

- Sir Humphry Davy

• Advancing knowledge, addressing social challenges, improving quality of life, influencing policy, economic competitiveness

hardware to to applicationsmethods and algorithms

Page 6: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale Computing: The Modern Space Race

• "Exascale": 1018 floating point operations per second

• Will enable new frontiers in science and engineering• Environment and climate• Material, manufacturing, design• Healthcare, biology, biomedicine• Cosmology and astrophysics• High-energy physics

1

• Much research investment toward achieving exascale within 5-10 years

EuroHPC declaration (2017): €1 billion investment in building exascaleinfrastructure by 2023

• Challenges at all levels

Nothing tends so much to the advancement of knowledge as the application of a new instrument.

- Sir Humphry Davy

• Advancing knowledge, addressing social challenges, improving quality of life, influencing policy, economic competitiveness

hardware to to applicationsmethods and algorithms

Page 7: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale System Projections

*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)

Today's SystemsPredicted Exascale

Systems*Factor

Improvement

System Peak 1016 flops/s 1018 flops/s 100

Node MemoryBandwidth

102 GB/s 103 GB/s 10

Interconnect Bandwidth

101 GB/s 102 GB/s 10

Memory Latency 10−7 s 5 ⋅ 10−8 s 2

Interconnect Latency 10−6 s 5 ⋅ 10−7 s 2

2

CPU Cache

CPU DRAM

DRAM

CPU DRAM

CPU DRAM

CPU DRAM

Page 8: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale System Projections

*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)

Today's SystemsPredicted Exascale

Systems*Factor

Improvement

System Peak 1016 flops/s 1018 flops/s 100

Node MemoryBandwidth

102 GB/s 103 GB/s 10

Interconnect Bandwidth

101 GB/s 102 GB/s 10

Memory Latency 10−7 s 5 ⋅ 10−8 s 2

Interconnect Latency 10−6 s 5 ⋅ 10−7 s 2

2

CPU Cache

CPU DRAM

DRAM

CPU DRAM

CPU DRAM

CPU DRAM

Page 9: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale System Projections

*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)

Today's SystemsPredicted Exascale

Systems*Factor

Improvement

System Peak 1016 flops/s 1018 flops/s 100

Node MemoryBandwidth

102 GB/s 103 GB/s 10

Interconnect Bandwidth

101 GB/s 102 GB/s 10

Memory Latency 10−7 s 5 ⋅ 10−8 s 2

Interconnect Latency 10−6 s 5 ⋅ 10−7 s 2

2

CPU Cache

CPU DRAM

DRAM

CPU DRAM

CPU DRAM

CPU DRAM

Page 10: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale System Projections

*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)

Today's SystemsPredicted Exascale

Systems*Factor

Improvement

System Peak 1016 flops/s 1018 flops/s 100

Node MemoryBandwidth

102 GB/s 103 GB/s 10

Interconnect Bandwidth

101 GB/s 102 GB/s 10

Memory Latency 10−7 s 5 ⋅ 10−8 s 2

Interconnect Latency 10−6 s 5 ⋅ 10−7 s 2

2

Page 11: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Exascale System Projections

• Gaps will only grow larger

*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)

• Reducing time spent moving data/waiting for data will be essential for applications at exascale!

Today's SystemsPredicted Exascale

Systems*Factor

Improvement

System Peak 1016 flops/s 1018 flops/s 100

Node MemoryBandwidth

102 GB/s 103 GB/s 10

Interconnect Bandwidth

101 GB/s 102 GB/s 10

Memory Latency 10−7 s 5 ⋅ 10−8 s 2

Interconnect Latency 10−6 s 5 ⋅ 10−7 s 2

• Movement of data (communication) is much more expensive than floating point operations (computation), in terms of both time and energy

2

Page 12: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Focus: Iterative solvers for sparse • Linear systems 𝐴𝑥 = 𝑏 and • Eigenvalue problems 𝐴𝑥 = 𝜆𝑥

3

Iterative Solvers

Initial guess

Convergence to sufficient accuracy?

Return solution

Yes

No

Refine Solution

Page 13: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Iterative solvers used when

• 𝐴 is very large, very sparse

• 𝐴 is represented implicitly

• Only approximate answer required

• Solving nonlinear equations

• Focus: Iterative solvers for sparse • Linear systems 𝐴𝑥 = 𝑏 and • Eigenvalue problems 𝐴𝑥 = 𝜆𝑥

3

Iterative Solvers

Initial guess

Convergence to sufficient accuracy?

Return solution

Yes

No

Refine Solution

Page 14: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Krylov Subspace Methods

Krylov Subspace Method: projection process onto the Krylov subspace

𝒦𝑖 𝐴, 𝑟0 = span 𝑟0, 𝐴𝑟0, 𝐴2𝑟0, … , 𝐴𝑖−1𝑟0

where 𝐴 is an 𝑁 × 𝑁 matrix and 𝑟0 is a length-𝑁 vector

4

Page 15: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Krylov Subspace Methods

In each iteration:

• Add a dimension to the Krylov subspace

– Forms nested sequence of Krylov subspaces

𝒦1 𝐴, 𝑟0 ⊂ 𝒦2 𝐴, 𝑟0 ⊂ ⋯ ⊂ 𝒦𝑖(𝐴, 𝑟0)

• Orthogonalize (with respect to some 𝒞𝑖)

• Linear systems: Select approximate solution

𝑥𝑖 ∈ 𝑥0 + 𝒦𝑖(𝐴, 𝑟0)

using 𝑟𝑖 = 𝑏 − 𝐴𝑥𝑖 ⊥ 𝒞𝑖

Krylov Subspace Method: projection process onto the Krylov subspace

𝒦𝑖 𝐴, 𝑟0 = span 𝑟0, 𝐴𝑟0, 𝐴2𝑟0, … , 𝐴𝑖−1𝑟0

where 𝐴 is an 𝑁 × 𝑁 matrix and 𝑟0 is a length-𝑁 vector

𝒞

𝑟new

𝐴𝛿

𝑟0

0

4

Page 16: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Krylov Subspace Methods

In each iteration:

• Add a dimension to the Krylov subspace

– Forms nested sequence of Krylov subspaces

𝒦1 𝐴, 𝑟0 ⊂ 𝒦2 𝐴, 𝑟0 ⊂ ⋯ ⊂ 𝒦𝑖(𝐴, 𝑟0)

• Orthogonalize (with respect to some 𝒞𝑖)

• Linear systems: Select approximate solution

𝑥𝑖 ∈ 𝑥0 + 𝒦𝑖(𝐴, 𝑟0)

using 𝑟𝑖 = 𝑏 − 𝐴𝑥𝑖 ⊥ 𝒞𝑖

Krylov Subspace Method: projection process onto the Krylov subspace

𝒦𝑖 𝐴, 𝑟0 = span 𝑟0, 𝐴𝑟0, 𝐴2𝑟0, … , 𝐴𝑖−1𝑟0

where 𝐴 is an 𝑁 × 𝑁 matrix and 𝑟0 is a length-𝑁 vector

𝒞

𝑟new

𝐴𝛿

𝑟0

0

4

Conjugate gradient method: 𝐴 is symmetric positive definite, 𝒞𝑖 = 𝒦𝑖(𝐴, 𝑟0)

𝑟𝑖 ⊥ 𝒦𝑖 𝐴, 𝑟0 ⟺ 𝑥 − 𝑥𝑖 𝐴 = min𝑧∈𝑥0+𝒦𝑖(𝐴,𝑟0)

𝑥 − 𝑧 𝐴 ⟹ 𝒓𝑵+𝟏 = 𝟎

Page 17: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Krylov Subspace Methods in the Wild

Climate Modeling

Computational Cosmology(Dark Matter Simulation, Almgren et al., LBNL)

Medical Treatment

Computer Vision

Power Grid Modeling

Chemical Engineering(Low-Emission Combustion Simulation, CCSE, LBNL)

Financial Portfolio Optimization

Latent Semantic Analysis

5

Page 18: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Conjugate Gradient on the World's Fastest Computer

6

Site: Oak Ridge National Laboratory

Manufacturer: IBM

Cores: 2,282,544

Memory: 2,801,664 GB

Processor: IBM POWER9 22C 3.07GHz

Interconnect: Dual-rail Mellanox EDR Infiniband

Performance

Theoretical peak: 187,659 TFlops/s

LINPACK benchmark: 122,300 Tflops/s

HPCG benchmark: 2,926 Tflops/s

Summit - IBM Power System AC922

Page 19: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

current #1 on top500

Conjugate Gradient on the World's Fastest Computer

6

Site: Oak Ridge National Laboratory

Manufacturer: IBM

Cores: 2,282,544

Memory: 2,801,664 GB

Processor: IBM POWER9 22C 3.07GHz

Interconnect: Dual-rail Mellanox EDR Infiniband

Performance

Theoretical peak: 187,659 TFlops/s

LINPACK benchmark: 122,300 Tflops/s

HPCG benchmark: 2,926 Tflops/s

Summit - IBM Power System AC922

Page 20: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Site: Oak Ridge National Laboratory

Manufacturer: IBM

Cores: 2,282,544

Memory: 2,801,664 GB

Processor: IBM POWER9 22C 3.07GHz

Interconnect: Dual-rail Mellanox EDR Infiniband

Performance

Theoretical peak: 187,659 TFlops/s

LINPACK benchmark: 122,300 Tflops/s

HPCG benchmark: 2,926 Tflops/s

current #1 on top500

LINPACK benchmark (dense 𝐴𝑥 = 𝑏, direct)

65% efficiency

Conjugate Gradient on the World's Fastest Computer

6

Summit - IBM Power System AC922

Page 21: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Site: Oak Ridge National Laboratory

Manufacturer: IBM

Cores: 2,282,544

Memory: 2,801,664 GB

Processor: IBM POWER9 22C 3.07GHz

Interconnect: Dual-rail Mellanox EDR Infiniband

Performance

Theoretical peak: 187,659 TFlops/s

LINPACK benchmark: 122,300 Tflops/s

HPCG benchmark: 2,926 Tflops/s

current #1 on top500

LINPACK benchmark (dense 𝐴𝑥 = 𝑏, direct)

65% efficiency

Conjugate Gradient on the World's Fastest Computer

6

Summit - IBM Power System AC922

HPCG benchmark (sparse 𝐴𝑥 = 𝑏, iterative)

1.5% efficiency

Page 22: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The Conjugate Gradient (CG) Method

Iteration Loop

Sparse Matrix × Vector

Inner Products

Vector Updates

Inner Products

Vector Updates

End Loop

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0for 𝑖 = 1:nmax

𝛼𝑖−1 =𝑟𝑖−1

𝑇 𝑟𝑖−1

𝑝𝑖−1𝑇 𝐴𝑝𝑖−1

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴𝑝𝑖−1

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

end

7

Page 23: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The Conjugate Gradient (CG) Method

Iteration Loop

Sparse Matrix × Vector

Inner Products

Vector Updates

Inner Products

Vector Updates

End Loop

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0for 𝑖 = 1:nmax

𝛼𝑖−1 =𝑟𝑖−1

𝑇 𝑟𝑖−1

𝑝𝑖−1𝑇 𝐴𝑝𝑖−1

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴𝑝𝑖−1

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

end

7

Page 24: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The Conjugate Gradient (CG) Method

Iteration Loop

Sparse Matrix × Vector

Inner Products

Vector Updates

Inner Products

Vector Updates

End Loop

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0for 𝑖 = 1:nmax

𝛼𝑖−1 =𝑟𝑖−1

𝑇 𝑟𝑖−1

𝑝𝑖−1𝑇 𝐴𝑝𝑖−1

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴𝑝𝑖−1

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

end

7

Page 25: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The Conjugate Gradient (CG) Method

Iteration Loop

Sparse Matrix × Vector

Inner Products

Vector Updates

Inner Products

Vector Updates

End Loop

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0for 𝑖 = 1:nmax

𝛼𝑖−1 =𝑟𝑖−1

𝑇 𝑟𝑖−1

𝑝𝑖−1𝑇 𝐴𝑝𝑖−1

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴𝑝𝑖−1

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

end

7

Page 26: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The Conjugate Gradient (CG) Method

Iteration Loop

Sparse Matrix × Vector

Inner Products

Vector Updates

Inner Products

Vector Updates

End Loop

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0for 𝑖 = 1:nmax

𝛼𝑖−1 =𝑟𝑖−1

𝑇 𝑟𝑖−1

𝑝𝑖−1𝑇 𝐴𝑝𝑖−1

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴𝑝𝑖−1

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

end

7

Page 27: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The Conjugate Gradient (CG) Method

Iteration Loop

Sparse Matrix × Vector

Inner Products

Vector Updates

Inner Products

Vector Updates

End Loop

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0for 𝑖 = 1:nmax

𝛼𝑖−1 =𝑟𝑖−1

𝑇 𝑟𝑖−1

𝑝𝑖−1𝑇 𝐴𝑝𝑖−1

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴𝑝𝑖−1

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

end

7

Page 28: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Sparse matrix-vector multiplication (SpMV)• 𝑂(nnz) flops• Must communicate vector entries w/neighboring

processors (nearest neighbor MPI collective)

×

Cost Per Iteration

8

Page 29: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Inner products• 𝑂(𝑁) flops• global synchronization (MPI_Allreduce)• all processors must exchange data and wait for all

communication to finish before proceeding

Sparse matrix-vector multiplication (SpMV)• 𝑂(nnz) flops• Must communicate vector entries w/neighboring

processors (nearest neighbor MPI collective)

×

Cost Per Iteration

8

×

Page 30: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Inner products• 𝑂(𝑁) flops• global synchronization (MPI_Allreduce)• all processors must exchange data and wait for all

communication to finish before proceeding

Sparse matrix-vector multiplication (SpMV)• 𝑂(nnz) flops• Must communicate vector entries w/neighboring

processors (nearest neighbor MPI collective)

Low computation/communication ratio

⇒ Performance is communication-bound

SpMV

orthogonalize

×

Cost Per Iteration

8

×

Page 31: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Synchronization-reducing variants

Communication cost has motivated many approaches to reducing synchronization in CG:

9

• Pipelined Krylov subspace methods

• s-step Krylov subspace methods

Page 32: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Synchronization-reducing variants

Communication cost has motivated many approaches to reducing synchronization in CG:

9

• Pipelined Krylov subspace methods

• Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration

• Modifications also allow decoupling of matrix-vector products and inner products - enables overlapping

• s-step Krylov subspace methods

Page 33: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Synchronization-reducing variants

Communication cost has motivated many approaches to reducing synchronization in CG:

9

• Pipelined Krylov subspace methods

• Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration

• Modifications also allow decoupling of matrix-vector products and inner products - enables overlapping

• s-step Krylov subspace methods

• Compute iterations in blocks of s using a different Krylov subspace basis

• Enables one synchronization per s iterations

Page 34: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Synchronization-reducing variants

Communication cost has motivated many approaches to reducing synchronization in CG:

9

• Pipelined Krylov subspace methods

• Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration

• Modifications also allow decoupling of matrix-vector products and inner products - enables overlapping

• s-step Krylov subspace methods

• Compute iterations in blocks of s using a different Krylov subspace basis

• Enables one synchronization per s iterations

Both approaches are mathematically equivalent

to classical CG

Page 35: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The effects of finite precision

Well-known that roundoff error has two effects:

1. Delay of convergence• No longer have exact Krylov

subspace• Can lose numerical rank deficiency• Residuals no longer orthogonal -

Minimization of 𝑥 − 𝑥𝑖 𝐴 no longer exact

2. Loss of attainable accuracy• Rounding errors cause true

residual 𝑏 − 𝐴𝑥𝑖 and updated residual 𝑟𝑖 deviate!

𝐴: bcsstk03 from SuiteSparse, 𝑏: equal components in the eigenbasis of 𝐴, 𝑏 = 1

𝑁 = 112, 𝜅 𝐴 ≈ 7e6

10

CG (double)

Page 36: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The effects of finite precision

Well-known that roundoff error has two effects:

1. Delay of convergence• No longer have exact Krylov

subspace• Can lose numerical rank deficiency• Residuals no longer orthogonal -

Minimization of 𝑥 − 𝑥𝑖 𝐴 no longer exact

2. Loss of attainable accuracy• Rounding errors cause true

residual 𝑏 − 𝐴𝑥𝑖 and updated residual 𝑟𝑖 deviate!

𝐴: bcsstk03 from SuiteSparse, 𝑏: equal components in the eigenbasis of 𝐴, 𝑏 = 1

𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Much work on these results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG

10

CG (double)exact CG

Page 37: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Synchronization-reducing variants are designed to reduce the time/iteration

• But this is not the whole story!

• What we really want to minimize is the runtime, subject to some constraint on accuracy,

runtime = (time/iteration) x (# iterations)

Optimizing high performance iterative solvers

• Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy

• Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy!

12

CG (double)exact CG

11

Page 38: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Synchronization-reducing variants are designed to reduce the time/iteration

• But this is not the whole story!

• What we really want to minimize is the runtime, subject to some constraint on accuracy,

runtime = (time/iteration) x (# iterations)

Optimizing high performance iterative solvers

• Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy

• Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy!

11

Page 39: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Accuracy 𝑥 − 𝑥𝑖 generally not computable, but 𝑥 − 𝑥𝑖 = 𝐴−1 𝑏 − 𝐴 𝑥𝑖

• Size of the true residual, 𝑏 − 𝐴 𝑥𝑖 , used as computable measure of accuracy

Maximum attainable accuracy

13

Page 40: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Accuracy 𝑥 − 𝑥𝑖 generally not computable, but 𝑥 − 𝑥𝑖 = 𝐴−1 𝑏 − 𝐴 𝑥𝑖

• Size of the true residual, 𝑏 − 𝐴 𝑥𝑖 , used as computable measure of accuracy

• Rounding errors cause the true residual, 𝒃 − 𝑨 𝒙𝒊, and the updated residual, 𝒓𝒊, to deviate

Maximum attainable accuracy

13

Page 41: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Accuracy 𝑥 − 𝑥𝑖 generally not computable, but 𝑥 − 𝑥𝑖 = 𝐴−1 𝑏 − 𝐴 𝑥𝑖

• Size of the true residual, 𝑏 − 𝐴 𝑥𝑖 , used as computable measure of accuracy

• Rounding errors cause the true residual, 𝒃 − 𝑨 𝒙𝒊, and the updated residual, 𝒓𝒊, to deviate

• Writing 𝑏 − 𝐴 𝑥𝑖 = 𝑟𝑖 + 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖,

𝑏 − 𝐴 𝑥𝑖 ≤ 𝑟𝑖 + 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

Maximum attainable accuracy

13

Page 42: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Accuracy 𝑥 − 𝑥𝑖 generally not computable, but 𝑥 − 𝑥𝑖 = 𝐴−1 𝑏 − 𝐴 𝑥𝑖

• Size of the true residual, 𝑏 − 𝐴 𝑥𝑖 , used as computable measure of accuracy

• Rounding errors cause the true residual, 𝒃 − 𝑨 𝒙𝒊, and the updated residual, 𝒓𝒊, to deviate

• Writing 𝑏 − 𝐴 𝑥𝑖 = 𝑟𝑖 + 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖,

𝑏 − 𝐴 𝑥𝑖 ≤ 𝑟𝑖 + 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

• As 𝑟𝑖 → 0, 𝑏 − 𝐴 𝑥𝑖 depends on 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

Maximum attainable accuracy

13

Page 43: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Accuracy 𝑥 − 𝑥𝑖 generally not computable, but 𝑥 − 𝑥𝑖 = 𝐴−1 𝑏 − 𝐴 𝑥𝑖

• Size of the true residual, 𝑏 − 𝐴 𝑥𝑖 , used as computable measure of accuracy

• Rounding errors cause the true residual, 𝒃 − 𝑨 𝒙𝒊, and the updated residual, 𝒓𝒊, to deviate

• Writing 𝑏 − 𝐴 𝑥𝑖 = 𝑟𝑖 + 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖,

𝑏 − 𝐴 𝑥𝑖 ≤ 𝑟𝑖 + 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

• As 𝑟𝑖 → 0, 𝑏 − 𝐴 𝑥𝑖 depends on 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

• Many results on bounding attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000).

Maximum attainable accuracy

13

Page 44: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• In finite precision HSCG, iterates are updated by

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝜹𝒙𝒊 and 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝜹𝒓𝒊

Maximum attainable accuracy of HSCG

13

Page 45: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• In finite precision HSCG, iterates are updated by

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝜹𝒙𝒊 and 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝜹𝒓𝒊

• Let 𝑓𝑖 ≡ 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

Maximum attainable accuracy of HSCG

13

Page 46: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• In finite precision HSCG, iterates are updated by

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝜹𝒙𝒊 and 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝜹𝒓𝒊

• Let 𝑓𝑖 ≡ 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

𝑓𝑖 = 𝑏 − 𝐴 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝛿𝑥𝑖 − 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝛿𝑟𝑖

Maximum attainable accuracy of HSCG

13

Page 47: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• In finite precision HSCG, iterates are updated by

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝜹𝒙𝒊 and 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝜹𝒓𝒊

• Let 𝑓𝑖 ≡ 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

𝑓𝑖 = 𝑏 − 𝐴 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝛿𝑥𝑖 − 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝛿𝑟𝑖

= 𝑓𝑖−1 + 𝐴𝛿𝑥𝑖 + 𝛿𝑟𝑖

Maximum attainable accuracy of HSCG

13

Page 48: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• In finite precision HSCG, iterates are updated by

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝜹𝒙𝒊 and 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝜹𝒓𝒊

• Let 𝑓𝑖 ≡ 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

𝑓𝑖 = 𝑏 − 𝐴 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝛿𝑥𝑖 − 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝛿𝑟𝑖

= 𝑓𝑖−1 + 𝐴𝛿𝑥𝑖 + 𝛿𝑟𝑖

= 𝑓0 + 𝑚=1𝑖 𝐴𝛿𝑥𝑚 + 𝛿𝑟𝑚

Maximum attainable accuracy of HSCG

13

Page 49: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• In finite precision HSCG, iterates are updated by

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝜹𝒙𝒊 and 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝜹𝒓𝒊

• Let 𝑓𝑖 ≡ 𝑏 − 𝐴 𝑥𝑖 − 𝑟𝑖

𝑓𝑖 = 𝑏 − 𝐴 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 − 𝛿𝑥𝑖 − 𝑟𝑖−1 − 𝛼𝑖−1𝐴 𝑝𝑖−1 − 𝛿𝑟𝑖

= 𝑓𝑖−1 + 𝐴𝛿𝑥𝑖 + 𝛿𝑟𝑖

= 𝑓0 + 𝑚=1𝑖 𝐴𝛿𝑥𝑚 + 𝛿𝑟𝑚

Maximum attainable accuracy of HSCG

𝑓𝑖 ≤ 𝑂(휀) 𝐴 𝑥 + max𝑚=0,…,𝑖

𝑥𝑚 Greenbaum, 1997

𝑓𝑖 ≤ 𝑂 휀 𝑚=0𝑖 𝑁𝐴 𝐴 𝑥𝑚 + 𝑟𝑚 van der Vorst and Ye, 2000

𝑓𝑖 ≤ 𝑂 휀 𝑁𝐴 𝐴 𝐴−1 𝑚=0𝑖 𝑟𝑚 Sleijpen and van der Vorst, 1995

13

Page 50: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Pipelined CG (GVCG)

• Overall idea: use auxiliary recurrences and modified formulas for recurrence coefficients 𝛼𝑖 and 𝛽𝑖 to reduce/decouple synchronization points

14

Page 51: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Pipelined CG (GVCG)

• Overall idea: use auxiliary recurrences and modified formulas for recurrence coefficients 𝛼𝑖 and 𝛽𝑖 to reduce/decouple synchronization points

• Long history of related work:

• Modified recurrence coefficient computation: Johnson [1983, 1984], van Rosendale [1983, 1984], Saad [1985]

• CG with two 3-term recurrences (STCG) [Stiefel, 1952/53]; analyzed by Gutknecht and Strakoš [2000]

14

Page 52: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Pipelined CG (GVCG)

• Overall idea: use auxiliary recurrences and modified formulas for recurrence coefficients 𝛼𝑖 and 𝛽𝑖 to reduce/decouple synchronization points

• Long history of related work:

• Modified recurrence coefficient computation: Johnson [1983, 1984], van Rosendale [1983, 1984], Saad [1985]

• CG with two 3-term recurrences (STCG) [Stiefel, 1952/53]; analyzed by Gutknecht and Strakoš [2000]

14

Page 53: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Pipelined CG (GVCG)

• Overall idea: use auxiliary recurrences and modified formulas for recurrence coefficients 𝛼𝑖 and 𝛽𝑖 to reduce/decouple synchronization points

• Long history of related work:

• Modified recurrence coefficient computation: Johnson [1983, 1984], van Rosendale [1983, 1984], Saad [1985]

• CG with two 3-term recurrences (STCG) [Stiefel, 1952/53]; analyzed by Gutknecht and Strakoš [2000]

• Approach of Chronopoulos and Gear [1989]

• Uses auxiliary vector 𝑠𝑖 ≡ 𝐴𝑝𝑖 and different computation of 𝛼𝑖 to reduce number of synchronizations per iteration from 2 to 1

14

Page 54: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Pipelined CG (GVCG)

• Overall idea: use auxiliary recurrences and modified formulas for recurrence coefficients 𝛼𝑖 and 𝛽𝑖 to reduce/decouple synchronization points

• Long history of related work:

• Modified recurrence coefficient computation: Johnson [1983, 1984], van Rosendale [1983, 1984], Saad [1985]

• CG with two 3-term recurrences (STCG) [Stiefel, 1952/53]; analyzed by Gutknecht and Strakoš [2000]

• Approach of Chronopoulos and Gear [1989]

• Uses auxiliary vector 𝑠𝑖 ≡ 𝐴𝑝𝑖 and different computation of 𝛼𝑖 to reduce number of synchronizations per iteration from 2 to 1

• Pipelined CG of Ghysels and Vanroose [2014]

• Uses 3 auxiliary vectors: 𝐴𝑝𝑖, 𝐴𝑟𝑖 and 𝐴2𝑟𝑖• Removes sequential dependency between matrix-vector products and inner

products

• Computations can then be overlapped using nonblocking (asynchronous) communication ⇒ hides the latency of global communications

14

Page 55: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

GVCG (Ghysels and Vanroose 2014)

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

𝑠0 = 𝐴𝑝0, 𝑤0 = 𝐴𝑟0, 𝑧0 = 𝐴𝑤0,𝛼0 = 𝑟0

𝑇𝑟0/𝑝0𝑇𝑠0

for 𝑖 = 1:nmax

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝑤𝑖 = 𝑤𝑖−1 − 𝛼𝑖−1𝑧𝑖−1

𝑞𝑖 = 𝐴𝑤𝑖

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝛼𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑤𝑖𝑇𝑟𝑖− 𝛽𝑖 𝛼𝑖−1 𝑟𝑖

𝑇𝑟𝑖

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝑤𝑖 + 𝛽𝑖𝑠𝑖−1

𝑧𝑖 = 𝑞𝑖 + 𝛽𝑖𝑧𝑖−1

end15

Page 56: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

GVCG (Ghysels and Vanroose 2014)

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

𝑠0 = 𝐴𝑝0, 𝑤0 = 𝐴𝑟0, 𝑧0 = 𝐴𝑤0,𝛼0 = 𝑟0

𝑇𝑟0/𝑝0𝑇𝑠0

for 𝑖 = 1:nmax

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝑤𝑖 = 𝑤𝑖−1 − 𝛼𝑖−1𝑧𝑖−1

𝑞𝑖 = 𝐴𝑤𝑖

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝛼𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑤𝑖𝑇𝑟𝑖− 𝛽𝑖 𝛼𝑖−1 𝑟𝑖

𝑇𝑟𝑖

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝑤𝑖 + 𝛽𝑖𝑠𝑖−1

𝑧𝑖 = 𝑞𝑖 + 𝛽𝑖𝑧𝑖−1

end15

Page 57: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

GVCG (Ghysels and Vanroose 2014)

Iteration Loop

Inner Products

SpMV

Vector Updates

End Loop

Ove

rlap

Vector Updates

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

𝑠0 = 𝐴𝑝0, 𝑤0 = 𝐴𝑟0, 𝑧0 = 𝐴𝑤0,𝛼0 = 𝑟0

𝑇𝑟0/𝑝0𝑇𝑠0

for 𝑖 = 1:nmax

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝑤𝑖 = 𝑤𝑖−1 − 𝛼𝑖−1𝑧𝑖−1

𝑞𝑖 = 𝐴𝑤𝑖

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝛼𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑤𝑖𝑇𝑟𝑖− 𝛽𝑖 𝛼𝑖−1 𝑟𝑖

𝑇𝑟𝑖

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝑤𝑖 + 𝛽𝑖𝑠𝑖−1

𝑧𝑖 = 𝑞𝑖 + 𝛽𝑖𝑧𝑖−1

end15

Page 58: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

GVCG (Ghysels and Vanroose 2014)

Iteration Loop

Inner Products

SpMV

Vector Updates

End Loop

Ove

rlap

Vector Updates

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

𝑠0 = 𝐴𝑝0, 𝑤0 = 𝐴𝑟0, 𝑧0 = 𝐴𝑤0,𝛼0 = 𝑟0

𝑇𝑟0/𝑝0𝑇𝑠0

for 𝑖 = 1:nmax

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝑤𝑖 = 𝑤𝑖−1 − 𝛼𝑖−1𝑧𝑖−1

𝑞𝑖 = 𝐴𝑤𝑖

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝛼𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑤𝑖𝑇𝑟𝑖− 𝛽𝑖 𝛼𝑖−1 𝑟𝑖

𝑇𝑟𝑖

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝑤𝑖 + 𝛽𝑖𝑠𝑖−1

𝑧𝑖 = 𝑞𝑖 + 𝛽𝑖𝑧𝑖−1

end15

Page 59: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

GVCG (Ghysels and Vanroose 2014)

Iteration Loop

Inner Products

SpMV

Vector Updates

End Loop

Ove

rlap

Vector Updates

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

𝑠0 = 𝐴𝑝0, 𝑤0 = 𝐴𝑟0, 𝑧0 = 𝐴𝑤0,𝛼0 = 𝑟0

𝑇𝑟0/𝑝0𝑇𝑠0

for 𝑖 = 1:nmax

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝑤𝑖 = 𝑤𝑖−1 − 𝛼𝑖−1𝑧𝑖−1

𝑞𝑖 = 𝐴𝑤𝑖

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝛼𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑤𝑖𝑇𝑟𝑖− 𝛽𝑖 𝛼𝑖−1 𝑟𝑖

𝑇𝑟𝑖

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝑤𝑖 + 𝛽𝑖𝑠𝑖−1

𝑧𝑖 = 𝑞𝑖 + 𝛽𝑖𝑧𝑖−1

end15

Page 60: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

GVCG (Ghysels and Vanroose 2014)

Iteration Loop

Inner Products

SpMV

Vector Updates

End Loop

Ove

rlap

Vector Updates

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

𝑠0 = 𝐴𝑝0, 𝑤0 = 𝐴𝑟0, 𝑧0 = 𝐴𝑤0,𝛼0 = 𝑟0

𝑇𝑟0/𝑝0𝑇𝑠0

for 𝑖 = 1:nmax

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝑤𝑖 = 𝑤𝑖−1 − 𝛼𝑖−1𝑧𝑖−1

𝑞𝑖 = 𝐴𝑤𝑖

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝛼𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑤𝑖𝑇𝑟𝑖− 𝛽𝑖 𝛼𝑖−1 𝑟𝑖

𝑇𝑟𝑖

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝑤𝑖 + 𝛽𝑖𝑠𝑖−1

𝑧𝑖 = 𝑞𝑖 + 𝛽𝑖𝑧𝑖−1

end15

Page 61: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

GVCG (Ghysels and Vanroose 2014)

Iteration Loop

Inner Products

SpMV

Vector Updates

End Loop

Ove

rlap

Vector Updates

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

𝑠0 = 𝐴𝑝0, 𝑤0 = 𝐴𝑟0, 𝑧0 = 𝐴𝑤0,𝛼0 = 𝑟0

𝑇𝑟0/𝑝0𝑇𝑠0

for 𝑖 = 1:nmax

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝑤𝑖 = 𝑤𝑖−1 − 𝛼𝑖−1𝑧𝑖−1

𝑞𝑖 = 𝐴𝑤𝑖

𝛽𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑟𝑖−1𝑇 𝑟𝑖−1

𝛼𝑖 =𝑟𝑖

𝑇𝑟𝑖

𝑤𝑖𝑇𝑟𝑖− 𝛽𝑖 𝛼𝑖−1 𝑟𝑖

𝑇𝑟𝑖

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝑤𝑖 + 𝛽𝑖𝑠𝑖−1

𝑧𝑖 = 𝑞𝑖 + 𝛽𝑖𝑧𝑖−1

end

Precond

15

Page 62: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of pipelined CG

• What is the effect of adding auxiliary recurrences to the CG method?

16

Page 63: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of pipelined CG

• What is the effect of adding auxiliary recurrences to the CG method?

• To isolate the effects, we consider a simplified version of a pipelined method

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0, 𝑠0 = 𝐴𝑝0

for 𝑖 = 1:nmax

𝛼𝑖−1 =(𝑟𝑖−1,𝑟𝑖−1)

(𝑝𝑖−1,𝑠𝑖−1)

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝛽𝑖 =(𝑟𝑖,𝑟𝑖)

(𝑟𝑖−1,𝑟𝑖−1)

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝐴𝑟𝑖 + 𝛽𝑖𝑠𝑖−1

end16

Page 64: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of pipelined CG

• What is the effect of adding auxiliary recurrences to the CG method?

• To isolate the effects, we consider a simplified version of a pipelined method

• Uses same update formulas for 𝛼 and 𝛽 as HSCG, but uses additional recurrence for 𝐴𝑝𝑖

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0, 𝑠0 = 𝐴𝑝0

for 𝑖 = 1:nmax

𝛼𝑖−1 =(𝑟𝑖−1,𝑟𝑖−1)

(𝑝𝑖−1,𝑠𝑖−1)

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1

𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

𝛽𝑖 =(𝑟𝑖,𝑟𝑖)

(𝑟𝑖−1,𝑟𝑖−1)

𝑝𝑖 = 𝑟𝑖 + 𝛽𝑖𝑝𝑖−1

𝑠𝑖 = 𝐴𝑟𝑖 + 𝛽𝑖𝑠𝑖−1

end16

see [C., Rozložník, Strakoš, Tíchy, Tůma, 2018]

Page 65: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 + 𝜹𝒙𝒊 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 + 𝜹𝒓𝒊

17

Page 66: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 + 𝜹𝒙𝒊 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 + 𝜹𝒓𝒊

𝑓𝑖 = 𝑟𝑖 − (𝑏 − 𝐴 𝑥𝑖)

17

Page 67: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 + 𝜹𝒙𝒊 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 + 𝜹𝒓𝒊

𝑓𝑖 = 𝑟𝑖 − (𝑏 − 𝐴 𝑥𝑖)

= 𝑓𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 − 𝐴 𝑝𝑖−1 + 𝛿𝑟𝑖 + 𝐴𝛿𝑥𝑖

17

Page 68: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 + 𝜹𝒙𝒊 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 + 𝜹𝒓𝒊

𝑓𝑖 = 𝑟𝑖 − (𝑏 − 𝐴 𝑥𝑖)

= 𝑓𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 − 𝐴 𝑝𝑖−1 + 𝛿𝑟𝑖 + 𝐴𝛿𝑥𝑖

= 𝑓0 + 𝑚=1𝑖 (𝛿𝑟𝑚 + 𝐴𝛿𝑥𝑚) − 𝐺𝑖𝑑𝑖

where

𝐺𝑖 = 𝑆𝑖 − 𝐴 𝑃𝑖, 𝑑𝑖 = 𝛼0, … , 𝛼𝑖−1𝑇

17

Page 69: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 + 𝜹𝒙𝒊 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 + 𝜹𝒓𝒊

𝑓𝑖 = 𝑟𝑖 − (𝑏 − 𝐴 𝑥𝑖)

= 𝑓𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 − 𝐴 𝑝𝑖−1 + 𝛿𝑟𝑖 + 𝐴𝛿𝑥𝑖

= 𝑓0 + 𝑚=1𝑖 (𝛿𝑟𝑚 + 𝐴𝛿𝑥𝑚) − 𝐺𝑖𝑑𝑖

where

𝐺𝑖 = 𝑆𝑖 − 𝐴 𝑃𝑖, 𝑑𝑖 = 𝛼0, … , 𝛼𝑖−1𝑇

Classical CG: 𝑓𝑖 = 𝑓0 + 𝑚=1𝑖 𝐴𝛿𝑥𝑚 + 𝛿𝑟𝑚

17

Page 70: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 + 𝜹𝒙𝒊 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 + 𝜹𝒓𝒊

𝑓𝑖 = 𝑟𝑖 − (𝑏 − 𝐴 𝑥𝑖)

= 𝑓𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 − 𝐴 𝑝𝑖−1 + 𝛿𝑟𝑖 + 𝐴𝛿𝑥𝑖

= 𝑓0 + 𝑚=1𝑖 (𝛿𝑟𝑚 + 𝐴𝛿𝑥𝑚) − 𝐺𝑖𝑑𝑖

where

𝐺𝑖 = 𝑆𝑖 − 𝐴 𝑃𝑖, 𝑑𝑖 = 𝛼0, … , 𝛼𝑖−1𝑇

Classical CG: 𝑓𝑖 = 𝑓0 + 𝑚=1𝑖 𝐴𝛿𝑥𝑚 + 𝛿𝑟𝑚

17

Page 71: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝐺𝑖 ≤𝑂 휀

1 − 𝑂 휀𝜅( 𝑈𝑖) 𝐴 𝑃𝑖 + 𝐴 𝑅𝑖

𝑈𝑖−1

𝑈𝑖 =

1 − 𝛽1 0 00 1 ⋱ 0⋮ ⋱ 1 − 𝛽𝑖−1

0 … 0 1

𝑈𝑖−1 =

1 𝛽1 … … 𝛽1 𝛽2 ⋯ 𝛽𝑖−1

0 1 𝛽2 … 𝛽2 ⋯ 𝛽𝑖−1

⋮ ⋱ ⋱ ⋱ ⋮⋮ ⋱ 1 𝛽𝑖−1

0 ⋯ ⋯ 0 1

18

Page 72: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝐺𝑖 ≤𝑂 휀

1 − 𝑂 휀𝜅( 𝑈𝑖) 𝐴 𝑃𝑖 + 𝐴 𝑅𝑖

𝑈𝑖−1

𝑈𝑖 =

1 − 𝛽1 0 00 1 ⋱ 0⋮ ⋱ 1 − 𝛽𝑖−1

0 … 0 1

𝑈𝑖−1 =

1 𝛽1 … … 𝛽1 𝛽2 ⋯ 𝛽𝑖−1

0 1 𝛽2 … 𝛽2 ⋯ 𝛽𝑖−1

⋮ ⋱ ⋱ ⋱ ⋮⋮ ⋱ 1 𝛽𝑖−1

0 ⋯ ⋯ 0 1

𝛽ℓ𝛽ℓ+1 ⋯ 𝛽𝑗 =𝑟𝑗

2

𝑟ℓ−12 , ℓ < 𝑗

18

Page 73: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝐺𝑖 ≤𝑂 휀

1 − 𝑂 휀𝜅( 𝑈𝑖) 𝐴 𝑃𝑖 + 𝐴 𝑅𝑖

𝑈𝑖−1

𝑈𝑖 =

1 − 𝛽1 0 00 1 ⋱ 0⋮ ⋱ 1 − 𝛽𝑖−1

0 … 0 1

𝑈𝑖−1 =

1 𝛽1 … … 𝛽1 𝛽2 ⋯ 𝛽𝑖−1

0 1 𝛽2 … 𝛽2 ⋯ 𝛽𝑖−1

⋮ ⋱ ⋱ ⋱ ⋮⋮ ⋱ 1 𝛽𝑖−1

0 ⋯ ⋯ 0 1

• Residual oscillations can cause these factors to be large!• Errors in computed recurrence coefficients can be amplified!

𝛽ℓ𝛽ℓ+1 ⋯ 𝛽𝑗 =𝑟𝑗

2

𝑟ℓ−12 , ℓ < 𝑗

18

Page 74: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝐺𝑖 ≤𝑂 휀

1 − 𝑂 휀𝜅( 𝑈𝑖) 𝐴 𝑃𝑖 + 𝐴 𝑅𝑖

𝑈𝑖−1

𝑈𝑖 =

1 − 𝛽1 0 00 1 ⋱ 0⋮ ⋱ 1 − 𝛽𝑖−1

0 … 0 1

𝑈𝑖−1 =

1 𝛽1 … … 𝛽1 𝛽2 ⋯ 𝛽𝑖−1

0 1 𝛽2 … 𝛽2 ⋯ 𝛽𝑖−1

⋮ ⋱ ⋱ ⋱ ⋮⋮ ⋱ 1 𝛽𝑖−1

0 ⋯ ⋯ 0 1

• Residual oscillations can cause these factors to be large!• Errors in computed recurrence coefficients can be amplified!

• Resembles results for attainable accuracy in STCG (3-term)

𝛽ℓ𝛽ℓ+1 ⋯ 𝛽𝑗 =𝑟𝑗

2

𝑟ℓ−12 , ℓ < 𝑗

18

Page 75: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of simple pipelined CG

𝐺𝑖 ≤𝑂 휀

1 − 𝑂 휀𝜅( 𝑈𝑖) 𝐴 𝑃𝑖 + 𝐴 𝑅𝑖

𝑈𝑖−1

𝑈𝑖 =

1 − 𝛽1 0 00 1 ⋱ 0⋮ ⋱ 1 − 𝛽𝑖−1

0 … 0 1

𝑈𝑖−1 =

1 𝛽1 … … 𝛽1 𝛽2 ⋯ 𝛽𝑖−1

0 1 𝛽2 … 𝛽2 ⋯ 𝛽𝑖−1

⋮ ⋱ ⋱ ⋱ ⋮⋮ ⋱ 1 𝛽𝑖−1

0 ⋯ ⋯ 0 1

• Residual oscillations can cause these factors to be large!• Errors in computed recurrence coefficients can be amplified!

• Resembles results for attainable accuracy in STCG (3-term)

• Seemingly innocuous change can cause drastic loss of accuracy• For analysis of attainable accuracy in GVCG, see [Cools et al., 2018]

𝛽ℓ𝛽ℓ+1 ⋯ 𝛽𝑗 =𝑟𝑗

2

𝑟ℓ−12 , ℓ < 𝑗

18

Page 76: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Simple pipelined CG

19

Page 77: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Simple pipelined CG

effect of using auxiliary vector 𝑠𝑖 ≡ 𝐴𝑝𝑖

19

Page 78: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Simple pipelined CG

effect of changing formula for recurrence coefficient 𝛼 and using auxiliary vector 𝑠𝑖 ≡ 𝐴𝑝𝑖

19

Page 79: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Simple pipelined CG

effect of changing formula for recurrence coefficient 𝛼 and using auxiliary vectors 𝑠𝑖 ≡ 𝐴𝑝𝑖, 𝑤𝑖 ≡ 𝐴𝑟𝑖 , 𝑧𝑖 ≡ 𝐴2𝑟𝑖

19

Page 80: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Towards understanding convergence delay

• Coefficients α and 𝛽 (related to entries of 𝑇𝑖) determine distribution functions 𝜔 𝑖 𝜆 which approximate distribution function 𝜔(𝜆) determined by inputs 𝐴, 𝑏, 𝑥0in terms of the 𝑖th Gauss-Christoffel quadrature

• CG method = matrix formulation of Gauss-Christoffel quadrature (see, e.g., [Liesen& Strakoš, 2013])

• A-norm of CG error for 𝑓 𝜆 = 𝜆−1 given as scaled quadrature error

𝜆−1𝑑𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ(𝑖)

𝜃ℓ𝑖

−1+

𝑥 − 𝑥𝑖 𝐴2

𝑟02

20

Page 81: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Towards understanding convergence delay

• Coefficients α and 𝛽 (related to entries of 𝑇𝑖) determine distribution functions 𝜔 𝑖 𝜆 which approximate distribution function 𝜔(𝜆) determined by inputs 𝐴, 𝑏, 𝑥0in terms of the 𝑖th Gauss-Christoffel quadrature

• CG method = matrix formulation of Gauss-Christoffel quadrature (see, e.g., [Liesen& Strakoš, 2013])

• A-norm of CG error for 𝑓 𝜆 = 𝜆−1 given as scaled quadrature error

𝜆−1𝑑𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ(𝑖)

𝜃ℓ𝑖

−1+

𝑥 − 𝑥𝑖 𝐴2

𝑟02

20

Page 82: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Towards understanding convergence delay

• Coefficients α and 𝛽 (related to entries of 𝑇𝑖) determine distribution functions 𝜔 𝑖 𝜆 which approximate distribution function 𝜔(𝜆) determined by inputs 𝐴, 𝑏, 𝑥0in terms of the 𝑖th Gauss-Christoffel quadrature

• CG method = matrix formulation of Gauss-Christoffel quadrature (see, e.g., [Liesen& Strakoš, 2013])

• A-norm of CG error for 𝑓 𝜆 = 𝜆−1 given as scaled quadrature error

𝜆−1𝑑𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ(𝑖)

𝜃ℓ𝑖

−1+

𝑥 − 𝑥𝑖 𝐴2

𝑟02

• For particular CG implementation, can the computed 𝜔 𝑖 (𝜆) be associated with some distribution function 𝜔(𝜆) related to the distribution function 𝜔(𝜆), i.e.,

𝜆−1𝑑𝜔 𝜆 ≈ 𝜆−1𝑑 𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ𝑖 𝜃ℓ

𝑖−1

+𝑥 − 𝑥𝑖 𝐴

2

𝑟02

+ 𝐹𝑖

where 𝐹𝑖 is small relative to error term?

20

Page 83: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Towards understanding convergence delay

• Coefficients α and 𝛽 (related to entries of 𝑇𝑖) determine distribution functions 𝜔 𝑖 𝜆 which approximate distribution function 𝜔(𝜆) determined by inputs 𝐴, 𝑏, 𝑥0in terms of the 𝑖th Gauss-Christoffel quadrature

• CG method = matrix formulation of Gauss-Christoffel quadrature (see, e.g., [Liesen& Strakoš, 2013])

• A-norm of CG error for 𝑓 𝜆 = 𝜆−1 given as scaled quadrature error

𝜆−1𝑑𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ(𝑖)

𝜃ℓ𝑖

−1+

𝑥 − 𝑥𝑖 𝐴2

𝑟02

• For particular CG implementation, can the computed 𝜔 𝑖 (𝜆) be associated with some distribution function 𝜔(𝜆) related to the distribution function 𝜔(𝜆), i.e.,

𝜆−1𝑑𝜔 𝜆 ≈ 𝜆−1𝑑 𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ𝑖 𝜃ℓ

𝑖−1

+𝑥 − 𝑥𝑖 𝐴

2

𝑟02

+ 𝐹𝑖

where 𝐹𝑖 is small relative to error term?

• For classical CG, yes; proved by Greenbaum [1989]

20

Page 84: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Towards understanding convergence delay

• Coefficients α and 𝛽 (related to entries of 𝑇𝑖) determine distribution functions 𝜔 𝑖 𝜆 which approximate distribution function 𝜔(𝜆) determined by inputs 𝐴, 𝑏, 𝑥0in terms of the 𝑖th Gauss-Christoffel quadrature

• CG method = matrix formulation of Gauss-Christoffel quadrature (see, e.g., [Liesen& Strakoš, 2013])

• A-norm of CG error for 𝑓 𝜆 = 𝜆−1 given as scaled quadrature error

𝜆−1𝑑𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ(𝑖)

𝜃ℓ𝑖

−1+

𝑥 − 𝑥𝑖 𝐴2

𝑟02

• For particular CG implementation, can the computed 𝜔 𝑖 (𝜆) be associated with some distribution function 𝜔(𝜆) related to the distribution function 𝜔(𝜆), i.e.,

𝜆−1𝑑𝜔 𝜆 ≈ 𝜆−1𝑑 𝜔 𝜆 =

ℓ=1

𝑖

𝜔ℓ𝑖 𝜃ℓ

𝑖−1

+𝑥 − 𝑥𝑖 𝐴

2

𝑟02

+ 𝐹𝑖

where 𝐹𝑖 is small relative to error term?

• For classical CG, yes; proved by Greenbaum [1989]

• For pipelined CG, THOROUGH ANALYSIS NEEDED!

20

Page 85: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

(matrix bcsstk03)

Differences in entries 𝛾𝑖 , 𝛿𝑖 in Jacobi matrices 𝑇𝑖 in HSCG vs. GVCG

Page 86: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

*

ox

eigenvalues of 𝐴

eigenvalues of 𝑇400, HSCG

eigenvalues of 𝑇400, GVCG

value

freq

uen

cy

Page 87: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step Krylov Subspace Methods

21

• Idea: Compute blocks of 𝑠 iterations at once

• Generate an 𝑂(𝑠) dimensional Krylov subspace basis; block orthogonalization

• Communicate every 𝑠 iterations instead of every iteration

• Reduces number of synchronizations per iteration by a factor of s

Page 88: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step Krylov Subspace Methods

• First related work: s-dimensional steepest descent, least squares

• [Khabaza, 1963], [Forsythe, 1968], [Marchuk and Kuznecov, 1968]

• Flurry of work on s-step Krylov subspace methods in 1980's/1990's; e.g.,

• [Van Rosendale, 1983]; [Chronopoulos and Gear, 1989], [de Sturler, 1991], [de Sturler and van der Vorst, 1995],...

21

• Idea: Compute blocks of 𝑠 iterations at once

• Generate an 𝑂(𝑠) dimensional Krylov subspace basis; block orthogonalization

• Communicate every 𝑠 iterations instead of every iteration

• Reduces number of synchronizations per iteration by a factor of s

Page 89: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step Krylov Subspace Methods

• First related work: s-dimensional steepest descent, least squares

• [Khabaza, 1963], [Forsythe, 1968], [Marchuk and Kuznecov, 1968]

• Flurry of work on s-step Krylov subspace methods in 1980's/1990's; e.g.,

• [Van Rosendale, 1983]; [Chronopoulos and Gear, 1989], [de Sturler, 1991], [de Sturler and van der Vorst, 1995],...

21

• Idea: Compute blocks of 𝑠 iterations at once

• Generate an 𝑂(𝑠) dimensional Krylov subspace basis; block orthogonalization

• Communicate every 𝑠 iterations instead of every iteration

• Reduces number of synchronizations per iteration by a factor of s

Recent use in many applications

• combustion, cosmology [Williams, C., et al., IPDPS, 2014]

• geoscience dynamics [Anciaux-Sedrakian et al., 2016]

• far-field scattering [Zhang et al., 2016]

• wafer defect detection [Zhang et al., 2016]

Page 90: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step Krylov Subspace Methods

• First related work: s-dimensional steepest descent, least squares

• [Khabaza, 1963], [Forsythe, 1968], [Marchuk and Kuznecov, 1968]

• Flurry of work on s-step Krylov subspace methods in 1980's/1990's; e.g.,

• [Van Rosendale, 1983]; [Chronopoulos and Gear, 1989], [de Sturler, 1991], [de Sturler and van der Vorst, 1995],...

21

• Idea: Compute blocks of 𝑠 iterations at once

• Generate an 𝑂(𝑠) dimensional Krylov subspace basis; block orthogonalization

• Communicate every 𝑠 iterations instead of every iteration

• Reduces number of synchronizations per iteration by a factor of s

Recent use in many applications

• combustion, cosmology [Williams, C., et al., IPDPS, 2014]

• geoscience dynamics [Anciaux-Sedrakian et al., 2016]

• far-field scattering [Zhang et al., 2016]

• wafer defect detection [Zhang et al., 2016]

up to 4.2x on 24K cores on Cray XE6

Page 91: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

Key observation: After iteration 𝑖, for 𝑗 ∈ {0, . . , 𝑠},

𝑥𝑖+𝑗 − 𝑥𝑖 , 𝑟𝑖+𝑗 , 𝑝𝑖+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

22

Page 92: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

Key observation: After iteration 𝑖, for 𝑗 ∈ {0, . . , 𝑠},

𝑥𝑖+𝑗 − 𝑥𝑖 , 𝑟𝑖+𝑗 , 𝑝𝑖+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

s steps of s-step CG:

22

Page 93: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

Key observation: After iteration 𝑖, for 𝑗 ∈ {0, . . , 𝑠},

𝑥𝑖+𝑗 − 𝑥𝑖 , 𝑟𝑖+𝑗 , 𝑝𝑖+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

s steps of s-step CG:

Expand solution space 𝒔 dimensions at once

Compute “basis” matrix 𝒴 such that

span 𝒴 = 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

according to the recurrence 𝐴𝒴 = 𝒴 ℬ

Compute inner products basis vectors in one synchronization

𝒢 = 𝒴𝑇𝒴

𝑂 1messages

22

Page 94: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

Key observation: After iteration 𝑖, for 𝑗 ∈ {0, . . , 𝑠},

𝑥𝑖+𝑗 − 𝑥𝑖 , 𝑟𝑖+𝑗 , 𝑝𝑖+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

s steps of s-step CG:

Expand solution space 𝒔 dimensions at once

Compute “basis” matrix 𝒴 such that

span 𝒴 = 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

according to the recurrence 𝐴𝒴 = 𝒴 ℬ

Compute inner products basis vectors in one synchronization

𝒢 = 𝒴𝑇𝒴

Compute s iterations of vector updatesPerform 𝑠 iterations of vector updates by updating coordinates in basis 𝒴:

𝑥𝑖+𝑗 − 𝑥𝑖 = 𝒴𝑥𝑗′, 𝑟𝑖+𝑗 = 𝒴𝑟𝑗

′, 𝑝𝑖+𝑗 = 𝒴𝑝𝑗′

𝑂 1messages

no data movement

22

Page 95: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

Key observation: After iteration 𝑖, for 𝑗 ∈ {0, . . , 𝑠},

𝑥𝑖+𝑗 − 𝑥𝑖 , 𝑟𝑖+𝑗 , 𝑝𝑖+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

s steps of s-step CG:

Expand solution space 𝒔 dimensions at once

Compute “basis” matrix 𝒴 such that

span 𝒴 = 𝒦𝑠+1 𝐴, 𝑝𝑖 + 𝒦𝑠 𝐴, 𝑟𝑖

according to the recurrence 𝐴𝒴 = 𝒴 ℬ

Compute inner products basis vectors in one synchronization

𝒢 = 𝒴𝑇𝒴

Compute s iterations of vector updatesPerform 𝑠 iterations of vector updates by updating coordinates in basis 𝒴:

𝑥𝑖+𝑗 − 𝑥𝑖 = 𝒴𝑥𝑗′, 𝑟𝑖+𝑗 = 𝒴𝑟𝑗

′, 𝑝𝑖+𝑗 = 𝒴𝑝𝑗′

𝑂 1messages

no data movement

Number of synchronizations per step reduced by factor of 𝑂(𝑠)!

22

Page 96: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

for 𝑘 = 0:nmax/𝑠

Compute 𝒴𝑘 and ℬ𝑘 such that 𝐴𝒴𝑘 = 𝒴𝑘ℬ𝑘 and

span(𝒴𝑘) = 𝒦𝑠+1 𝐴, 𝑝𝑠𝑘 + 𝒦𝑠 𝐴, 𝑟𝑠𝑘

𝒢𝑘 = 𝒴𝑘𝑇𝒴𝑘

𝑥0′ = 0, 𝑟0

′ = 𝑒𝑠+2, 𝑝0′ = 𝑒1

for 𝑗 = 1: 𝑠

𝛼𝑠𝑘+𝑗−1 =𝑟𝑗−1

′𝑇 𝒢𝑘𝑟𝑗−1′

𝑝𝑗−1′𝑇 𝒢𝑘ℬ𝑘𝑝𝑗−1

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝛼𝑠𝑘+𝑗−1𝑝𝑗−1′

𝑟𝑗′ = 𝑟𝑗−1

′ − 𝛼𝑠𝑘+𝑗−1ℬ𝑘𝑝𝑗−1′

𝛽𝑠𝑘+𝑗 =𝑟𝑗

′𝑇𝒢𝑘𝑟𝑗′

𝑟𝑗−1′𝑇 𝒢𝑘𝑟𝑗−1

𝑝𝑗′ = 𝑟𝑗

′ + 𝛽𝑠𝑘+𝑗𝑝𝑗−1′

end

[𝑥𝑠 𝑘+1 −𝑥𝑠𝑘 , 𝑟𝑠 𝑘+1 , 𝑝𝑠 𝑘+1 ] = 𝒴𝑘[𝑥𝑠′ , 𝑟𝑠

′, 𝑝𝑠′]

end23

Outer Loop

Compute basis O(s) SPMVs

O(𝑠2) Inner Products (one

synchronization)

Inner Loop

Local Vector Updates (no

comm.)

End Inner Loop

Inner Outer Loop

s times

Page 97: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

for 𝑘 = 0:nmax/𝑠

Compute 𝒴𝑘 and ℬ𝑘 such that 𝐴𝒴𝑘 = 𝒴𝑘ℬ𝑘 and

span(𝒴𝑘) = 𝒦𝑠+1 𝐴, 𝑝𝑠𝑘 + 𝒦𝑠 𝐴, 𝑟𝑠𝑘

𝒢𝑘 = 𝒴𝑘𝑇𝒴𝑘

𝑥0′ = 0, 𝑟0

′ = 𝑒𝑠+2, 𝑝0′ = 𝑒1

for 𝑗 = 1: 𝑠

𝛼𝑠𝑘+𝑗−1 =𝑟𝑗−1

′𝑇 𝒢𝑘𝑟𝑗−1′

𝑝𝑗−1′𝑇 𝒢𝑘ℬ𝑘𝑝𝑗−1

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝛼𝑠𝑘+𝑗−1𝑝𝑗−1′

𝑟𝑗′ = 𝑟𝑗−1

′ − 𝛼𝑠𝑘+𝑗−1ℬ𝑘𝑝𝑗−1′

𝛽𝑠𝑘+𝑗 =𝑟𝑗

′𝑇𝒢𝑘𝑟𝑗′

𝑟𝑗−1′𝑇 𝒢𝑘𝑟𝑗−1

𝑝𝑗′ = 𝑟𝑗

′ + 𝛽𝑠𝑘+𝑗𝑝𝑗−1′

end

[𝑥𝑠 𝑘+1 −𝑥𝑠𝑘 , 𝑟𝑠 𝑘+1 , 𝑝𝑠 𝑘+1 ] = 𝒴𝑘[𝑥𝑠′ , 𝑟𝑠

′, 𝑝𝑠′]

end

Outer Loop

Compute basis O(s) SPMVs

O(𝑠2) Inner Products (one

synchronization)

Inner Loop

Local Vector Updates (no

comm.)

End Inner Loop

Inner Outer Loop

s times

23

Page 98: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

for 𝑘 = 0:nmax/𝑠

Compute 𝒴𝑘 and ℬ𝑘 such that 𝐴𝒴𝑘 = 𝒴𝑘ℬ𝑘 and

span(𝒴𝑘) = 𝒦𝑠+1 𝐴, 𝑝𝑠𝑘 + 𝒦𝑠 𝐴, 𝑟𝑠𝑘

𝒢𝑘 = 𝒴𝑘𝑇𝒴𝑘

𝑥0′ = 0, 𝑟0

′ = 𝑒𝑠+2, 𝑝0′ = 𝑒1

for 𝑗 = 1: 𝑠

𝛼𝑠𝑘+𝑗−1 =𝑟𝑗−1

′𝑇 𝒢𝑘𝑟𝑗−1′

𝑝𝑗−1′𝑇 𝒢𝑘ℬ𝑘𝑝𝑗−1

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝛼𝑠𝑘+𝑗−1𝑝𝑗−1′

𝑟𝑗′ = 𝑟𝑗−1

′ − 𝛼𝑠𝑘+𝑗−1ℬ𝑘𝑝𝑗−1′

𝛽𝑠𝑘+𝑗 =𝑟𝑗

′𝑇𝒢𝑘𝑟𝑗′

𝑟𝑗−1′𝑇 𝒢𝑘𝑟𝑗−1

𝑝𝑗′ = 𝑟𝑗

′ + 𝛽𝑠𝑘+𝑗𝑝𝑗−1′

end

[𝑥𝑠 𝑘+1 −𝑥𝑠𝑘 , 𝑟𝑠 𝑘+1 , 𝑝𝑠 𝑘+1 ] = 𝒴𝑘[𝑥𝑠′ , 𝑟𝑠

′, 𝑝𝑠′]

end

Outer Loop

Compute basis O(s) SPMVs

O(𝑠2) Inner Products (one

synchronization)

Inner Loop

Local Vector Updates (no

comm.)

End Inner Loop

Inner Outer Loop

s times

23

Page 99: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG

𝑟0 = 𝑏 − 𝐴𝑥0, 𝑝0 = 𝑟0

for 𝑘 = 0:nmax/𝑠

Compute 𝒴𝑘 and ℬ𝑘 such that 𝐴𝒴𝑘 = 𝒴𝑘ℬ𝑘 and

span(𝒴𝑘) = 𝒦𝑠+1 𝐴, 𝑝𝑠𝑘 + 𝒦𝑠 𝐴, 𝑟𝑠𝑘

𝒢𝑘 = 𝒴𝑘𝑇𝒴𝑘

𝑥0′ = 0, 𝑟0

′ = 𝑒𝑠+2, 𝑝0′ = 𝑒1

for 𝑗 = 1: 𝑠

𝛼𝑠𝑘+𝑗−1 =𝑟𝑗−1

′𝑇 𝒢𝑘𝑟𝑗−1′

𝑝𝑗−1′𝑇 𝒢𝑘ℬ𝑘𝑝𝑗−1

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝛼𝑠𝑘+𝑗−1𝑝𝑗−1′

𝑟𝑗′ = 𝑟𝑗−1

′ − 𝛼𝑠𝑘+𝑗−1ℬ𝑘𝑝𝑗−1′

𝛽𝑠𝑘+𝑗 =𝑟𝑗

′𝑇𝒢𝑘𝑟𝑗′

𝑟𝑗−1′𝑇 𝒢𝑘𝑟𝑗−1

𝑝𝑗′ = 𝑟𝑗

′ + 𝛽𝑠𝑘+𝑗𝑝𝑗−1′

end

[𝑥𝑠 𝑘+1 −𝑥𝑠𝑘 , 𝑟𝑠 𝑘+1 , 𝑝𝑠 𝑘+1 ] = 𝒴𝑘[𝑥𝑠′ , 𝑟𝑠

′, 𝑝𝑠′]

end

Outer Loop

Compute basis O(s) SPMVs

O(𝑠2) Inner Products (one

synchronization)

Inner Loop

Local Vector Updates (no

comm.)

End Inner Loop

Inner Outer Loop

s times

23

Page 100: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

24

s-step CG with monomial basis (𝒴 = [𝑝𝑖 , 𝐴𝑝𝑖 , … , 𝐴𝑠𝑝𝑖 , 𝑟𝑖 , 𝐴𝑟𝑖 , …𝐴𝑠−1𝑟𝑖])

𝐴: bcsstk03 from UFSMC𝑏: equal components in the eigenbasis of 𝐴 and 𝑏 = 1𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Numerical Behavior of s-step CG

Page 101: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG with monomial basis (𝒴 = [𝑝𝑖 , 𝐴𝑝𝑖 , … , 𝐴𝑠𝑝𝑖 , 𝑟𝑖 , 𝐴𝑟𝑖 , …𝐴𝑠−1𝑟𝑖])

𝐴: bcsstk03 from UFSMC𝑏: equal components in the eigenbasis of 𝐴 and 𝑏 = 1𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Numerical Behavior of s-step CG

24

Page 102: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG with monomial basis (𝒴 = [𝑝𝑖 , 𝐴𝑝𝑖 , … , 𝐴𝑠𝑝𝑖 , 𝑟𝑖 , 𝐴𝑟𝑖 , …𝐴𝑠−1𝑟𝑖])

𝐴: bcsstk03 from UFSMC𝑏: equal components in the eigenbasis of 𝐴 and 𝑏 = 1𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Numerical Behavior of s-step CG

24

Page 103: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG with monomial basis (𝒴 = [𝑝𝑖 , 𝐴𝑝𝑖 , … , 𝐴𝑠𝑝𝑖 , 𝑟𝑖 , 𝐴𝑟𝑖 , …𝐴𝑠−1𝑟𝑖])

𝐴: bcsstk03 from UFSMC𝑏: equal components in the eigenbasis of 𝐴 and 𝑏 = 1𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Numerical Behavior of s-step CG

24

Page 104: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG with monomial basis (𝒴 = [𝑝𝑖 , 𝐴𝑝𝑖 , … , 𝐴𝑠𝑝𝑖 , 𝑟𝑖 , 𝐴𝑟𝑖 , …𝐴𝑠−1𝑟𝑖])

𝐴: bcsstk03 from UFSMC𝑏: equal components in the eigenbasis of 𝐴 and 𝑏 = 1𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Numerical Behavior of s-step CG

Effects of roundoff error:1. convergence delay2. loss of accuracy

24

Page 105: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

s-step CG with monomial basis (𝒴 = [𝑝𝑖 , 𝐴𝑝𝑖 , … , 𝐴𝑠𝑝𝑖 , 𝑟𝑖 , 𝐴𝑟𝑖 , …𝐴𝑠−1𝑟𝑖])

𝐴: bcsstk03 from UFSMC𝑏: equal components in the eigenbasis of 𝐴 and 𝑏 = 1𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Numerical Behavior of s-step CG

Effects of roundoff error:1. convergence delay2. loss of accuracy

24

Page 106: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Sources of Roundoff Error in s-step CG

Computing the 𝑠-step Krylov subspace basis:

𝐴 𝒴𝑘 = 𝒴𝑘ℬ𝑘 + Δ𝒴𝑘

Updating coordinate vectors in the inner loop, 𝑗 = 1: 𝑠:

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝑞𝑗−1′ + 𝜉𝑗

𝑟𝑗′ = 𝑟𝑗−1

′ − ℬ𝑘 𝑞𝑗−1′ + 𝜂𝑗

with 𝑞𝑗−1′ = fl( 𝛼𝑠𝑘+𝑗−1 𝑝𝑗−1

′ )

Recovering CG vectors for use in next outer loop:

𝑥𝑠𝑘+𝑠 = 𝒴𝑘 𝑥𝑗′ + 𝑥𝑠𝑘 + 𝜙𝑠𝑘+𝑠

𝑟𝑠𝑘+𝑠 = 𝒴𝑘 𝑟𝑗′ + 𝜓𝑠𝑘+𝑠

Error in outer iteration k:

25

Page 107: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Sources of Roundoff Error in s-step CG

Error in computing 𝑠-step basis

Error in outer iteration k:

25

Computing the 𝑠-step Krylov subspace basis:

𝐴 𝒴𝑘 = 𝒴𝑘ℬ𝑘 + Δ𝒴𝑘

Updating coordinate vectors in the inner loop, 𝑗 = 1: 𝑠:

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝑞𝑗−1′ + 𝜉𝑗

𝑟𝑗′ = 𝑟𝑗−1

′ − ℬ𝑘 𝑞𝑗−1′ + 𝜂𝑗

with 𝑞𝑗−1′ = fl( 𝛼𝑠𝑘+𝑗−1 𝑝𝑗−1

′ )

Recovering CG vectors for use in next outer loop:

𝑥𝑠𝑘+𝑠 = 𝒴𝑘 𝑥𝑗′ + 𝑥𝑠𝑘 + 𝜙𝑠𝑘+𝑠

𝑟𝑠𝑘+𝑠 = 𝒴𝑘 𝑟𝑗′ + 𝜓𝑠𝑘+𝑠

Page 108: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Sources of Roundoff Error in s-step CG

Error in computing 𝑠-step basis

Error in updating coefficient vectors

Error in outer iteration k:

25

Computing the 𝑠-step Krylov subspace basis:

𝐴 𝒴𝑘 = 𝒴𝑘ℬ𝑘 + Δ𝒴𝑘

Updating coordinate vectors in the inner loop, 𝑗 = 1: 𝑠:

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝑞𝑗−1′ + 𝜉𝑗

𝑟𝑗′ = 𝑟𝑗−1

′ − ℬ𝑘 𝑞𝑗−1′ + 𝜂𝑗

with 𝑞𝑗−1′ = fl( 𝛼𝑠𝑘+𝑗−1 𝑝𝑗−1

′ )

Recovering CG vectors for use in next outer loop:

𝑥𝑠𝑘+𝑠 = 𝒴𝑘 𝑥𝑗′ + 𝑥𝑠𝑘 + 𝜙𝑠𝑘+𝑠

𝑟𝑠𝑘+𝑠 = 𝒴𝑘 𝑟𝑗′ + 𝜓𝑠𝑘+𝑠

Page 109: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Sources of Roundoff Error in s-step CG

Error in computing 𝑠-step basis

Error in updating coefficient vectors

Error in outer iteration k:

25

Error in basis change

Computing the 𝑠-step Krylov subspace basis:

𝐴 𝒴𝑘 = 𝒴𝑘ℬ𝑘 + Δ𝒴𝑘

Updating coordinate vectors in the inner loop, 𝑗 = 1: 𝑠:

𝑥𝑗′ = 𝑥𝑗−1

′ + 𝑞𝑗−1′ + 𝜉𝑗

𝑟𝑗′ = 𝑟𝑗−1

′ − ℬ𝑘 𝑞𝑗−1′ + 𝜂𝑗

with 𝑞𝑗−1′ = fl( 𝛼𝑠𝑘+𝑗−1 𝑝𝑗−1

′ )

Recovering CG vectors for use in next outer loop:

𝑥𝑠𝑘+𝑠 = 𝒴𝑘 𝑥𝑗′ + 𝑥𝑠𝑘 + 𝜙𝑠𝑘+𝑠

𝑟𝑠𝑘+𝑠 = 𝒴𝑘 𝑟𝑗′ + 𝜓𝑠𝑘+𝑠

Page 110: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable Accuracy of s-step CG

For CG:

𝑓𝑖 ≤ 𝑓0 + 휀𝛤

𝑚=1

𝑖

1 + 𝑁 𝐴 𝑥𝑚 + 𝑟𝑚

𝑓𝑖 ≤ 𝑓0 + 휀

𝑚=1

𝑖

1 + 𝑁 𝐴 𝑥𝑚 + 𝑟𝑚

For s-step CG: 𝑖 ≡ 𝑠𝑘 + 𝑗

Γ = 𝑐 ⋅ maxℓ≤𝑘

𝒴ℓ+ 𝒴ℓ

where 𝑐 is a low-degree polynomial in 𝑠

[C., 2015]

e.g., [van der Vorst and Ye, 2000], [Greenbaum, 1997]

Residual gap: 𝑓𝑖 ≡ 𝑏−𝐴 𝑥𝑖− 𝑟𝑖

26

Page 111: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Roundoff Error in Lanczos vs. s-step Lanczos

Finite precision Lanczos process: (𝐴 is 𝑁 × 𝑁 with at most 𝑛 nonzeros per row)

𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚

𝑇 + 𝛿 𝑉𝑚

𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =

𝛼1 𝛽2

𝛽2 ⋱ ⋱

⋱ ⋱ 𝛽𝑚

𝛽𝑚 𝛼𝑚

𝜎 ≡ 𝐴 2

𝜃𝜎 ≡ 𝐴 2

Lanczos [Paige, 1976]

휀0 = 𝑂 휀𝑁

휀1 = 𝑂 휀𝑛𝜃

27

for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎

𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎

𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2

𝛽𝑖+12 + 𝛼𝑖

2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2

2 ≤ 4𝑖 3휀0 + 휀1 𝜎2

Page 112: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Roundoff Error in Lanczos vs. s-step Lanczos

Finite precision Lanczos process: (𝐴 is 𝑁 × 𝑁 with at most 𝑛 nonzeros per row)

𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚

𝑇 + 𝛿 𝑉𝑚

𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =

𝛼1 𝛽2

𝛽2 ⋱ ⋱

⋱ ⋱ 𝛽𝑚

𝛽𝑚 𝛼𝑚

𝜎 ≡ 𝐴 2

𝜃𝜎 ≡ 𝐴 2

Lanczos [Paige, 1976]

휀0 = 𝑂 휀𝑁

휀1 = 𝑂 휀𝑛𝜃

for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎

𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎

𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2

𝛽𝑖+12 + 𝛼𝑖

2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2

2 ≤ 4𝑖 3휀0 + 휀1 𝜎2

s-step Lanczos [C., Demmel, 2015]:

휀0 = 𝑂 휀𝑁𝚪𝟐

휀1 = 𝑂 휀𝑛𝜃𝚪

Γ = 𝑐 ⋅ maxℓ≤𝑘

𝒴ℓ+ 𝒴ℓ 27

Page 113: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Γ ≤ 24휀 𝑁 + 11𝑠 + 15− 1 2

≈1

𝑁휀

Convergence of Ritz Values in s-step Lanczos

• All results of Paige [1980], e.g., loss of orthogonality eigenvalue convergence, hold for s-step Lanczos as long as

28

Γ = 𝑐 ⋅ maxℓ≤𝑘

𝒴ℓ+ 𝒴ℓ

Page 114: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Γ ≤ 24휀 𝑁 + 11𝑠 + 15− 1 2

≈1

𝑁휀

Convergence of Ritz Values in s-step Lanczos

• All results of Paige [1980], e.g., loss of orthogonality eigenvalue convergence, hold for s-step Lanczos as long as

• Bounds on accuracy of Ritz values depend on Γ2

28

Γ = 𝑐 ⋅ maxℓ≤𝑘

𝒴ℓ+ 𝒴ℓ

Page 115: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Convergence of Ritz Values in s-step Lanczos

• All results of Paige [1980], e.g., loss of orthogonality eigenvalue convergence, hold for s-step Lanczos as long as

𝜆

𝑂(휀𝑁3 𝐴 )Lanczos

• Bounds on accuracy of Ritz values depend on Γ2

Γ ≤ 24휀 𝑁 + 11𝑠 + 15− 1 2

≈1

𝑁휀

28

Γ = 𝑐 ⋅ maxℓ≤𝑘

𝒴ℓ+ 𝒴ℓ

Page 116: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Convergence of Ritz Values in s-step Lanczos

• All results of Paige [1980], e.g., loss of orthogonality eigenvalue convergence, hold for s-step Lanczos as long as

𝜆

𝑂(휀𝑁3 𝐴 )

𝑂(휀𝑁3 𝐴 𝚪𝟐)

Lanczos

• Bounds on accuracy of Ritz values depend on Γ2

s-step Lanczos

Γ ≤ 24휀 𝑁 + 11𝑠 + 15− 1 2

≈1

𝑁휀

28

Γ = 𝑐 ⋅ maxℓ≤𝑘

𝒴ℓ+ 𝒴ℓ

Page 117: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Convergence of Ritz Values in s-step Lanczos

• All results of Paige [1980], e.g., loss of orthogonality eigenvalue convergence, hold for s-step Lanczos as long as

𝜆

𝑂(휀𝑁3 𝐴 )

𝑂(휀𝑁3 𝐴 )

Lanczos

• Bounds on accuracy of Ritz values depend on Γ2

s-step Lanczos behaves the same numerically as classical Lanczos

If 𝚪 ≈ 𝟏:

s-step Lanczos

Γ ≤ 24휀 𝑁 + 11𝑠 + 15− 1 2

≈1

𝑁휀

28

Γ = 𝑐 ⋅ maxℓ≤𝑘

𝒴ℓ+ 𝒴ℓ

Page 118: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

A different problem...

29

𝐴: nos4 from SuiteSparse𝑏: equal components in the eigenbasis

of 𝐴 and 𝑏 = 1𝑁 = 100, 𝜅 𝐴 ≈ 2e3

Page 119: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

A different problem...

𝐴: nos4 from SuiteSparse𝑏: equal components in the eigenbasis

of 𝐴 and 𝑏 = 1𝑁 = 100, 𝜅 𝐴 ≈ 2e3

29

Page 120: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

A different problem...

𝐴: nos4 from SuiteSparse𝑏: equal components in the eigenbasis

of 𝐴 and 𝑏 = 1𝑁 = 100, 𝜅 𝐴 ≈ 2e3

29

Page 121: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

A different problem...

𝐴: nos4 from SuiteSparse𝑏: equal components in the eigenbasis

of 𝐴 and 𝑏 = 1𝑁 = 100, 𝜅 𝐴 ≈ 2e3

Page 122: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

A different problem...

If application only requires 𝑥 − 𝑥𝑖 𝐴 ≤ 10−10,

any of these methods will work!

𝐴: nos4 from SuiteSparse𝑏: equal components in the eigenbasis

of 𝐴 and 𝑏 = 1𝑁 = 100, 𝜅 𝐴 ≈ 2e3

Page 123: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

A different problem...

If application only requires 𝑥 − 𝑥𝑖 𝐴 ≤ 10−10,

any of these methods will work!

𝐴: nos4 from SuiteSparse𝑏: equal components in the eigenbasis

of 𝐴 and 𝑏 = 1𝑁 = 100, 𝜅 𝐴 ≈ 2e3

Need adaptive, problem-dependent approach based on understanding of finite precision behavior!

Page 124: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Consider the growth of the relative residual gap caused by errors in outer loop 𝑘, which begins with global iteration number 𝑚

Adaptive s-step CG

30

Page 125: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Consider the growth of the relative residual gap caused by errors in outer loop 𝑘, which begins with global iteration number 𝑚

• We can approximate an upper bound on this quantity by

𝑓𝑚+𝑠 − 𝑓𝑚𝐴 𝑥

≲ 휀 1 + 𝜅 𝐴 Γ𝑘

max𝑗∈{0,…,𝑠}

𝑟𝑚+𝑗

𝐴 𝑥

Adaptive s-step CG

𝑓𝑖 ≡ 𝑏−𝐴 𝑥𝑖− 𝑟𝑖

30

Page 126: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Consider the growth of the relative residual gap caused by errors in outer loop 𝑘, which begins with global iteration number 𝑚

• We can approximate an upper bound on this quantity by

𝑓𝑚+𝑠 − 𝑓𝑚𝐴 𝑥

≲ 휀 1 + 𝜅 𝐴 Γ𝑘

max𝑗∈{0,…,𝑠}

𝑟𝑚+𝑗

𝐴 𝑥

• If our application requires relative accuracy 휀∗, we must have

Γ𝑘 ≡ 𝑐 ⋅ 𝒴𝑘+ 𝒴𝑘 ≲

휀∗

휀 max𝑗∈{0,…,𝑠}

𝑟𝑚+𝑗

Adaptive s-step CG

𝑓𝑖 ≡ 𝑏−𝐴 𝑥𝑖− 𝑟𝑖

30

Page 127: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Consider the growth of the relative residual gap caused by errors in outer loop 𝑘, which begins with global iteration number 𝑚

• We can approximate an upper bound on this quantity by

𝑓𝑚+𝑠 − 𝑓𝑚𝐴 𝑥

≲ 휀 1 + 𝜅 𝐴 Γ𝑘

max𝑗∈{0,…,𝑠}

𝑟𝑚+𝑗

𝐴 𝑥

• If our application requires relative accuracy 휀∗, we must have

Γ𝑘 ≡ 𝑐 ⋅ 𝒴𝑘+ 𝒴𝑘 ≲

휀∗

휀 max𝑗∈{0,…,𝑠}

𝑟𝑚+𝑗

• 𝑟𝑖 large → Γ𝑘 must be small; 𝑟𝑖 small → Γ𝑘 can grow

Adaptive s-step CG

𝑓𝑖 ≡ 𝑏−𝐴 𝑥𝑖− 𝑟𝑖

30

Page 128: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Consider the growth of the relative residual gap caused by errors in outer loop 𝑘, which begins with global iteration number 𝑚

• We can approximate an upper bound on this quantity by

𝑓𝑚+𝑠 − 𝑓𝑚𝐴 𝑥

≲ 휀 1 + 𝜅 𝐴 Γ𝑘

max𝑗∈{0,…,𝑠}

𝑟𝑚+𝑗

𝐴 𝑥

• If our application requires relative accuracy 휀∗, we must have

Γ𝑘 ≡ 𝑐 ⋅ 𝒴𝑘+ 𝒴𝑘 ≲

휀∗

휀 max𝑗∈{0,…,𝑠}

𝑟𝑚+𝑗

• 𝑟𝑖 large → Γ𝑘 must be small; 𝑟𝑖 small → Γ𝑘 can grow

⇒ adaptive s-step approach [C., 2018]

• 𝑠 starts off small, increases at rate depending on 𝑟𝑖 and 휀∗

Adaptive s-step CG

𝑓𝑖 ≡ 𝑏−𝐴 𝑥𝑖− 𝑟𝑖

30

Page 129: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

mesh3e1 (UFSMC)𝑛 = 289𝜅 𝐴 ≈ 10

𝑏𝑖 = 1/ 𝑁

s-step CG

adpt. s-step CG

CG

Adaptive s-step CG

31

Page 130: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

mesh3e1 (UFSMC)𝑛 = 289𝜅 𝐴 ≈ 10

𝑏𝑖 = 1/ 𝑁

Adaptive s-step CG

s-step CG

adpt. s-step CG

CG

31

Page 131: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence

Takeaway

32

Page 132: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence

Takeaway

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

Reduce time per iteration

32

Page 133: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence

Takeaway

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

32

Page 134: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence

Takeaway

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

32

Page 135: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence

Takeaway

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

32

Page 136: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

runtime =time periteration

×number of iterations

until convergence

subspace recycling

𝐴𝑥 = 𝑏 ⇒ 𝑀𝐿−1𝐴𝑀𝑅

−1𝑢 = 𝑀𝐿−1𝑏

𝑥 = 𝑀𝑅−1𝑢

Takeaway

32

Page 137: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

runtime =time periteration

×number of iterations

until convergence

subspace recycling

𝐴𝑥 = 𝑏 ⇒ 𝑀𝐿−1𝐴𝑀𝑅

−1𝑢 = 𝑀𝐿−1𝑏

𝑥 = 𝑀𝑅−1𝑢

Takeaway

32

Page 138: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

runtime =time periteration

×number of iterations

until convergence

doubled precision → twice as many bits moved

Takeaway

32

Page 139: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

runtime =time periteration

×number of iterations

until convergence

Takeaway

32

Page 140: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

runtime =time periteration

×number of iterations

until convergence

𝐴𝑥 ≈ 𝐴𝑥

Takeaway

32

Page 141: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence

convergence criteria never met: divergence, or convergence to inaccurate solution

Takeaway

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

32

Page 142: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence∞

convergence criteria never met: divergence, or convergence to inaccurate solution

Takeaway

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

32

Page 143: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

runtime =time periteration

×number of iterations

until convergence

Takeaway

reduced precision

approximate operators

asynchronous execution

modify algorithm to reduce

communication

increased precision

preconditioning

block methods

eigenvalue deflation

Reduce number of iterationsReduce time per iteration

subspace recycling

To minimize runtime, must understand how modifications affect:

1) attainable accuracy 2) convergence rate 3) time per iteration

32

Page 144: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Future Work: Finite Precision Krylov Subspace Methods

33

• Convergence delay in high-performance CG variants• Extending results of Greenbaum [1989] to s-step and pipelined versions

Page 145: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Future Work: Finite Precision Krylov Subspace Methods

33

• Convergence delay in high-performance CG variants• Extending results of Greenbaum [1989] to s-step and pipelined versions

• Deviation from exact Krylov subspaces in Lanczos

• Can the space spanned by the computed 𝑉𝑖 be related to some exactly Krylov subspace?

Page 146: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Future Work: Finite Precision Krylov Subspace Methods

33

• Convergence delay in high-performance CG variants• Extending results of Greenbaum [1989] to s-step and pipelined versions

• Deviation from exact Krylov subspaces in Lanczos

• Can the space spanned by the computed 𝑉𝑖 be related to some exactly Krylov subspace?

• Loss of orthogonality vs. backward error in finite precision GMRES 𝑟𝑖

𝑏 + 𝐴 𝑥𝑖⋅ 𝐼 − 𝑉𝑖

𝑇 𝑉𝑖 ≈ 𝑂(휀) ?

Page 147: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Future Work: Finite Precision Krylov Subspace Methods

33

• Convergence delay in high-performance CG variants• Extending results of Greenbaum [1989] to s-step and pipelined versions

• Deviation from exact Krylov subspaces in Lanczos

• Can the space spanned by the computed 𝑉𝑖 be related to some exactly Krylov subspace?

• Loss of orthogonality vs. backward error in finite precision GMRES 𝑟𝑖

𝑏 + 𝐴 𝑥𝑖⋅ 𝐼 − 𝑉𝑖

𝑇 𝑉𝑖 ≈ 𝑂(휀) ?

• Rigorous analysis of accuracy and convergence for various commonly-used techniques • Deflation, incomplete preconditioning, matrix equilibration, look-

ahead, etc.

Page 148: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Simulation + Data + Learning

• Data analytics and machine learning increasingly important in scientific discovery

• Event identification, correlation in high-energy physics

• Climate simulation validation using sensor data

• Determine patterns and trends from astronomical data

• Genetic sequencing

• The convergence of simulation, data, and learning

• current hot topic: workshops, conferences, research initiatives, funding calls

34

Page 149: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Simulation + Data + Learning

• Data analytics and machine learning increasingly important in scientific discovery

• Event identification, correlation in high-energy physics

• Climate simulation validation using sensor data

• Determine patterns and trends from astronomical data

• Genetic sequencing

• Driving changes in supercomputer architecture

• Multiprecision hardware

• Specialized accelerators

• Memory at node

• The convergence of simulation, data, and learning

• current hot topic: workshops, conferences, research initiatives, funding calls

34

Page 150: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Numerical linear algebra routines are the core computational kernels in many data science and machine learning applications

35

Numerical Linear Algebra for Data Analytics + ML

Page 151: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Numerical linear algebra routines are the core computational kernels in many data science and machine learning applications

• Growing problem sizes, growing datasets → need scalable performance

Numerical Linear Algebra for Data Analytics + ML

35

Page 152: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

• Numerical linear algebra routines are the core computational kernels in many data science and machine learning applications

• Growing problem sizes, growing datasets → need scalable performance

Challenges:

• Optimizing performance in different space: different/new architectures, matrix structures, accuracy requirements, etc.

• Translation between

(% accuracy on test dataset) ↔ (number of FP digits)

• Designing efficient and effective preconditioners

• More general error analyses: How do approximations (e.g., sparsification, low-rank representation) affect convergence and accuracy of numerical algorithms?

Numerical Linear Algebra for Data Analytics + ML

35

Page 153: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

[email protected]

www.karlin.mff.cuni.cz/~carson

Thank You!Thank you!

Page 154: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

The effects of finite precision

Errors have two effects:

1. Delay of convergence• No longer have exact Krylov

subspace• Can lose numerical rank

deficiency• Residuals no longer orthogonal

- Minimization no longer exact!

2. Loss of attainable accuracy• Rounding errors cause true

residual 𝑏 − 𝐴𝑥𝑖 and updated residual 𝑟𝑖 deviate!

𝐴: bcsstk03 from UFSMC, 𝑏: equal components in the eigenbasis of 𝐴 and 𝑏 = 1

𝑁 = 112, 𝜅 𝐴 ≈ 7e6

Many existing results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG

Page 155: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Attainable accuracy of pipelined CG

• In finite precision:

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1 𝑝𝑖−1 + 𝜹𝒙𝒊 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 + 𝜹𝒓𝒊

𝑓𝑖 = 𝑟𝑖 − (𝑏 − 𝐴 𝑥𝑖)

= 𝑓𝑖−1 − 𝛼𝑖−1 𝑠𝑖−1 − 𝐴 𝑝𝑖−1 + 𝛿𝑟𝑖 + 𝐴𝛿𝑥𝑖

= 𝑓0 + 𝑚=1𝑖 (𝐴𝛿𝑥𝑚 + 𝛿𝑟𝑚) − 𝐺𝑖𝑑𝑖

where

𝐺𝑖 = 𝑆𝑖 − 𝐴 𝑃𝑖, 𝑑𝑖 = 𝛼0, … , 𝛼𝑖−1𝑇

• Bound on 𝐺𝑖 will differ depending on the method (other recurrences or auxiliary vectors used)

• Both ChG CG and GVCG use the same update formulas for 𝑥𝑖 and 𝑟𝑖:

𝑥𝑖 = 𝑥𝑖−1 + 𝛼𝑖−1𝑝𝑖−1, 𝑟𝑖 = 𝑟𝑖−1 − 𝛼𝑖−1𝑠𝑖−1

23

Page 156: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Preconditioning for s-step KSMs

• Much recent/ongoing work in developing communication-avoiding preconditioned methods

• Many approaches shown to be compatible

• Diagonal

• Sparse Approx. Inverse (SAI) – for s-step BICGSTAB by Mehri(2014)

• HSS preconditioning (Hoemmen, 2010); for banded matrices (Knight, C., Demmel, 2014); same general technique for any system that can be written as sparse + low-rank

• CA-ILU(0) – Moufawad and Grigori (2013)

• Deflation for s-step CG (C., Knight, Demmel, 2014), for s-step GMRES (Yamazaki et al., 2014)

• Domain decomposition – avoid introducing additional communication by “underlapping” subdomains (Yamazaki et al., 2014)

Page 157: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Example: Tridiagonal matrix

SpMV Dependency Graph

𝐺 = (𝑉, 𝐸) where 𝑉 = 𝑦0, … , 𝑦𝑛−1 ∪ {𝑥0, … , 𝑥𝑛−1} and 𝑦𝑖 , 𝑥𝑗 ∈ 𝐸 if 𝐴𝑖𝑗 ≠ 0

Page 158: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

0 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 4

0 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 40 10 20 30processor 1 processor 2 processor 3 processor 4

Example: tridiagonal matrix, s = 3, n = 40, p = 4

Naïve algorithm:s messages per neighbor

Matrix powers optimization:

1 message per neighbor

Parallel Matrix Powers

Page 159: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Avoids communication:

• In serial, by exploiting temporal locality:

• Reading 𝐴, reading vectors

• In parallel, by doing only 1 ‘expand’ phase (instead of 𝑠).

• Requires sufficiently low ‘surface-to-volume’ ratio

Tridiagonal Example:

The Matrix Powers Kernel (Demmel et al., 2007)

Sequential

Parallel

A3vA2vAvv

A3vA2vAvv

black = local elementsred = 1-level dependenciesgreen = 2-level dependenciesblue = 3-level dependencies

Also works for general graphs!

Page 160: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Complexity comparison

Example of parallel (per processor) complexity for 𝑠 iterations of CG vs. s-step CG for a 2D 9-point stencil:

(Assuming each of 𝑝 processors owns 𝑛/𝑝 rows of the matrix and 𝑠 ≤ 𝑛/𝑝)

All values in the table meant in the Big-O sense (i.e., lower order terms and constants not included)

Flops Words Moved Messages

SpMV Orth. SpMV Orth. SpMV Orth.

Classical CG

𝑠𝑛

𝑝

𝑠𝑛

𝑝 𝑠 𝑛 𝑝 𝑠 log2 𝑝 𝑠 𝑠 log2 𝑝

s-step CG𝑠𝑛

𝑝𝑠2𝑛

𝑝𝑠 𝑛 𝑝 𝑠2 log2 𝑝 1 log2 𝑝

Page 161: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Choosing the Block Size s

• Parameter 𝑠 is limited by machine parameters, matrix sparsity structure, and machine properties

• As we increase s, at some point the lower-order terms in flops and words moved will dominate runtime

• This point depends on relative costs of, e.g., a flop versus sending a message on the machine

tim

e p

er it

erat

ion

s

• But 𝑠 is also limited by numerical properties ...

• We can auto-tune to find the best 𝑠 based on these properties• That is, find 𝑠 that gives the least time per iteration

Page 162: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Choosing a Polynomial Basis

• Recall: in each outer loop of CA-CG, we compute bases for some Krylov subspaces, 𝒦𝑚 𝐴, 𝑣 = span{𝑣, 𝐴𝑣, … , 𝐴𝑚−1𝑣}

• Two choices based on spectral information that usually lead to well-conditioned bases:

• Newton polynomials

• Chebyshev polynomials

• Simple loop unrolling gives monomial basis 𝑌 = 𝑝, 𝐴𝑝, 𝐴2𝑝, 𝐴3𝑝, …

• Condition number can grow exponentially with 𝑠

• Condition number = ratio of largest to smallest eigenvalues, 𝜆max/𝜆min

• Recognized early on that this negatively affects convergence (Leland, 1989)

• Improve basis condition number to improve convergence: Use different polynomials to compute a basis for the same subspace.

Page 163: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

History of 𝑠-step Krylov Methods

1983

Van Rosendale:

CG

1988

Walker: GMRES

Chronopoulosand Gear: CG

1990 1991 1992

First termed “s-step

methods”

de Sturler: GMRES

1989

Bai, Hu, and Reichel:GMRES

Chronopoulosand Kim:

Nonsymm. Lanczos

Joubert and Carey: GMRES

Erhel:GMRES

Toledo: CG

de Sturler and van der Vorst:

GMRES

1995 2001

Chronopoulosand Kinkaid:

Orthodir

Chronopoulos and Kim: Orthomin,

GMRES Chronopoulos: MINRES, GCR,

Orthomin

Kim and Chronopoulos: Arndoli, Symm.

Lanczos

Leland: CG

Page 164: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Recent Years…

2010 2011 2014

Hoemmen:Arnoldi, GMRES,

Lanczos, CG

First termed “CA” methods; first TSQR,

general matrix powers kernel

Carson, Knight, and Demmel:

BICG, CGS, BICGSTAB

Ballard, Carson, Demmel, Hoemmen,

Knight, Schwartz:Arnoldi, GMRES,

Nonsymm. Lanczos

Carson and Demmel: 2-term

Lanczos

Carson and Demmel:CG-RR,

BICG-RR

First theoretical results on finite

precision behavior

2012 2013

Feuerriegeland Bücker: Lanczos, BICG, QMR

Grigori, Moufawad, Nataf:

CG

First CA-BICGSTAB

method

Page 165: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 512^2 grid

Page 166: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 1024^2 grid

Page 167: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 2048^2 grid

Page 168: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 16^2 grid per process

Page 169: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 32^2 grid per process

Page 170: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 64^2 grid per process

Page 171: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

0 512 1024 1536 2048 2560 3072 3584 4096

Tim

e (

seco

nd

s)

Processes (6 threads each)

Bottom Solver Time (total)

MPI_AllReduce Time (total)

Solver Time

Communication Time

Coarse-grid Krylov Solver on NERSC’s Hopper (Cray XE6)

Weak Scaling: 43 points per process (0 slope ideal)

Solver performance and scalability limited by communication!

Page 172: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Communication-Avoiding Krylov Method Speedups

• Recent results: CA-BICGSTAB used as geometric multigrid (GMG) bottom-solve (Williams, Carson, et al., IPDPS ‘14)

• Plot: Net time spent on different operations over one GMG bottom solve using 24,576 cores, 643 points/core on fine grid, 43 points/core on coarse grid

• Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini network, 3D torus

• CA-BICGSTAB with 𝒔 = 𝟒

• 3D Helmholtz equation

𝑎𝛼𝑢 − 𝑏𝛻 ⋅ 𝛽𝛻𝑢 = 𝑓

𝛼 = 𝛽 = 1.0, 𝑎 = 𝑏 = 0.9

4.2x speedup in Krylov solve; 2.5x in overall GMG solve

• Implemented in BoxLib: applied to low-Mach number combustion and 3D N-body dark matter simulation apps

Page 173: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Benchmark timing breakdown

• Plot: Net time spent across all bottom solves at 24,576 cores, for BICGSTAB and CA-BICGSTAB with 𝑠 = 4

• 11.2x reduction in MPI_AllReduce time (red)

– BICGSTAB requires 6𝑠 more MPI_AllReduce’s than CA-BICGSTAB

– Less than theoretical 24x since messages in CA-BICGSTAB are larger, not always latency-limited

• P2P (blue) communication doubles for CA-BICGSTAB

– Basis computation requires twice as many SpMVs (P2P) per iteration as BICGSTAB

0.000

0.250

0.500

0.750

1.000

1.250

1.500

BICGSTAB CA-BICGSTAB

Tim

e (

seco

nd

s)

Breakdown of Bottom Solver

MPI (collectives)

MPI (P2P)

BLAS3

BLAS1

applyOp

residual

Page 174: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

Representation of Matrix Structures

Rep

rese

nta

tio

n o

f M

atri

x V

alu

es

Example: stencil with variable coefficients

explicit structureexplicit values

explicit structureimplicit values

implicit structureexplicit values

implicit structureimplicit values

Example: stencil with constant coefficients

Example: Laplacianmatrix of a graph

Example: general sparse matrix

Hoemmen (2010), Fig 2.5

Page 175: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

𝒴(ℬ𝑝𝑗′)

𝑂(𝑠)

𝑂(𝑠)

×

𝑟𝑗′𝑇𝒢𝑟𝑗

× ×

(𝑟𝑖+𝑗 , 𝑟𝑖+𝑗)

𝐴𝑝𝑖+𝑗

×

×𝑛

𝑛

𝐴𝒴𝑝𝑗′

= =

= 𝑟𝑗′𝑇𝒴𝑇𝒴𝑟𝑗

′ =

s-step (communication-avoiding) CG

For s iterations of updates, inner products and SpMVs (in basis 𝒴) can be computed by independently by each processor without communication:

Page 176: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴

if 𝑑𝑠𝑘+𝑗 ≤ 휀 𝑟𝑠𝑘+𝑗 𝐚𝐧𝐝 𝑑𝑠𝑘+𝑗+1 > 휀 𝑟𝑠𝑘+𝑗+1 𝐚𝐧𝐝 𝑑𝑠𝑘+𝑗+1 > 1.1𝑑𝑖𝑛𝑖𝑡

𝑧 = 𝑧 + 𝒴𝑘 𝑥𝑘,𝑗+1′ + 𝑥𝑠𝑘+1

𝑥𝑠𝑘+𝑗+1 = 0

𝑟𝑠𝑘+𝑗+1 = 𝑏 − 𝐴𝑧

𝑑𝑖𝑛𝑖𝑡 = 𝑑𝑠𝑘+𝑗+1= 휀 1 + 2𝑁′ 𝐴 𝑧 + 𝑟𝑠𝑘+𝑗+1

𝑝𝑠𝑘+𝑗+1 = 𝒴𝑘𝑝𝑘,𝑗+1′

break from inner loop and begin new outer loop

end

Residual replacement for s-step CG

• Use computable bound for 𝑏 − 𝐴𝑥𝑠𝑘+𝑗+1 − 𝑟𝑠𝑘+𝑗+1 to update 𝑑𝑠𝑘+𝑗+1, an estimate of error in computing 𝑟𝑠𝑘+𝑗+1, in each iteration

• Set threshold 휀 ≈ 휀, replace whenever 𝑑𝑠𝑘+𝑗+1/ 𝑟𝑠𝑘+𝑗+1 reaches threshold

176

Pseudo-code for residual replacement with group update for s-step CG:

group update of approximate solution

set residual to true residual

Page 177: Sparse Matrix Computations - Univerzita Karlovacarson/ppt/KNM_MFF... · 2018. 11. 15. · ×Vector Inner Products Vector Updates Inner Products Vector Updates End Loop 0= −𝐴