From CVA to the Resolution of a Large Number of Small...

From CVA to the Resolution of a LargeNumber of Small Random Systems

Lokman Abbas-TurkiFirst part from a joint work with M. A. MikouLast part from a joint work with S. Graillat

UPMC, LPMA

6 April 2016

Lokman (UPMC, LPMA) GPU Tech. Conf. 2016 1 / 38

Plan

IntroductionFrom linear/linear to linear/nonlinearFrom nonlinear/linear to nonlinear/nonlinear

Simulation algorithmsWithout funding constraintsWith funding constraints

Adaptation issues toAmerican options

The difference with the referencesLDLtHouseholder tridiagonalization + PCRDivide and conquer for eigenproblem

Some simulationresults

Conclusion


Introduction

Plan






Conclusion


Introduction From linear/linear to linear/nonlinear

Credit ValuationAdjustment

In a financial transaction between a party C that has to pay another partyB some amount V , the CVA value is the price of the insurance contractthat covers the default of party C to pay the whole sum V .

CVAt,T = (1− R)Et(V+τ 1t<τ≤T

)(1)

I R is the recovery to make if the counterparty defaults (Assume R = 0),I τ is the random default time of the counterparty,I T is the protection time horizon.

Numericalsimulation

CVA0,T ≈N−1∑k=0

E(V+tk1τ∈(tk ,tk+1]

), (2)

N ≤ the number of time steps used for SDEs discretization.

ImportanceI Hold sufficient amount of liquid assets to face the counterparty default.I Basel III includes the calculation of the CVA (Credit Valuation

Adjustment) as an important part of the prudential rules.


Introduction From linear/linear to linear/nonlinear

Various kind ofcontracts

Simulating assets St = (S1t , ..., S

dt ) trajectories then contracts trajectories

to get Vt as a sum:

Vt =∑ie

φexpie (St) +∑ii

φeuiii (St) +∑id

φeudid,t(St) +∑ia

φamia,t(St), (3)

where ie, ii , id and ia are the exposure indices and:

φexp explicit function, for example: φexp(Stk ) = S1tk− S2

tk.

φeui is a path-independent European contract,

φeui (Stk ) = E(f eui (ST )|Stk ). (4)

φeudt is a path-dependent European contract,

φeudtk(Stk ) = E(f eudtk

(Stk+1 )|Stk ), (5)

for example: f eudtk(Stk+1 ) = ( max

i=0,..,kS1ti∨ S1

tk+1− S2

tk+1)+.

φamt is an American contract, involving an optimal stopping problem

φamtk (Stk ) = f (Stk ) ∨ E(φamtk+1(Stk+1 )|Stk ) (6)

with f an explicit payoff that generally does not depend on the asset path.

Common problem ϕ(x) = E(f (Stk+1 )|Stk = x).


Introduction From nonlinear/linear to nonlinear/nonlinear

TVA definition Total valuation adjustment (>CVA+DVA+FVA), it covers:I Both defaults: τ = τ c ∧ τb, CVA and DVA.I Funding our risk and the risk of the counterparty: Nonlinear BSDE part,

FVA.

S. Crépey(2012) Ignoring the external funding and denoting βt = e−∫ t0 rudu where r is the

risk-free short rate process, Θ satisfies the following BSDE on [0, τ ∧ T ]

βtΘt = E

[βτ1τ<T (Vτ − Rτ ) +

∫ τ∧T

tβsgs(Vs −Θs)ds

∣∣Gt] (7)

where G is the extension of F by the natural filtration generated by τ cand by τb. R is the total close-out cash-flow specified thanks to CSA(Credit Support Annex) and g is the funding coefficient.

TVA BSDEsimulation

I Only for European contracts.I Requires a good approximation of the exposure V .I Practitioners usually use rough approximations.I No trustable procedure in the general case.


Introduction From nonlinear/linear to nonlinear/nonlinear

WOLOG: V = P

For one Americancontract

I Pt is the American derivative exposition.I Let τ∗ ∈ [0,T ] be the optimal stopping time associated to Pt .

Theorem Θ = ΘJ + (1− J)ξτ , where Jt = 1{τ>t} and the pre-default TVA Θsatisfies the following BSDE on [0, τ∗]

βtΘt = Et

(∫ τ∗

tβs gs(Ps − Θs)ds

), gt(Pt − Θt) = gt(Pt − Θt) + γt ξt . (8)

I βt = e−∫ t0 (γu+ru)du .

I ξt := 1Et (1{τ>t})

Et(ξt1{τ>t}) with ξt := Pt − E(R1{t=τ}|Gt).

Extension Possible to extend (8) on a portfolio of various American options.


Simulation algorithms

Plan






Conclusion



Without fundingconstraints CVA0,T =

N−1∑k=0

E(P+k+11τ∈(kh,(k+1)h]

), h =

T

N.

With fundingconstraints Θk = Ek (Θk+1 + hg(k + 1,Pk+1,Θk+1)) , ΘN = 0.

The exposure Pk (x) = E(Φk,τk (Sτk )|Sk = x

)with

τN = N,∀k ∈ {N − 1, ..., 0}, τk = k1Γk + τk+11Γc

k,

(9)

where Γk ={

Φk+1,k (Sk ) > E(Pk+1(Sk+1)

∣∣∣Sk)}. The conditionalexpectation involved in Γk is approximated using a regression on a basisof monomial functions where Kk is its cardinal.



Without fundingconstraints CVA0,T =

N−1∑k=0

E(P+k+11τ∈(kh,(k+1)h]

), h =

T

N.

With fundingconstraints Θk = Ek (Θk+1 + hg(k + 1,Pk+1,Θk+1)) , ΘN = 0.

An example of atwo stage

simulation withM0 = 2, M6 = 8

and M8 = 4



CVA0,Tapproximation CVA0,T =

N−1∑k=0

1M0

M0∑i=1

F 2k+1

(P1(S i

1), ..., Pk+1(S ik+1)

)(10)

With{F 1k+1(x1, ..., xk+1) = E

(1τ∈(kh,(k+1)h]|P1 = x1, ...,Pk+1 = xk+1

),

F 2k+1(x1, ..., xk+1) = (xk+1)+F 1

k+1(x1, ..., xk+1).(11)

Θk approximation

For k = 1, ...,N − 1

Θk (x)= tψ(x)Ψ−1k

1M0

M0∑j=1

ψ(S jk )

(Θk+1(S j

k+1) +1Ng(k+1, Θk+1(S j

k+1), Pk+1(S jk+1)

))and ΘN(x) = 0, Θ0(S0) =

1M0

M0∑j=1

(Θ1(S j

1) +1Ng(1, Θ1(S j

1), P1(S j1))).

(12)

Where Ψk = T

1M0

M0∑i=0

ψ(S ik )tψ(S i

k )

with: {S i}i∈{1,...,M0} and {S i}i∈{1,...,M0} are two

independent simulations of the underlying asset S, ψ is a basis of monomial functions where Kis its cardinal and T is an operator that must satisfy some desired properties.


Simulation algorithms Without funding constraints

Theorem

E

[(CVA0,T − CVA0,T

)2]≤

N2

M0max

k∈{0,...,N−1}Var

(F 2k+1

(P1(S i

1), ..., Pk+1(S ik+1)

))

+N∑j=1

14NM2

j

(E[Vj (S

ij )f ′′j (Pj (S

ij ))F 3

j (P1(S i1), ...,Pj (S

ij ))])2

+N∑j=1

14NM2

j

EVj (S

ij )f ′′j (Pj (S

ij ))

N−1∑k=j

F 4k+1(P1(S i

1), ...,Pj (Sij ), S i

j )

2

+N∑j=1

N

4M2j

(E[Vj (S

ij )F 1

j (P1(S i1), ...,Pj (S

ij ))|Pj (S

ij ) = 0

]ϕj (0)

)2+ N

N∑j=1

(N − j + 1)2O

(1M4

j

)

Where ϕj is the density of Pj (Sij ), Vj (x) = Var

(√Mj

(Pj (x)− Pj (x)

)).

Good choices If ϕj (0) is big then take Mj ∼√M0, otherwise Mj ∼

√M0/N. In both

cases, N must be small when compared to√M0. When American options

are involved, make sure that K3j /Mj is small enough.

For example

Mj =N − j

N − 1M1 with either M1 =

√M0

Nor M1 =

√M0. (13)


Simulation algorithms With funding constraints

Theorem As long as {Θi (x)}0≤i≤N−1 are of class Cs on the support of S ∈ Rd ,there exists a positive constant C such that for each 0 ≤ k ≤ N − 1

E

[(Θk (S i

k )−Θk (S ik ))2]≤

CK

N2

N−1∑l=k

EVl+1(S j

l+1)∂2Pg(l + 1,Θl+1(S j

l+1),Pl+1(S jl+1)

)2Ml

2

+O

(K

M0+

K2

N2M0+

K

N4M2l

+K1−2s/d

N2 + K−2s/d

).

Good choice Take Ml ∼√

M0/N, N must be sufficiently small N ∼ 10. WhenAmerican options are involved, make sure that K3

l /Ml is small enough.

For example

Ml =N − l

N − 1M1 with M1 =

√M0

N. (14)


Adaptation issues to American options

Plan






Conclusion


Adaptation issues to American options

An example of atwo stage

simulation withM0 = 2, M6 = 8

and M8 = 4

Inner dynamicprogramming

τN = N,∀k ∈ {N − 1, ..., 0}, τk = k1Γk + τk+11Γc

k,

(15)

where Γk ={

Φk+1,k (Sk ) > Rk · ψ(Sk )}. The vector Rk minimizes the

quadratic error||Pk+1(Sk+1)− Rk · ψ(Sk )||L2 (16)

and denoting Ak = E (ψ(Sk )tψ(Sk )), we get

Rk = A−1k E (ψ(Sk )Pk+1(Sk+1)) . (17)


Adaptation issues to American options The difference with the references

Three main methods for symmetric big matricesCholesky

factorizationI V. Volkov and J. Demmel. LU, QR and Cholesky Factorizations using

Vector Capabilities of GPUs. Berkeley Technical Report. 2008.I G. Ballard, J. Demmel, O. Holtz and O. Schwartz,

Communication-Optimal Parallel and Sequential Cholesky Decomposition.SIAM J. SCI. COMPUT. 32(6), 3495–3523. 2010.

Tridiagonal form +cyclic reduction

I Y. Zhang , J. Cohen and J. D. Owens. 15th ACM SIGPLAN Symposiumon Principles and Practice of Parallel Programming, 127–136. 2010.

I D. Goddeke and R. Strzodka. Cyclic Reduction Tridiagonal Solvers onGPUs Applied to Mixed Precision Multigrid. Parallel and DistributedSystems, IEEE Trans. 22(1), 22–32. 2010.

Tridiagonal form +eigenproblem

I J. W. Demmel, O. A. Marques, B. N. Parlett and C. Vomel. Performanceand Accuracy of Lapack’s Symmetric Tridiagonal Eigensolvers. SIAM J.SCI. COMPUT. 30(3), 1508–1526. 2008.

I C. Vomel, S. Tomov and J. Dongarra. Divide & Conquer on HybridGpu-Accelerated Multicore Systems. SIAM J. SCI. COMPUT. 34(2),70–82. 2012.



None of the previous works can be used directlyThe reason

I Large number of small random linear systems: The size does not exceed64 and the communication is reduced.

I Some of these random systems could be ill-conditioned.

Ak,l =1Mk

Mk∑j=1

ψl (S(j)tk

)ψl (S(j)tk

)t (18)

Typical conditionnumbers for linearregression n = 30

in the Black &Scholes model



Must we systematically use Householdertridiagonalization with divide & conquer when wesuspect the random linear systems to be ill-conditioned?

Our answer

I Perform Householder tridiagonalization O(4n3/3) and solvethe linear systems cheaply using parallel cyclic reductionO(n log2(n)).

I Take a decision according to the value of the residueerror:

* If the residue error is small then we already have goodsolutions.

* Otherwise, we must perform divide & conquer O(4n3/3)diagonalizations and discard the smallest eigenvalues.

I The next time we solve this same kind of linear systems:* If they used to be well-conditioned then we just process LDLtO(n3/6).

* Otherwise we execute directly the combination ofHouseholder tridiagonalization and divide & conquerdiagonalization.


Adaptation issues to American options LDLt

Shared occupation n(n + 1)/2 + n and complexityO(n3/6)

A = LDLt , Dj,j = Aj,j −j−1∑k=1

L2j,kDk,k ,

Li,j =1

Dj,j

Ai,j −j−1∑k=1

Li,kLj,kDk,k

if i > j .

(19)

Standard LDLtparallel strategy



Three different versions studied

1. An SIMD version that requires only independent threads, one for eachlinear system.

2. A collaborative version that involves n collaborative threads for eachlinear system with n unknowns.

3. An optimal hybrid solution that involves n∗ (n∗ < n) collaborativethreads for each linear system with n unknowns.

The speedup of thecollaborative and

the hybrid versionswhen compared to

the SIMDimplementation.



(a) Optimal number of collaborative threads(b) Number of systems solved within a second

(a) (b)


Adaptation issues to American options Householder tridiagonalization + PCR

Householder tridiagonalization: Shared occupationn2 + 2n and complexity O(4n3/3)

1. An SIMD version that requires only independent threads, one for eachlinear system.

2. A collaborative version that involves n collaborative threads for eachlinear system with n unknowns.

For symmetric A

U = Ht3...H

tnAHn...H3 =

d1 c1

c1 d2 c2 0

c2 d3. . .

. . .. . .

. . .

0. . .

. . . cn−1cn−1 dn

, (20)

with each Householder matrix H given by

H = I − uut/b, b = utu/2. (21)



From CR: Shared occupation 3n and complexityO(nlog2(n))

CR when n = 8

e1 e7e5

z2 z8z4 z6

Stepg1:gForwardgreductiongtogag4-unknowngsystemginvolvinggz2,gz4,gz6gandgz8

Stepg2:gForwardgreductiongtoag2-unknowngsystemginvolvinggz4gandgz8

Stepg3:gSolveg2-unknowngsystem

Stepg4:gBackwardgsubstitutiongtosolvegthegrestg2gunknowns

Stepg5:gBackwardgsubstitutiongtosolvegthegrestg4gunknowns

A A A AA AA A

AAA

AAAA

AA

AA

AA

A AAA AAA

AA

AA

A

A

AA

e3

z3z1 z5 z7

e1 e2 e3 e4 e5 e6 e7 e8

e'8e'6e'4e'2

e''4 e''8

e'6e'2 z4 z8

z2 z4 z6 z8



To a new version of PCR: Shared occupation 4n andcomplexity O(nlog2(n))

PCR when n = 8

Step 1: Reduced to 2 systems of 4 unknowns

Step 2: Reduced to 4 systems of 2 unknowns

Step 3: Solve

AAAA AA AA AAA AAAA AA AA AA A

e'7e'5e'3e'1AAAA AA A AAA AAAA AA A AA A

e''6e''2e''5e''1 e''8e''4e''7e''3

e'7e'5e'3e'1 e'2 e'4 e'6 e'8

AAAA A A AA AAAAA A A A

z4z2z3z1 z8z6z7z5

e''7e''3e''5e''1 e''2 e''6 e''4 e''8

e'8e'6e'4e'2

e4e3e2e1 e5 e6 e7 e8



d1 c1c1 d2 c2

c2 d3 c3c3 d4 c4

c4 d5 c5c5 d6 c6

c6 d7

(R)−→

d ′1 0 c ′20 d ′2 0 c ′3c ′2 0 d ′3 0 c ′4

c ′3 0 d ′4 0 c ′5c ′4 0 d ′5 0 c ′6

c ′5 0 d ′6 0c ′6 0 d ′7

(P)−→

d ′1 c ′2c ′2 d ′3 c ′4

c ′4 d ′5 c ′6c ′6 d ′7 0

0 d ′2 c ′3c ′3 d ′4 c ′5

c ′5 d ′6

(R)−→

d ′′1 0 c ′′20 d ′′3 0 c ′′4c ′′2 0 d ′′5 0

c ′4 0 d ′7 00 d ′′2 0 c ′′3

0 d ′′4 0c ′′3 0 d ′′6

(P)−→

d ′′1 c ′′2c ′′2 d ′′5 0

0 d ′′3 c ′′4c ′′4 d ′7 0

0 d ′′2 c ′′3c ′′3 d ′′6 0

0 d ′′4



(a) Number of PCRs + two matrix/vectormultiplications performed per second.(b) LDLt vs. tridiagonal + PCR

(a) (b)


Adaptation issues to American options Divide and conquer for eigenproblem

For ill-conditioned systemsStandardprocedure

I Tridiagonal Householder decomposition A = QUQt where Q is orthogonaland U is symmetric tridiagonal.

I Divide & conquer algorithm for symmetric tridiagonal eigenproblems toestablish U = ODOt where O is orthogonal and D is diagonal.

I Discard the smallest eigenvalues of D that provide a condition numberlarger than 105.

U =

d1 c1

c1. . .

. . .. . .

. . . cm−1cm−1 dm − cm 0

0 dm+1 − cm cm+1

cm+1. . .

. . .. . .

. . . cn−1cn−1 dn

+ cm1m,m+11tm,m+1

=

(U1 00 U2

)+ cm1m,m+11tm,m+1

(22)



Shared occupation 2n(n + 2) + 21+blog2(n−1)c andcomplexity O(4n3/3)

Steps1.

U =

(O1 00 O2

)((D1 00 D2

)+ cmuu

t

)(Ot

1 00 Ot

2

)where u =

(Ot

1 00 Ot

2

)1m,m+1 =

(last column of Ot

1first column of Ot

2

).

2. Let Λ = {λ1, ..., λn}, ordered family of eigenvalues of(

D1 00 D2

). If

cm 6= 0 and the eigenvalue λ of U satisfies λ /∈ Λ, then its value isobtained as a solution of

n∑i=1

u2i

λi − λ+

1cm

= 0. (23)

3. From u and the solutions of (23), Löwner’s Theorem provides vector u

that is used to compute the eigenvector Vλ of(

D1 00 D2

)+ cmuut

4. Let W = (Vλ)λ eigenvalue of U , we get the eigenvectors of U thanks to the

multiplication(

Q1 00 Q2

)W .



Additional details on steps 1. and 2.

Step 1.

Advantage: Pure divide and conquer algorithm, it prevents to haveeigenvalues of multiplicity bigger than two at each conquering step.

Step 2. Using Gragg’s scheme (based on Newton’s method): Choose hk such thathk (λ) = xk,0 + xk,1/(λk − λ) + xk,2/(λk+1 − λ) matchesn∑

i=1

u2i

λi − λ+

1cm

at its root ∈ (λk , λk+1) up to the second derivative.

Advantage: Cubic monotonic convergence.



Comparison with Householder tridiagonalization

Execution timecomparison:

TDC/THouseholder

0 10 20 30 40 50 60 702

3

4

5

6

7

8

9

10

System size

CommmentsI Small matrices.I Iterative algorithm to solve the secular equation.I Divergence produced by deflation.


Some simulation results

Plan






Conclusion



Within less than 1 minute simulation on GPU:M0 = 131K , N = 10, Neds = 50

EuropeanPath-dependent

option Φ(ST ) =

(S1T

2+

S2T

2− S

3T

)+

M1 Θ0 Θ0 std CVA0,T CVA0,T std√M0

N0.01364 4 ∗ 10−5 0.0296 2 ∗ 10−4

√M0√N

0.01307 4 ∗ 10−5 0.0294 2 ∗ 10−4

√M0 0.01265 3 ∗ 10−5 0.0291 2 ∗ 10−4



Within less than 1 minute simulation on GPU:M0 = 131K , N = 10, Neds = 50

EuropeanPath-dependent

option Φ(ST ) =

(3S1

T

10+

7S2T

10− S

3T

)+

−(7S1

T

10+

3S2T

10− S

3T

)+

M1 Θ0 Θ0 std CVA0,T CVA0,T std√M0

N2.72 ∗ 10−3 10−5 0.0365 8 ∗ 10−4

√M0√N

2.44 ∗ 10−3 10−5 0.0453 8 ∗ 10−4

√M0 2.28 ∗ 10−3 10−5 0.0520 8 ∗ 10−4

√N√M0 2.24× 10−3 10−5 0.0528 8× 10−4



Within less than 1 minute simulation on GPU:M0 = 131K , N = 10, Neds = 50 and n = 10

American option

Φ(ST ) =

(K −

S1T

3−

S2T

3−

S3T

3

)+

M1 Θ0 Θ0 std CVA0,T CVA0,T std√M0√N

0.0242 10−4 0.0356 2 ∗ 10−4

√M0 0.0229 10−4 0.0351 2 ∗ 10−4


Conclusion

Plan






Conclusion


Conclusion

Mathematical and computing work suited to GPUs

Mathematical partI Extending CVA (TVA) on American options.I Judicious choice of the number of inner and outer trajectories.

Computing partI CUDA source code of: LDLt, Householder reduction, parallel cyclic

reduction that is not necessary a power of two and divide and conquer foreigenproblem.

I Execution time comparison of the different methods mentioned above.I Original method to further optimize the adaptation of LDLt to our

context.I Original parallel cyclic reduction that can be used for any vector size and

not only a power of two.I Precise answer to the following question: Must we systematically use

Householder tridiagonalization with divide & conquer when we suspectthe random linear systems to be ill-conditioned?


Conclusion

Future workI Further developments on the use of Nested Monte Carlo on GPUs for

BSDEs.I Studying the rounding errors and error propagation.I Use CADNA library to test each procedure:

http://www-pequan.lip6.fr/cadna/

Source codeI http://www.proba.jussieu.fr/~abbasturki/soft.htmI Premia library:

https://www.rocq.inria.fr/mathfi/Premia/index.html

ReferencesI L.A. Abbas-Turki and Stef Graillat. Resolution of a large number of small

random symmetric linear systems in single precision arithmetic on GPUs:https://hal.archives-ouvertes.fr/hal-01295549

I L.A. Abbas-Turki and M.A. Mikou. TVA on American Derivatives:https://hal.archives-ouvertes.fr/hal-01142874


http://www-pequan.lip6.fr/cadna/

http://www.proba.jussieu.fr/~abbasturki/soft.htm

https://www.rocq.inria.fr/mathfi/Premia/index.html

https://hal.archives-ouvertes.fr/hal-01295549

https://hal.archives-ouvertes.fr/hal-01142874

The End

Thank you

Questions?


From CVA to the Resolution of a Large Number of Small...

Documents

Transcript of From CVA to the Resolution of a Large Number of Small...