arXiv:2107.14478v2 [math.NA] 5 Sep 2021
Transcript of arXiv:2107.14478v2 [math.NA] 5 Sep 2021
arX
iv:2
107.
1447
8v2
[m
ath.
NA
] 5
Sep
202
1
ERROR ANALYSIS OF DEEP RITZ METHODS FOR ELLIPTIC EQUATIONS
YULING JIAO∗, YANMING LAI † , YISU LO‡ , YANG WANG § , AND YUNFEI YANG ¶
Abstract. Using deep neural networks to solve PDEs has attracted a lot of attentions recently. However,
why the deep learning method works is falling far behind its empirical success. In this paper, we provide a
rigorous numerical analysis on deep Ritz method (DRM) [43] for second order elliptic equations with Drichilet,
Neumann and Robin boundary condition, respectively. We establish the first nonasymptotic convergence rate
in H1 norm for DRM using deep networks with smooth activation functions including logistic and hyperbolic
tangent functions. Our results show how to set the hyper-parameter of depth and width to achieve the desired
convergence rate in terms of number of training samples.
Key words. DRM, Neural Networks, Approximation error, Rademacher complexity, Chaining method,
Pseudo dimension, Covering number.
AMS subject classifications. 65N99
1. Introduction. Partial differential equations (PDEs) are one of the fundamental math-
ematical models in studying a variety of phenomenons arising in science and engineering. There
have been established many conventional numerical methods successfully for solving PDEs in
the case of low dimension (d ≤ 3), particularly the finite element method [6, 7, 33, 39, 22].
However, one will encounter some difficulties in both of theoretical analysis and numerical im-
plementation when extending conventional numerical schemes to high-dimensional PDEs. The
classic analysis of convergence, stability and any other properties will be trapped into trouble-
some situation due to the complex construction of finite element space [7, 6]. Moreover, in the
term of practical computation, the scale of the discrete problem will increase exponentially with
respect to the dimension.
Motivated by the well-known fact that deep learning method for high-dimensional data anal-
ysis has been achieved great successful applications in discriminative, generative and reinforce-
ment learning [18, 14, 37], solving high dimensional PDEs with deep neural networks becomes
an extremely potential approach and has attracted much attentions [3, 38, 27, 34, 43, 45, 5, 17].
Roughly speaking, these works can be divided into three categories. The first category is using
deep neural network to improve classical numerical methods, see for example [40, 42, 21, 15].
In the second category, the neural operator is introduced to learn mappings between infinite-
dimensional spaces with neural networks [24, 2, 25]. For the last category, one utilizes deep
neural networks to approximate the solutions of PDEs directly including physics-informed neu-
ral networks (PINNs) [34], deep Ritz method (DRM) [43] and deep Galerkin method (DGM)
[45]. PINNs is based on residual minimization for solving PDEs [3, 38, 27, 34]. Proceed from
∗School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan Uni-versity, Wuhan 430072, P.R. China. ([email protected])
†School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China. ([email protected])
‡Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])
§Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])
¶Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])
1
the variational form, [43, 45, 44] propose neural-network based methods related to classical
Ritz and Galerkin method. In [45], weak adversarial networks (WAN) are proposed inspired by
Galerkin method. Based on Ritz method, [43] proposes the deep Ritz method (DRM) to solve
variational problems corresponding to a class of PDEs.
1.1. Related works and contributions. The idea of using neural networks to solve
PDEs goes back to 1990’s [23, 8]. Although there are great empirical achievements in recent
several years, a challenging and interesting question is to provide a rigorous error analysis such
as finite element method. Several recent efforts have been devoted to making processes along
this line, see for example [11, 28, 30, 32, 26, 20, 36, 41, 10]. In [28], least squares minimization
method with two-layer neural networks is studied, the optimization error under the assumption
of over-parametrization and generalization error without the over-parametrization assumption
are analyzed. In [26, 44, 19], the generalization error bounds of two-layer neural networks are
derived via assuming that the exact solutions lie in spectral Barron space. Although the studies
in [26, 44] can overcome the curse of dimensionality, it should be pointed out that it is difficult
to generalize these results to deep neural networks or to the situation where the underlying
solutions are living in general Sobolev spaces.
Two important questions have not been addressed in the above mentioned re-
lated study are those: Can we provide a convergence result of DRM only requiring
the target solution living in H2 ? How to determine the depth and width to achieve
the desired convergence rate ? In this paper, we give a firm answers on these questions by
providing an error analysis of using DRM with sigmoidal deep neural networks to solve second
order elliptic equations with Drichilet, Neumann and Robin boundary condition, respectively.
Let uφAbe the solution of a random solver for DRM (i.e., uφA
is the solution of (3.6))
and use the notation Nρ (D, nD, Bθ) to refer to the collection of functions implemented by a
ρ−neural network with depth D, total number of nonzero weights nD and each weight being
bounded by Bθ. Set ρ = 11+e−x or ex−e−x
ex+e−x . Our main contributions are as follows:
• Let uR be the weak solution of Robin problem (3.1)(3.2c). For any ǫ > 0 and µ ∈ (0, 1),
set the parameterized function class
P = Nρ
(C log(d+ 1), C(d, β)ǫ−d/(1−µ), C(d, β)ǫ−(9d+8)/(2−2µ)
)
and number of samples
N = M = C(d,Ω, coe, α, β)ǫ−Cd log(d+1)/(1−µ),
if the optimization error of uφAis Eopt ≤ ǫ, then
EXiNi=1,YjM
j=1‖uφA
− uR‖H1(Ω) ≤ C(Ω, coe, α)ǫ.
• Let uD be the weak solution of Dirichlet problem (3.1)(3.2a). Set α = 1, g = 0. For
any ǫ > 0, let β = C(coe)ǫ as the penalty parameter, set the parameterized function
2
class
P = Nρ
(C log(d+ 1), C(d)ǫ−5d/2(1−µ), C(d)ǫ−(45d+40)/(4−4µ)
)
and number of samples
N = M = C(d,Ω, coe)ǫ−Cd log(d+1)/(1−µ),
if the optimization error Eopt ≤ ǫ, then
EXiNi=1,YjM
j=1‖uφA
− uD‖H1(Ω) ≤ C(Ω, coe)ǫ.
We summarize the related works and our results in the following table 1.1.
Table 1.1: Previous works and our result
PaperDepth and
activation functionsEquation(s)
RegularityCondition
[28]D = 2
ReLU3Second order
differential equationu∗ ∈ Barron class
[26]D = 2Softplus
Poisson equationand Schrodinger equation
u∗ ∈ Barron class
[19]D = 2
ReLUk2m-th order
differential equationu∗ ∈ Barron class
[9]D = O(log d)
ReLU2Second order
elliptic equationu∗ ∈ C2
This paperD = O(log d)Logistic and
Hyperbolic tangent
Second orderelliptic equation
u∗ ∈ H2
The rest of the paper are organized as follows. In Section 2, we give some preliminaries.
In Section 3, we present the DRM method and the error decomposition results for analysis of
DRM. In Section 4 and 5, we give detail analysis on the approximation error and statistical
error. In Section 6, we present our main results. We give conclusion and short discussion in
Section 7.
2. Neural Network. Due to its strong expressivity, neural network function class plays
an important role in machine learning. A variety of neural networks are choosen as parameter
function classes in the training process. We now introduce some notation related to neural
network which will simplify our later discussion. Let D ∈ N+. A function f : Rd → R
nD
implemented by a neural network is defined by
f0(x) = x,
fℓ(x) = ρ (Aℓfℓ−1 + bℓ) for ℓ = 1, . . . ,D − 1,
f := fD(x) = ADfD−1 + bD,
(2.1)
3
where Aℓ =(a(ℓ)ij
)∈ R
nℓ×nℓ−1 and bℓ =(b(ℓ)i
)∈ R
nℓ . ρ is called the activation function and
acts componentwise. D is called the depth of the network and W := maxnℓ : ℓ = 1, · · · ,Dis called the width of the network. φ = Aℓ,bℓℓ are called the weight parameters. For
convenience, we denote ni, i = 1, · · · ,D, as the number of nonzero weights on the first i layers
in the representation (2.1). Clearly nD is the total number of nonzero weights. Sometimes
we denote a function implemented by a neural network as fρ for short. We use the notation
Nρ (D, nD, Bθ) to refer to the collection of functions implemented by a ρ−neural network with
depth D, total number of nonzero weights nD and each weight being bounded by Bθ.
3. Deep Ritz Method and Error Decomposition. Let Ω be a convex bounded open
set in Rd and assume that ∂Ω ∈ C∞. Without loss of generality we assume that Ω ⊂ [0, 1]d.
We consider the following second order elliptic equation:
−u+ wu = f in Ω (3.1)
with three kinds of boundary condition:
u = 0 on ∂Ω (3.2a)
∂u
∂n= g on ∂Ω (3.2b)
αu+ β∂u
∂n= g on ∂Ω, α, β ∈ R, β 6= 0 (3.2c)
which are called Drichilet, Neumann and Robin boundary condition, respectively. Note that
for Drichilet problem, we only consider the homogeneous boundary condition here since the
inhomogeneous case can be turned into homogeneous case by translation. We also remark that
Neumann condition (3.2b) is covered by Robin condition (3.2c). Hence in the following we only
consider Dirichlet problem and Robin problem.
We make the following assumption on the known terms in equation:
(A) f ∈ L∞(Ω), g ∈ H1/2(Ω), w ∈ L∞(Ω), w ≥ cw
where cw is some positive constant. In the following we abbreviate
C(‖f‖L∞(Ω), ‖g‖H1/2(Ω), ‖w‖L∞(Ω), cw
),
constants depending on the known terms in equation, as C(coe) for simplicity.
For problem (3.1)(3.2a), the varitional problems is to find u ∈ H10 (Ω) such that
(∇u,∇v) + (wu, v) = (f, v), ∀v ∈ H10 (Ω). (3.3a)
The corresponding minimization problem is
minu∈H1
0 (Ω)
1
2
∫
Ω
(|∇u|2 + wu2 − 2fu
)dx. (3.3b)
Lemma 3.1. Let (A) holds. Let uD be the solution of problem (3.3a)(also (3.3b)). Then
uD ∈ H2(Ω).
4
Proof. See [12].
For problem (3.1)(3.2c), the variational problem is to find u ∈ H1(Ω) such that
(∇u,∇v) + (wu, v) +α
β(T0u, T0v)|∂Ω = (f, v) +
1
β(g, T0v)|∂Ω, ∀v ∈ H1(Ω) (3.4a)
where T0 is zero order trace operator. The corresponding minimization problem is
minu∈H1(Ω)
∫
Ω
(1
2|∇u|2 + 1
2wu2 − fu
)dx+
1
β
∫
∂Ω
(α2(T0u)
2 − gT0u)ds. (3.4b)
Lemma 3.2. Let (A) holds. Let uR be the solution of problem (3.4a)(also (3.4b)). Then
uR ∈ H2(Ω) and ‖uR‖H2(Ω) ≤ C(coe)β for any β > 0.
Proof. See [12].
Intuitively, when α = 1, g = 0 and β → 0, we expect that the solution of Robin problem
converges to the solution of Dirichlet problem. Hence we only need to consider the Robin
problem since the Dirichlet problem can be handled through a limit process. Define L as a
functional on H1(Ω):
L(u) :=∫
Ω
(1
2|∇u|2 + 1
2wu2 − fu
)dx+
1
β
∫
∂Ω
(α2(T0u)
2 − gT0u)ds.
The next lemma verify the assertion.
Lemma 3.3. Let (A) holds. Let α = 1, g = 0. Let uD be the solution of problem (3.3a)(also
(3.3b)) and uR the solution of problem (3.4a)(also (3.4b)). There holds
‖uR − uD‖H1(Ω) ≤ C(coe)β.
Proof. We first have
∫
Ω
∇uD · ∇vdx−∫
∂Ω
T1uDvds+
∫
Ω
wuDvdx =
∫
Ω
fvdx, ∀v ∈ H1(Ω).
with T1 being first order trace operator. Hence for any u ∈ H1(Ω),
L(u) =∫
Ω
(1
2|∇u|2 + 1
2wu2 − fu
)dx+
1
2β
∫
∂Ω
(T0u)2ds
=
∫
Ω
(1
2|∇u|2 + 1
2wu2
)dx +
1
2β
∫
∂Ω
(T0u)2ds−
∫
Ω
∇uD · ∇udx+
∫
∂Ω
T1uDuds−∫
Ω
wuDudx,
=
∫
Ω
(1
2|∇u−∇uD|2 + 1
2w(u− uD)2
)dx+
1
2β
∫
∂Ω
(T0u+ βT1uD)2ds
−∫
Ω
(1
2|∇uD|2 + 1
2wu2
D
)dx − β
2
∫
∂Ω
(T1uD)2ds. (3.5)
Define
Rβ(u) =
∫
Ω
(1
2|∇u−∇uD|2 + 1
2w(u− uD)
2
)dx+
1
2β
∫
∂Ω
(T0u+ βT1uD)2 ds.
5
Since uR is the minimizer of L, from (3.5) we conclude that it is also the minimizer of R.
Note uD ∈ H2(Ω)(Lemma 3.1), by trace theorem we know T1uD ∈ H1/2(∂Ω) and hence
there exists φ ∈ H1(Ω) such that T0φ = −T1uD. Set u = βφ+ uD, then
C(coe)‖uR − uD‖2H1(Ω) ≤ R(uR) ≤ R(u) = β2
∫
Ω
(1
2|∇φ|2 + 1
2wφ2
)dx = C(uD, coe)β2.
Note that L can be equivalently written as
L(u) =|Ω|EX∼U(Ω)
(1
2|∇u(X)|2 + 1
2w(X)u2(X)− f(X)u(X)
)
+|∂Ω|β
EY ∼U(∂Ω)
(α2(T0u)
2(Y )− g(Y )T0u(Y ))
where U(Ω) and U(∂Ω) are uniform distribution on Ω and ∂Ω, respectively. We then introduce
a discrete version of L:
L(u) := |Ω|N
N∑
i=1
(1
2|∇u(Xi)|2 +
1
2w(Xi)u
2(Xi)− f(Xi)u(Xi)
)
+|∂Ω|βM
M∑
j=1
(α2(T0u)
2(Yj)− g(Yj)T0u(Yj))
where XiNi=1 and YjMj=1 are i.i.d. random variables according to U(Ω) and U(∂Ω) respec-
tively. We now consider a minimization problem with respect to L:
minuφ∈P
L(uφ) (3.6)
where P refers to the parameterized function class. We denote by uφ the solution of problem
(3.6). Finally, we call a (random) solver A, say SGD, to minimize L and denote the output of
A, say uφA, as the final solution.
The following error decomposition enables us to apply different methods to deal with dif-
ferent kinds of error.
Proposition 3.1. Let (A) holds. Assume that P ⊂ H1(Ω). Let uR and uD be the solution
of problem (3.4a)(also (3.4b)) and (3.3a)(also (3.3b)), respectively. Let uφAbe the solution of
problem (3.6) generated by a random solver.
(1)
‖uφA− uR‖H1(Ω) ≤ C(coe) [Eapp + Esta + Eopt]1/2 .
where
Eapp =1
βC(Ω, coe, α) inf
u∈P‖u− uR‖2H1(Ω),
Esta = supu∈P
[L(u)− L(u)
]+ sup
u∈P
[L(u)− L(u)
],
6
Eopt = L (uφA)− L (uφ) .
(2) Set α = 1, g = 0.
‖uφA− uD‖H1(Ω) ≤ C(coe)
[Eapp + Esta + Eopt + ‖uR − uD‖2H1(Ω)
]1/2.
Proof. We only prove (1) since (2) is a direct result from (1) and triangle inequality. For
any u ∈ P , set v = u− uR, then
L (u) = L (uR + v)
=1
2(∇(uR + v),∇(uR + v))L2(Ω) +
1
2(uR + v, uR + v)L2(Ω;w) − 〈uR + v, f〉L2(Ω)
+α
2β(T0uR + T0v, T0uR + T0v)L2(∂Ω) −
1
β〈T0uR + T0v, g〉L2(∂Ω)
=1
2(∇uR,∇uR)L2(Ω) +
1
2(uR, uR)L2(Ω;w) − 〈uR, f〉L2(Ω) +
α
2β(T0uR, T0uR)L2(∂Ω)
− 1
β〈T0uR, g〉L2(∂Ω) +
1
2(∇v,∇v)L2(Ω) +
1
2(v, v)L2(Ω;w) +
α
2β(T0v, T0v)L2(∂Ω)
+
[(∇uR,∇v)L2(Ω) + (u∗, v)L2(Ω;w) − 〈v, f〉L2(Ω) +
α
β(T0uR, T0v)L2(∂Ω) −
1
β〈T0v, g〉L2(∂Ω)
]
= L (uR) +1
2(∇v,∇v)L2(Ω) +
1
2(v, v)L2(Ω;w) +
α
2β(T0v, T0v)L2(∂Ω),
where the last equality is due to the fact that uR is the solution of equation (3.4a). Hence
C(coe)‖v‖2H1(Ω) ≤ L (u)− L (uR) =1
2(∇v,∇v)L2(Ω) +
1
2(v, v)L2(Ω;w) +
α
2β(T0v, T0v)L2(∂Ω)
≤ 1
βC(Ω, coe, α)‖v‖2H1(Ω),
here we apply trace inequality
‖T0v‖L2(∂Ω) ≤ C(Ω)‖v‖H1(Ω).
See more details in [1]. In other words, we obtain
C(coe)‖u − uR‖2H1(Ω) ≤ L (u)− L (uR) ≤1
βC(Ω, coe, α)‖u− uR‖2H1(Ω). (3.7)
Now, letting u be any element in P , we have
L (uφA)− L (uR)
=L (uφA)− L (uφA
) + L (uφA)− L (uφ) + L (uφ)− L (u) + L (u)− L (u) + L (u)− L (uR)
≤ supu∈P
[L(u)− L(u)
]+[L (uφA
)− L (uφ)]+ sup
u∈P
[L(u)− L(u)
]+
1
βC(Ω, coe, α)‖u− uR‖2H1(Ω),
where the last step is due to inequality (3.7) and the fact that L (uφ)− L (u) ≤ 0. Since u can
7
be any element in P , we take the infimum of u on both side of the above display,
L (uφA)− L (uR) ≤ inf
u∈P
1
βC(Ω, coe, α)‖u − uR‖2H1(Ω) + sup
u∈P[L(u)− L(u)]
+ supu∈P
[L(u)− L(u)] +[L (uφA
)− L (uφ)]. (3.8)
Combining (3.7) and (3.8) yields the result.
4. Approximation Error. For nerual network approximation in Sobolev spaces, [16] is
a comprehensive study concerning a variety of activation functions, including ReLU, sigmoidal
type functions, etc. The key idea in [16] to build the upper bound in Sobolev spaces is to
construct an approximate partition of unity.
Denote Fs,p,d :=f ∈ W s,p
([0, 1]d
): ‖f‖W s,p([0,1]d) ≤ 1
.
Theorem 4.1 (Proposition 4.8, [16]). Let p ≥ 1, s, k, d ∈ N+, s ≥ k + 1. Let ρ be logistic
function 11+e−x or tanh function ex−e−x
ex+e−x . For any ǫ > 0 and f ∈ Fs,p,d, there exists a neural
network fρ with depth C log(d+ s) and C(d, s, p, k)ǫ−d/(s−k−µk) non-zero weights such that
‖f − fρ‖Wk,p([0,1]d) ≤ ǫ.
Moreover, the weights in the neural network are bounded in absolute value by
C(d, s, p, k)ǫ−2− 2(d/p+d+k+µk)+d/p+ds−k−µk
where µ is an arbitrarily small positive number.
Remark 4.1. The bounds in the theorem can be found in the proof of [16, Proposition
4.8], except that they did not explicitly give the bound on the depth. In their proof, they
partition [0, 1]d into small patches, approximate f by a sum of localized polynomial∑
m φmpm,
and approximately implement∑
m φmpm by a neural network, where the bump functions φmform an approximately partition of unity and pm =
∑|α|<s cf,m,αx
α are the averaged Taylor
polynomials. As shown in [16], φm can be approximated by the products of the d-dimensional
output of a neural network with constant layers. And the identity map I(x) = x and the product
function ×(a, b) = ab can also be approximated by neural networks with constant layers. In
order to approximate φmxα, we need to implement d + s − 1 products. Hence, the required
depth can be bounded by C log(d+ s).
Since the region [0, 1]d is larger than the region Ω we consider(recall we assume without
loss of generality that Ω ⊂ [0, 1]d at the beginning), we need the following extension result.
Lemma 4.2. Let k ∈ N+, 1 ≤ p < ∞. There exists a linear operator E from W k,p(Ω) to
W k,p0
([0, 1]d
)and Eu = u in Ω.
Proof. See Theorem 7.25 in [13].
From Lemma 3.2 we know that our target function uR ∈ H2(Ω). Hence we are able to
obtain an approximation result in H1-norm.
Corollary 4.3. Let ρ be logistic function 11+e−x or tanh function ex−e−x
ex+e−x . For any ǫ > 0
and f ∈ H2(Ω) with ‖f‖H2(Ω) ≤ 1, there exists a neural network fρ with depth C log(d+1) and
8
C(d)ǫ−d/(1−µ) non-zero weights such that
‖f − fρ‖H1(Ω) ≤ ǫ.
Moreover, the weights in the neural network are bounded by C(d)ǫ−(9d+8)/(2−2µ), where µ is an
arbitrarily small positive number.
Proof. Set k = 1, s = 2, p = 2 in Theorem 4.1 and use the fact ‖f − fρ‖H2(Ω) ≤ ‖Ef −fρ‖H2([0,1]d), where E is the extension operator in Lemma 4.2.
5. Statistical Error. In this section we investigate statistical error supu∈P ±[L(u)− L(u)
].
Lemma 5.1.
EXiNi=1,YjM
j=1supu∈P
±[L(u)− L(u)
]≤
5∑
k=1
EXiNi=1,YjM
j=1supu∈P
±[Lk(u)− Lk(u)
].
where
L1(u) =|Ω|2
EX∼U(Ω)|∇u(X)|2, L2(u) =|Ω|2
EX∼U(Ω)w(X)u2(X),
L3(u) = −|Ω|EX∼U(Ω)f(X)u(X), L4(u) =α|∂Ω|2β
EY ∼U(∂Ω)(Tu)2(Y ),
L5(u) = −|∂Ω|β
EY ∼U(∂Ω)g(Y )Tu(Y ).
and Lk(u) is the discrete version of Lk(u), for example,
L1(u) =|Ω|2N
N∑
i=1
|∇u(Xi)|2.
Proof. Direct result from triangle inequality.
By the technique of symmetrization, we can bound the difference between continuous loss
Li and empirical loss Li by Rademacher complexity.
Definition 5.1. The Rademacher complexity of a set A ⊆ RN is defined as
RN (A) = EσiNk=1
[supa∈A
1
N
N∑
k=1
σkak
],
where, σkNk=1 are N i.i.d Rademacher variables with P(σk = 1) = P(σk = −1) = 12 . The
Rademacher complexity of function class F associate with random sample XkNk=1 is defined
as
RN (F) = EXk,σkNk=1
[supu∈F
1
N
N∑
k=1
σku(Xk)
].
For Rademacher complexity, we have following structural result.
Lemma 5.2. Assume that w : Ω → R and |w(x)| ≤ B for all x ∈ Ω, then for any function
9
class F , there holds
RN (w · F) ≤ BRN(F),
where w · F := u : u(x) = w(x)u(x), u ∈ F.Proof.
RN (w · F) =1
NEXk,σkN
k=1supu∈F
N∑
k=1
σkw(Xk)u(Xk)
=1
2NEXkN
k=1EσkN
k=2supu∈F
[w(X1)u(X1) +
N∑
k=2
σkw(Xk)u(Xk)
]
+1
2NEXkN
k=1EσkN
k=2supu∈F
[−w(X1)u(X1) +
N∑
k=2
σkw(Xk)u(Xk)
]
=1
2NEXkN
k=1EσkN
k=2
supu,u′∈F
[w(X1)[u(X1)− u′(X1)] +
N∑
k=2
σkw(Xk)u(Xk) +
N∑
k=2
σkw(Xk)u′(Xk)
]
≤ 1
2NEXkN
k=1EσkN
k=2
supu,u′∈F
[B|u(X1)− u′(X1)|+
N∑
k=2
σkw(Xk)u(Xk) +
N∑
k=2
σkw(Xk)u′(Xk)
]
=1
2NEXkN
k=1EσkN
k=2
supu,u′∈F
[B[u(X1)− u′(X1)] +
N∑
k=2
σkw(Xk)u(Xk) +
N∑
k=2
σkw(Xk)u′(Xk)
]
=1
NEXk,σkN
k=1supu∈F
[σ1Bu(X1) +
N∑
k=2
σkw(Xk)u(Xk)
]
≤ · · · ≤ BN
EXk,σkNk=1
supu∈F
N∑
k=1
σku(Xk) = BR(F).
Now we bound the difference between continuous loss and empirical loss in terms of Rademacher
complexity.
Lemma 5.3.
EXiNi=1
supu∈P
±[L1(u)− L1(u)
]≤ C(Ω, coe)RN (F1),
EXiNi=1
supu∈P
±[L2(u)− L2(u)
]≤ C(Ω, coe)RN (F2),
EXiNi=1
supu∈P
±[L3(u)− L3(u)
]≤ C(Ω, coe)RN (F3),
EYjMj=1
supu∈P
±[L4(u)− L4(u)
]≤ α
βC(Ω, coe)RN (F4),
EYjMj=1
supu∈P
±[L5(u)− L5(u)
]≤ 1
βC(Ω, coe)RN (F5),
10
where
F1 = |∇u|2 : u ∈ P, F2 = u2 : u ∈ P,F3 = u : u ∈ P, F4 = u2|∂Ω : u ∈ P,F5 = u|∂Ω : u ∈ P.
Proof. We only present the proof with respect to L2 since other inequalities can be shown
similarly. We take XkNk=1 as an independent copy of XkNk=1, then
L2(u)− L2(u) =|Ω|2
[EX∼U(Ω)w(X)u2(X)− 1
N
N∑
k=1
w(Xk)u2(Xk)
]
=|Ω|2N
EXkNk=1
N∑
k=1
[w(Xk)u
2(Xk)− w(Xk)u2(Xk)
].
Hence
EXkNk=1
supu∈P
∣∣∣L2(u)− L2(u)∣∣∣
≤ |Ω|2N
EXk,XkNk=1
supu∈P
N∑
k=1
[w(Xk)u
2(Xk)− w(Xk)u2(Xk)
]
=|Ω|2N
EXk,Xk,σkNk=1
supu∈P
N∑
k=1
σk
[w(Xk)u
2(Xk)− w(Xk)u2(Xk)
]
≤ |Ω|2N
EXk,σkNk=1
supu∈P
N∑
k=1
σkw(Xk)u2(Xk) +
|Ω|2N
EXk,σkNk=1
supu∈P
N∑
k=1
−σkw(Xk)u2(Xk)
=|Ω|N
EXk,σkNk=1
supu∈P
N∑
k=1
σkw(Xk)u2(Xk)
= |Ω|RN (w · F2) ≤ C(Ω, coe)RN (F2),
where the second step is due to the fact that the insertion of Rademacher variables doesn’t
change the distribution, the fourth step is because σkw(Xk)u2(Xk) and −σkw(Xk)u
2(Xk) have
the same distribution, and we use Lemma 5.2 in the last step.
In order to bound Rademecher complexities, we need the concept of covering number.
Definition 5.4. An ǫ-cover of a set T in a metric space (S, τ) is a subset Tc ⊂ S such
that for each t ∈ T , there exists a tc ∈ Tc such that τ(t, tc) ≤ ǫ. The ǫ-covering number of
T , denoted as C(ǫ, T, τ) is defined to be the minimum cardinality among all ǫ-cover of T with
respect to the metric τ .
In Euclidean space, we can establish an upper bound of covering number for a bounded set
easily.
11
Lemma 5.5. Suppose that T ⊂ Rd and ‖t‖2 ≤ B for t ∈ T , then
C(ǫ, T, ‖ · ‖2) ≤(2B
√d
ǫ
)d
.
Proof. Let m =⌊2B
√d
ǫ
⌋and define
Tc =
−B +
ǫ√d,−B +
2ǫ√d, · · · ,−B +
mǫ√d
d
,
then for t ∈ T , there exists tc ∈ Tc such that
‖t− tc‖2 ≤
√√√√d∑
i=1
(ǫ√d
)2
= ǫ.
Hence
C(ǫ, T, ‖ · ‖2) ≤ |Tc| = md ≤(2B
√d
ǫ
)d
.
A Lipschitz parameterization allows us to translates a cover of the function space into a
cover of the parameter space. Such a property plays an essential role in our analysis of statistical
error.
Lemma 5.6. Let F be a parameterized class of functions: F = f(x; θ) : θ ∈ Θ. Let
‖ · ‖Θ be a norm on Θ and let ‖ · ‖F be a norm on F . Suppose that the mapping θ 7→ f(x; θ) is
L-Lipschitz, that is,
∥∥∥f(x; θ)− f(x; θ)∥∥∥
F≤ L
∥∥∥θ − θ∥∥∥Θ,
then for any ǫ > 0, C (ǫ,F , ‖ · ‖F) ≤ C (ǫ/L,Θ, ‖ · ‖Θ).Proof. Suppose that C (ǫ/L,Θ, ‖ · ‖Θ) = n and θini=1 is an ǫ/L-cover of Θ. Then for any
θ ∈ Θ, there exists 1 ≤ i ≤ n such that
‖f(x; θ) − f (x; θi)‖F ≤ L ‖θ − θi‖Θ ≤ ǫ.
Hence f(x; θi)ni=1 is an ǫ-cover of F , implying that C (ǫ,F , ‖ · ‖F) ≤ n.
To find the relation between Rademacher complexity and covering number, we first need
the Massart’s finite class lemma stated below.
Lemma 5.7. For any finite set A ⊂ RN with diameter D = supa∈A ‖a‖2,
RN (A) ≤ D
N
√2 log |A|.
Proof. See, for example, [35, Lemma 26.8].
12
Lemma 5.8. Let F be a function class and ‖f‖∞ ≤ B for any f ∈ F , we have
RN (F) ≤ inf0<δ<B/2
(4δ +
12√N
∫ B/2
δ
√log C(ǫ,F , ‖ · ‖∞)dǫ
).
Proof. We apply chaning method. Set ǫk = 2−k+1B. We denote by Fk such that Fk is an
ǫk-cover of F and |Fk| = C(ǫk,F , ‖ · ‖∞). Hence for any u ∈ F , there exists uk ∈ Fk such that
‖u− uk‖∞ ≤ ǫk. Let K be a positive integer determined later. We have
RN(F) = Eσi,XiNi=1
[supu∈F
1
N
N∑
i=1
σiu (Xi)
]
= Eσi,XiNi=1
1
Nsupu∈F
N∑
i=1
σi (u (Xi)− uK (Xi)) +
K−1∑
j=1
N∑
i=1
σi (uj+1 (Xi)− uj (Xi)) +
N∑
i=1
σiu1 (Xi)
≤ Eσi,XiNi=1
[supu∈F
1
N
N∑
i=1
σi (u (Xi)− uK (Xi))
]+
K−1∑
j=1
Eσi,XiNi=1
[supu∈F
1
N
N∑
i=1
σi (uj+1 (Xi)− uj (Xi))
]
+ Eσi,XiNi=1
[1
Nsupu∈F1
1
N
N∑
i=1
σiu(Xi)
].
We can choose F1 = 0 to eliminate the third term. For the first term,
Eσi,XiNi=1
supu∈F
1
N
[N∑
i=1
σi (u (Xi)− uK (Xi))
]≤ Eσi,XiN
i=1supu∈F
1
N
N∑
i=1
|σi| ‖u− uK‖∞ ≤ ǫK .
For the second term, for any fixed samples XiNi=1, we define
Vj := (uj+1 (X1)− uj (X1) , . . . , uj+1 (XN )− uj (XN )) ∈ RN : u ∈ F.
Then, for any vj ∈ Vj ,
‖vj‖2 =(
n∑
i=1
|uj+1(Xi)− uj(Xi)|2)1/2
≤ √n ‖uj+1 − uj‖∞
≤ √n ‖uj+1 − u‖∞ +
√n ‖uj − u‖∞ =
√nǫj+1 +
√nǫj = 3
√nǫj+1.
Applying Lemma 5.7, we have
K−1∑
j=1
EσNi=1
[supu∈F
1
N
N∑
i=1
σi (uj+1 (Xi)− uj (Xi))
]
=
K−1∑
j=1
EσiNi=1
[supvj∈Vj
1
N
N∑
i=1
σivji
]≤
K−1∑
j=1
3ǫj+1√N
√2 log |Vj |.
13
By the denition of Vj , we know that |Vj | ≤ |Fj | |Fj+1| ≤ |Fj+1|2. Hence
K−1∑
j=1
Eσi,XiNi=1
[supu∈F
1
N
N∑
i=1
σi (uj+1 (Xi)− uj (Xi))
]≤
K−1∑
j=1
6ǫj+1√N
√log |Fj+1|.
Now we obtain
RN (F) ≤ ǫK +
K−1∑
j=1
6ǫj+1√N
√log |Fj+1|
= ǫK +12√N
K−1∑
j=1
(ǫj+1 − ǫj+2)√log C(ǫj+1,F , ‖ · ‖∞)
≤ ǫK +12√N
∫ B/2
ǫK+1
√log C(ǫ,F , ‖ · ‖∞)dǫ.
We conclude the lemma by choosing K such that ǫK+2 < δ ≤ ǫK+1 for any 0 < δ < B/2.
From Lemma 5.6 we know that the kep step to bound C(ǫ,Fi, ‖ · ‖∞) with Fi defined in
Lemma 5.3 is to compute the upper bound of Lipschitz constant of class Fi, which is done in
Lemma 5.9-5.12.
Lemma 5.9. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a bounded Lipschitz continuous
function with Bρ, Lρ ≤ 1. Set the parameterized function class P = Nρ (D, nD, Bθ). For any
f(x; θ) ∈ P, f(x; θ) is√nDB
D−1θ
(∏D−1i=1 ni
)-Lipschitz continuous with respect to variable θ,
i.e.,
∣∣∣f(x; θ)− f(x; θ)∣∣∣ ≤ √
nDBD−1θ
(D−1∏
i=1
ni
)∥∥∥θ − θ∥∥∥2, ∀x ∈ Ω.
Proof. For ℓ = 2, · · · ,D(the argument for the case of ℓ = D is slightly different),
∣∣∣f (ℓ)q − f (ℓ)
q
∣∣∣ =
∣∣∣∣∣∣ρ
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j + b(ℓ)q
− ρ
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j + b(ℓ)q
∣∣∣∣∣∣
≤ Lρ
∣∣∣∣∣∣
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j −
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j + b(ℓ)q − b(ℓ)q
∣∣∣∣∣∣
≤ Lρ
nℓ−1∑
j=1
∣∣∣a(ℓ)qj
∣∣∣∣∣∣f (ℓ−1)
j − f(ℓ−1)j
∣∣∣+ Lρ
nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣∣∣∣f (ℓ−1)
j
∣∣∣+ Lρ
∣∣∣b(ℓ)q − b(ℓ)q
∣∣∣
≤ BθLρ
nℓ−1∑
j=1
∣∣∣f (ℓ−1)j − f
(ℓ−1)j
∣∣∣+BρLρ
nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣+ Lρ
∣∣∣b(ℓ)q − b(ℓ)q
∣∣∣
≤ Bθ
nℓ−1∑
j=1
∣∣∣f (ℓ−1)j − f
(ℓ−1)j
∣∣∣+nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q
∣∣∣ .
14
For ℓ = 1,
∣∣∣f (1)q − f (1)
q
∣∣∣ =
∣∣∣∣∣∣ρ
n0∑
j=1
a(1)qj x
+j b
(1)q
− ρ
n0∑
j=1
a(1)qj xj + b(1)q
∣∣∣∣∣∣
≤n0∑
j=1
∣∣∣a(1)qj − a(1)qj
∣∣∣ +∣∣∣b(1)q − b(1)q
∣∣∣ =n1∑
j=1
∣∣∣θj − θj
∣∣∣ .
For ℓ = 2,
∣∣∣f (2)q − f (2)
q
∣∣∣ ≤ Bθ
n1∑
j=1
∣∣∣f (1)j − f
(1)j
∣∣∣+n1∑
j=1
∣∣∣a(2)qj − a(2)qj
∣∣∣+∣∣∣b(2)q − b(2)q
∣∣∣
≤ Bθ
n1∑
j=1
n1∑
k=1
∣∣∣θk − θk
∣∣∣+n1∑
j=1
∣∣∣a(2)qj − a(2)qj
∣∣∣+∣∣∣b(2)q − b(2)q
∣∣∣
≤ n1Bθ
n2∑
j=1
∣∣∣θj − θj
∣∣∣ .
Assuming that for ℓ ≥ 2,
∣∣∣f (ℓ)q − f (ℓ)
q
∣∣∣ ≤(
ℓ−1∏
i=1
ni
)Bℓ−1
θ
nℓ∑
j=1
∣∣∣θj − θj
∣∣∣ ,
we have
∣∣∣f (ℓ+1)q − f (ℓ+1)
q
∣∣∣ ≤ Bθ
nℓ∑
j=1
∣∣∣f (ℓ)j − f
(ℓ)j
∣∣∣+nℓ∑
j=1
∣∣∣a(ℓ+1)qj − a
(ℓ+1)qj
∣∣∣+∣∣∣b(ℓ+1)
q − b(ℓ+1)q
∣∣∣
≤ Bθ
nℓ∑
j=1
(ℓ−1∏
i=1
ni
)Bℓ−1
θ
n1∑
k=1
∣∣∣θk − θk
∣∣∣+nℓ∑
j=1
∣∣∣a(ℓ+1)qj − a
(ℓ+1)qj
∣∣∣+∣∣∣b(ℓ+1)
q − b(ℓ+1)q
∣∣∣
≤(
ℓ∏
i=1
ni
)Bℓ
θ
nℓ+1∑
j=1
∣∣∣θj − θj
∣∣∣ .
Hence by induction and Holder inequality we conclude that
∣∣∣f − f∣∣∣ ≤
(D−1∏
i=1
ni
)BD−1
θ
nD∑
j=1
∣∣∣θj − θj
∣∣∣ ≤ √nDB
D−1θ
(D−1∏
i=1
ni
)∥∥∥θ − θ∥∥∥2.
Lemma 5.10. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ′ is
bounded by Bρ′ . Set the parameterized function class P = Nρ (D, nD, Bθ). Let p = 1, · · · , d.We have
∣∣∣∂xpf(ℓ)q
∣∣∣ ≤(
ℓ−1∏
i=1
ni
)(BθBρ′)
ℓ, ℓ = 1, 2, · · · ,D − 1,
∣∣∂xpf∣∣ ≤
(D−1∏
i=1
ni
)BD
θ BD−1ρ′ .
15
Proof. For ℓ = 1, 2, · · · ,D − 1,
∣∣∣∂xpf(ℓ)q
∣∣∣ =
∣∣∣∣∣∣
nℓ−1∑
j=1
a(ℓ)qj ∂xpf
(ℓ−1)j ρ′
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j + b(ℓ)q
∣∣∣∣∣∣≤ BθBρ′
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j
∣∣∣
≤ (BθBρ′)2nℓ−1∑
k=1
nℓ−2∑
j=1
∣∣∣∂xpf(ℓ−2)j
∣∣∣ = nℓ−1 (BθBρ′)2nℓ−2∑
j=1
∣∣∣∂xpf(ℓ−2)j
∣∣∣
≤ · · · ≤(
ℓ−1∏
i=2
ni
)(BθBρ′)
ℓ−1n1∑
j=1
∣∣∣∂xpf(1)j
∣∣∣
≤(
ℓ−1∏
i=2
ni
)(BθBρ′)ℓ−1
n1∑
j=1
BθBρ′ =
(ℓ−1∏
i=1
ni
)(BθBρ′)ℓ .
The bound for∣∣∂xpf
∣∣ can be derived similarly.
Lemma 5.11. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ, ρ′
are bounded by Bρ, Bρ′ ≤ 1 and have Lipschitz constants Lρ, Lρ′ ≤ 1, respectively. Set the
parameterized function class P = Nρ (D, nD, Bθ). Then, for any f(x; θ) ∈ P, p = 1, · · · , d,∂xpf(x; θ) is
√nD(D + 1)B2D
θ
(∏D−1i=1 ni
)2-Lipschitz continuous with respect to variable θ, i.e.,
∣∣∣∂xpf(x; θ)− ∂xpf(x; θ)∣∣∣ ≤ √
nD(D + 1)B2Dθ
(D−1∏
i=1
ni
)2 ∥∥∥θ − θ∥∥∥2, ∀x ∈ Ω.
Proof. For ℓ = 1,
∣∣∣∂xpf(1)q − ∂xp f
(1)q
∣∣∣
=
∣∣∣∣∣∣a(1)qp ρ
′
n0∑
j=1
a(1)qj xj + b(1)q
− a(1)qp ρ
′
n0∑
j=1
a(1)qj xj + b(1)q
∣∣∣∣∣∣
≤∣∣∣a(1)qp − a(1)qp
∣∣∣
∣∣∣∣∣∣ρ′
n0∑
j=1
a(1)qj xj + b(1)q
∣∣∣∣∣∣+∣∣∣a(1)qp
∣∣∣
∣∣∣∣∣∣ρ′
n0∑
j=1
a(1)qj xj + b(1)q
− ρ′
n0∑
j=1
a(1)qj xj + b(1)q
∣∣∣∣∣∣
≤Bρ′
∣∣∣a(1)qp − a(1)qp
∣∣∣+BθLρ′
n0∑
j=1
∣∣∣a(1)qj − a(1)qj
∣∣∣+BθLρ′
∣∣∣b(1)q − b(1)q
∣∣∣ ≤ 2Bθ
n1∑
k=1
∣∣∣θk − θk
∣∣∣
For ℓ ≥ 2, we establish the Recurrence relation:
∣∣∣∂xpf(ℓ)q − ∂xp f
(ℓ)q
∣∣∣
≤nℓ−1∑
j=1
∣∣∣a(ℓ)qj
∣∣∣∣∣∣∂xpf
(ℓ−1)j
∣∣∣
∣∣∣∣∣∣ρ′
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j + b(ℓ)q
− ρ′
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j + b(ℓ)q
∣∣∣∣∣∣
+
nℓ−1∑
j=1
∣∣∣a(ℓ)qj ∂xpf(ℓ−1)j − a
(ℓ)qj ∂xp f
(ℓ−1)j
∣∣∣
∣∣∣∣∣∣ρ′
nℓ−1∑
j=1
a(ℓ)qj f
(ℓ−1)j + b(ℓ)q
∣∣∣∣∣∣
16
≤ BθLρ′
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j
∣∣∣
nℓ−1∑
j=1
∣∣∣a(ℓ)qj f(ℓ−1)j − a
(ℓ)qj f
(ℓ−1)j
∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q
∣∣∣
+Bρ′
nℓ−1∑
j=1
∣∣∣a(ℓ)qj ∂xpf(ℓ−1)j − a
(ℓ)qj ∂xp f
(ℓ−1)j
∣∣∣
≤ BθLρ′
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j
∣∣∣
Bρ
nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣+Bθ
nℓ−1∑
j=1
∣∣∣f (ℓ−1)j − f
(ℓ−1)j
∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q
∣∣∣
+Bρ′Bθ
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j − ∂xp f
(ℓ−1)j
∣∣∣+Bρ′
nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣∣∣∣∂xp f
(ℓ−1)j
∣∣∣
≤ Bθ
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j
∣∣∣
nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣+Bθ
nℓ−1∑
j=1
∣∣∣f (ℓ−1)j − f
(ℓ−1)j
∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q
∣∣∣
+Bθ
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j − ∂xp f
(ℓ−1)j
∣∣∣+nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣∣∣∣∂xp f
(ℓ−1)j
∣∣∣
≤ Bθ
(ℓ−1∏
i=1
ni
)Bℓ
θ
nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣+Bθ
nℓ−1∑
j=1
(ℓ−2∏
i=1
ni
)Bℓ−2
θ
nℓ−1∑
k=1
∣∣∣θk − θk
∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q
∣∣∣
+Bθ
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j − ∂xp f
(ℓ−1)j
∣∣∣+nℓ−1∑
j=1
∣∣∣a(ℓ)qj − a(ℓ)qj
∣∣∣(
ℓ−2∏
i=1
ni
)Bℓ−1
θ
≤ Bθ
nℓ−1∑
j=1
∣∣∣∂xpf(ℓ−1)j − ∂xp f
(ℓ−1)j
∣∣∣+B2ℓθ
(ℓ−1∏
i=1
ni
)2nℓ∑
k=1
∣∣∣θk − θk
∣∣∣
For ℓ = 2,
∣∣∣∂xpf(2)q − ∂xp f
(2)q
∣∣∣ ≤ Bθ
n1∑
j=1
∣∣∣∂xpf(1)j − ∂xp f
(1)j
∣∣∣+B4θn
21
n2∑
k=1
∣∣∣θk − θk
∣∣∣
≤ 2B2θn1
n1∑
k=1
∣∣∣θk − θk
∣∣∣+B4θn
21
n2∑
k=1
∣∣∣θk − θk
∣∣∣ ≤ 3B4θn
21
n2∑
k=1
∣∣∣θk − θk
∣∣∣
Assuming that for ℓ ≥ 2,
∣∣∣∂xpf(ℓ)q − ∂xp f
(ℓ)q
∣∣∣ ≤ (ℓ + 1)B2ℓθ
(ℓ−1∏
i=1
ni
)2nℓ∑
k=1
∣∣∣θk − θk
∣∣∣
we have
∣∣∣∂xpf(ℓ+1)q − ∂xp f
(ℓ+1)q
∣∣∣
≤Bθ
nℓ∑
j=1
∣∣∣∂xpf(ℓ)j − ∂xp f
(ℓ)j
∣∣∣+B2ℓ+2θ
(ℓ∏
i=1
ni
)2nℓ+1∑
k=1
∣∣∣θk − θk
∣∣∣
≤Bθ
nℓ∑
j=1
(ℓ+ 1)B2ℓθ
(ℓ−1∏
i=1
ni
)2nℓ∑
k=1
∣∣∣θk − θk
∣∣∣+B2ℓ+2θ
(ℓ∏
i=1
ni
)2nℓ+1∑
k=1
∣∣∣θk − θk
∣∣∣
17
≤(ℓ+ 2)B2ℓ+2θ
(ℓ∏
i=1
ni
)2nℓ+1∑
k=1
∣∣∣θk − θk
∣∣∣
Hence by by induction and Holder inequality we conclude that
∣∣∣∂xpf − ∂xp f∣∣∣ ≤ (D + 1)B2D
θ
(D−1∏
i=1
ni
)2nD∑
k=1
∣∣∣θk − θk
∣∣∣ ≤ √nD(D + 1)B2D
θ
(D−1∏
i=1
ni
)2 ∥∥∥θ − θ∥∥∥2
Lemma 5.12. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ, ρ′
are bounded by Bρ, Bρ′ ≤ 1 and have Lipschitz constants Lρ, Lρ′ ≤ 1, respectively. Set the
parameterized function class P = Nρ (D, nD, Bθ) Then, for any fi(x; θ), fi(x; θ) ∈ Fi, i =
1, · · · , 5, we have
|fi(x; θ)| ≤ Bi, ∀x ∈ Ω,
|fi(x; θ) − fi(x; θ)| ≤ Li‖θ − θ‖2, ∀x ∈ Ω,
with B1 = d(∏D−1
i=1 ni
)2B2D
θ , B2 = B4 = B23 , B3 = B5 = (nD−1 + 1)Bθ, and
L1 = 2d√nD(D + 1)B3D
θ
(D−1∏
i=1
ni
)3
, L2 = 2√nDB
Dθ (nD−1 + 1)
(D−1∏
i=1
ni
),
L3 =√nDB
D−1θ
(D−1∏
i=1
ni
), L4 = L2, L5 = L3.
Proof. Direct result from Lemma 5.9, 5.11 and some calculation.
Now we state our main result with respect to statistical error.
Theorem 5.13. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ, ρ′
are bounded by Bρ, Bρ′ ≤ 1 and have Lipschitz constants Lρ, Lρ′ ≤ 1, respectively. Set the
parameterized function class P = Nρ (D, nD, Bθ). Then, if N = M , we have
EXiNi=1,YjM
j=1supu∈P
±[L(u)− L(u)
]≤ C(Ω, coe, α)
β
d√Dn
2DD B2D
θ√N
√log (dDnDBθN).
Proof. From Lemma 5.5, 5.6 and 5.8, we have
RN (Fi) ≤ inf0<δ<Bi/2
(4δ +
12√N
∫ Bi/2
δ
√log C(ǫ,Fi, ‖ · ‖∞)dǫ
)
≤ inf0<δ<Bi/2
(4δ +
12√N
∫ Bi/2
δ
√nD log
(2LiBθ
√nD
ǫ
)dǫ
)
≤ inf0<δ<Bi/2
(4δ +
6√nDBi√N
√log
(2LiBθ
√nD
δ
)).
18
Choosing δ = 1/√N < Bi/2 and applying Lemma 5.12, we have
RN (Fi) ≤4√N
+6√nDBi√N
√log(2LiBθ
√nD
√N)
≤ Cd√nD(∏D−1
i=1 ni
)2B2D
θ√N
√√√√√log
4dnD(D + 1)B3D+1
θ
(D−1∏
i=1
ni
)3 √N
≤ Cd√Dn
2DD B2D
θ√N
√log (dDnDBθN) (5.1)
Combining Lemma 5.1, 5.3 and (5.1), we obtain, if N = M ,
EXiNi=1,YjM
j=1supu∈P
±[L(u)− L(u)
]≤ C(Ω, coe, α)
β
d√Dn
2DD B2D
θ√N
√log (dDnDBθN).
6. Covergence Rate for the Ritz Method. Theorem 6.1. Let (A) holds. Assume
that Eopt = 0. Let ρ be logistic function 11+e−x or tanh function ex−e−x
ex+e−x . Let uφAbe the solution
of problem (3.6) generated by a random solver and uφ be an optimal solution of problem (3.6).
(1)Let uR be the weak solution of Robin problem (3.1)(3.2c). For any ǫ > 0 and µ ∈ (0, 1),
set the parameterized function class
P = Nρ
(C log(d+ 1), C(d, β)ǫ−d/(1−µ), C(d, β)ǫ−(9d+8)/(2−2µ)
)
and number of samples
N = M = C(d,Ω, coe, α, β)ǫ−Cd log(d+1)/(1−µ),
if the optimization error Eopt = L (uφA)− L (uφ) ≤ ǫ, then
EXiNi=1,YjM
j=1‖uφA
− uR‖H1(Ω) ≤ C(Ω, coe, α)ǫ.
(2)Let uD be the weak solution of Dirichlet problem (3.1)(3.2a). Set α = 1, g = 0. For any
ǫ > 0, let β = C(coe)ǫ as the penalty parameter, set the parameterized function class
P = Nρ
(C log(d+ 1), C(d)ǫ−5d/2(1−µ), C(d)ǫ−(45d+40)/(4−4µ)
)
and number of samples
N = M = C(d,Ω, coe)ǫ−Cd log(d+1)/(1−µ),
if the optimization error Eopt ≤ ǫ, then
EXiNi=1,YjM
j=1‖uφA
− uD‖H1(Ω) ≤ C(Ω, coe)ǫ.
19
Remark 6.1. Dirichlet boundary condition corresponding to a constrained minimization
problem, which may cause some difficulties in computation. The penalty method has been applied
in finite element methods and finite volume method [4, 29]. It is also been used in deep PDEs
solvers [43, 34, 44] since it is not easy to construct a network with given values on the boundary.
Recently, [32, 31] also study the convergence of DRM with Dirichlet boundary condition via
penalty method. However, the analysis in [32, 31] is based on some additional conditions, and
we do not need these conditions to obtain the error inducing by the penalty. More importantly,
we provide the convergence rate analysis involving the statistical error caused by finite samples
used in the SGD training, while in [32, 31] they do not consider the statistical error at all.
Proof. We first normalize the solution.
infu∈P
‖u− uR‖H1(Ω) = ‖uR‖H1(Ω) infu∈P
∥∥∥∥u
‖uR‖H1(Ω)− uR
‖uR‖H1(Ω)
∥∥∥∥H1(Ω)
= ‖uR‖H1(Ω) infu∈P
∥∥∥∥u− uR
‖uR‖H1(Ω)
∥∥∥∥H1(Ω)
≤ C(coe)
βinfu∈P
∥∥∥∥u− uR
‖uR‖H1(Ω)
∥∥∥∥H1(Ω)
where in the third step we apply Lemma 3.2. By Lemma 3.2 and Corollary 4.3, there exists a
neural network function
uρ ∈ P = Nρ
(C log(d+ 1), C(d)
(1
β3/2ǫ
)d/(1−µ)
, C(d)
(1
β3/2ǫ
)(9d+8)/(2−2µ))
such that
∥∥∥∥uρ −uR
‖uR‖H1(Ω)
∥∥∥∥H1(Ω)
≤ β3/2ǫ.
Hence,
Eapp =1
βC(Ω, coe, α) inf
u∈P‖u− uR‖2H1(Ω)
≤ 1
β3C(Ω, coe, α) inf
u∈P
∥∥∥∥u− uR
‖uR‖H1(Ω)
∥∥∥∥2
H1(Ω)
≤ C(Ω, coe, α)ǫ2. (6.1)
Since ρ, ρ′ are bounded and Lipschitz continuous with Bρ, Bρ′ , Lρ, Lρ′ ≤ 1 for ρ = 11+e−x and
ρ = ex−e−x
ex+e−x , we can apply Theorem 5.13 with D = C log(d + 1), nD = C(d)(
1β3/2ǫ
)d/(1−µ)
,
Bθ = C(d)(
1β3/2ǫ
)(9d+8)/(2−2µ)
. Now we conclude that by setting
N,M = C(d,Ω, coe, α)
(1
β3/2ǫ
)Cd log(d+1)/(1−µ)
(6.2)
we have
Esta = EXiNi=1,YjM
j=1
[supu∈P
[L(u)− L(u)
]+ sup
u∈P
[L(u)− L(u)
]]≤ ǫ2. (6.3)
Combining Proposition 3.1(1), (6.1) and (6.3) yields (1).
20
Setting the penalty parameter
β = C(coe)ǫ (6.4)
and combining Lemma 3.3, Proposition 3.1(2) and (6.1)− (6.4) yields (2).
7. Conclusions and Extensions. This paper provided an analysis of convergence rate
for deep Ritz methods for elliptic equations with Drichilet, Neumann and Robin boundary
condition, respectively. Specifically, our study shed light on how to set depth and width of
networks and how to set the penalty parameter to achieve the desired convergence rate in terms
of number of training samples.
There are several interesting further research directions. First, the approximation and
statistical error bounds deriving here can be used for studying the nonasymptotic convergence
rate for residual based method, such as PINNs. Second, the similar result may be applicable to
deep Ritz methods for optimal control problems and inverse problems.
8. Acknowledgements. We would like to thank Ingo Guhring and Mones Raslan for
helpful discussions on approximation error.
The work of Y. Jiao is supported in part by the National Science Foundation of China
under Grant 11871474 and by the research fund of KLATASDSMOE. The work of Y. Wang is
supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416
and HK Innovation Technology Fund ITS/044/18FX, as well as Guangdong-Hong Kong-Macao
Joint Laboratory for Data-Driven Fluid Mechanics and Engineering Applications.
REFERENCES
[1] R. A. Adams and J. J. Fournier, Sobolev spaces, Elsevier, 2003.
[2] A. Anandkumar, K. Azizzadenesheli, K. Bhattacharya, N. Kovachki, Z. Li, B. Liu, and A. Stuart,
Neural operator: Graph kernel network for partial differential equations, in ICLR 2020 Workshop on
Integration of Deep Neural Models and Differential Equations, 2020.
[3] C. Anitescu, E. Atroshchenko, N. Alajlan, and T. Rabczuk, Artificial neural network methods for the
solution of second order boundary value problems, Cmc-computers Materials & Continua, 59 (2019),
pp. 345–359.
[4] I. Babuska, The finite element method with penalty, Mathematics of Computation, 27 (1973), pp. 221–228.
[5] J. Berner, M. Dablander, and P. Grohs, Numerically solving parametric families of high-dimensional
kolmogorov partial differential equations via deep learning, in Advances in Neural Information Pro-
cessing Systems, vol. 33, Curran Associates, Inc., 2020, pp. 16615–16627.
[6] S. Brenner and R. Scott, The mathematical theory of finite element methods, vol. 15, Springer Science
& Business Media, 2007.
[7] P. G. Ciarlet, The finite element method for elliptic problems, SIAM, 2002.
[8] M. Dissanayake and N. Phan-Thien, Neural-network-based approximations for solving partial differential
equations, Communications in Numerical Methods in Engineering, 10 (1994), pp. 195–201.
[9] C. Duan, Y. Jiao, Y. Lai, X. Lu, and Z. Yang, Convergence rate analysis for deep ritz method, arXiv
preprint arXiv:2103.13330, (2021).
[10] W. E, C. Ma, and L. Wu, The Barron space and the flow-induced function spaces for neural network
models, arXiv preprint arXiv:1906.08039, (2021).
[11] W. E and S. Wojtowytsch, Some observations on partial differential equations in Barron and multi-layer
spaces, arXiv preprint arXiv:2012.01484, (2020).
[12] L. C. Evans, Partial differential equations, Graduate studies in mathematics, 19 (1998).
21
[13] D. Gilbarg and N. S. Trudinger, Elliptic partial differential equations of second order, vol. 224, springer,
2015.
[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio, Generative adversarial networks, Advances in Neural Information Processing Systems, 3
(2014).
[15] D. Greenfeld, M. Galun, R. Basri, I. Yavneh, and R. Kimmel, Learning to optimize multigrid PDE
solvers, in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and
R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019,
pp. 2415–2423.
[16] I. Guhring and M. Raslan, Approximation rates for neural networks with encodable weights in smooth-
ness spaces, Neural Networks, 134 (2021), pp. 107–130.
[17] J. Han, A. Jentzen, and W. E, Solving high-dimensional partial differential equations using deep learning,
Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
[18] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance
on imagenet classification, in Proceedings of the IEEE international conference on computer vision,
2015, pp. 1026–1034.
[19] Q. Hong, J. W. Siegel, and J. Xu, A priori analysis of stable neural network solutions to numerical
pdes, arXiv preprint arXiv:2104.02903, (2021).
[20] Q. Hong, J. W. Siegel, and J. Xu, Rademacher complexity and numerical quadrature analysis of stable
neural networks with applications to numerical pdes, arXiv preprint arXiv:2104.02903, (2021).
[21] J.-T. Hsieh, S. Zhao, S. Eismann, L. Mirabella, and S. Ermon, Learning neural pde solvers with
convergence guarantees, in International Conference on Learning Representations, 2018.
[22] T. J. Hughes, The Finite Element Method: Linear Static and Dynamic Finite Element Analysis, Courier
Corporation, 2012.
[23] I. E. Lagaris, A. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial
differential equations., IEEE Trans. Neural Networks, 9 (1998), pp. 987–1000.
[24] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, A. Stuart, K. Bhattacharya, and A. Anandkumar,
Multipole graph neural operator for parametric partial differential equations, in Advances in Neural
Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin,
eds., vol. 33, Curran Associates, Inc., 2020, pp. 6755–6766.
[25] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. liu, K. Bhattacharya, A. Stuart, and A. Anandku-
mar, Fourier neural operator for parametric partial differential equations, in International Conference
on Learning Representations, 2021.
[26] J. Lu, Y. Lu, and M. Wang, A priori generalization analysis of the deep ritz method for solving high
dimensional elliptic equations, arXiv preprint arXiv:2101.01708, (2021).
[27] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis, Deepxde: A deep learning library for solving differential
equations, SIAM Review, 63 (2021), pp. 208–228.
[28] T. Luo and H. Yang, Two-layer neural networks for partial differential equations: Optimization and
generalization theory, arXiv preprint arXiv:2006.15733, (2020).
[29] B. Maury, Numerical analysis of a finite element/volume penalty method, Siam Journal on Numerical
Analysis, 47 (2009), pp. 1126–1148.
[30] S. Mishra and R. Molinaro, Estimates on the generalization error of physics informed neural networks
(pinns) for approximating pdes, arXiv preprint arXiv:2007.01138, (2020).
[31] J. Muller and M. Zeinhofer, Deep ritz revisited, in ICLR 2020 Workshop on Integration of Deep Neural
Models and Differential Equations, 2020.
[32] J. Muller and M. Zeinhofer, Error estimates for the variational training of neural networks with
boundary penalty, arXiv preprint arXiv:2103.01007, (2021).
[33] A. Quarteroni and A. Valli, Numerical Approximation of Partial Differential Equations, vol. 23,
Springer Science & Business Media, 2008.
[34] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning
framework for solving forward and inverse problems involving nonlinear partial differential equations,
Journal of Computational Physics, 378 (2019), pp. 686–707.
[35] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms,
22
Cambridge university press, 2014.
[36] Y. Shin, Z. Zhang, and G. Karniadakis, Error estimates of residual minimization using neural networks
for linear pdes, arXiv preprint arXiv:2010.08019, (2020).
[37] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep
neural networks and tree search, nature, 529 (2016), pp. 484–489.
[38] J. A. Sirignano and K. Spiliopoulos, Dgm: A deep learning algorithm for solving partial differential
equations, Journal of Computational Physics, 375 (2018), pp. 1339–1364.
[39] J. Thomas, Numerical Partial Differential Equations: Finite Difference Methods, vol. 22, Springer Science
& Business Media, 2013.
[40] K. Um, R. Brand, Y. R. Fei, P. Holl, and N. Thuerey, Solver-in-the-loop: Learning from differentiable
physics to interact with iterative pde-solvers, in Advances in Neural Information Processing Systems,
vol. 33, Curran Associates, Inc., 2020, pp. 6111–6122.
[41] S. Wang, X. Yu, and P. Perdikaris, When and why pinns fail to train: A neural tangent kernel per-
spective, arXiv preprint arXiv:2007.14527, (2020).
[42] Y. Wang, Z. Shen, Z. Long, and B. Dong, Learning to discretize: Solving 1d scalar conservation laws via
deep reinforcement learning, Communications in Computational Physics, 28 (2020), pp. 2158–2179.
[43] E. Weinan and T. Yu, The deep ritz method: A deep learning-based numerical algorithm for solving
variational problems, Communications in Mathematics and Statistics, 6 (2017), pp. 1–12.
[44] J. Xu, Finite neuron method and convergence analysis, Communications in Computational Physics, 28
(2020), pp. 1707–1745.
[45] Y. Zang, G. Bao, X. Ye, and H. Zhou, Weak adversarial networks for high-dimensional partial differ-
ential equations, Journal of Computational Physics, 411 (2020), p. 109409.
23