Compressed Sensing and Neural...
Transcript of Compressed Sensing and Neural...
Lasso & Compressed SensingNeural Networks
Compressed Sensing and Neural Networks
Jan Vybıral
(Charles University & Czech Technical University
Prague, Czech Republic)
NOMAD Summer
Berlin, September 25-29, 2017
1 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Outline
Lasso & Compressed Sensing
I Least squares & Regularization
I Convexity, P vs. NP
I Sparsity & `1-minimization
I Compressed Sensing
Neural Networks
I Introduction
I Notation
I Training the network
I Applications
2 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Part I
Lasso & Compressed Sensing
I Least squares & Regularization
I Convexity, P vs. NP
I Sparsity & `1-minimization
I Compressed Sensing
Neural Networks
I Introduction
I Notation
I Training the network
I Applications
3 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Least squares
Fitting a cloud of points by a linear hyperplane
Considered already by Gauss and Legendre in 18th century
In 2D:
4 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Least squares
Objects (=points) described by Ω real numbers:
d1 = (d1,1, . . . , d1,Ω) ∈ RΩ
...
dN = (dN,1, . . . , dN,Ω) ∈ RΩ
N - number of objects; D - N × Ω matrix with rows d1, . . . ,dN
P = (P1, . . . ,PN) are properties of interest
We look for a linear dependence P = f (d) with a linear f , i.e.
Pi =Ω∑j=1
cjdi ,j or P = Dc
5 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Least squares
Objects (=points) described by Ω real numbers:
d1 = (d1,1, . . . , d1,Ω) ∈ RΩ
...
dN = (dN,1, . . . , dN,Ω) ∈ RΩ
N - number of objects; D - N × Ω matrix with rows d1, . . . ,dN
P = (P1, . . . ,PN) are properties of interest
We look for a linear dependence P = f (d) with a linear f , i.e.
Pi =Ω∑j=1
cjdi ,j or P = Dc
5 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Least squares
The solution is found by minimizing the least-square error:
c = arg minc∈RΩ
N∑i=1
(Pi −
Ω∑j=1
cjdi ,j
)2= arg min
c∈RΩ
‖P−Dc‖22
I Closed formula exists
I Convex objective function
I c with all coordinates occupied
I Absolute term incorporated by an additional column full ofones
6 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Least squares
The solution is found by minimizing the least-square error:
c = arg minc∈RΩ
N∑i=1
(Pi −
Ω∑j=1
cjdi ,j
)2= arg min
c∈RΩ
‖P−Dc‖22
I Closed formula exists
I Convex objective function
I c with all coordinates occupied
I Absolute term incorporated by an additional column full ofones
6 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Regularization
How to include preknowledge on c?
Say, we prefer linear fit with small coefficients. We just weight theerror of the fit against the size of the coefficient!
λ > 0 - regularization parameter
c = arg minc∈RΩ
‖P−Dc‖22 + λ‖c‖2
2
I λ→ 0: least squares
I λ→∞: c = 0
7 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Tractability
Convexity
I The minimizer is unique
I Local minimum of a convex function is also a global one
I Many effective methods exist (convex optimization)
P vs. NP
I P-problems: solvable in polynomial time (in dependence onthe size of the input)
I NP-problems: solution verifiable in polynomial time; P⊂NP
I One million dollar problem: P=NP?
I Computational Complexity
8 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Tractability
Convexity
I The minimizer is unique
I Local minimum of a convex function is also a global one
I Many effective methods exist (convex optimization)
P vs. NP
I P-problems: solvable in polynomial time (in dependence onthe size of the input)
I NP-problems: solution verifiable in polynomial time; P⊂NP
I One million dollar problem: P=NP?
I Computational Complexity
8 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Sparsity
If Ω is large (especially Ω N), we are often interested in“selecting features”, i.e. in c with many coordinates equal to zero.
‖c‖0 := #i : ci 6= 0 - the number of non-zero coordinates of c
Looking for a linear fit using only two features:
c = arg minc∈RΩ,‖c‖0≤2
‖P−Dc‖22
Regularized version:
c = arg minc∈RΩ
‖P−Dc‖22 + λ‖c‖0
NP-hard!9 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
`1-minimization
Other ways to measure the size of c: the `p-norms
‖c‖p =( Ω∑j=1
|cj |p)1/p
I Unit balls in `p in R2
I p =∞: ‖c‖∞ = maxj=1,...,Ω
|cj |
I p ≥ 1 - convex problem
I p ≤ 1 - promotes sparsity
10 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
`1-minimization
p ≤ 1 - promotes sparsity
Solution of Sp = arg minz∈R2
‖z‖p s.t. Az = y for p = 1, p = 2
11 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
`1-minimization
Take p = 1 (Lasso - Tibshirani, 1996)
c = arg minc∈RΩ
‖P−Dc‖22 + λ‖c‖1
I Chen, Donoho, Saunders: Basis pursuit (1998)
I λ→ 0 : least squares
I λ→∞: c = 0
I In between: λ selects sparsity
12 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
`1-minimization
Effect of λ > 0 on the support of ω
13 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Compressed Sensing (aka Compressive Sensing, Compressive Sampling)
Theorem: Let D ∈ RN×Ω with independent gaussian entries!Let 0 < ε < 1, s a natural number and
N ≥ C(
s log(Ω) + log(1/ε)), C a universal constant.
If c ∈ RΩ is s-sparse, P = Dc and c is the minimizer of
c = arg minu∈RΩ
‖u‖1, s.t. P = Du,
then c = c with prob. at least 1− ε.
14 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Compressed Sensing (aka Compressive Sensing, Compressive Sampling)
I Candes, Romberg, Tao (2006); Donoho (2006)
I Extensive theory of recovery of sparse vectors from linearmeasurements
I Optimal conditions on the number of measurements (i.e. datapoints) N ≈ Cs log Ω
I Only true, if most of the features (i.e. the columns of D) areincoherent with the majority of the others (if two features arevery similar, it is difficult to distinguish between them)
I H. Boche, R. Calderbank, G. Kutyniok, J.V.,A Survey of Compressed Sensing,First chapter in Compressed Sensing and its Applications,Birkhauser, Springer, 2015
15 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Dictionaries
Real-life signals are (almost) never sparse in the canonical basis ofRΩ, more often they are sparse in some orthonormal basis, i.e.
x = Bc,
where c ∈ RΩ is sparse and columns (and rows) of B ∈ RΩ×Ω areorthonormal vectors - wavelets, Fourier basis, etc.
Compressed Sensing applies then without any essential change!...just replace D with DB. . . i.e. you rotate the problem. . .
16 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Dictionaries
Even more often, the signal is represented in an overcompletedictionary/lexicon:
x = Lc,
where c ∈ R` is sparse and columns of L ∈ RΩ×` is thedictionary/lexicon - its columns form an overcomplete system(` > Ω)
x is a sparse combination of non-orthogonal vectors - the columnsof L.
Examples: Unions of two or more orthonormal bases, eachcapturing different features
17 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
Dictionaries
I Compressed sensing can be adapted also to this situation
I Optimization:
x = arg minu∈RΩ
‖L∗u‖1, s.t. P = Du
I We do not recover the (non-unique!) sparse coefficients c, butthe (approximation of) the signal x.
I Error bound involves L∗x, is reasonably small for examplewhen L∗L is nearly diagonal . . . not too many features in thedictionary are too correlated. . .
18 / 31
Lasso & Compressed SensingNeural Networks
Least squares & RegularizationConvexity, P vs. NPSparsity & `1-minimizationCompressed Sensing
`1-based optimalization
I `1-SVM: Support vector machines are a standard tool forclassification problems. `1-penalty term leads to sparseclassifiers.
I Nuclear norm: Minimizing nuclear norm (=sum of absolutevalues of eigenvalues) of a matrix leads to low-rank matrices.
I TV(=total variation)-norm: Minimizing∑
i ,j |ui ,j+1 − ui ,j |over images u gives images with edges and flat parts.
I L1: Minimizing the L1-norm (=integral of the absolute value)of a function leads to functions with small support
I TV-norm of f : Minimizing∫|∇f | leads to functions with
jumps along curves.
19 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Part II
Lasso & Compressed Sensing
I Least squares & Regularization
I Convexity, P vs. NP
I Sparsity & `1-minimization
I Compressed Sensing
Neural Networks
I Introduction
I Notation
I Training the network
I Applications
20 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks
W. McCulloch, W. Pitts (1943)Motivated by biological research on human brain and neurons
Neural network is a graph of nodes, partially connected. Nodesrepresents neurons, oriented connections between the nodesrepresent the transfer of outputs of some neurons to inputs ofother neurons.
21 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks
I In 70’s and 80’s a number of obstacles appeared - insufficientcomputer power to train large neural networks, theoreticalproblems of processing exclusive-or, etc.
I Support vector machines (and other simple algorithms) tookover the field of machine learning
I 2010’s: Algorithmic advances and higher computational powerallowed to train large neural networks to human (andsuperhuman) performance in pattern recognition
I Large neural networks (a.k.a. deep learning) used successfullyin many tasks
22 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks: Artificial Neuron
Artificial Neuron:. . . gets activated if a linear combination of its inputs grows over acertain threshold. . .
I Inputs x = (x1, . . . , xn) ∈ Rn
I Weights w = (w1, . . . ,wn) ∈ Rn
I Comparing 〈w , x〉 with a threshold b ∈ RI Plugging the result into the “activation function” - jump (or
smoothed jump) function σ
Artificial neuron is a functionx → σ(〈x ,w〉 − b),
where σ : R→ R might be σ(x) = sgn(x) or σ(x) = ex/(1 + ex),etc.
23 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks: Layers
Artificial neural network is a directed, acyclic graph of artificialneuronsThe neurons are grouped by their distance to the input into layers
24 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks: Layers
I Input: x = (x1, . . . , xn) ∈ Rn
I First layer of neurons:y1 = σ(〈x ,w 1
1 〉 − b11), . . . , yn1 = σ(〈x ,w 1
n1〉 − b1
n1)
I The outputs y = (y1, . . . , yn1) become inputs for the nextlayer . . . ; last layer outputs y ∈ R
I Training the network: given inputs x1, . . . , xN and outputsy 1, . . . , yN and optimize over weights w ’s and b’s
25 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks: Training
I The parameters p of the network are initialized (for examplein a random way) =⇒ Np
I For a set of pairs input/output (x i , y i ) we calculate theoutput of the neural network with current parameters=⇒ z i = Np(x i ).
I In an optimal case, z i = y i for all inputs
I Update the parameters of the neural networks tominimize/decrease the loss function, i.e.∑
i
|y i − z i |2
I . . . and repeat . . .
26 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks: Training
I Non-convex minimization over a huge space!
I Huge number of local minimizers exist
I Initialization of the minimization algorithm is important
I Backpropagation algorithm: the error at the output isredistributed to the neurons of the last hidden layer, then tothe previous one, etc.
I The error is distributed back through the network and used toupdate the parameters of each neuron by a gradient descentmethod
27 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks: Training
I Discovered in 1960’s
I Applied to neural networks 1970’s
I Theoretical progress in 1980’s and 1990’s
I Profited from increased computational power in 2010’s, whichallowed applications to large data sets and neural networks oftens or hundreds of layers
I Achieved human and super-human powers in patternrecognition and later on in many other applications
28 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Neural Networks: Deep learning
I Training of a layer with large number (∼ 100) layers
I Made possible by the use of GPU’s (Nvidia), whichaccelerated the speed of deep learning by ca. 100times
I Use of many parameters makes it sensitive to overfitting(=too exact adaptation to the training data, not observed inother data from the same area)
I Overfitting reduced by regularization methods: `2 (decay) or`1 (sparsity) of weights
I Further tricks used to accelerate the learning algorithm
29 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Applications
I Pattern recognition
I Computer vision
I Speech recognition
I Social network filtering
I Recommendation systems
I Bioinformatics
I AlphaGo
I . . .
30 / 31
Lasso & Compressed SensingNeural Networks
IntroductionNotationTraining the networkApplications
Thank you for your attention!
31 / 31