Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences...
Transcript of Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences...
Big Data in the SocialSciences
Statistical methods for multi-source high-dimensional data
Katrijn Van Deun, Tilburg University
Outline
• Introduction & Motivation
• Methods• Model
• Objective function
• Estimation
• Model Selection
• Conclusion & Discussion
Statistical methods for multi-source high-dimensional data 2
Introduction
Statistical methods for multi-source high-dimensional data 3
Big Data in the Social Sciences
• Everything is measured• What we think and do: Social media, Web browsing behavior
• Where we are and with whom: GPS tracking, cameras
• At a very detailed level: neuron, DNA
• Data are shared• Open data: in science, governments (open government data)
• Data are linked• Government
• Science: multi-disciplinary
• Linked Data web-architecture
Statistical methods for multi-source high-dimensional data 4
Statistical methods for multi-source high-dimensional data 5
• Illustration: Health & Retirement Study; traditional data
Res
po
nd
ents W
ell b
ein
g
1..............
10 000
Survey data
1 … 500
Statistical methods for multi-source high-dimensional data 6
• Illustration: Health & Retirement Study; traditional + novel type of data
• Multiple sources, heterogeneous in nature
• High-dimensional (p>>n)
Res
po
nd
ents
501 … 50 000
Novel kind of data
We
ll be
ing
1.....
1000
Survey data
1 … 500
Big Data
• Illustration: ALSPAC household panel data
• Multiple sources / multi-block (heterogeneous)
• High-dimensional
Statistical methods for multi-source high-dimensional data 7
res
po
nd
en
ts
Extremely information-rich data
• 1) Adds context, detail => deeper understanding + more accurate prediction
Eg. Same income, social network, health but difference in well-being?
• 2) Gives insight in the interplay between multiple factors
Eg. gene-environment interactions: find (epi-)genetic markers that make someone susceptible to obesity together with the protective/risk-provokingenvironmental conditions associated to these markers
However, statistical tools fall short …
Statistical methods for multi-source high-dimensional data 8
Statistical methods for multi-source high-dimensional data 9
Statistical methods for multi-source high-dimensional data 10
• First challenge = Data fusion?
• How to guarantee that the different sources of variation complement eachother?
• Find common sources of structural variation?
• Yet, heterogeneous data sources; often common sources of variationare subtle while source-specific variation is dominant
Statistical methods for multi-source high-dimensional data 11
• Second challenge = Relevant variables?
• Find relevant variables
• Information may be hidden in a bulk of irrelevant variablese.g., genetic markers for obesity hidden in a bulk of irrelevant markers
• Interpretation based on many variables is not very insightful
• Note: in general, we do not know which variables are relevant and which are not
• S-O-A: penalties, eg lasso• However: selects only one variable out of a group of correlated variables
• (i) have a high risk of not selecting the most relevant variable, • (ii) are highly instable, and • (iii) tend to also select variables of irrelevant groups
Method:Sparse common and specific components
Statistical methods for multi-source high-dimensional data 12
Method: Sparse common & specific components
• Structured analysis
• Data fusion: Common components
• Detection of relevant variables: Penalties (eg lasso)
=> Selection of linked variables (between blocks)
13
Res
po
nd
ents
501 … 50 000
Nieuw soort data
We
ll be
ing
1.....
1000
Survey data
1 … 500
Statistical methods for multi-source high-dimensional data
• Notation and naming conventions
• data block: denotes the different data sources forming the multiblock data
• Xk : data block k (with k=1,…,K ); the outcome(s) is denoted by Y (y ifunivariate)
• Each of the data blocks: same set of observation units (respondents)
14
Res
po
nd
ents
501 … 50 000
X2: Novel kind of data
Y: We
ll be
ing
1.....
1000
X1: Survey
data
1 … 500
Statistical methods for multi-source high-dimensional data
• Point of departure = Simultaneous component analysis (SCA)
– Extension of PCA to the multiblock case
– Promising method for data integration
Method: Data fusion
Statistical methods for multi-source high-dimensional data 15
• Principal component models
• Weight based variant:
Xk = Xk WkPkT + Ek s.t. Wk
TWk = I, (1)
= Tk PkT + Ek
with Wk (Jk×R) the component weights, Tk (I×R) the component scores, and Pk
(Jk×R) the component loadings
Interpretation of component trk based on Jk (!) regression weights: tirk=Sjwjrkxijk
Statistical methods for multi-source high-dimensional data 16
• Principal component models
• Loading based variant
Xk = TkPkT + Ek s.t. Tk
TTk = I, (2)
Interpretation of component trk based on Jk (!) correlations:
r(xjk,trk)=pjrk
• Note: In a least squares approach subject to PkTPk=I, we have Wk=Pk
Statistical methods for multi-source high-dimensional data 17
• Simultaneous component analysis
For all k:Xk = TPk
T + Ek s.t. TTT=I (3)
->same component scores for all data blocks!
1. Weight based variant[X1 ... XK] = T [P1
T ... PKT] + [E1 ... EK
T]= [X1 ... XK] [W1
T ... WKT]T [P1
T ... PKT] + [E1 ... EK
T]
or, in shorthand notation:
Xconc = Xconc Wconc PconcT + Econc (4)
2. Loading based variantXconc = T Pconc
T + Econc (5)
Statistical methods for multi-source high-dimensional data 18
• Simultaneous components
• Guarantee same source of variation for the different data blocks
• But … do not guarantee that the components are common!
• Explanation: SC methods model the largest source of variation in the concatenated data, often this is source-specific variation as common variation is subtle (e.g. response tendencies and general socio-behavioral or genetic processes vs subtle gene-environment ineractions)
Statistical methods for multi-source high-dimensional data 19
• Common component analysis (CCA)
– Account for dominating source specific variation
– Common component model:
Xk = TPk’ such that T’T=I and Pk has common / specific structure
– Structure can be imposed (constrained analysis)
– Or: Xk= XconcWconcPk’ with the Wconc having common / specific structure
P1 P2
Common x x x x x x x
Spec1 x x x 0 0 0 0
Spec1 0 0 0 x x x x
Statistical methods for multi-source high-dimensional data 20
• So far, so good …
• Yet:
• Interest in relevant / important variables (social factors, genetic markers)?
• Interpretation of the components based on 1000s of loadings is infeasible
Need to automatically select the “relevant” variables!
Sparse common components are needed.
Statistical methods for multi-source high-dimensional data 21
• Structured sparsity:
• Sparse common components: few non-zero loadings in each data block Xk
• (Sparse) distinctive components: (few) non-zero loadings only in one/few data blocks Xk
Statistical methods for multi-source high-dimensional data 22
X1 X2
Common x 0 0 x x 0 x
Dist1 x x x 0 0 0 0
Dist1 0 0 0 x 0 x x
• Sparse analysis
– Impose restriction on the loadings: many should become zero
– S-O-A: add penalties (e.g., lasso) to the objective function
Variable selection: How to?
Statistical methods for multi-source high-dimensional data 23
• Sparse SCA: Objective function
Add penalty known to have variable selection properties to SCA objective function:
Minimize over T and PC and such that 𝐓′𝐓 = 𝐈
𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶′𝑜𝑛𝑐
2 + σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏
with 𝐩𝑟,𝑘 𝟏 = σ𝑗 𝑝𝑗𝑘𝑟 the L1 penalty or lasso tuned by lr,k ≥ 0
-> shrinks and selects variables
Fit / SCA Penalty
Statistical methods for multi-source high-dimensional data 24
• (adaptive) Lasso
• Oracle properties (under someconditions)
• Estimation: soft thresholdingoperator S(bOLS, l/2)
= bOLS – l/2 if bOLS>l/2
bLASSO = 0 if – l/2 <bOLS<l/2
= bOLS – l/2 if bOLS<-l/2
Statistical methods for multi-source high-dimensional data 25
0
• Sparse common and specific components: Objective function
Add penalties and/or constraints to obtain structured sparsity
𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶𝑜𝑛𝑐′ 2 + σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏+ σ𝑘 σ𝑟
𝑆𝛾𝑟
𝑆𝐩𝑟
𝑆,𝑘 2+ σ𝑟
𝐶𝜀𝑟
𝐶𝐩𝑟
𝐶,𝑘 1,2
Penalties:
Lasso: 𝐩𝑟,𝑘 𝟏 = σ𝒋𝒌𝒑𝒋
𝒌𝒓 and 𝜆𝑟,𝑘≥0 a tuning parameter
Group lasso: 𝐩𝑟,𝑘 𝟐 = σ𝒋𝒌𝒑𝒋
𝒌𝒓 ² and 𝛾𝑟
𝑆≥0 a tuning parameter
Elitist lasso: 𝐩𝑟,𝑘 𝟏, 𝟐= σ𝒋
𝒌𝒑𝒋
𝒌𝒓 ² and 𝜀𝑟
𝐶≥0 a tuning parameter
Fit / SCAPenalties
Statistical methods for multi-source high-dimensional data 26
Kills blocksof variables
Kills variables withineach block
Generic objective function:
• Allows for combinations of penalties, some known
• Penalties can be imposed per component• Useful to find sparse common and specific components
Statistical methods for multi-source high-dimensional data 27
• Algorithm: Alternating procedure
Given fixed tuning parameters and number of common and distinctivecomponents, do
0. Initialize Pconc
1. Update T conditional upon Pconc
Closed form: T=UV’ with U and V from the SVD of XCP (I×R -> small for H-D data)
2. Update Pconc conditional upon T
MM+UST: Introduce surrogated objective function and applyunivariate soft thresholding to surrogate (see next)
3. Check stop criteria (convergence of the loss, maximum number of iterations) and return to step 1 or terminate
Statistical methods for multi-source high-dimensional data 28
• Algorithm: Conditional estimation of Pconc given T
Complicated optimization problem because of penalties => MM procedure
Statistical methods for multi-source high-dimensional data 29
• MM in brief
Find min of f(x) via surrogate
function g(x,c) (c:constant):
– g(x,c) easy to minimize
– g(x,c) ≥ f(x)
– g(c,c) = f(c) (supporting
point)
f(xmin) ≤ g(xmin,a) ≤ g(a,a)=f(a)
=>min of prev.iter. = supp.point current it.
Yields non-increasing series of loss values
Loss
Orig function
f(x)
MM function
g(x,c)
x
First supp.
point
2nd supp. point
Statistical methods for multi-source high-dimensional data 30
Example: constructing a majorizing function for the elitist lasso penalty
𝒋𝒌
𝑝𝑗𝑘𝑟 ² ≤
𝒋𝒌
𝑝𝑗𝑘𝑟
(0)
𝑗𝑘
𝑝𝑗𝑘𝑟
2
𝑝𝑗𝑘𝑟
(0)
=p’D1p with D1 a diagonal matrix of σ𝒋
𝒌𝑝𝑗
𝑘𝑟
(0)
𝑝𝑗𝑘𝑟
(0)
Applied to group, elistist and regularly lasso we obtain as a surrogate function
L ≤ 𝑘 + 𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶𝑜𝑛𝑐′ 2 + vec 𝐏𝐶𝑜𝑛𝑐
′𝐃 vec 𝐏𝐶𝑜𝑛𝑐
=𝑘 + vec(𝑿𝐶𝑜𝑛𝑐) − 𝐈⨂𝐓 vec(𝐏𝐂𝐨𝐧𝐜)2 + vec 𝐏𝐶𝑜𝑛𝑐
′𝐃 vec 𝐏𝐶𝑜𝑛𝑐For which the root can be found using standard techniques
Statistical methods for multi-source high-dimensional data 31
Hence, the problem is now to find the minimum of
vec(𝑿𝐶𝑜𝑛𝑐) − 𝐈⨂𝐓 vec(𝐏𝐂𝐨𝐧𝐜)2 + vec 𝐏𝐶𝑜𝑛𝑐
′𝐃 vec 𝐏𝐶𝑜𝑛𝑐
which can be found using standard techniques:
vec 𝐏𝐂𝐨𝐧𝐜 = 𝐃+ 𝐈 −𝟏vec 𝐓𝑇𝑿𝐶𝑜𝑛𝑐
Note that D+I is diagonal
Statistical methods for multi-source high-dimensional data 32
Alternative approach: Coordinate descent
L ≤ 𝑘 + 𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶𝑜𝑛𝑐′ 2 + vec 𝐏𝐶𝑜𝑛𝑐
′𝐃∗ vec 𝐏𝐶𝑜𝑛𝑐 + σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏
=𝑘 + vec(𝑿𝐶𝑜𝑛𝑐) − 𝐈⨂𝐓 vec(𝐏𝐂𝐨𝐧𝐜)2 + vec 𝐏𝐶𝑜𝑛𝑐
′𝐃∗vec 𝐏𝐶𝑜𝑛𝑐 +σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏
=k + σ𝑗𝑘,𝑟σ𝑖 𝑥𝑖𝑗 − 𝑡𝑖𝑟𝑝𝑗
𝑘𝑟 ² + 𝑑𝑗𝑘𝑟
∗ 𝑝𝑗𝑘𝑟2 + 𝜆𝑟,𝑘 𝑝𝑗
𝑘𝑟
For which the root can be found using subgradient techniques
Statistical methods for multi-source high-dimensional data 33
• Hence, the following soft thresholding update of the loadings:
𝑝𝑗𝑘𝑟 =
σ𝑖 𝑥𝑖𝑗𝑘𝑡𝑖𝑟 − 𝜆𝑟/2
1 + 𝑑𝑗𝑘𝑟
if 𝑝𝑗𝑘𝑟 > 0
σ𝑖 𝑥𝑖𝑗𝑘𝑡𝑖𝑟 + 𝜆𝑟/2
1 + 𝑑𝑗𝑘𝑟
if 𝑝𝑗𝑘𝑟 < 0
0 else
which can be calculated for all loadings of component r simultaneously usingsimple vector and matrix operations!!!
=> Highly efficient (time+memory), scalable to large data
Statistical methods for multi-source high-dimensional data 34
• Algorithm: Weight based variant (sparseness on weights)
• Similar type of algorithms can be constructed
Standard MM: vec 𝐖𝐂𝐨𝐧𝐜 = 𝐃 + 𝐈⊗ 𝐗𝑐𝑜𝑛𝑐𝑇 𝐗𝐶𝑜𝑛𝑐
−𝟏vec 𝐓𝑇𝐗𝐶𝑜𝑛𝑐
• Expression to estimate 𝑤𝑗𝑘∗𝑟∗ using coordinate descent (cycle over all 𝑅σ𝑘 𝐽𝑘
coefficients); sparse group lasso case
The expression for an individual coefficient is not very expensive but has to becalculated many times.
Statistical methods for multi-source high-dimensional data 35
𝑤𝑗𝑘∗𝑟∗ =
σ𝑖,𝑘,𝑗𝑘𝑥𝑖𝑗𝑘
∗𝑟𝑖𝑗𝑘𝑝𝑗
𝑘𝑟∗ − 𝜆𝑟/2
1 + 𝑑𝑗𝑘𝑟
if 𝑤𝑗𝑘∗𝑟∗ > 0
σ𝑖,𝑘,𝑗𝑘𝑥𝑖𝑗𝑘
∗𝑟𝑖𝑗𝑘𝑝𝑗
𝑘𝑟∗ + 𝜆𝑟/2
1 + 𝑑𝑗𝑘𝑟
if 𝑤𝑗𝑘∗𝑟∗ < 0
0 else
• Algorithm
• Coordinate-wise approach allows to fix coefficients
• This can be used to define the specific components by fixing to zero thecoefficients corresponding to the block not accounted for by the component
Statistical methods for multi-source high-dimensional data 36
Model selection
• Number of components R, penalty tuning parameters and/or status (common or specific):
• How to select suitable values?
• No definite answer, only suggestions
• R: inspect VAF
• Use stability selection to tune variable selection (lL)
• Cross-validation
• Exhaustive strategy for status if nr blocks & components limited
Statistical methods for multi-source high-dimensional data 37
Some results
Statistical methods for multi-source high-dimensional data 38
Weights or loadings: Simulation results
Statistical methods for multi-source high-dimensional data 39
• Data generated with eithersparse W or sparse P
• All data analyzed with both a model with sparsity imposedon W and sparsity imposed on P
• Recovery best when generatedwith sparse loadings andanalyzed with sparse weights
• More direct link betweenreproduced data and loadings
Structured sparsity needed or just lasso?
Statistical methods for multi-source high-dimensional data 40
Tucker congruence between WTRUE and estimated W
DISCUSSION
Statistical methods for multi-source high-dimensional data 41
• We presented a generalization of sparse PCA to the multiblock case
• Sparsity can be imposed accounting for the block structure
• Sparsity can be imposed either on the weights or the loadings
Note: best recovery for generated with sp.loadings and estimated with sp.weights
• Different algorithmic strategies possible: here MM and coord. Descentconsidered
Statistical methods for multi-source high-dimensional data 42
Fit Stability estimates Comput. Effic.
Sparse weights + - -
Sparse loadings - + +
• Planned developments / in progress
• Selection of relevant clusters of correlated variables (to deal with issues of high-dimensionality that haunt lasso, elastic net and so on)
• Prediction by extensions of principal covariates regression
with 0 ≤ a ≤ 1 (a = 0 ->PCR; a = 1 ->MLR/RRR)
Statistical methods for multi-source high-dimensional data 43
2
21
2
2
2
2
1,,
WW
X
XWPX
y
XWpypPW
RL
T
Xy
yXL
ll
aa
• SPCovR Application: Find early (day 3) genetic signature that predictsflu vaccine efficacy (day 28) (public data: Nakaya et al., 2011)
Statistical methods for multi-source high-dimensional data 44
2
i
.
.
.
24
.
.
.
. . . . . .1
X y
1
.
.
.
.
.
24
• Comparison of results with sparse PLS
Statistical methods for multi-source high-dimensional data 45
• SPCovR: Biological content of selected transcripts
• Sparse PLS: no (significant) terms found
Statistical methods for multi-source high-dimensional data 46
Thank you!
&
48
• I am looking for a PhD candidate!!