big data in the social sciences - supélec van deun.pdf · big data in the social sciences...

48
Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Upload: others

Post on 25-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Big Data in the SocialSciences

Statistical methods for multi-source high-dimensional data

Katrijn Van Deun, Tilburg University

Page 2: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Outline

• Introduction & Motivation

• Methods• Model

• Objective function

• Estimation

• Model Selection

• Conclusion & Discussion

Statistical methods for multi-source high-dimensional data 2

Page 3: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Introduction

Statistical methods for multi-source high-dimensional data 3

Page 4: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Big Data in the Social Sciences

• Everything is measured• What we think and do: Social media, Web browsing behavior

• Where we are and with whom: GPS tracking, cameras

• At a very detailed level: neuron, DNA

• Data are shared• Open data: in science, governments (open government data)

• Data are linked• Government

• Science: multi-disciplinary

• Linked Data web-architecture

Statistical methods for multi-source high-dimensional data 4

Page 5: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Statistical methods for multi-source high-dimensional data 5

• Illustration: Health & Retirement Study; traditional data

Res

po

nd

ents W

ell b

ein

g

1..............

10 000

Survey data

1 … 500

Page 6: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Statistical methods for multi-source high-dimensional data 6

• Illustration: Health & Retirement Study; traditional + novel type of data

• Multiple sources, heterogeneous in nature

• High-dimensional (p>>n)

Res

po

nd

ents

501 … 50 000

Novel kind of data

We

ll be

ing

1.....

1000

Survey data

1 … 500

Big Data

Page 7: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Illustration: ALSPAC household panel data

• Multiple sources / multi-block (heterogeneous)

• High-dimensional

Statistical methods for multi-source high-dimensional data 7

res

po

nd

en

ts

Page 8: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Extremely information-rich data

• 1) Adds context, detail => deeper understanding + more accurate prediction

Eg. Same income, social network, health but difference in well-being?

• 2) Gives insight in the interplay between multiple factors

Eg. gene-environment interactions: find (epi-)genetic markers that make someone susceptible to obesity together with the protective/risk-provokingenvironmental conditions associated to these markers

However, statistical tools fall short …

Statistical methods for multi-source high-dimensional data 8

Page 9: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Statistical methods for multi-source high-dimensional data 9

Page 10: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Statistical methods for multi-source high-dimensional data 10

• First challenge = Data fusion?

• How to guarantee that the different sources of variation complement eachother?

• Find common sources of structural variation?

• Yet, heterogeneous data sources; often common sources of variationare subtle while source-specific variation is dominant

Page 11: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Statistical methods for multi-source high-dimensional data 11

• Second challenge = Relevant variables?

• Find relevant variables

• Information may be hidden in a bulk of irrelevant variablese.g., genetic markers for obesity hidden in a bulk of irrelevant markers

• Interpretation based on many variables is not very insightful

• Note: in general, we do not know which variables are relevant and which are not

• S-O-A: penalties, eg lasso• However: selects only one variable out of a group of correlated variables

• (i) have a high risk of not selecting the most relevant variable, • (ii) are highly instable, and • (iii) tend to also select variables of irrelevant groups

Page 12: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Method:Sparse common and specific components

Statistical methods for multi-source high-dimensional data 12

Page 13: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Method: Sparse common & specific components

• Structured analysis

• Data fusion: Common components

• Detection of relevant variables: Penalties (eg lasso)

=> Selection of linked variables (between blocks)

13

Res

po

nd

ents

501 … 50 000

Nieuw soort data

We

ll be

ing

1.....

1000

Survey data

1 … 500

Statistical methods for multi-source high-dimensional data

Page 14: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Notation and naming conventions

• data block: denotes the different data sources forming the multiblock data

• Xk : data block k (with k=1,…,K ); the outcome(s) is denoted by Y (y ifunivariate)

• Each of the data blocks: same set of observation units (respondents)

14

Res

po

nd

ents

501 … 50 000

X2: Novel kind of data

Y: We

ll be

ing

1.....

1000

X1: Survey

data

1 … 500

Statistical methods for multi-source high-dimensional data

Page 15: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Point of departure = Simultaneous component analysis (SCA)

– Extension of PCA to the multiblock case

– Promising method for data integration

Method: Data fusion

Statistical methods for multi-source high-dimensional data 15

Page 16: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Principal component models

• Weight based variant:

Xk = Xk WkPkT + Ek s.t. Wk

TWk = I, (1)

= Tk PkT + Ek

with Wk (Jk×R) the component weights, Tk (I×R) the component scores, and Pk

(Jk×R) the component loadings

Interpretation of component trk based on Jk (!) regression weights: tirk=Sjwjrkxijk

Statistical methods for multi-source high-dimensional data 16

Page 17: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Principal component models

• Loading based variant

Xk = TkPkT + Ek s.t. Tk

TTk = I, (2)

Interpretation of component trk based on Jk (!) correlations:

r(xjk,trk)=pjrk

• Note: In a least squares approach subject to PkTPk=I, we have Wk=Pk

Statistical methods for multi-source high-dimensional data 17

Page 18: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Simultaneous component analysis

For all k:Xk = TPk

T + Ek s.t. TTT=I (3)

->same component scores for all data blocks!

1. Weight based variant[X1 ... XK] = T [P1

T ... PKT] + [E1 ... EK

T]= [X1 ... XK] [W1

T ... WKT]T [P1

T ... PKT] + [E1 ... EK

T]

or, in shorthand notation:

Xconc = Xconc Wconc PconcT + Econc (4)

2. Loading based variantXconc = T Pconc

T + Econc (5)

Statistical methods for multi-source high-dimensional data 18

Page 19: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Simultaneous components

• Guarantee same source of variation for the different data blocks

• But … do not guarantee that the components are common!

• Explanation: SC methods model the largest source of variation in the concatenated data, often this is source-specific variation as common variation is subtle (e.g. response tendencies and general socio-behavioral or genetic processes vs subtle gene-environment ineractions)

Statistical methods for multi-source high-dimensional data 19

Page 20: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Common component analysis (CCA)

– Account for dominating source specific variation

– Common component model:

Xk = TPk’ such that T’T=I and Pk has common / specific structure

– Structure can be imposed (constrained analysis)

– Or: Xk= XconcWconcPk’ with the Wconc having common / specific structure

P1 P2

Common x x x x x x x

Spec1 x x x 0 0 0 0

Spec1 0 0 0 x x x x

Statistical methods for multi-source high-dimensional data 20

Page 21: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• So far, so good …

• Yet:

• Interest in relevant / important variables (social factors, genetic markers)?

• Interpretation of the components based on 1000s of loadings is infeasible

Need to automatically select the “relevant” variables!

Sparse common components are needed.

Statistical methods for multi-source high-dimensional data 21

Page 22: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Structured sparsity:

• Sparse common components: few non-zero loadings in each data block Xk

• (Sparse) distinctive components: (few) non-zero loadings only in one/few data blocks Xk

Statistical methods for multi-source high-dimensional data 22

X1 X2

Common x 0 0 x x 0 x

Dist1 x x x 0 0 0 0

Dist1 0 0 0 x 0 x x

Page 23: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Sparse analysis

– Impose restriction on the loadings: many should become zero

– S-O-A: add penalties (e.g., lasso) to the objective function

Variable selection: How to?

Statistical methods for multi-source high-dimensional data 23

Page 24: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Sparse SCA: Objective function

Add penalty known to have variable selection properties to SCA objective function:

Minimize over T and PC and such that 𝐓′𝐓 = 𝐈

𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶′𝑜𝑛𝑐

2 + σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏

with 𝐩𝑟,𝑘 𝟏 = σ𝑗 𝑝𝑗𝑘𝑟 the L1 penalty or lasso tuned by lr,k ≥ 0

-> shrinks and selects variables

Fit / SCA Penalty

Statistical methods for multi-source high-dimensional data 24

Page 25: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• (adaptive) Lasso

• Oracle properties (under someconditions)

• Estimation: soft thresholdingoperator S(bOLS, l/2)

= bOLS – l/2 if bOLS>l/2

bLASSO = 0 if – l/2 <bOLS<l/2

= bOLS – l/2 if bOLS<-l/2

Statistical methods for multi-source high-dimensional data 25

0

Page 26: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Sparse common and specific components: Objective function

Add penalties and/or constraints to obtain structured sparsity

𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶𝑜𝑛𝑐′ 2 + σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏+ σ𝑘 σ𝑟

𝑆𝛾𝑟

𝑆𝐩𝑟

𝑆,𝑘 2+ σ𝑟

𝐶𝜀𝑟

𝐶𝐩𝑟

𝐶,𝑘 1,2

Penalties:

Lasso: 𝐩𝑟,𝑘 𝟏 = σ𝒋𝒌𝒑𝒋

𝒌𝒓 and 𝜆𝑟,𝑘≥0 a tuning parameter

Group lasso: 𝐩𝑟,𝑘 𝟐 = σ𝒋𝒌𝒑𝒋

𝒌𝒓 ² and 𝛾𝑟

𝑆≥0 a tuning parameter

Elitist lasso: 𝐩𝑟,𝑘 𝟏, 𝟐= σ𝒋

𝒌𝒑𝒋

𝒌𝒓 ² and 𝜀𝑟

𝐶≥0 a tuning parameter

Fit / SCAPenalties

Statistical methods for multi-source high-dimensional data 26

Kills blocksof variables

Kills variables withineach block

Page 27: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Generic objective function:

• Allows for combinations of penalties, some known

• Penalties can be imposed per component• Useful to find sparse common and specific components

Statistical methods for multi-source high-dimensional data 27

Page 28: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Algorithm: Alternating procedure

Given fixed tuning parameters and number of common and distinctivecomponents, do

0. Initialize Pconc

1. Update T conditional upon Pconc

Closed form: T=UV’ with U and V from the SVD of XCP (I×R -> small for H-D data)

2. Update Pconc conditional upon T

MM+UST: Introduce surrogated objective function and applyunivariate soft thresholding to surrogate (see next)

3. Check stop criteria (convergence of the loss, maximum number of iterations) and return to step 1 or terminate

Statistical methods for multi-source high-dimensional data 28

Page 29: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Algorithm: Conditional estimation of Pconc given T

Complicated optimization problem because of penalties => MM procedure

Statistical methods for multi-source high-dimensional data 29

Page 30: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• MM in brief

Find min of f(x) via surrogate

function g(x,c) (c:constant):

– g(x,c) easy to minimize

– g(x,c) ≥ f(x)

– g(c,c) = f(c) (supporting

point)

f(xmin) ≤ g(xmin,a) ≤ g(a,a)=f(a)

=>min of prev.iter. = supp.point current it.

Yields non-increasing series of loss values

Loss

Orig function

f(x)

MM function

g(x,c)

x

First supp.

point

2nd supp. point

Statistical methods for multi-source high-dimensional data 30

Page 31: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Example: constructing a majorizing function for the elitist lasso penalty

𝒋𝒌

𝑝𝑗𝑘𝑟 ² ≤

𝒋𝒌

𝑝𝑗𝑘𝑟

(0)

𝑗𝑘

𝑝𝑗𝑘𝑟

2

𝑝𝑗𝑘𝑟

(0)

=p’D1p with D1 a diagonal matrix of σ𝒋

𝒌𝑝𝑗

𝑘𝑟

(0)

𝑝𝑗𝑘𝑟

(0)

Applied to group, elistist and regularly lasso we obtain as a surrogate function

L ≤ 𝑘 + 𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶𝑜𝑛𝑐′ 2 + vec 𝐏𝐶𝑜𝑛𝑐

′𝐃 vec 𝐏𝐶𝑜𝑛𝑐

=𝑘 + vec(𝑿𝐶𝑜𝑛𝑐) − 𝐈⨂𝐓 vec(𝐏𝐂𝐨𝐧𝐜)2 + vec 𝐏𝐶𝑜𝑛𝑐

′𝐃 vec 𝐏𝐶𝑜𝑛𝑐For which the root can be found using standard techniques

Statistical methods for multi-source high-dimensional data 31

Page 32: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Hence, the problem is now to find the minimum of

vec(𝑿𝐶𝑜𝑛𝑐) − 𝐈⨂𝐓 vec(𝐏𝐂𝐨𝐧𝐜)2 + vec 𝐏𝐶𝑜𝑛𝑐

′𝐃 vec 𝐏𝐶𝑜𝑛𝑐

which can be found using standard techniques:

vec 𝐏𝐂𝐨𝐧𝐜 = 𝐃+ 𝐈 −𝟏vec 𝐓𝑇𝑿𝐶𝑜𝑛𝑐

Note that D+I is diagonal

Statistical methods for multi-source high-dimensional data 32

Page 33: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Alternative approach: Coordinate descent

L ≤ 𝑘 + 𝑿𝐶𝑜𝑛𝑐 − 𝐓𝐏𝐶𝑜𝑛𝑐′ 2 + vec 𝐏𝐶𝑜𝑛𝑐

′𝐃∗ vec 𝐏𝐶𝑜𝑛𝑐 + σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏

=𝑘 + vec(𝑿𝐶𝑜𝑛𝑐) − 𝐈⨂𝐓 vec(𝐏𝐂𝐨𝐧𝐜)2 + vec 𝐏𝐶𝑜𝑛𝑐

′𝐃∗vec 𝐏𝐶𝑜𝑛𝑐 +σ𝑟,𝑘 𝜆𝑟,𝑘 𝐩𝑟,𝑘 𝟏

=k + σ𝑗𝑘,𝑟σ𝑖 𝑥𝑖𝑗 − 𝑡𝑖𝑟𝑝𝑗

𝑘𝑟 ² + 𝑑𝑗𝑘𝑟

∗ 𝑝𝑗𝑘𝑟2 + 𝜆𝑟,𝑘 𝑝𝑗

𝑘𝑟

For which the root can be found using subgradient techniques

Statistical methods for multi-source high-dimensional data 33

Page 34: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Hence, the following soft thresholding update of the loadings:

𝑝𝑗𝑘𝑟 =

σ𝑖 𝑥𝑖𝑗𝑘𝑡𝑖𝑟 − 𝜆𝑟/2

1 + 𝑑𝑗𝑘𝑟

if 𝑝𝑗𝑘𝑟 > 0

σ𝑖 𝑥𝑖𝑗𝑘𝑡𝑖𝑟 + 𝜆𝑟/2

1 + 𝑑𝑗𝑘𝑟

if 𝑝𝑗𝑘𝑟 < 0

0 else

which can be calculated for all loadings of component r simultaneously usingsimple vector and matrix operations!!!

=> Highly efficient (time+memory), scalable to large data

Statistical methods for multi-source high-dimensional data 34

Page 35: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Algorithm: Weight based variant (sparseness on weights)

• Similar type of algorithms can be constructed

Standard MM: vec 𝐖𝐂𝐨𝐧𝐜 = 𝐃 + 𝐈⊗ 𝐗𝑐𝑜𝑛𝑐𝑇 𝐗𝐶𝑜𝑛𝑐

−𝟏vec 𝐓𝑇𝐗𝐶𝑜𝑛𝑐

• Expression to estimate 𝑤𝑗𝑘∗𝑟∗ using coordinate descent (cycle over all 𝑅σ𝑘 𝐽𝑘

coefficients); sparse group lasso case

The expression for an individual coefficient is not very expensive but has to becalculated many times.

Statistical methods for multi-source high-dimensional data 35

𝑤𝑗𝑘∗𝑟∗ =

σ𝑖,𝑘,𝑗𝑘𝑥𝑖𝑗𝑘

∗𝑟𝑖𝑗𝑘𝑝𝑗

𝑘𝑟∗ − 𝜆𝑟/2

1 + 𝑑𝑗𝑘𝑟

if 𝑤𝑗𝑘∗𝑟∗ > 0

σ𝑖,𝑘,𝑗𝑘𝑥𝑖𝑗𝑘

∗𝑟𝑖𝑗𝑘𝑝𝑗

𝑘𝑟∗ + 𝜆𝑟/2

1 + 𝑑𝑗𝑘𝑟

if 𝑤𝑗𝑘∗𝑟∗ < 0

0 else

Page 36: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Algorithm

• Coordinate-wise approach allows to fix coefficients

• This can be used to define the specific components by fixing to zero thecoefficients corresponding to the block not accounted for by the component

Statistical methods for multi-source high-dimensional data 36

Page 37: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Model selection

• Number of components R, penalty tuning parameters and/or status (common or specific):

• How to select suitable values?

• No definite answer, only suggestions

• R: inspect VAF

• Use stability selection to tune variable selection (lL)

• Cross-validation

• Exhaustive strategy for status if nr blocks & components limited

Statistical methods for multi-source high-dimensional data 37

Page 38: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Some results

Statistical methods for multi-source high-dimensional data 38

Page 39: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Weights or loadings: Simulation results

Statistical methods for multi-source high-dimensional data 39

• Data generated with eithersparse W or sparse P

• All data analyzed with both a model with sparsity imposedon W and sparsity imposed on P

• Recovery best when generatedwith sparse loadings andanalyzed with sparse weights

• More direct link betweenreproduced data and loadings

Page 40: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Structured sparsity needed or just lasso?

Statistical methods for multi-source high-dimensional data 40

Tucker congruence between WTRUE and estimated W

Page 41: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

DISCUSSION

Statistical methods for multi-source high-dimensional data 41

Page 42: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• We presented a generalization of sparse PCA to the multiblock case

• Sparsity can be imposed accounting for the block structure

• Sparsity can be imposed either on the weights or the loadings

Note: best recovery for generated with sp.loadings and estimated with sp.weights

• Different algorithmic strategies possible: here MM and coord. Descentconsidered

Statistical methods for multi-source high-dimensional data 42

Fit Stability estimates Comput. Effic.

Sparse weights + - -

Sparse loadings - + +

Page 43: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Planned developments / in progress

• Selection of relevant clusters of correlated variables (to deal with issues of high-dimensionality that haunt lasso, elastic net and so on)

• Prediction by extensions of principal covariates regression

with 0 ≤ a ≤ 1 (a = 0 ->PCR; a = 1 ->MLR/RRR)

Statistical methods for multi-source high-dimensional data 43

2

21

2

2

2

2

1,,

WW

X

XWPX

y

XWpypPW

RL

T

Xy

yXL

ll

aa

Page 44: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• SPCovR Application: Find early (day 3) genetic signature that predictsflu vaccine efficacy (day 28) (public data: Nakaya et al., 2011)

Statistical methods for multi-source high-dimensional data 44

2

i

.

.

.

24

.

.

.

. . . . . .1

X y

1

.

.

.

.

.

24

Page 45: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• Comparison of results with sparse PLS

Statistical methods for multi-source high-dimensional data 45

Page 46: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

• SPCovR: Biological content of selected transcripts

• Sparse PLS: no (significant) terms found

Statistical methods for multi-source high-dimensional data 46

Page 47: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

Thank you!

&

Page 48: Big Data in the Social Sciences - Supélec van deun.pdf · Big Data in the Social Sciences Statistical methods for multi-source high-dimensional data Katrijn Van Deun, Tilburg University

48

• I am looking for a PhD candidate!!