sparse proteomics analysis (spa) - tu berlin · sparse proteomics analysis (spa) toward a...

Sparse Proteomics Analysis (SPA)Toward a Mathematical Theory for

Feature Selection from Forward Models

Martin Genzel

Technische Universitat Berlin

Winter School on Compressed SensingDecember 5, 2015

Outline

1 Biological Background

2 Sparse Proteomics Analysis (SPA)

3 Theoretical Foundation by High-dimensional Estimation Theory

Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 2 / 19

What is Proteomics?

The pathological mechanisms of many diseases, such as cancer,are manifested on the level of protein activities.

To improve clinical treatment options and early diagnostics,we need to understand protein structures and their interactions!

Proteins are long chains of amino acids,controlling many biological andchemical processes in the human body.

The entire set of proteins at a certainpoint of time is called a proteome.

Proteomics is the large-scale study ofthe human proteome.

http://www.topsan.org/Proteins/JCSG/3qxb

What is Proteomics?

What is Mass Spectrometry?

How to “capture” a proteome?

Mass spectrometry (MS) is a popular technique to detect the abundanceof proteins in samples (blood, urine, etc.).

Schematic Work-Flow

Mass (m/z)

+ ++ - - + +

Sample

Mass spectrum

Schematic Work-Flow

Mass (m/z)

+ ++ - - + +

Sample

Mass spectrum

Schematic Work-Flow

Mass (m/z)

+ ++ - - + +

Sample

Mass spectrum

Real-World MS-Data

Mass (m/z)

MS-vector: x = (x1, . . . , xd) ∈ Rd , d ≈ 104 . . . 106

Index = Mass/Feature, Entry = Intensity/Amplitude

Real-World MS-Data

Mass (m/z)

Real-World MS-Data

Mass (m/z)

Feature Selection from MS-Data

Goal: Detect a small set of features (disease fingerprint) that allows foran appropriate distinction between the diseased and healthy group.

Schematic Work-Flow

Blood from healthy individual

Blood from diseased individual

Samples

Mass (m/z)

Schematic Work-Flow

Mass (m/z)

ity (c

Samples Mass Spectra

Schematic Work-Flow

Mass (m/z)

Disease Fingerprint

Comparing

ity (c

Samples Mass Spectra Feature Selection

Mathematical Problem Formulation

Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn).

xk ∈ Rd : Mass spectrum of the k-th patientyk ∈ {−1,+1}: Health status of the k-th patient

(healthy = +1, diseased = −1)

Goal: Learn a feature vector ω ∈ Rd

which is

sparse, i.e., few non-zero entries,(⇒ stability, avoid overfitting)

and its entries correspond topeaks that are highly correlatedwith the disease.(⇒ interpretability, biologicalrelevance)

which is

How to learn a fingerprint ω?

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis is a generic framework to meet this challenge.

Input: Sample pairs (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}Compute:

1 Preprocessing (Smoothing, Standardization)

2 Feature Selection (LASSO, `1-SVM, Robust 1-Bit CS)

3 Postprocessing (Sparsification)

Output: Sparse feature vector ω ∈ Rd

⇒ Biomarker extraction, dimension reduction

Mass (m/z)

Blood SampleBiomarker

Identification

Intensity

Rest of this talk

Feature Selection (Geometric Intuition)

Linear Separation Model: Find a feature vector ω ∈ Rd such that

yk = sign(〈xk ,ω〉) for “many” k ∈ {1, . . . , n}.

Moreover, ω should be sparse and interpretable.

Feature Selection via the LASSO

The LASSO (Tibshirani ’96)

minω∈Rd

n∑k=1

(yk − 〈xk ,ω〉)2 subject to ‖ω‖1 ≤ R

Multivariate approach, originally designed for linear regression models:

yk ≈ 〈xk ,ω〉, k = 1, . . . , n.

But also applicable to non-linear models → Next part

Later: R ≈√s to allow for s-sparse solutions (with unit norm).

Feature Selection via the LASSO

The LASSO (Tibshirani ’96)

minω∈Rd

n∑k=1

Multivariate approach, originally designed for linear regression models:

yk ≈ 〈xk ,ω〉, k = 1, . . . , n.

But also applicable to non-linear models → Next part

Later: R ≈√s to allow for s-sparse solutions (with unit norm).

Some Numerical Results

5-fold cross-validation for real-world pancreas data (156 samples):

1 Learn feature vector ωby SPA, using 80% ofthe samples.

2 Classify the remaining20% of the sample by anordinary SVM, afterprojecting onto supp(ω).

3 Iterate this procedure12-times for randompartitions.

Classification accuracy for different sparsity levelss = # supp(ω)

But what abouttheoretical guarantees?

Toward a Theoretical Foundation of SPA

Linear Separation Model: Explains the observations/labels:

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

Forward Model: Explains the random distribution of the data:

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

am: Deterministic featureatom, sampledGaussian peak (∈ Rd)

sm,k : Random latent factorspecifying the peakamplitude (∈ R)

nk : Random baseline noise(∈ Rd)

𝑠",$ % exp −(% −𝑐")-

𝛽"-

𝑠",$

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

𝑠",$ % exp −(% −𝑐")-

𝛽"-

𝑠",$

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

𝑠",$ % exp −(% −𝑐")-

𝛽"-

𝑠",$

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

Supposed that sufficiently many samples are given,can we learn the sparse fingerprint ω0?

Problem: The vector ω0 is not unique becausesome features are perfectly correlated⇒ No hope for support recovery or approximation

Idea: Separate the fingerprintfrom its data representation!

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

yk = sign(〈xk ,ω0〉), k = 1, . . . , n

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

Combining the Models

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

Assumptions:

sk := (s1,k , . . . , sM,k) ∼ N (0, IM) – peak amplitudes

nk ∼ N (0, σ2Id ) – noise vector

a1, . . . , aM ∈ Rd – arbitrary (peak) atoms, D :=

a>1...

∈ RM×d

Put this into the classification model:

yk = sign(〈xk ,ω0〉) = sign(〈∑M

m=1 sm,kam + nk ,ω0〉)= sign(〈D>sk + nk ,ω0〉)

= sign(〈sk ,Dω0︸︷︷︸=:z0

〉+ 〈nk ,ω0〉)

= sign(〈sk , z0〉+ 〈nk ,ω0〉)

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

Assumptions:

a>1...

∈ RM×d

m=1 sm,kam + nk ,ω0〉)= sign(〈D>sk + nk ,ω0〉)

= sign(〈sk ,Dω0︸︷︷︸=:z0

〉+ 〈nk ,ω0〉)

= sign(〈sk , z0〉+ 〈nk ,ω0〉)

xk =∑M

m=1 sm,kam + nk , k = 1, . . . , n

Assumptions:

a>1...

∈ RM×d

m=1 sm,kam + nk ,ω0〉)= sign(〈D>sk + nk ,ω0〉) = sign(〈sk ,Dω0︸︷︷︸

〉+ 〈nk ,ω0〉)

= sign(〈sk , z0〉+ 〈nk ,ω0〉)

Signal Space vs. Coefficient Space

xk =∑M

m=1 sm,kam + nk = D>sk + nk

Let us first assume that nk = 0 (no baseline noise). Then

yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉),

where z0 = Dω0.

z0 has a (non-unique) representation in the dictionary Dwith sparse coefficients ω0.

z0 “lives” in the signal space RM (independent of specific data type).

ω0 “lives” in the coefficient space Rd (data dependent).

⇒ Try to show a recovery result for z0!

xk =∑M

m=1 sm,kam = D>sk

Let us first assume that nk = 0 (no baseline noise).

where z0 = Dω0.

xk =∑M

m=1 sm,kam = D>sk

where z0 = Dω0.

xk =∑M

m=1 sm,kam = D>sk

where z0 = Dω0.

xk =∑M

m=1 sm,kam = D>sk

where z0 = Dω0.

What Does This Mean for the LASSO?

yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉) with z0 = Dω0

SPA via the LASSO

minω∈Rd

n∑k=1

Warning: The minimizers “live” in different spaces!

Warning: We neither know D nor sk , but just their product.

Idea: Apply results for the K -LASSO with K = R ·DBd1 !

SPA via the LASSO

minω∈R·Bd

n∑k=1

(yk − 〈xk ,ω〉

︸︷︷︸=〈sk ,z〉

︸︷︷︸Solvable in practice!

z :=Dω↓= min

z∈R·DBd1

n∑k=1

(yk − 〈sk , z〉)2

︸︷︷︸Solvable in theory!

SPA via the LASSO

minω∈R·Bd

n∑k=1

(yk − 〈xk ,ω〉︸︷︷︸=〈sk ,z〉

z :=Dω↓= min

z∈R·DBd1

n∑k=1

(yk − 〈sk , z〉)2

︸︷︷︸Solvable in theory!

SPA via the LASSO

minω∈R·Bd

n∑k=1

(yk − 〈xk ,ω〉

︸︷︷︸=〈sk ,z〉

z :=Dω↓= min

z∈R·DBd1

n∑k=1

(yk − 〈sk , z〉)2︸︷︷︸Solvable in theory!

SPA via the LASSO

minω∈R·Bd

n∑k=1

(yk − 〈xk ,ω〉

︸︷︷︸=〈sk ,z〉

z :=Dω↓= min

z∈R·DBd1

n∑k=1

(yk − 〈sk , z〉)2︸︷︷︸Solvable in theory!

A Simplified Version of Roman Vershynin’s Result

Theorem (Plan, Vershynin ’15)

Suppose that sk ∼ N (0, IM), z0 ∈ SM−1, and the observations follow

yk = sign(〈sk , z0〉), k = 1, . . . , n.

Put µ =√

2π and assume that µz0 ∈ K , where K is convex, and

n & w(K )2.

Then, with high probability, the solution z of the K -LASSO satisfies

‖z − µz0‖2 .√

w(K)√n.

The (global) mean width for bounded K ⊂ RM is given by

w(K ) = supu∈K〈g ,u〉, where g ∼ N (0, IM).

A Simplified Version of Roman Vershynin’s Result

Theorem (Plan, Vershynin ’15)

Suppose that sk ∼ N (0, IM), z0 ∈ SM−1, and the observations follow

yk = sign(〈sk , z0〉), k = 1, . . . , n.

Put µ =√

2π and assume that µz0 ∈ K , where K is convex, and

n & w(K )2.

Then, with high probability, the solution z of the K -LASSO satisfies

‖z − µz0‖2 .√

w(K)√n.

Assume that K = µR ·DBd1 ⇒ z0 = Dω0 for some ω0 ∈ R · Bd

Assume that the columns of D are normalized. Then

w(K ) . R ·√

log(d).

A Recovery Guarantee for SPA

Theorem (G. ’15)

Suppose that sk ∼ N (0, IM). Let z0 ∈ SM−1 and assume that there existsR > 0 such that z0 = Dω0 for some ω0 ∈ R · Bd

1 . The observations follow

yk = sign(〈sk , z0〉) = sign(〈xk ,ω0〉), k = 1, . . . , n.

and the number of samples satisfies

n & R2 · log(d).

Then, with high probability, the solution of the LASSO

z = argminz∈R·DBd

n∑k=1

(yk − 〈sk , z〉)2

satisfies

‖z −√

2πz0‖2 .

(R2·log(d)

Theorem (G. ’15)

n & R2 · log(d).

z = argminz∈R·DBd

n∑k=1

(yk − 〈sk , z〉)2

satisfies

‖z −√

2πz0‖2 .

(R2·log(d)

Theorem (G. ’15)

n & R2 · log(d).

z = D · ω = D · argminω∈R·Bd

n∑k=1

(yk − 〈xk ,ω〉)2

satisfies

‖Dω −√

2πDω0‖2 = ‖z −

√2πz0‖2 .

(R2·log(d)

Practical Relevance for MS-Data?

Extensions:I Baseline noise nk ∼ N (0, σ2Id )I Non-trivial covariance matrix, i.e., sk ∼ N (0,Σ)I Adversarial bit-flips in the model yk = sign(〈xk ,ω0〉)

How to achieve normalized columns in D?How to guarantee that R ≈

√s, i.e., s-sparse vectors are allowed?

→ Standardize the data (centering + normalizing)

Given ω, how to switch over to the signal space? (D is unknown)→ Identify supp(ω) with peaks (manual approach)

Practical Relevance for MS-Data?

Extensions:I Baseline noise nk ∼ N (0, σ2Id )I Non-trivial covariance matrix, i.e., sk ∼ N (0,Σ)I Adversarial bit-flips in the model yk = sign(〈xk ,ω0〉)

How to achieve normalized columns in D?How to guarantee that R ≈

√s, i.e., s-sparse vectors are allowed?

→ Standardize the data (centering + normalizing)

Given ω, how to switch over to the signal space? (D is unknown)→ Identify supp(ω) with peaks (manual approach)

Message of this talk

An s-sparse disease fingerprint can be accuratelyrecovered from only O(s log(d)) samples!

THANK YOU FORYOUR ATTENTION!

sparse proteomics analysis (spa) - tu berlin · sparse proteomics analysis (spa) toward a...

Documents