Download - Zura Kakushadze Stevens 05042016
Stat Risk Models, Billion Alphas, & ... Cancer Signatures
Zura Kakushadze
Quantigicr Solutions LLC, Stamford, CT, USABusiness School & School of Physics, Free University of Tbilisi, Georgia
Talk Presented at Financial Engineering Division, School of Systems and EnterprisesStevens Institute of Technology, Hoboken, NJ, USA
May 4, 2016
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 1 / 20
Motivation
Proliferation of Alphas
α mining: machines have taken over!
Olden days: #(α’s) N ∼ 10
Nowadays: N ∼ 104, 105, 106, . . . , 109 (soon!)
α’s: ephemeral, faint, #(obs.) � N ⇒ sample cov.mat singular!
“Mega” α: α combos → optimization → need invertible cov.mat
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 2 / 20
Factor Models
Factor Models
Sample cov.mat: Cij = σiσjΨij (i , j = 1, . . . ,N)
Sample var: σ2i (computable, relatively stable, skewed)
Sample cor.mat: Ψij (singular, out-of-sample unstable)
Factor model: Ψij ≈ ξ2i δij +∑K
A=1 βiAβjA (pos-def, K � N)
Pair-wise cor: ⇐ factor loadings βiA (ξ2i ⇐ Ψii = 1)
What to Use for Factor Loadings?
Stocks: style factors (size, value, growth, mom, vol, liq, etc., ∼< 10)
Stocks: industry classification (GICS, ICB, BICS, etc.)
Alphas: no useful classification, few style factors (bad cor proxies)
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 3 / 20
Statistical Risk Models
Statistical Risk Models [ZK & Yu, 2016a]
No add’l input: only α return time series Ris (s = 1, . . . ,M + 1)
Factor loadings βiA: linear combos of Ris = Ris/σi (norm.ret)
Principal components (Xis = serially demeaned Ris):
Ψij =1
M
M+1∑s=1
XisXjs =M∑
A=1
λ(A)V(A)i V
(A)j
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 4 / 20
Statistical Risk Models
Truncation
First K prin.comp (λ(1) > λ(2) > . . . ):
Ψij ≈ ξ2i δij +K∑
A=1
βiAβjA, βiA =√λ(A)V
(A)i
What Should K Be?
Fix K : keep it simple
K = eRank(Ψij): eRank = effective rank [Roy & Vetterli, 2007]
eRank = effective dimensionality (R code):eig = eigen(Psi)$values # eigenvalues of cor.mat
eig = eig[eig > 0] # positive eigenvalues
p = eig / sum(eig) # normalized as weights
eRank = exp(-sum(p * log(p))) # arg.exp = spectral entropy
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 5 / 20
Figure: eRank Illustration
Simple Example
Uniform cor: Ψij = (1− ρ)δij + ρ [1-factor model, βi =√ρ]
0 20 40 60 80 100
020
4060
8010
0
Correlation (%)
eRan
k, N
= 1
00
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 6 / 20
How to Combine a Billion Alphas?
Sharpe → max [ZK & Yu, 2016b]
Exp.ret: Ei
Weights: wi = γ∑N
j=1 C−1ij Ej
[γ ⇐
∑Ni=1 |wi | = 1
]Rescale: wi = σiξiwi/γ, Ei = Ei/σiξi , βiA = βiA/ξi
Claim: N � 1⇒ wi ≈ εi = regression residuals of Ei over βiA
Example (K = 1, uniform cor):
wi = Ei −ρ
1 + (N − 1)ρ
N∑j=1
Ej ≈ Ei −1
N
N∑j=1
Ej
Reg over int: βi ≡ ρ/√
1− ρGeneral: cor 6� 1,N � 1⇒ approx. regression (holds for prin.comp)
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 7 / 20
Removing “Overall” Mode
Sharpe → max = Overkill
Hedge: incl. all α’s going bust at once
Unlikely: if avg. α cor is low
Math: 1st prin.comp V(1)i ≈ 1/
√N (“overall” ∼ “market” mode)
Residuals: εi ⊥ βiA ⇒ εi ⊥ V(1)i ⇒∼ 50% of wi & wi < 0
Solution: X-sec demean Ris , then calc Ψij – remove “overall” mode!
Computational Cost
Regression: O(M2N)
Prin.comp: O(M2N) (not power iterations) [ZK & Yu, 2016a]
Reduce: βiA = X-sec demeaned Xis (s = 1, . . . ,M − 1), ξi ≡ 1
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 8 / 20
Cancer Signatures
Motivation
Cancer: 1 in 8 human deaths
Diff: somatic mutations
Common: single nucleotide variation = single base
Exo: chemical insults, UV, etc.
Endo: imperfect DNA replic., spont. cytosine deamination, etc.
Mutational signatures: alteration patterns in cancer genome
Identify: understand origins and development of cancer
Therapy: if ∃ small #(sigs), cure for 1 cancer may help cure many
Prevention: pair obs. sigs w/ sigs caused by carcinogens
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 9 / 20
Figure: Double Helix
DNA = 2 strands. Each strand = string of A, C, G, T (adenine, cytosine, guanine and thymine).
Base complementarity: A in one strand always binds with T in the other, and G with C.
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 10 / 20
Mutation Categories
Base Mutations
6 mutations: C > A, C > G, C > G, T > A, T > C, T > G
Other 6: base complementarity
96 mut.cat: 4 (lhs) × 6 (base.mut) × 4 (rhs) (e.g.: TCG > TAG)
Data
Samples: DNA sequenced whole cancer genomes
Occur.cts: matrix Gis ≥ 0 (i = 1, . . . ,N = 96; s = 1, . . . , d samples)
By cancer type: [G (α)]is (α = 1, . . . , n cancer types)
Nonnegative Matrix Factorization (NMF) [Alexandrov et al, 2013]
NMF: G ≈WH, N × K sig weights WiA, K × d sig exposures HAs
Iter.algo: K = trial.err, many loc.min, no glob.conv, avg.samplings
Run: cancer type/“big matrix”, days/weeks, sig instability (var)...
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 11 / 20
Stat Factor Models for Cancer Sigs [ZK & Yu, 2016c]
Skewed Counts, Aggregation, “Overall” Mode, etc.
bio ↔ fin dict: mut.cat ↔ tkr, sample ↔ trd.day, sigs ↔ “sectors”
Skewed cts: nonneg.cts ∼ log-norm distrib.
Fix: Ris = ln(1 + Gis) (some Gis = 0)
Cor.mat: calc Ψij based on Ris
Samples: too noisy in each cancer type
Fix: aggregate by cancer type
Data: published only (Q&A), 14 cancer types, 1389 samples
Avg.cor: whopping 96%⇒ “overall” mode = somatic mut.noise
Fix: rm “overall” mode = X-sec demean Ris (s: 14 cancer types)
#(sigs.eRank): K = Round(eRank(Ψij)) = 7
NMF: re-exp Gis = exp(R ′is) (R ′is = X-sec demeaned Ris)
Result: #(sigs.NMF) = 7 = #(sigs.eRank), much less noisy sigs!
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 12 / 20
Average Correlation and 1st 5 Eigenvalues of Ψij
Cancer Type Avg.Cor Eig.1 Eig.2 Eig.3 Eig.4 Eig.5
B Cell Lymphoma 66.6 65.0 5.67 3.08 2.54 2.23Bone Cancer 48.1 48.1 3.31 2.03 1.88 1.81Brain Lower Grade Glioma 17.7 20.3 4.52 3.91 3.47 3.23Breast Cancer 65.2 64.1 6.18 3.01 2.07 1.11Chronic Lymphocytic Leukemia 79.6 77.6 1.73 1.23 1.17 1.07Esophageal Cancer 16.7 23.3 9.58 7.59 7.34 6.98Gastric Cancer 80.9 78.2 6.41 3.95 1.62 0.68Liver Cancer 87.9 84.7 1.77 1.09 0.90 0.76Lung Cancer 80.0 78.3 6.03 4.47 1.93 1.17Medulloblastoma 54.2 53.6 3.33 2.01 1.76 1.63Ovarian Cancer 75.6 73.8 6.04 2.64 1.99 1.31Pancreatic Cancer 17.0 21.3 5.38 3.71 3.27 2.84Prostate Cancer 68.1 68.5 11.5 8.60 7.43 0Renal Cell Carcinoma 78.2 75.9 5.89 1.86 1.63 0.97All Cancer Types 88.1 84.9 5.47 0.77 0.58 0.37Aggregated by Cancer Type 96.1 92.4 1.26 0.89 0.51 0.34
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 13 / 20
Figure: NMF Reconstruction Accuracy (Our Method)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Reconstruction accuracy by cancer types
4 signatures 5 signatures 6 signatures 7 signatures 8 signatures
Cancer subtype
Pe
ars
on
Co
rre
latio
n
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 14 / 20
Figure: Signature Contributions (Our Method)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Signature contribution to 14 cancer subtypes
Signature 7
Signature 6
Signature 5
Signature 4
Signature 3
Signature 2
Signature 1
Cancer subtypes
% c
on
trib
utio
n
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 15 / 20
Figure: Sig1 (Pancreatic), Vanilla NMF v. Our Method
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 16 / 20
Figure: Sig5 (Liver), Vanilla NMF v. Our Method
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 17 / 20
Figure: Yoda v. Yoda on Steroids
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 18 / 20
Novel Approach
7 Cancer Signatures
4 known sigs: [Nik-Zainal et al, 2012], [Alexandrov et al, 2013b]
Novel sig: dominant liver cancer sig: 96% contr. (exciting!)
Novel sig: renal cell carcinoma (kidney): 70% contr.
Novel sig: bone cancer, brain lower grade glioma, medulloblastoma
Bonus: more stable sigs, comp.cost cut by factor ∼ 10 (fewer iter)!
What’s Next?
Exome: much cheaper, faster than genome
Issue: small ⊂ genome, sparse (many 0’s), noisy
Fix: aggregate by cancer type, rm “overall” mode
Cleaner sigs: novel prop method, still a secret... ,
Stability: need� data, Int’l Cancer Genome Consortium (embargoed)
Funding: gov, angels, biotech/pharma, etc.
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 19 / 20
References
ZK & Willie Yu (2016a) Statistical Risk Models. The Journal of InvestmentStrategies (forthcoming); http://ssrn.com/abstract=2732453 (Feb 15, 2016).
ZK & Willie Yu (2016b) How to Combine a Billion Alphas. Journal of AssetManagement (under review); http://ssrn.com/abstract=2739219 (Feb 29, 2016).
ZK & Willie Yu (2016c) Factor Models for Cancer Signatures. Working Paper;http://ssrn.com/abstract=2772458 (Apr 29, 2016).
Thank you!
Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 20 / 20