r21: enhanced deconvolution and prediction of mutational
TRANSCRIPT
R21: Enhanced deconvolution and prediction of mutational signatures
Joshua D. Campbell, PhDMasanao Yajima, PhD
Section of Computational Biomedicine, Department of MedicineBoston University School of Medicine
ITCR Annual Meeting5/28/2020
Various exogenous exposures or endogenous biological processes contribute to the mutational load in cancer.
Tri-nucleotide context
Smoking
UV radiation
C>T at TCC/CCC
C>A
C>G at TCT/TCA
APOBEC
Deconvolution of mutational signatures using non-negative matrix factorization (NMF)
Alexandrov et al, Deciphering Signatures of Mutational Processes Operative in Human Cancer, Cell Reports, 2013.
Limitations of NMF and current software packages for mutational signature inference
1. No inherent method for predicting new samples given an existing training model.
2. Limited flexibility to include additional information into signature inference processes.
Train Test?
ACGT
Base+
3
Base+
2
+-
C>A_ACAC>A_ACCC>A_ACGC>A_ACT
T>G_TTAT>G_TTCT>G_TTGT>G_TTT
...
C>A_AAACAAAC>A_AAACAACC>A_AAACAAGC>A_AAACAAT
T>G_TTTTTTAT>G_TTTTTTCT>G_TTTTTTGT>G_TTTTTTT
... .
..
C>A_ACAC>A_ACCC>A_ACGC>A_ACT
T>G_TTAT>G_TTCT>G_TTGT>G_TTT
...
...
ACGT
Base-2
Base-3
X X X X
A) B) C)
Number of itemsin distributionPowered with
standard data?
NMFTrinucleotide Context
NMFHeptanucleotide Context
96
Yes
24,576
No
4
Yes
4
Yes
4
Yes
4
Yes
96
Yes
Novel model with joint probabilityHeptanucleotide Context with Strand
Mutatio
nStra
nd
X
2
Yes
Mutatio
nMutatio
n
-
-
Characterize flanking bases Utilize pre-existing in signature discovery
KnownSigs
Tumors
LDA identified similar signatures to NMF in a Pan-Lung cancer dataset from TCGA.
0.2
0.6
Trinucleotidecontext
1.0
APOB
ECMu
tatio
nsAl
l oth
er
Muta
tions
A
BFr
actio
n of
bas
es
0
4
8
P-va
lue (-
log10
)
Diffe
rent
ialba
se u
sage
0.2
0.6
1.0
-1-2-3-4-5-6 65432
Mutation 1
Frac
tion
of b
ases
Downstream (3’)Flanking bases
Upstream (5’)Flanking Bases
Position
-3
T>GT>CT>AC>TC>GC>A
TGCA
Bases
Mutation
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
0.000.050.100.150.20
Mut
ation
type
pro
babil
ity
NMF-dervied signatures LDA-derived signatures
Trinucleotide context
R=0.98
R=0.99
R=0.85
R=0.94
R=0.79
C>A C>G C>T T>A T>C T>G
Trinucleotide context
UV
Smoking
MMR
APOBEC
Clock-like/aging
“Pan-Lung” dataset = 1,144 lung cancer exomes (Campbell et al, Nature Genetics, 2016)
Development of a novel Bayesian model that allows for inclusion of other features such as additional flanking bases.
θi ∼ DirK(α) for i = 1..SMk ∼ DirT (γ) for k = 1..KDk,a ∼ Dir4(β) for k = 1..K;a = (p− w)..(p− 1)Uk,b ∼ Dir4(β) for k = 1..K;b = (p+ 1)..(p+ w)Gk,f ∼ DirEf (δ) for k = 1..K;f = 1..Fzi,j ∼ Categorical(θi) for i = 1..S; j = 1..Ni
mi,j ∼ Categorical(Mzi,j) for i = 1..S; j = 1..Ni
di,j,a ∼ Categorical(Dzi,j ,a) for i = 1..S; j = 1..Ni;a = (p− w)..(p− 1)ui,j,b ∼ Categorical(Uzi,j ,b) for i = 1..S; j = 1..Ni;b = (p+ 1)..(p+ w)gi,j,f ∼ Categorical(Gzi,j ,f ) for i = 1..S; j = 1..Ni;f = 1..F
S is the number of samples.K is the number of mutational signatures.Ni is the number of mutations for sample i.T is the number of mutations typesEf is the number of entries in feature distribution Gf .F is the total number of genomic feature distributions.p is the position of mutation in the genome.w is the length of the flanking motif.a,b are the index positions in the flanking sequence relative to pmi,j is the jth observed mutation for sample idi,j,t is the jth observed base at downstream position t for sample iui,j,t is the jth observed base at upstream position t for sample igi,j,f is the jth observed feature in f for sample i
αθzmM
up+wup+1dp-1dp-w gf
Up+1 Up+wDp-1Dp-w Gf
gF
GF......
...
.........
SN
Kγ βp+1 βp+wβp-1βp-w δf δF.........p(θ,Z,M ,D,U ,G,m,d,u, g|α, γ,β, δ =
S
i=1
p(θi|α) p(zi,j |θi)K
k=1
p(Mk|γ)p−1
a=p−w
p(Dk,a|β)p+w
b=p+1
p(Uk,b|β)F
f=1
p(Gk,f |δ)Ni
j=1
p(mi,j |Mzi,j )p(di,j,a|Dzi,j ,a)p(ui,j,b|Uzi,j ,b)p(gi,j,f |Gzi,j ,f )
A) Model specification B) Plate diagram
C) Likelihood)
ϖ ϖ ϖ ϖ ϖ ϖ
Models flanking bases in signatures as independently observed variables.
Development of a novel Bayesian model that allows for joint learning of known and novel signatures.
FixedSignatures
EstimatedSignatures
Developing a comprehensive R packagefor mutational signature inference.
Comparison of features across existing software packages
Developing a comprehensive R packagefor mutational signature inference.
DBS Ins Del
VCF/MAF/Table
SNV-96SNV-192 with transcription
strand
SNV-192 replication
strand
SNV-96
Ins
Del
DBS
Ins
SNV-192 transcription
strand
Del
Custom
SNV-96 Custom
DBS
Deconvolution
Create tables
Mix and match
Comprehensive sets of visualizations for exploratory analysis.
COSMIC V2Signatures
Signatures Tumor profiles Comparisons between signatures
Embedding of tumors in 2D
LDA had significantly better performance thanNMF/deconstructSigs in a 5-fold cross-validation.
Cross-validation can be useful for determining
signature discovery stability.
Subsampling can be useful for determining
signature discovery sensitivity.
Acknowledgements
Boston University School of MedicineComputational Biomedicine
EvanJohnson
MasanaoYajima
ShiyiYang
Aaron Chevalier
KellyGeyer
https://github.com/campbio/BAGEL/