Supplementary Figure 1
Motif size distribution.
The number of MS loci per motif size across the whole genome (red), exome (green), and in an annotated set of cancer genes from Lawrence et at
1 (blue). Mono- and di-repeats represent ~99% of all MS loci.
1. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).
Nature Biotechnology: doi:10.1038/nbt.3966
No.ofrepeatsNo.ofrepeats
No.oflociNo.ofloci
Supplementary Figure 2
Sequencing coverage across motifs.
The number of MS loci per length for different motifs (A, C, AC, and AG) across the exome is shown in red while the average number of MS loci covered by at least 10 reads is shown in blue. The number of MS loci covered at 10x depth decreases more rapidly than the number of MS loci, demonstrating the difficulty in achieving sufficient coverage for longer repeat lengths. Together, the motifs A, C, AC, and AG represent 98% of MS loci in the exome.
Nature Biotechnology: doi:10.1038/nbt.3966
MSrepeatlength
Frac
onsofreadsnotinthem
ode
Supplementary Figure 3
Comparison of accuracy of sequence-alignment tools at MS loci.
Noise is plotted as a function of the MS repeat length for the standard alignment (using Burrows-Wheeler Aligner, BWA2) versus
the MS-specific alignment (adapted from lobSTR3). Data is shown for the AG motif. Noise was defined as the fraction of reads that
differ from the modal number of repeats, aggregated over all the MS loci in the X-chromosome from normal male samples (which are assumed to be homozygous at each MS locus). On average, noise is reduced by approximately a factor of 5 using the MS-specific alignment method.
2. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinforma. Oxf. Engl. 25, 1754–1760 (2009).
3. Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).
Nature Biotechnology: doi:10.1038/nbt.3966
No.ofMSindels
Allelefrac on
Supplementary Figure 4
Analysis of true-positive rates.
The number of detected simulated MS indels (out of 200) across repeat lengths (shown in different colors) and allele fractions. The sensitivity to detect MS indels decreases markedly at low allele fractions.
Nature Biotechnology: doi:10.1038/nbt.3966
log10(KStest)log10(KStest)
No.oflociNo.ofloci
Supplementary Figure 5
False-positive rates.
False positive rates for the A and C motifs as a function MSMuTect parameters. Heat maps show the log10 false positive rate per MS locus (i.e. the fraction of false-called MS indels among all MS loci) for the A and C motifs. The y-axis is the threshold for the different AIC scores (Tr) and the x-axis is the threshold for the Kolmogorov-Smirnov (KS) filtering step.
Nature Biotechnology: doi:10.1038/nbt.3966
Supplementary Figure 6
Distribution of MS indels and SNVs across cancer.
Comparison of the fraction of MS indels (upper panel) and number of SNVs (lower panel) across 4,041 tumors from 20 tumor types. Only samples with annotated MS indels and SNVs are shown. Red horizontal lines represent the mean number of MS indels in each tumor type.
Nature Biotechnology: doi:10.1038/nbt.3966
******
*
0
50
100
150
MSI−
H
MSI−
LM
SS
One a
llele
in the
norm
al
Num
ber
of M
S in
dels
********
ns
0
100
200
300
400
MSI−
H
MSI−
LM
SS
********
ns
0
10
20
MSI−
H
MSI−
LM
SS
***
ns
0
1
2
MSI−
H
MSI−
LM
SS
nsns
ns
0
10
20
30
40
MSI−
H
MSI−
LM
SS
One allele in the tumor
Tw
o a
llele
s in t
he n
orm
al
Num
ber
of M
S indels
******
ns
0.0
2.5
5.0
7.5
10.0
MSI−
H
MSI−
LM
SS
Two alleles in the tumor
******
ns
0
1
2
3
MSI−
H
MSI−
LM
SS
Three alleles in the tumor
**
ns
0
1
2
MSI−
H
MSI−
LM
SS
Four alleles in the tumor
Supplementary Figure 7
The number of MS indels for different changes in the number of alleles.
The number of MS indels for STAD samples (broken to MSI-H, MSI-L and MSS) plotted for different numbers of germline and tumor alleles. MSMuTect not only detects the presence of a somatic MS indel, but also infers the actual alleles in both the germline and tumor samples. The upper row shows the number of MS indels for loci that had one allele in the germline and the lower row for two alleles in the germline. The columns represent the number of somatic MS indels alleles in the tumor (range from one to four). For example, the plot in the third column of the second row shows cases in which the germline has two alleles (ie. heterozygous sites) but the tumor sample has 3 alleles. MS indels are more common in MSI-H tumors in all settings except when the germline has two alleles but the tumor has only a single allele (bottom left corner), which reflects loss-of-heterozygosity (LOH). MSI designations (MSI-H, MSI-L, or MSS) are based on Bethesda gel classification (taken from TCGA). The y-axis scale varies across panels. The significance of the difference was calculated using one tailed t-test (ns- p>0.05, * p<0.05, ** p<10
-3, *** p<10
-8, ****
p<10-16
)
Nature Biotechnology: doi:10.1038/nbt.3966
Supplementary Figure 8
Correlation between germline variability and somatic MS indel frequency.
The x-axis represents the binned fraction of non-reference alleles at each MS locus (out of the 2*N alleles in our cohort, where N is the number of covered normal samples). The somatic MS indel frequency for each MS locus is plotted as blue dots. Black dots represent the mean of each bin. The upper panel shows germline variability of A8 in the range of germline variability between 0 to 0.1 and the lower panel in the range of 0 to 1. The effect of germline variability on the somatic rate is minor for germline variability <0.1.
Nature Biotechnology: doi:10.1038/nbt.3966
●
●
●
●●
●●
●
●
●
●
●●
● ●
●
●
● ● ● ● ● ●
−5
−4
−3
−2
−1
0
0 10 20 30 40
Number of MS indels
log
10
(Fre
qu
en
cy o
f lo
ci)
Supplementary Figure 9
Distribution of MS indels in A8 in noncoding regions.
The observed frequency of mutated A8 loci per given number of indels are shown as black dots whereas the expected frequency using a fit based on a Binomial distribution is represented by the red line. The x-axis represents the number of MS indels and the y-axis represents the fraction of loci that have a particular number of MS indels.
Nature Biotechnology: doi:10.1038/nbt.3966
Supplementary Figure 10
STAD quantile–quantile plot.
MSMutSig QQ plot for stomach adenocarcinoma (STAD). Quantile-quantile plot of observed vs. expected P-values under the negative binomial (also called gamma-Poisson) model. Significant MS loci (q<0.1) are shown in red.
Nature Biotechnology: doi:10.1038/nbt.3966
Supplementary Figure 11
COAD quantile–quantile plot.
MSMutSig QQ plot for colon adenocarcinoma (COAD). Quantile-quantile plot of observed vs. expected P-values under the negative binomial (also called gamma-Poisson) model. Significant MS loci (q<0.1) are shown in red.
Nature Biotechnology: doi:10.1038/nbt.3966
Supplementary Figure 12
UCEC quantile–quantile plot.
MSMutSig QQ plot for endometrial cancer (UCEC). Quantile-quantile plot of observed vs. expected P-values under the negative binomial (also called gamma-Poisson) model. Significant MS loci (q<0.1) are shown in red.
Nature Biotechnology: doi:10.1038/nbt.3966
*
●
●
0
500
1000
1500
WT (n
= 3
6)
p.K14
89fs
(n =
31)
Exp
ressio
n (
RS
EM
)
Supplementary Figure 13
PRDM2 transcript levels in WT versus mutant PRDM2 cases.
PRDM2 transcript levels (by RNAseq) was lower in cases with a PRDM2 p.K1489fs frameshift mutation than in PRDM2 WT cases (P=0.016, two tailed Mann-Whitney test).
Nature Biotechnology: doi:10.1038/nbt.3966
Supplementary Figure 14
MutSig quantile–quantile plot for endometrial cancer (UCEC).
Quantile-quantile plot of observed vs. expected P-values for MSI-H cases using only previously identified mutations (red) and using previously identified mutations and MS indels (green). Using MutSig for datasets with large numbers of MS indels leads to an inflation in the number of significantly mutated genes.
Nature Biotechnology: doi:10.1038/nbt.3966