nature genetics: doi:10.1038/ng...sample. samples are sorted in descending order according to the...

35
Supplementary Figure 1 Patterns of somatically acquired genomic variants in GCT, CCOC, ENOC, and HGSC ovarian cancers. (a) Top, mutation load showing the total number of mutations (y axis; log10 scale) for each tumor (x axis). Samples are sorted in descending order based on the total number of mutations. Second panel, contributions of the six identified mutation signatures per sample. Six mutation signatures were extracted from the trinucleotide substitutions of 133 tumor genomes: S.APOBEC, signature similar to the COSMIC APOBEC signature (COSMIC signature 13); S.POLE, a mimic of COSMIC signature 10 related to altered activity of the error-prone polymerase POLE; S.AGE, the age-related signature (COSMIC signature 1) that has been known to correlate with age at cancer diagnosis; S.BC, closely matched with the pattern of COSMIC signature 8 previously found in breast cancer and medulloblastoma; S.MMR, matched to COSMIC signature 6 associated with defective mismatch repair; S.HRD, associated with COSMIC signature 3 representing deficiency in homologous recombination DNA repair. (b) Proportion of the genome harboring high- level copy number amplifications (AMP; top), dominant loss of heterozygosity (LOH; second panel), and copy number loss (LOSS; bottom) per sample. (c) Total number of rearrangements (top; y axis, log10 scale) and the proportion of rearrangement types (second panel) observed in each sample. Balanced, balanced rearrangements; Deletion, deletion rearrangements; Duplication, tandem duplications; Foldback, foldback inversions; Unbalanced, unbalanced rearrangements; Inversion, inversion rearrangements. Nature Genetics: doi:10.1038/ng.3849

Upload: others

Post on 16-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 1

Patterns of somatically acquired genomic variants in GCT, CCOC, ENOC, and HGSC ovarian cancers.

(a) Top, mutation load showing the total number of mutations (y axis; log10 scale) for each tumor (x axis). Samples are sorted in

descending order based on the total number of mutations. Second panel, contributions of the six identified mutation signatures per sample. Six mutation signatures were extracted from the trinucleotide substitutions of 133 tumor genomes: S.APOBEC, signature similar to the COSMIC APOBEC signature (COSMIC signature 13); S.POLE, a mimic of COSMIC signature 10 related to altered activity of the error-prone polymerase POLE; S.AGE, the age-related signature (COSMIC signature 1) that has been known to correlate with age at cancer diagnosis; S.BC, closely matched with the pattern of COSMIC signature 8 previously found in breast cancer and medulloblastoma; S.MMR, matched to COSMIC signature 6 associated with defective mismatch repair; S.HRD, associated with COSMIC signature 3 representing deficiency in homologous recombination DNA repair. (b) Proportion of the genome harboring high-

level copy number amplifications (AMP; top), dominant loss of heterozygosity (LOH; second panel), and copy number loss (LOSS; bottom) per sample. (c) Total number of rearrangements (top; y axis, log10 scale) and the proportion of rearrangement types (second

panel) observed in each sample. Balanced, balanced rearrangements; Deletion, deletion rearrangements; Duplication, tandem duplications; Foldback, foldback inversions; Unbalanced, unbalanced rearrangements; Inversion, inversion rearrangements.

Nature Genetics: doi:10.1038/ng.3849

Page 2: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 2

Substitution mutational signatures.

(a) Top, total number of SNVs (y axis; log10 scale) for each tumor (x axis). Second panel, proportion of six base substitution patterns per sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance (second panel) for the number of signatures from 2 to 12, each with 200 replicates. (c) Six inferred mutational

signature profiles. S.APOBEC, signature similar to the COSMIC APOBEC signature (COSMIC signature 13); S.POLE, a mimic of COSMIC signature 10 related to altered activity of the error-prone polymerase POLE; S.AGE, the age-related signature (COSMIC signature 1) that has been known to correlate with age at cancer diagnosis; S.BC, closely matched with the pattern of COSMIC signature 8 previously found in breast cancer; S.MMR, matched to COSMIC signature 6 associated with defective mismatch repair; S.HRD, associated with COSMIC signature.3 representing deficiency in homologous recombination DNA repair.

Nature Genetics: doi:10.1038/ng.3849

Page 3: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 3

Genomic feature descriptions.

Description of the 20 genomic features used in the integrative clustering of patients with ovarian cancer, including 6 mutation signatures (S.APOBEC, S.POLE, S.AGE, S.BC, S.MMR, and S.HRD); 6 rearrangement types and 1 homology length (Foldback.Inversion, Inversion, Tandem.Duplication, Deletion.Rearrangement, Balanced.Rearrangement, Unbalanced.Rearrangement, and Homology>=5bp); 3 copy number aberrations (CN.Amplification, CN.Loss, and CN.LOH); and 4 mutation variant types (Nonsynonymous, Splicesite, Stop.Lost/Gained, and Frameshift).

Nature Genetics: doi:10.1038/ng.3849

Page 4: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Nature Genetics: doi:10.1038/ng.3849

Page 5: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 4

Integration of genomic features stratifies patients with ovarian cancer.

(a) Hierarchical clustering of 133 patients with ovarian cancer (columns) by integrating genomic features including point mutation, copy

number, and structural rearrangement profiles identifies seven major genomic subgroups. Scaled values of genomic features (rows in the top panel) are shown in a heat map with a dendrogram of the hierarchical cluster analysis. The color-coding reflects the scaled value of genomic features obtained by subtracting the value of each feature from its mean and dividing the value by its standard deviation. Mutation status (presence, gray; absence, white; rows for bottom panel) for the significantly mutated genes (MutSigCV q <

0.01) and DNA repair genes across patients is displayed. Histotype is included as an annotation row (red, HGS; blue, COCC; green, ENOC; purple, GCT). (b) Comparison of the estimated cellularities of the subgroups of each histotype, where no significant differences were observed. The cellularity of each sample was estimated using Titan. Student’s t test was performed, and the corresponding P value is annotated on top of the box plots for each histotype. (c) Determining the number of clusters by the ‘elbow’ rule. The plot shows

the explained variance (EV; y axis) computed as a function of the number of clusters (x axis) generated from hierarchical clustering. Given the threshold of EV (at 0.45; horizontal dashed line) and its increment threshold of 0.05, the optimal number of clusters (k = 7) was identified (vertical dashed line). (d) Mutation load in the HGSC subgroups. The mutation load for the HGSC samples in the H-HRD subgroup, on average, was higher than in the H-FBI subgroup (Mann–Whitney–Wilcoxon test, P < 0.001). (e) Focal amplifications (red) and deletions (blue) in the CCOC (C-APOBEC and C-AGE) subgroups. (f) Focal amplifications (red) and deletions (blue) in E-MSI and

MSS ENOC samples.

Nature Genetics: doi:10.1038/ng.3849

Page 6: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 5

Integration/clustering of genomic features/cases in the HGSC cohort only.

(a) Hierarchical clustering of 59 patients with HGSC by integrating genomic variant profiles highlights two major genomic subgroups, H-

FBI (n = 22) and H-HRD (n = 37), of HGSC tumors. The plot shows the contribution of genomic features (scaled value; top heat map) with a dendrogram illustrating hierarchical clustering. The mutation status (presence, gray; absence, white) for the significantly mutated genes (MutSigCV q < 0.01) and DNA repair genes across patients is shown in the second panel. Histotype and two subgroups are included as annotation rows (red, HGSC; dark green, H-FBI subgroup; dark orange, H-HRD subgroup). (b) Importance of genomic

features segregating the HGSC subgroups of H-FBI (n = 22) and H-HRD (n = 37). Left, genomic features (y axis) are sorted in descending order of the average Gini score (x axis), reflecting the importance of features in stratifying the two subgroups of HGSC tumors. Right, box plot showing the distribution of the top six genomic features contributing to the differences between H-HRD and H-FBI. The y axis shows the value of genomic features. (c,d) Kaplan–Meier plots showing significant differences in overall survival (c) and progression-free survival (d) between the HGSC subgroups H-HRD and H-FBI (log-rank test, P = 0.0083 and 0.0108), in which samples

enriched in foldback inversions (H-FBI) had poor survival outcomes.

Nature Genetics: doi:10.1038/ng.3849

Page 7: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Nature Genetics: doi:10.1038/ng.3849

Page 8: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 6

Integration/clustering of genomic features/cases in patients with endometrioisis-associated cancers only.

(a) Hierarchical clustering of 35 patients with CCOC and 29 patients with ENOC by integrating genomic variant profiles highlights six

major genomic subgroups. The plot shows the contribution of genomic features (scaled value; top heat map) with a dendrogram illustrating hierarchical clustering. The mutation status (presence, gray; absence, white) for the significantly mutated genes (MutSigCV q < 0.01) and DNA repair genes across patients is displayed. (b) Contribution of genomic subgroup memberships in ENOC and CCOC. The number (n) and proportion (%) of samples from each subgroup are shown. (c) Importance of genomic features segregating the C-

APOBEC (n = 9) and C-AGE (n = 15) subgroups of CCOC tumors. Left, genomic features (y axis) are sorted in descending order of the average Gini score (x axis), reflecting the importance of features in stratifying the two subgroups of CCOC tumors. Right, box plot showing the distribution of the top six genomic features contributing to the differences between C-APOBEC and C-AGE. The y axis shows the value of genomic features. (d) Importance of genomic features segregating the E-MSI (n = 8) and MSS (n = 20) subgroups

of ENOC tumors. Left, genomic features (y axis) are sorted in descending order of the average Gini score (x axis), reflecting the importance of features in stratifying MSI and E-MSS ENOC tumors. Right, box plot showing the distribution of the top six genomic features contributing to the differences between the E-MSI and MSS subgroups. The y axis shows the value of genomic features. (e)

Box plots showing the distribution of immunogenic epitope counts in the MSS and E-MSI ENOC subgroups.

Nature Genetics: doi:10.1038/ng.3849

Page 9: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 7

Homozygous deletions identified from HGSC tumors in PTEN and RB1.

(a,b) Examples of homozygous deletion in PTEN (a) and RB1 (b). In each example, the following are shown: a chromosome ideogram

highlighting the region of interest (top); a log-ratio plot overlaying rearrangement events (if present, shown with arcs) on copy number aberration segments (middle); an allelic ratio plot showing the corresponding LOH profile in each region (bottom). ALOH, amplified LOH; HOMD, homozygous deletion; NLOH, neutral LOH; Deletion, deletion rearrangement.

Nature Genetics: doi:10.1038/ng.3849

Page 10: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 8

HGSC tumors stratified by foldback-inversion profile.

(a) HGSC cases could be stratified into two subgroups based on the proportion of foldback inversions, in which cases with a higher

Nature Genetics: doi:10.1038/ng.3849

Page 11: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

proportion of foldback inversions (with reference to the median) referred to as the High FBI group had statistically significant inferior overall and progression-free survival outcomes (log-rank test, P = 0.0187 and 0.0286) as compared to cases with a low proportion of foldback inversions (Low FBI group). (b) Distribution of the break distance for foldback inversions in our HGSC cohort. (c–f) Two examples of foldback inversions at chromosome 8. In c and e, the genomic locus of the events is illustrated. A red bar marked on an

ideogram of the chromosome 8 q arm shows where the events occurred (top). Two foldback inversions are illustrated: SV1, on the forward strand (arrows pointing to the right in orange; second panel) and SV2, on the reverse strand (backward arrows pointing to the left in grey; third panel). Coverage depth and reads (fourth and fifth panels) covering the breakpoints of foldback inversions and the locations of breakpoints on the genomic scale (bottom) are shown. In d and f, two foldback inversions are shown schematically at

nucleotide sequence level. Annotated red on the genomic scale shows the breakpoints of the forward strand and the reverse strand sequences. In d, two foldback inversions co-occurred within 1 kb. Left, the reverse strand inverted and was fused to the forward strand

by a 4-bp homology sequence, CTTT (highlighted in green). Right, the forward strand inverted and was fused to the reverse strand by a 12-bp homology sequence, TTCACATGTGAA (highlighted in green). In f, two foldback inversions co-occurred within 2 kb. Left, the

reverse strand inverted and was fused to the forward strand by a 4-bp homology sequence, GAGC (highlighted in green). Right, the forward strand inverted and was fused to the forward strand by a 12-bp homology sequence, AGAGTATACTCT (highlighted in green).

Nature Genetics: doi:10.1038/ng.3849

Page 12: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 9

Co-occurrence of foldback inversions and focal high-level amplifications (HLAMPs) in HGSC samples.

Examples of focal HLAMPs colocalized with foldback inversions in CCNE1 and KRAS in our discovery HGSC cohort and in CCNE1 and MYC in the ICGC HGSC cohort. In each example, the following are shown: a chromosome ideogram highlighting the region of interest (top); a log-ratio plot overlaying rearrangement events (shown with arcs) on copy number (CN) aberration segments (middle); an allelic ratio plot showing the corresponding LOH profile in each region (bottom). ALOH, amplified LOH; ASCNA, allele-specific copy number amplification; BCNA, balanced copy number amplification; GAIN, copy number gain; HET, diploid heterozygous; NLOH, neutral LOH.

Nature Genetics: doi:10.1038/ng.3849

Page 13: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Figure 10

High-level amplification–associated foldback inversions (HLAMP-FBIs) in HGSC cell lines.

(a) The proportion of HLAMP-FBI for the primary (TOV1369) and relapse (OV1369(R2) cell lines) (red dotted lines) superimposed on the distribution of HLAMP-FBI from the H-HRD (blue) and H-FBI (green) subgroups. (b) Examples of HLAMP-associated foldback

inversions present in the relapse cell line (OV1369(R2)) but absent in the primary-tumor-derived cell line (TOV1369) from the same patient. In each example, the following are shown: a chromosome ideogram highlighting the region of interest (top); log-ratio plots overlaying rearrangement events (shown with arcs) on copy number (CN) aberration segments in the primary cell line (middle) and relapse cell line (bottom). HOMD, homozygous deletion; DLOH, deletion LOH; HET, diploid heterozygous; GAIN, CN gain; AMP, copy number amplification; HLAMP, high-level amplification.

Nature Genetics: doi:10.1038/ng.3849

Page 14: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Supplementary Note for:Genomic consequences of aberrant DNA repair mecha-nisms stratify ovarian cancer histotypes

Yi Kan Wang1,∗, Ali Bashashati1,∗, Michael S. Anglesio2,15, Dawn R. Cochrane1, Diljot S. Grewal1,2,Gavin Ha1,3, Andrew McPherson1,2, Hugo M. Horlings1, Janine Senz1, Leah M. Prentice1, An-thony N. Karnezis2, Daniel Lai1, Mohamed R. Aniba1, Allen W. Zhang1,4,5, Karey Shumansky1,Celia Siu1, Adrian Wan1, Melissa K. McConechy2, Hector Li-Chang2, Alicia Tone2,6, DianeProvencher7,8,9, Manon de Ladurantaye7,8, Hubert Fleury7,8 Aikou Okamoto10, Satoshi Yanagida10,Nozomu Yanaihara10, Misato Saito10, Andrew J. Mungall11, Richard Moore11, Marco A. Marra11,12,C. Blake Gilks2,13, Anne-Marie Mes-Masson7,8,14, Jessica N. McAlpine15, Samuel Aparicio1,2,David G. Huntsman1,2,15, Sohrab P. Shah1,2,11

1. Department of Molecular Oncology, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC, V5Z 1L3,Canada

2. Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, V6T 2B5,Canada

3. Dana Farber Cancer Institute and the Broad Institute, Boston, Massachusetts, USA (present address)

4. Graduate Bioinformatics Training Program, University of British Columbia, Vancouver, British Columbia,Canada

5. Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Vancouver, BritishColumbia, Canada

6. Division of Gynecologic Oncology, Princess Margaret Cancer Centre, Toronto (present address)

7. Centre de recherche du Centre hospitalier de l'Universite de Montreal (CRCHUM), Montreal, Canada

8. Institut du cancer de Montreal, Montreal, Canada

9. Division of Gynecologic Oncology, Universite de Montreal, Montreal, Canada

10. Department of Obstetrics and Gynecology, The Jikei University School of Medicine, Tokyo, Japan

11. Michael Smith Genome Sciences Centre, BC Cancer Agency, 675 West 10th Avenue, Vancouver, BC, V5Z1L3, Canada

12. Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6T 2B5, Canada

13. Department of Pathology, Vancouver General Hospital, 899 W 12th Ave, Vancouver, BC, V5Z 1M9, Canada

14. Department of Medicine, Universite de Montreal, Montreal, Canada

15. Department of Gynecology and Obstetrics, University of British Columbia, BC, V6Z 2K5, Vancouver, Canada

* - equal contribution

Corresponding Authors: David G. Huntsman ([email protected]), Sohrab P. Shah ([email protected])

1

Nature Genetics: doi:10.1038/ng.3849

Page 15: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Keywords: ovarian cancer, high grade serous, fold-back inversions, clear cell, endometrioid, whole genome sequenc-ing, patient stratification

2

Nature Genetics: doi:10.1038/ng.3849

Page 16: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Contents

1 Whole genome sequencing (WGS) analysis 5

1.1 Copy number alterations and loss of heterozygosity . . . . . . . . . . . . . . . . . 5

1.2 Identification of significantly altered genome regions . . . . . . . . . . . . . . . . 5

1.3 Single nucleotide variant (SNV) and indel calling . . . . . . . . . . . . . . . . . . 6

1.4 Mutation signature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Detection of kataegis events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Structural variation prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6.1 Rearrangement classification . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Genomic feature computation for clustering 9

3 Fold-back inversion associated with HLAMP 10

4 Fold-back inversions (as a single feature) stratified HGSC cases 11

4.1 Discovery HGSC cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 ICGC HGSC cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Prediction of structural variations from TCGA ovarian exome sequencing data 11

6 HGSC cell line sequencing and analysis 12

7 Validation experiments 13

3

Nature Genetics: doi:10.1038/ng.3849

Page 17: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

7.1 PCR validation of breakpoints prediction . . . . . . . . . . . . . . . . . . . . . . . 13

7.1.1 Target selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7.1.2 Bioinformatics of validation approach and analysis results . . . . . . . . . 14

7.2 Validation of single neucleotide variants (SNVs) . . . . . . . . . . . . . . . . . . . 14

7.2.1 Targeted library construction and sequencing . . . . . . . . . . . . . . . . 14

7.2.2 Targeted deep sequencing analysis and results . . . . . . . . . . . . . . . . 15

8 Microsatellite instability testing 16

9 Nanostring molecular subtypes 16

10 Prediction of neoantigens in ENOC 17

10.1 HLA predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

10.2 MHC-I binding prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

10.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

11 General statistical analysis 17

4

Nature Genetics: doi:10.1038/ng.3849

Page 18: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

1 Whole genome sequencing (WGS) analysis

1.1 Copy number alterations and loss of heterozygosity Titan1 (verison 1.5.5 availablethrough R Bioconductor TitanCNA package) was performed on whole genome sequencing data toestimate cellularity, profile clonal and subclonal regions of somatic copy number alterations (CNA)and loss of heterozygosity (LOH) from the matched tumour/normal samples. All tumour sampleswere run with ploidy = 2 and 4 initializations. All other Titan algorithm parameters except for thefollowing were set as default:

• norm est meth = ’map’ # estimate normal content using MAP

• max iters = 50 # maximum number of EM iterations

• pseudo counts = 1e-300

• txn z strength = 1e6

• txn exp len = 1e16

• alpha high = 20000

• alpah k = 15000 # prior on the copy number Gaussian variance parameter

• normal params n0 = 0.5 # initial normal content

• estimate ploidy = TRUE

Using the internal clustering validation measure, S Dbw validity index as a guidance, thefinal solution of optimal number of clusters/clones from Titan predictions was determined by man-ual inspection of the copy number, allelic ratio and cellular prevalence profiles from both diploidand tetraploid runs. The small CNA segments of length <5 kb were further filtered. Gene annota-tion for each copy number segment was performed using a python library pygenes (version1.0.2) with human genome reference Homo sapiens GRCh37.73.gtf.

1.2 Identification of significantly altered genome regions GISTIC2.0 (version 2.0.21)was used to identify significantly amplified or deleted copy number aberration regions in eachhistotype and in each subgroup of samples. Titan predicted copy number segments and the cor-responding median LogR values were used as segmented data and the SNPs generated in Titananalysis were used as markers. Other than the following three, other GISTIC parameters were setas default:

• -conf = 0.9

5

Nature Genetics: doi:10.1038/ng.3849

Page 19: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

• -maxseg = 2000

• -rx = 0

The number and proportion of samples harbouring deep deletion (t <-1.3), shallow deletion(-0.1 >t ≥ -1.3), neutral (0.1 ≥ t ≥ -0.1), low level gain (0.9 ≥ t >0.1), high level gain (t >0.9)were computed for every significant aberrant region (FDR q-value <0.25).

1.3 Single nucleotide variant (SNV) and indel calling SNVs were predicted using an up-dated version of mutationSeq (version 4.3.5; model v4.1.2.npz)2 available athttp://compbio.bccrc.ca/software/mutationSeq. We also used Strelka (version1.0.13)3 with default parameter settings to identify somatic SNVs and indels. Both SNVs andindels were then annotated for variant effects and gene-coding status using SnpEff4 (version 3.6b).

We further identified a set of high confidence SNVs by taking the intersection of the highprobability calls predicted from mutationSeq (with probability ≥ 0.9) and the somatic SNVspredicted from Strelka. Significantly mutated genes (SMGs) were identified by MutSigCV5

(version 1.4) on the entire data cohort. Genes with a false discovery rate (FDR) q <0.1 were pre-dicted as SMGs. SNVs and indels with the following SnpEff annotations, SPLICE SITE ACCEPTOR,SPLICE SITE DONOR, NON SYNONYMOUS CODING, FRAME SHIFT, STOP GAINED, STOP LOST,in SMGs and DNA repair genes including TP53, PIK3CA, ARID1A, PTEN, PER3, KRAS, CTNNB1,FOXL2, NF1, KMT2B, PPP2R1A, PIK3R1, RPL22, POLE, RB1, BRCA1, BRCA2 have been re-ported in annotation track (bottom panel) of Fig. 1a.

The high confidence set of SNVs were further filtered by removing the positions that fellwithin either of the following regions: (1) the UCSC Genome Browser blacklists (Duke and DAC),and (2) defined in the ’CRG Alignability 36mer track’ with more than two mismatch nucleotides,requiring a 36-nucleotide fragment to be unique in the genome even after allowing for two dif-fering nucleotides. Post processing on this set of high confidence SNVs and somatic indels fromStrelka involved removing the known variants (both SNVs and indels) that were obtained fromthe 1000 Genomes Project (release 20130502) and dbSNP (version dbsnp 142.human 9606).The set of high confidence somatic SNVs and indels passing the above filters were then used in thedownstream mutation signature analysis and feature computation.

Coding mutations were defined as positions having any of the following SnpEff annotations:SPLICE SITE ACCEPTOR, SPLICE SITE DONOR, START LOST, NON SYNONYMOUS START,NON SYNONYMOUS CODING, FRAME SHIFT, CODON CHANGE, CODON INSERTION, CODON CHANGE PLUS CODON INSERTION,CODON DELETION, CODON CHANGE PLUS CODON DELETION, STOP GAINED, STOP LOST,RARE AMINO ACID

6

Nature Genetics: doi:10.1038/ng.3849

Page 20: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

1.4 Mutation signature extraction Trinucleotide mutation signatures were deciphered from thenucleotide substitution contexts of 133 tumour genomes using non-negative matrix factorization(NMF) with a random seeding method and the ’brunet’ algorithm, executed by the R packagesNMF (version 0.20.6) and SomaticSignatures (version 2.5.5).

NMF was run with different number of signatures (i.e., NMF rank) from 2 to 12. For agiven number of signatures, NMF was performed with 200 iterations (Supplementary Fig. 2b).The goodness of fit was then examined by computing the residual sum of squares (RSS) and theexplained variance. The optimal number of signatures (i.e., rank = 6) was selected at which thegoodness of fit converged. The inferred mutation signatures (Supplementary Fig. 2c) were thencompared to a curated list of cancer census mutational signatures and their presence in humancancer, available at: http://cancer.sanger.ac.uk/cosmic/signatures. The pro-posed aetiology of the closet match was assigned to name the inferred mutation signatures, i.e.S.APOBEC, S.POLE, S.AGE, S.BC, S.MMR and S.HRD.

To remove the random seeding bias in NMF results (i.e., to obtain stable mutation signa-tures), we performed NMF with multiple random seeds and computed a representative contribu-tion profile for each mutation signature. Briefly, with the optimal number of signatures (rank = 6)NMF was performed 2000 times and the inferred mutation signatures (basis component matrix)and their contribution profiles per sample (mixture coefficient matrix, i.e. Csignaturei,i=1:6 across133 samples) were computed for each iteration.

Partitioning Around Medoids (PAM) method, executed by ’pam’ under the R package cluster(version 2.0.3), was used to establish 6 clusters from the set of 2000 mixture coefficient ma-trices. The mean of each cluster was computed as the representative contribution of each mutationsignature. The normalized contribution profiles (referred to as ’coefficients’ in the main text), i.e.CS.APOBEC , CS.POLE , CS.AGE , CS.BC , CS.MMR and CS.HRD, were then used in the downstreamanalysis (Supplementary Fig. 1a) as the contribution of mutation signatures.

1.5 Detection of kataegis events Post-processed high confidence SNVs were used to identifyfoci of kataegis, i.e. regions of localized hypermutations, in each sample according to the criteriaand method proposed in6. Briefly, for each sample, all mutations were ordered by chromosomalposition and the intermutation distance (defined as the number of base pairs from each mutation tothe next one) was calculated. Intermutation distances were then segmented by fitting to a piecewiseconstant curve based on a recursive partitioning and regression-based tree model (executed by Rpackage rpart (version 4.1.10)) to find regions of constant intermutation distance. Theminimum number of mutations that must exist in a node in order for a split to be attempted wasset to six. Putative regions of kataegis were identified as those segments containing six or moreconsecutive mutations with an average intermutation distance of≤ 1000bp. The kataegic foci werefurther refined by retaining the regions of mutation clusters enriched for C>T and C>G mutationswith a predilection for a Tp CN mutation context, i.e. %C>T|C>G >50% of total mutations at the

7

Nature Genetics: doi:10.1038/ng.3849

Page 21: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

kataegic foci, of which %TpCN context >50%.

1.6 Structural variation prediction Rearrangement breakpoints were predicted using lumpy(version 0.2.13)7 executed by SpeedSeq version 0.1.08 and destruct (version0.4.5) derived from nFuse9, 10, available at https://bitbucket.org/dranew/destruct.

In brief, destruct (see related manuscript file) extracted discordant and non-mappingreads from BAM files and realigned the reads using a seed and extend strategy. Split alignmentacross a putative breakpoint was attempted for reads that did not fully align to a single loci. Dis-cordant alignments were clustered according to the likelihood they were produced from the samebreakpoint. Multiple mapped reads were assigned to a single mapping location using previouslydescribed methods11, 12. Finally, heuristic filters removed predicted breakpoints with poor discor-dant read coverage of sequence flanking predicted breakpoints.

We applied a stringent 3-step filtering criteria to identify high confidence breakpoint calls fordownstream analysis, as follows:

• Step 1: breakpoints that were predicted by both algorithms, lumpy and destruct, weretaken.

• Step 2: we removed (1) the breakpoints from the poor mappability regions, (2) eventswith break distance ≤ 30bp, (3) breakpoints annotated as deletion with breakpoints size<1000. Furthermore, only high confidence breakpoints that had at least five supportingreads in tumour and no read support in the matched normal sample were used in the anal-ysis. The breakpoints were further filtered by removing the positions in either of the fol-lowing regions: (1) UCSC Genome Browser blacklists (Duke and DAC), and (2) definedin the ’CRG Alignability 36mer track’ with more than two mismatch nucleotides, requiringa 36-nucleotide fragment to be unique in the genome even after allowing for two differingnucleotides.

• Step 3: predictions with small break distance and low number of support reads in tumoursamples were excluded. We designed a targeted deep sequencing PCR experiment to informthe filtering criteria (see Section 7.1) for this step.

1.6.1 Rearrangement classification

We classified breakpoints by the orientation type and rearrangement type. Orientation type refersto the relative position and orientation of the break-ends in the genome and consists of 4 categories:deletion, duplication, inversion and translocation. As expected, translocation breakpoints are thosefor which the break-ends are on different chromosomes, deletion breakpoints are those resulting

8

Nature Genetics: doi:10.1038/ng.3849

Page 22: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

from removing a segment of a chromosome and rejoining the free ends, duplication breakpointsare those resulting from a copy of a segment being inserted before or after the segment (tandemduplication), inversion breakpoints refer to one of the two breakpoints resulting from excision,inversion and reinsertion of a segment.

Rearrangement type refers to the type of rearrangement event that produced the breakpoint,where a rearrangement can be the result of one or more breakpoints. Rearrangement type consistsof 6 categories: balanced, deletion, fold-back, inversion, duplication and unbalanced. Balancedrearrangements are any set of breakpoints that preserve the number of copies of adjacent chro-mosomal segments. We identify balanced rearrangements as alternating cycles in the breakpointgraph as previously described9. Included in balanced rearrangements are reciprocal translocations,balanced insertions, and inversions greater than 1Mb in size for which both breakpoints have beenidentified. Inversions less than 1Mb in size are given the rearrangement type of inversion. Dele-tion and duplication rearrangement types are single breakpoint events, maximum 1Mb in size,for which those breakpoints have not been identified as part of a balanced rearrangement. Fold-back rearrangements are inversion type breakpoints, maximum 30kb in size, that have not beenidentified as part of an inversion or other balanced rearrangement. These breakpoints are termedfold-back as they imply an operation, duplication of a chromosome arm and subsequent joining ofthe two arms with opposing orientation, that results in the DNA sequence folding back on itself.The remaining set of unclassified breakpoints are given the rearrangement type of unbalanced.

2 Genomic feature computation for clustering

We generated 20 genomic features for clustering ovarian cancer tumours based on copy num-ber aberrations (CNAs), mutation profiles and structural variation characteristics (SupplementaryFig. 3). Three CNA-related features included the proportion of genome harbouring loss of het-erozygosity (LOH), the proportion of genome harbouring copy number high-level amplification(CN.Amplification) and the proportion of genome harbouring copy number loss (CN.Loss). Foreach sample, LOH was computed as the total length of copy number segments inferred by Ti-tan with Titan calls in dominant clonal DLOH, NLOH or ALOH divided by total length of thegenome. CN.Amplification was computed as the total length of copy number segments in whichthe estimated total copy number > estimated ploidy (to the nearest one) + 2 divided by thetotal length of the genome. CN.Loss was computed as the total length of copy number segmentsassociated with Titan calls in DLOH or HOMD divided by the total length of the genome.

The mutation profiles composed of the contribution of six mutation signatures and the pro-portion of four types of mutations: non-synonymous coding mutations, stop-gained/loss mutations,splice-site mutations and frameshifts. The contribution of mutation signatures was the normalizedrepresentative contribution of each mutation signature as described in Section 1.4. For each sample,the proportions of non-synonymous coding, stop-gained/loss, splice-site, and frameshift mutationswith SnpEff effect in the following categories were computed: NON SYNONYMOUS CODING for

9

Nature Genetics: doi:10.1038/ng.3849

Page 23: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Nonsynonymous; SPLICE SITE ACCEPTOR or SPLICE SITE DONOR for Splice-site; STOP GAINEDor STOP LOST for Stop.Lost/Gained; and FRAME SHIFT for Frameshift.

The structural variation characteristics were defined by the types of rearrangements andlength of homology associated with each rearrangement. The proportion of six rearrangementevents defined as Fold-back (Foldback.Inversion), Duplication (TandemDuplication), Deletion (Dele-tionRearrangement), Balanced (BalancedRearrangement), Unbalanced (UnbalancedRearrangement),and Inversion was computed for each sample. For the cases with no breakpoints passing the strin-gent criteria (as described above), the proportion of rearrangement events was treated as NA. Theproportion of rearrangements in each sample associated with large homology, i.e. homology size≥ 5bp (Homology≥ 5bp), was computed.

The 20 genomic features for 133 tumour samples were combined to generate a feature matrix(Supplementary Table 5a), representing genomic characteristics of the patients. We imputed themissing values in the feature matrix, i.e. proportion of rearrangement events, using impute.knnfunction from the R package impute (version 1.44.0) with default parameter settings.Each feature in the matrix was then scaled by subtracting the values from its mean and then dividingthe values by its standard deviation.

Hierarchical clustering analysis (using R package pheatmap (version 1.0.8)), us-ing ’manhattan’ distance measure and ’ward.D’ agglomeration method, was performed on thefeature matrix to determine the subgroupings of 133 patients. The cut-off selected for the dendro-gram was determined by assessing the percentage of explained variance (EV) and its increment fora given number of cluster k using the ’elbow’ rule. Given the distance matrix and the hierarchi-cal clustering, the css.hclust function (R package GMD (version 0.3.3)) was used tocompute the sum-of-squares. The percentage of variance explained was computed as the ratio oftotal between-group variance to the total sum of squares of the data (Supplementary Table 5b).Following the ’elbow’ rule, the elbow.batch function was used for clustering evaluation andthe optimal number of clusters (i.e., seven clusters) was selected with threshold of the EV = 0.45and the threshold of the increment in EV = 0.05 (Supplementary Fig. 4c).

3 Fold-back inversion associated with HLAMP

Given breakpoints of rearrangement (from destruct and lumpy) and copy number aberrations(from Titan) identified from WGS data, we computed the average LogR values of copy numbergains for SNPs within a 100kb window of a breakpoint (i.e. 50kb on each side of a breakpoint).Afterwards, the mean value of the average LogR corresponding to each type of rearrangement fora given case was computed. The lower quantile, median and upper quantile was then calculated,separately, for cases in H-HRD and H-FBI.

10

Nature Genetics: doi:10.1038/ng.3849

Page 24: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

4 Fold-back inversions (as a single feature) stratified HGSC cases

4.1 Discovery HGSC cases. The cases were stratified into two groups based on the medianvalue of the fold-back inversion proportion: cases with proportion of fold-back inversions >itsmedian value were in the group of High FBI and the rest cases were in the group of Low FBI.Log-rank test was performed on the two groups to determine the significance in the differencebetween their survival outcomes (Supplementary Fig. 8a).

4.2 ICGC HGSC cases. ICGC HGSC cohort structural variants and clinical outcome data (release17) were downloaded from ICGC data portal. Only primary tumour samples were included. Inver-sions with breakage distance≤ 30000 bp were reclassified as fold-back inversions. The proportionof fold-back inversions was computed for each sample. The ICGC HGSC cases were then strati-fied into two groups based on the median value of the fold-back inversion proportion: cases withproportion of fold-back inversions >its median value were in the group of High FBI and the restcases were in the group of Low FBI. Gene expression molecular subtypes and BRCA status forthe ICGC HGSC cases were available from13. Log-rank test was performed on the two groups todetermine the significance in the difference between their survival outcomes. In addition, usingthe same procedures that we analyzed our in-house HGSC cohort, we profiled the copy numberaberrations and rearrangement events from the raw BAM files for 62 (out of 82) ICGC HGSC casesfor which both matched tumour/normal BAM files were available through the ICGC Data Portalhttps://dcc.icgc.org/.

5 Prediction of structural variations from TCGA ovarian exome sequencing data

We analyzed the TCGA high grade serous ovarian cancer cases to determine whether the co-occurrence of amplifications (AMPs) and fold-back inversions stratify cases into subgroups withdistinct survival outcomes. To do so, we performed the following:

• A set of n = 435 TCGA ovarian serous cystadenocarcinoma cases with complete data of hg19exome BAM files, copy number, and clinical data was selected for this analysis. The copynumber SNP array data and clinical data for these cases were downloaded from the TCGAPancancer project under Synapse (https://www.synapse.org/) with Synapse ID:syn1461171. The corresponding exome BAM files were downloaded from the British ColumbiaCancer Agency’s Genome Sciences Centre (GSC) servers which host the TCGA sequencingdata.

• Genes associated with copy number LogR ≥ 1 were extracted for each case. The genomicpositions for the genes were obtained from UCSC. A total of 360 cases were found to harbouramplifications in at least one gene (in other words, 75 cases were found to have no AMPevents).

11

Nature Genetics: doi:10.1038/ng.3849

Page 25: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

• To identify structural variations in copy number amplified regions, BreaKmer (versionv0.0.6) 14, with default parameter settings, was performed on the 360 cases.

• Post-processing on the BreaKmer predicted rearrangements included (i) removing undefinedstructural variation subtype, i.e. SV subtype = ”None”; (ii) keeping the rearrangement eventssupported by at least 6 discordant reads counts; (iii) genomic breakpoints associated witheach rearrangement with at least 60 read depth and 2 split reads were included in downstreamanalysis; and (iv) inversions with break distance ≤ 30000bp were further classified as fold-back inversions.

• For each case, we identified all the AMP regions harbouring fold-back inversions and thencomputed the average LogR of these regions for each case.

• Taking the median of the average LogR as the boundary, the 360 cases were split into twosubgroups: cases with average LogR >median average LogR (FBI-AMP High, n = 174) andcases with average LogR ≤ median average LogR (FBI-AMP Low, n = 186).

• By incorporating the set of cases with no AMP (n = 75), we performed a survival analysison the three subgroups using R package survival (version 2.38.3). The Kaplan-Meier estimator and the log-rank test were computed to compare the survival outcomes be-tween the three subgroups.

6 HGSC cell line sequencing and analysis

Two HGSC cell lines derived from either solid tumour tissue (TOV) or ascites (OV) of the samepatient 1369 in a previous study15 were selected for whole-genome sequencing. The cell lineTOV1369 was derived from the primary tumour sample, collected at diagnosis and OV1369(R2)was derived from the relapse sample that had been treated with chemotherapy. The correspondingIC50 values for carbopolatin and olaparib were reported and the methods were fully described andused in 16, 17. The cell lines did not have corresponding matched normals. However, similar tomatched tumour/normal analysis (Section 1.6), both destruct and lumpy with some modifi-cations (see below) were used to predict breakpoints. We ran destruct on a pool of samplesincluding the cell line samples and eight normal samples (DAH290, DG1316, DG1230, DG1023,DAH145, DAH123, DG1331 and DAH168) chosen at random, with two samples from each of the4 ovarian cancer types under consideration. The destruct run resulted in a set of breakpointpredictions for the pool of datasets, and, for each prediction, the number of reads supporting thatprediction in each dataset. Predictions supported by at least one read in any of the normal sampleswere marked as germline/artifact and filtered. Additionally, predictions supported by at least oneread in two or more distinct cell line samples were marked as artifacts and filtered. Further filter-ing was identical to the matched normal destruct analysis (Section 1.6). In addition, lumpy(single-sample mode) was performed on the cell line samples. Highly filtered breakpoints thatpredicted by both destruct and lumpy, as previously described (Section 1.6), were used forcomputing the profile of rearrangement events.

12

Nature Genetics: doi:10.1038/ng.3849

Page 26: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

In addition, we also profiled copy number aberrations from the WGS data of the cell linesusing HMMCopy18. High-level amplification associated with fold-back inversion (HLAMP-FBI)were identified and the corresponding proportion of the HLAMP-FBI events was then computed.

7 Validation experiments

7.1 PCR validation of breakpoints prediction

7.1.1 Target selection

We designed an experiment to comprehensively validate the presence of fold-back inversions andother rearrangement breakpoints in a single sample. We used a sample, DAH208, that harboured awide spectrum of rearrangement predictions including a high number of fold-back inversions at lowprevalence (i.e., low read counts supporting the prediction) and some high prevalence events (i.e.,high read counts supporting the prediction). A primary aim of the experiment was to determineif the many fold-back inversions supported by low read counts were true rearrangements, falsepositive artifacts, or very low prevalence sample-specific events. A further aim was to identifyfeatures of fold-back inversions and other rearrangement breakpoints that could discern true fromfalse positives. We targeted the following 9 categories of breakpoints, with the specified numberof events for each category.

• 5 deletions breakpoints

• 5 duplications breakpoints

• 21 general fold-backs breakpoints

• 10 high break distance breakpoints

• 10 high homology breakpoints

• 15 high num reads breakpoints

• 10 low break distance breakpoints

• 10 low homology breakpoints

• 10 unbalanced breakpoints

Categories were defined as follows. High read counts fold-backs had at least 5 WGS reads.Low break distance fold-backs were fold-backs with breakends within 4 nucleotides, and high

13

Nature Genetics: doi:10.1038/ng.3849

Page 27: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

break distance greater than 100. High and low homology was defined as greater than 10 and lessthan or equal to 3, respectively.

7.1.2 Bioinformatics of validation approach and analysis results

Targeted deep-sequencing was performed according to internal lab standard operating procedureswhich was previously described19, 20 and the respective manufacturers specifications. PCR andMiSeq sequencing produced 151X151bp paired end reads, 3953239 for the normal sample and12790459 for the tumour sample. Reads were aligned to predicted breakpoint sequences usingbwa version 0.7.12. Paired end reads were discarded unless at least 100bp of each read aligned tothe same breakpoint sequence, within 5bp of the expected start location given the location of theprimers. Passing read alignments were counted for each breakpoint. Read counts were less than98 for the normal sample. The read count distribution for the tumour sample was multi-modal,with read counts less than 100 for some breakpoints and greater than 1000 for others. Based onthe distribution of read counts in the normal sample we selected a presence/absence threshold of100 reads for both tumour and normal samples. For predictions that passed the presence/absencethreshold of 100, tumour read counts were greater than 2282, with median 84364 and 1st and 3rdquartiles at 23320 and 112015 respectively. We successfully validated 5/5 deletions, 4/5 dupli-cations, 2/10 unbalanced and 8/15 high read count fold-backs (Supplementary Table 3). Noneof the predictions with less than 100 breakpoint distance validated. Homology of successfullyvalidated events was 6 or less, and higher homology events did not validate. Based on the aboveresults, we adjusted breakpoint prediction filtering criteria to include breakpoints with read support≥ 5 and break distance ≥ 30. This resulted in true positive rate of 90.5% (19 true SVs out of 21predictions).

7.2 Validation of single neucleotide variants (SNVs)

7.2.1 Targeted library construction and sequencing

To establish the sensitivity and specificity of our SNV prediction pipeline, we performed validationexperiments on all the 59 HGSC tumor/normal pairs. More specifically, we selected 192 predictedsomatic SNVs per case as candidates for deep sequencing, which included all the somatic codingSNVs and, if the number of coding SNVs was less than 192, randomly selected high confidencenon-coding SNVs to reach to 192 targets per case.

Whole genome amplification (WGA) was performed on matched tumour/normal samples.192 case-specific primers were designed with an average primer length of 40 bases, optimizationand amplicon generation. Primer quality control (QC) and forward and reverse PCR amplifica-

14

Nature Genetics: doi:10.1038/ng.3849

Page 28: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

tion was performed. Genomic libraries were created for Illumina sequencing using the plate-basedsmall gap library construction. Libraries were indexed, pooled and sequenced on an IlluminaHiSeq using 250 base PET lanes to a median depth >5000x. 42 of the initial 59 libraries passedWGA, QC and PCR amplification and were carried forward for downstream targeted deep se-quencing analysis.

7.2.2 Targeted deep sequencing analysis and results

The FASTQ files containing the sequenced amplicon reads were aligned to the human referencegenome GRCh37-lite using bowtie2 v2.0.221. The minimum base quality was set to 10, and theminimum mapping quality to 20. For each position in an amplicon region, the number of readscorresponding to the predicated variant allele and the reference allele were extracted. GATK v3.1-122–24 was used to call variants within each amplicon. The sequencing error rate for each case wascomputed as the average variant allele frequency for within-amplicon positions in tumour/normalsamples. A low-coverage threshold of 50 reads was applied, and the Binomial exact test utilized toinfer the presence/absence of the target, as previously described25, 26. P-values were adjusted usingthe Benjamini-Hochberg procedure, with a false discovery rate of 0.001. Following this, variantcalls in the patient tumor samples were compared to the matched normal and each position wasassigned a mutation status as follows:

• no coverage, if either the normal or tumor samples had no coverage

• low coverage, if either the normal or tumor sample had low coverage, i.e., coverage ≤ 50

• wildtype, if both the normal and tumor samples did not have a variant (called ’absent’)

• somatic, if a variant was absent in the normal sample and was present in the tumor with anallelic frequency >0.05 ; or if the germline was present with an allelic frequency <0.05 andthe varinat in the tumor was present with a high allelic frequency

• probable somatic, if normal had low coverage but with 0 allelic frequency while the variantin the tumor was present with high allelic frequency >0.05

• germline, if the germline was present and the germline allelic frequency was >0.05

• unknown otherwise

The positions annotated as ’somatic’ or ’probable somatic’ were considered as being vali-dated. Out of the 42 cases passing QC and PCR, validation rate was computed as (the number ofvalidated somatic SNVs) divided by (the total number of SNVs for which there was coverage todetermine the mutation status).

15

Nature Genetics: doi:10.1038/ng.3849

Page 29: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

Overall the average validation rate was 94% per case (Supplementary Table 2).

8 Microsatellite instability testing

A panel of five repeat loci including the mononucleotide repeat markers BAT26 (MSH2), BAT25(C-Kit) and the dinucleotide repeat markers D17S250 (BRCA2), D5S346 (APC), D2S123 (MSH2)were used for MSI testing. Tumours were defined as MSI-high if two or more of the markersexhibited instability. Cases exhibiting one instability marker were subjected to another set of fivemarkers. Tumours were defined as MSI-low if instability was confirmed in less than four of thetotal ten markers. Tumours with no instability markers were classified as MSI-negative. ENOCtumours defined as MSI-high were referred as ENOC MSI cases. Other than the ultra-mutant case(DG1285) with POLE mutation, all the MSI-low or MSI-negative ENOC tumours were referred toENOC MSS cases.

9 Nanostring molecular subtypes

Total RNA was extracted from tumour specimen using standard protocols. For fresh frozen, RNAtissue was cryo-sectionned and then processed using the Qiazol-column method from the QiagenmiRNeasy kit according to manufacturers recommendations (Qiagen). For FFPE derived tissue,3x 10um scrolls of FFPE tissue were cut, deparaffinized using Xylene and processed following therecommendation of the Qiagen miRNeasy FFPE kit (Qiagen) with an extended (45min) proteinaseK digest at 55C. All RNA was quantified on a NanoDrop spectrophotometer and considered ofsufficient quality for analysis if the Absorbance (260/280nm) was between 1.7-2.1.

NanoString gene expression was conducted according to manufacturers recommendationsusing 500ng total RNA for FFPE derived specimens and 100ng total RNA for fresh/frozen derivedspecimens. 365 genes were selected from a cross-reference of literature derived biomarkers27–32

and differentially expressed genes, between molecular subtypes and/or histological subtypes usingpublicly available expression datasets33–39. Data was normalized with nSOLVER software (NanoS-tring Technologies) using the geometric mean of ACTB, SDHA, PGK1, RPL19, and POLR1Bcounts. To allow for multiple clustering methods resulting in similar molecular subtypes to thosereported previously, weighted consensus of clustering methods were used including NMF andKmeans methods discussed in original defining studies of ovarian HGSC molecular subtypes36, 37.Herein, each clustering method was run on the log base 2 normalized counts, iteratively 1000xto establish 4 groups (k = 4). We then merged the consensus class assignment of each clusteringmethod to establish each final class and used marker genes from TCGA37 and Tothill et al36 studiesto assign class names or the equivalents. We verified the concordance of this method to the originalTothill and TCGA studies by reducing each of the datasets to the same genes selected for NanoS-tring and then applying our consensus method. Concordance was 81% (Tothill) and 76% (TCGA),with the resulting overall survival comparison being highly similar to the original reports.

16

Nature Genetics: doi:10.1038/ng.3849

Page 30: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

10 Prediction of neoantigens in ENOC

10.1 HLA predictions To predict human leukocyte antigen (HLA) genotyping, we performedOptiType40 with default parameters on both tumour and normal aligned WGS BAM files for eachENOC case. The two highest-scoring four-digit predictions for the HLA-A locus were retained.Predictions were consistent between all tumour/normal pairs.

10.2 MHC-I binding prediction A modified version of the pVAC-Seq41 pipeline was used forMHC-I binding prediction. A list of 7864 processed nonsynonymous somatic SNVs along withthe corresponding wildtype and variant peptide sequences was used as input for netMHC 3.442–44

and netMHCpan 2.845, 46. Given the HLA allele predictions described above, eight to 11-merpeptides were predicted using default settings. When available for a given HLA allele, the predic-tions from netMHC were used; otherwise, those from netMHCpan were used. Peptides with anIC50 value <500nM and better affinity than than the corresponding wild-type peptide were keptfor further analysis.

10.3 Filtering The further filtering was performed on the predicted epitopes according to thecriteria described in41. The predicted epitopes were remained in the downstream analysis for cor-responding to variants with normal coverage≥ 5X, normal variant allele fraction (VAF)≤ 2%, tu-mour DNA and RNA coverage ≥ 10X, tumour DNA and RNA VAF ≥ 10%, and gene FPKM >1.VAF values were determined from WGS BAM files with MutationSeq, and from RNA-seq BAMfiles with ASEReadCounter47. Cufflinks (version v2.1.1)48–51 was performed oneach case to compute Gene FPKM values.

11 General statistical analysis

We applied shrinkage discriminant analysis to determine the feature ranking for each subgroup us-ing the function sda.ranking with default setting in the R package sda v1.3.7. The over-representation of each histotype per subgroup was tested using one-sided Fisher’s exact test, per-formed by the R function fisher.test, with the fisher exact p-value adjusted using Benjamini-Hochberg method. Feature importance measure was computed for the subgroups of histotypes us-ing the pRF function from the R package pRF version 1.2with ntree=50 and n.perms=100,which estimates the statistical significance of the Decrease in Gini Coefficient metrics of ran-dom forest feature importance. The significance of differences between features of differentsubgroups were tested using Student’s t-test (two-tailed, confidence level = 0.95) with the p-values adjusted by Benjamini-Hochberg correction. The enrichment of molecular subtypes andBRCA status between the two HGSC subgroups were tested using Chi-squared test, performedby R function chisq.test. The differences in the distribution of the average LogR values pertype of structural variants between the two HGSC subgroups were tested using Mann-Whitney-Wilcoxon test (two-tailed) performed by wilcox.test function, from the R package stats(version 3.2.3). The Kaplan-Meier estimator and the log-rank test were computed, using R

17

Nature Genetics: doi:10.1038/ng.3849

Page 31: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

package survival (version 2.38.3), to compare the survival outcomes between HGSCsubgroups. The difference in the number of immunogenic epitopes generated in the ENOC MSIcases and MSS cases was tested using Kruskal-Wallis test, performed by R function kruskal.test.

18

Nature Genetics: doi:10.1038/ng.3849

Page 32: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

1. Ha, G. et al. TITAN: inference of copy number architectures in clonal cell populations fromtumor whole-genome sequence data. Genome Res 24, 1881–93 (2014). 5

2. Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumour–normalpaired sequencing data. Bioinformatics 28, 167–175 (2012). 6

3. Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics 28, 1811–1817 (2012). 6

4. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotidepolymorphisms, snpeff: Snps in the genome of drosophila melanogaster strain w1118; iso-2;iso-3. Fly 6, 80–92 (2012). 6

5. Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013). 6

6. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500,415–21 (2013). 7

7. Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. Lumpy: a probabilistic framework forstructural variant discovery. Genome biology 15, 1 (2014). 8

8. Chiang, C. et al. Speedseq: ultra-fast personal genome analysis and interpretation. Naturemethods (2015). 8

9. McPherson, A. et al. nFuse: discovery of complex genomic rearrangements in cancer usinghigh-throughput sequencing. Genome Res 22, 2250–61 (2012). 8, 9

10. Gunawardana, J. et al. Recurrent somatic mutations of ptpn1 in primary mediastinal b celllymphoma and hodgkin lymphoma. Nature genetics 46, 329–335 (2014). 8

11. McPherson, A. et al. defuse: an algorithm for gene fusion discovery in tumor rna-seq data.PLoS Comput Biol 7, e1001138 (2011). 8

12. Hormozdiari, F. et al. Next-generation variationhunter: combinatorial algorithms for transpo-son insertion discovery. Bioinformatics 26, i350–i357 (2010). 8

13. Patch, A.-M. et al. Whole–genome characterization of chemoresistant ovarian cancer. 11

14. Abo, R. P. et al. Breakmer: detection of structural variation in targeted massively parallelsequencing data using kmers. Nucleic acids research 43, e19–e19 (2015). 12

15. Letourneau, I. J. et al. Derivation and characterization of matched cell lines from primary andrecurrent serous ovarian cancer. BMC cancer 12, 1 (2012). 12

16. Fleury, H. et al. Cumulative defects in dna repair pathways drive the parp inhibitor responsein high-grade serous epithelial ovarian cancer cell lines. Oncotarget 5 (2016). 12

19

Nature Genetics: doi:10.1038/ng.3849

Page 33: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

17. Fleury, H. et al. Novel high-grade serous epithelial ovarian cancer cell lines that reflect themolecular diversity of both the sporadic and hereditary disease. Genes & cancer 6, 378 (2015).12

18. Ha, G. et al. Integrative analysis of genome-wide loss of heterozygosity and mono-allelicexpression at nucleotide resolution reveals disrupted pathways in triple negative breast cancer.Genome Research (2012). 13

19. Eirew, P. et al. Dynamics of genomic clones in breast cancer patient xenografts at single-cellresolution. Nature (2014). 14

20. McPherson, A. et al. Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer. Nature genetics (2016). 14

21. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with bowtie 2. Nature methods9, 357–359 (2012). 15

22. McKenna, A. et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome research 20, 1297–1303 (2010). 15

23. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation dna sequencing data. Nature genetics 43, 491–498 (2011). 15

24. Auwera, G. A. et al. From fastq data to high-confidence variant calls: the genome analysistoolkit best practices pipeline. Current protocols in bioinformatics 11–10 (2013). 15

25. Shah, S. P. et al. The clonal and mutational evolution spectrum of primary triple-negativebreast cancers. Nature 486, 395–399 (2012). 15

26. Shah, S. et al. Mutational evolution in a lobular breast tumour profiled at single nucleotideresolution. Nature 461, 809–813 (2009). 15

27. Kalloger, S. E. et al. Calculator for ovarian carcinoma subtype prediction. Modern Pathology24, 512–521 (2011). 16

28. Anglesio, M. S., Carey, M. S., Kobel, M., MacKay, H. & Huntsman, D. G. Clear cell car-cinoma of the ovary: a report from the first ovarian clear cell symposium, june 24th, 2010.Gynecologic oncology 121, 407–415 (2011). 16

29. Madore, J. et al. Characterization of the molecular differences between ovarian endometrioidcarcinoma and ovarian serous carcinoma. The Journal of pathology 220, 392–400 (2010). 16

30. Kobel, M. et al. Igf2bp3 (imp3) expression is a marker of unfavorable prognosis in ovariancarcinoma of clear cell subtype. Modern Pathology 22, 469–475 (2009). 16

31. Kobel, M. et al. A limited panel of immunomarkers can reliably distinguish between clear celland high-grade serous carcinoma of the ovary. The American journal of surgical pathology33, 14–21 (2009). 16

20

Nature Genetics: doi:10.1038/ng.3849

Page 34: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

32. Kobel, M. et al. Ovarian carcinoma subtypes are different diseases: implications for biomarkerstudies. PLoS Med 5, e232 (2008). 16

33. Helland, A. et al. Deregulation of mycn, lin28b and let7 in a molecular subtype of aggressivehigh-grade serous ovarian cancers. PloS one 6, e18064 (2011). 16

34. Anglesio, M. S. et al. Il6-stat3-hif signaling and therapeutic response to the angiogenesisinhibitor sunitinib in ovarian clear cell cancer. Clinical cancer research 17, 2538–2548 (2011).16

35. Anglesio, M. S. et al. Mutation of erbb2 provides a novel alternative mechanism for theubiquitous activation of ras-mapk in ovarian serous low malignant potential tumors. MolecularCancer Research 6, 1678–1690 (2008). 16

36. Tothill, R. W. et al. Novel molecular subtypes of serous and endometrioid ovarian cancerlinked to clinical outcome. Clinical Cancer Research 14, 5198–5208 (2008). 16

37. Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma.Nature 474, 609–15 (2011). 16

38. Ramakrishna, M. et al. Identification of candidate growth promoting genes in ovarian cancerthrough integrated copy number and expression analysis. PloS one 5, e9983 (2010). 16

39. Hendrix, N. D. et al. Fibroblast growth factor 9 has oncogenic activity and is a downstreamtarget of wnt signaling in ovarian endometrioid adenocarcinomas. Cancer research 66, 1354–1362 (2006). 16

40. Szolek, A. et al. Optitype: precision hla typing from next-generation sequencing data. Bioin-formatics 30, 3310–3316 (2014). 17

41. Hundal, J. et al. pvac-seq: A genome-guided in silico approach to identifying tumor neoanti-gens. Genome medicine 8, 1 (2016). 17

42. Nielsen, M. et al. Reliable prediction of t-cell epitopes using neural networks with novelsequence representations. Protein Science 12, 1007–1017 (2003). 17

43. Lundegaard, C. et al. Netmhc-3.0: accurate web accessible predictions of human, mouse andmonkey mhc class i affinities for peptides of length 8–11. Nucleic acids research 36, W509–W512 (2008). 17

44. Lundegaard, C., Lund, O. & Nielsen, M. Accurate approximation method for prediction ofclass i mhc affinities for peptides of length 8, 10 and 11 using prediction tools trained on9mers. Bioinformatics 24, 1397–1398 (2008). 17

45. Hoof, I., Peters, B., Buus, S. & Nielsen, M. Netmhcpan: Mhc class i binding predictionbeyond hla-a and-b. Tissue Antigens (2008). 17

21

Nature Genetics: doi:10.1038/ng.3849

Page 35: Nature Genetics: doi:10.1038/ng...sample. Samples are sorted in descending order according to the total number of mutations. (b) Residual sum of squares (RSS; top) and explained variance

46. Nielsen, M. et al. Netmhcpan, a method for quantitative predictions of peptide binding to anyhla-a and-b locus protein of known sequence. PloS one 2, e796 (2007). 17

47. Castel, S. E., Levy-Moonshine, A., Mohammadi, P., Banks, E. & Lappalainen, T. Tools andbest practices for data processing in allelic expression analysis. Genome biology 16, 1 (2015).17

48. Trapnell, C. et al. Transcript assembly and quantification by rna-seq reveals unannotatedtranscripts and isoform switching during cell differentiation. Nature biotechnology 28, 511–515 (2010). 17

49. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving rna-seq expressionestimates by correcting for fragment bias. Genome biology 12, 1 (2011). 17

50. Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts inannotated genomes using rna-seq. Bioinformatics 27, 2325–2329 (2011). 17

51. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with rna-seq.Nature biotechnology 31, 46–53 (2013). 17

22

Nature Genetics: doi:10.1038/ng.3849