identifying susceptibility genes for familial pancreatic cancer using novel high-resolution

Identifying Susceptibility Genes for Familial Pancreatic Cancer Using Novel High-Resolution Genome

Interrogation Platforms

Wigdan Ridha Al-Sukhni

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Institute of Medical Science University of Toronto

Identifying Susceptibility Genes for Familial Pancreatic Cancer Using

Novel High-Resolution Genome Interrogation Platforms

Wigdan Ridha Al-Sukhni

Doctor of Philosophy

Institute of Medical Science

University of Toronto

Abstract

Familial Pancreatic Cancer (FPC) is a cancer syndrome characterized by clustering of pancreatic cancer in

families, but most FPC cases do not have a known genetic etiology. Understanding genetic predisposition

to pancreatic cancer is important for improving screening as well as treatment. The central aim of this

thesis is to identify candidate susceptibility genes for FPC, and I used three approaches of increasing

resolution. First, based on a candidate-gene approach, I hypothesized that BRCA1 is inactivated by loss-

of-heterozygosity in pancreatic adenocarcinoma of germline mutation carriers. I demonstrated that 5/7

pancreatic tumors from BRCA1-mutation carriers show LOH, compared to only 1/9 sporadic tumors,

suggesting that BRCA1 inactivation is involved in tumorigenesis in germline mutation carriers. Second, I

hypothesized that the germline genomes of FPC subjects differ in copy-number profile from healthy

genomes, and that regions affected by rare deletions or duplications in FPC subjects overlap candidate

tumor-suppressors or oncogenes. I found no significant difference in the global copy-number profile of

FPC and control genomes, but I identified 93 copy-number variable genomic regions unique to FPC

subjects, overlapping 88 genes of which several have functional roles in cancer development. I

investigated one duplication to sequence the breakpoints, but I found that this duplication did not

segregate with disease in the affected family. Third, I hypothesized that in a family with multiple

pancreatic cancer patients, genes containing rare variants shared by the affected members constitute

susceptibility genes. Using next-generation sequencing to capture most bases in coding regions of the

genome, I interrogated the germline exome of three relatives who died of pancreatic cancer and a relative

who is healthy at advanced age. I identified a short-list of nine candidate genes with unreported

mutations shared by the three affected relatives and absent in the unaffected relative, of which a few had

functional relevance to tumorigenesis. I performed Sanger sequencing to screen an unrelated cohort of

approximately 70 FPC patients for mutations in the top two candidate genes, but I found no additional

rare variants in those genes. In conclusion, I present a list of candidate FPC susceptibility genes for

further validation and investigation in future studies.

Acknowledgments My research would not have been possible without the contribution of the following individuals:

A. Borgida, S. Holter, H. Rothenmund, and K. Smith at Ontario Pancreas Cancer Study and Ontario

Familial Gastrointestinal Cancer Registry for patient recruitment and selection. T. Selander of Samuel

Lunenfel Research Institute Biospecimen Repository for DNA extraction. S. Joe (Gallinger Lab) for

script-writing; N. Zwingerman, A. Gropper, and S. Moore (Gallinger Lab) for assistance with qPCR; A.

Lionel (Scherer Lab) for computational analysis of Affy6.0 data on Birdsuite and iPattern; Q. Trinh

(McPherson Lab) for computational analysis of exome data; R. Grant (Gallinger Lab) for assistance with

exome data interpretation; H. Kim and T. McPherson (Gallinger Lab) for assitance with PCR and Sanger

validation of exome variants. K. Hay, J. Keating, and S. Levitt (Gallinger Lab) for administrative support;

J. McPherson (Ontario Institute for Cancer Research) for exome sequencing data; and C. Marshall, D.

Pinto, D. Merico (The Centre for Applied Genomics), A. Shlien and D. Malkin (Malkin Lab) for their

advice on my data analysis and manuscript preparations.

My sincere gratitude to the Pancreatic Cancer Genetic Epidemiology Consortium (PACGENE) (PI - G

Petersen, Mayo) for being an invaluable source of DNA samples and insight into pancreatic cancer

genetics.

I am very grateful to my Program Advisory Committee (Gary Bader, Steven Narod, Stephen Scherer) for

their insightful feedback and advice throughout the five years of my PhD. In particular, their thoughtful

review of my manuscripts and thesis was most helpful and deeply appreciated.

To my supervisor, Steve Gallinger – I cannot adequately thank you in this crowded page for all that your

mentorship has meant to me since I first met you seven years ago. You pushed me when I needed

pushing and supported me when I was afraid of falling. You listened patiently to my complaints. You

cared about my success. I will always appreciate your open-mindedness, your integrity, and your

compassion. I feel most fortunate that I am able to call you my mentor and friend. Thank you for

everything.

A special thank you to M. Crump for helping me maneuver around some unexpected bumps in the road of

my PhD, and for exemplifying the compassionate clinician.

I dedicate this thesis to my beautiful family:

To Mama and Baba – Your love for me has been the greatest gift and blessing in my life, it is the reason

for who I am today. Thank you for supporting my aspirations even when you did not always understand

where they were taking me.

To Eisar, Mayce, Mohammed, and Bann – Thank you for putting up with me in my worst days… I am

proud of you all.

To my aunts, uncles, and cousins in Iraq and elsewhere – Thank you for keeping me alive in your hearts

despite the long years and oceans separating us. You inspire me.

I am grateful for the financial support received from the CIHR Vanier Doctoral Research Award,

Lustgarten grant, Invest-in-Research grant from Princess Margarte Hospital, Canadian Society for

Surgical Oncology grant, Johnson & Johnson research award, American HepatoPancreaticoBiliary

Association grant, and the Department of Surgery at the University of Toronto.

Table of Contents Abstract..........................................................................................................................................................ii

Acknowledgments.........................................................................................................................................iv

List of Tables...............................................................................................................................................vii

List of Figures.............................................................................................................................................viii

List of Appendices........................................................................................................................................ix

Abbreviations................................................................................................................................................xi

Chapter 1 Literature Review.........................................................................................................................1

1. Pancreatic Cancer.................................................................................................................1

2. Copy Number Variation.......................................................................................................12

3. Whole-Exome Sequencing..................................................................................................37

Chapter 2 Loss of Heterozygosity at BRCA1 Locus in Pancreatic Adenocarcinoma.................................51

1. Abstract................................................................................................................................51

2. Introduction..........................................................................................................................51

3. Materials & Methods...........................................................................................................52

4. Results..................................................................................................................................55

5. Discussion............................................................................................................................58

Chapter 3 Germline Genomic Copy Number Variation in Familial Pancreatic Cancer.............................63

1. Abstract................................................................................................................................63

2. Introduction..........................................................................................................................63

3. Materials & Methods...........................................................................................................64

4. Results..................................................................................................................................73

5. Discussion............................................................................................................................94

Chapter 4 Exome Sequencing in a Familial Pancreatic Cancer Kindred..................................................100

1. Abstract..............................................................................................................................100

2. Introduction........................................................................................................................100

3. Materials & Methods.........................................................................................................101

4. Results................................................................................................................................106

5. Discussion..........................................................................................................................119

Chapter 5 General Discussion, Conclusions, and Future Directions......................................................122

References..................................................................................................................................................133

Appendices.................................................................................................................................................172

List of Tables Table 1 Studies estimating risk of pancreatic adenocarcinoma in relatives of affected patients

Table 2 Summary of published studies reporting germline genomic copy-number variation in non-

disease samples

Table 3 Studies using exome-sequencing to identify genetic cause of disease

Table 4 Characteristics of BRCA1 mutation carriers and sporadic pancreatic cancer patients

Table 5 Pedigree summary for BRCA1 mutation carriers

Table 6 LOH results for BRCA1 mutation carriers and sporadic pancreatic cancer cases

Table 7 Proportion of high-confidence losses in cases and controls

Table 8 Proportion of high-confidence gains in cases and controls

Table 9 CNVs called by each of Birdsuite and iPattern in 36 samples on Affymetrix 6.0 array

Table 10 High confidence CNV profile of cases vs. controls (excluding EBV-derived samples and

excluding controls with data from only one chip)

Table 11 FPC specific CNVs

Table 12 Genes whose coding regions are affected by FPC-specific CNVs

Table 13 Summary of raw sequence data from Illumina GAII for each subject

Table 14 Sanger validation data for selected SNVs in each exome subject

Table 15 Sanger validation data for selected indels in each exome subject

Table 16 Number of variants identified in each exome subject

Table 17 Genes containing variants identified by filtration model #1, 2, 3, and/or 4

Table 18 Additional candidate variants in untranslated regions shared by exome subjects

List of Figures Figure 1 Location of BRCA1 microsatellite markers on chromosome 17

Figure 2 Sample electropherogram of microsatellite marker fragment analysis

Figure 3 Three representative matched-pair electropherograms for microsatellite LOH

Figure 4 Representative sequencing result for an individual with 5382insC germline BRCA1 mutation

Figure 5 Analysis of 500K arrays in FPC cases and controls

Figure 6 Criteria for merging CNVs

Figure 7 CNV prioritization plan

Figure 8 Gains and losses identified in FPC cases by each algorithm/chip

Figure 9 Gains and losses identified in controls by each algorithm/chip

Figure 10 Duplications overlapping TGFBR3 gene

Figure 11 Pedigree of case ID-203, indicating results of qPCR testing for duplication G_97

Figure 12 Fine-mapping the breakpoint of duplication overlapping TGFBR3 using qPCR walk-along

method

Figure 13 PCR gel demonstrating amplification of ~1.5-2kb fragment containing G_97 duplication

breakpoint in case Id_203

Figure 14 G_97 duplication breakpoint mapping by Sanger sequencing

Figure 15 PCR gel illustrating amplification of test regions and duplication breakpoint in case Id-203 and

affected sister

Figure 16 FPC-specific losses and gains on autosomal chromosomes

Figure 17 Pedigree of FPC kindred investigated by exome sequencing

Figure 18 Average coverage of bases in target region of exome per subject

Figure 19 Read-depth per base in target region of exome in each subject

Figure 20 Genome-wide distribution of all SNVs identified in each exome subject

Figure 21 Genome-wide distribution of SNVs excluding synonymous variants in each exome subject

Figure 22 Genome-wide distribution of SNVs not reported in dbSNP131 in each exome subject

List of Appendices Table S1 Primers for BRCA1 microsatellite markers

Table S2 BRCA1 mutations sequencing primers

Table S3 FPC cases in CNV study

Table S4 Controls (OFCCR and FGICR) in CNV study

Table S5 Primers for qPCR validation of CNVs

Table S6 Primers for qPCR breakpoint mapping of TGFBR3-transecting duplication

Table S7 High- and low-confidence losses on Affy500K array in FPC cases

Table S8 High- and low-confidence gains on Affy500K array in FPC cases

Table S9 High- and low-confidence losses on Affy500K array in controls

Table S10 High- and low-confidence gains on Affy500K array in controls

Table S11 High-confidence CNVs on Affy 6.0 array in FPC cases

Table S12 High-confidence CNVs on Affy 6.0 array in controls

Figure S1 qPCR of region D_180

Figure S5 qPCR of region D_234 (primer A)

Figure S6 qPCR of region D_234 (primer B)

Figure S10 qPCR of region D_30 & D_36

Figure S21 qPCR of region G_225

Figure S23 qPCR of region G_365 (primer A)

Figure S24 qPCR of region G_365 (primer B)

Figure S28 qPCR of region G_603/604

Figure S31 Region: G_97 (primer A) – ID_27

Figure S32 Region: G_97 (primer B) – ID_27

Figure S33 Region: G_97 (primer A) – ID_203 and family members

Figure S34 Region: G_97 (primer A) – ID_203’s family members

Figure S35 Region: G_97 (primer A) – ID_203 and family members

Figure S38 Region: G_97 (primer B) – ID_203 and family members

Figure S39 “T_Out_1” – Fine-mapping G_97 breakpoint in Id_203

Figure S43 “O_In_2” – Fine-mapping G_97 breakpoint in Id_203

Figure S44 “O_Out_1” – Fine-mapping G_97 breakpoint in Id_203

Figure S45 “O_Out_5” – Fine-mapping G_97 breakpoint in Id_203

Abbreviations AD – autosomal dominant

AGTC - Analytical Genetics Technology Centre

AJ – Ashkenazi Jewish

AML – acute myeloid leukemia

AR – autosomal recessive

BAC – bacterial artificial chromosome

BC – breast cancer

CCDS - Collaborative Consensus Coding Sequence

CGH – comparative genomic hybridization

ChIP-seq - chromatin immunoprecipitation sequencing

CIN – chromosomal instability

CNV – copy number variation

Conc – concordant

COSMIC - Catalogue of Somatic Mutations in Cancer

CRC – colorectal cancer

CSI – chromosomal structure instability

ddNTPs - dideoxy trinucleotide triphosphates

del - deletion

DGV – Database of Genomic Variants

Disc - discordant

EBV – Epstein-Barr virus

FAMMM - familial atypical multiple mole melanoma

FDR – first degree relative

FFPE – formalin-fixed paraffin-embedded

FGICR – familial gastrointestinal cancer registry

FISH – fluorescence in-situ hybridization

FN – false negative

FoSTeS - fork stalling and template switching

FP – false positive

FPC – familial pancreatic cancer

GB – gallbladder

GDB – human genome database

GST – glutathione-S-transferase

GTC – genotyping console

GWAS – genome wide association study

HBOC - hereditary breast and ovarian cancer

Het - heterozygous

HMM – hidden Markov model

Homo - homozygous

HP – hereditary pancreatitis

HR – hazard ratio

ICGC - International Cancer Genome Consortium

IHGSC - International Human Genome Sequencing Consortium

Ins - insertion

IPMN – intraductal pancreatic mucinous neoplasm

LCL – lymphoblastoid cell lines

LD – linkage disequilibrium

LOD – logarithm of odds

LOH – loss of heterozygosity

MAF - minor allele frequency

MCN – mucinous cystic neoplasm

MEI – mobile element insertion

MLPA – multiplex ligation probe amplification

MMBIR - microhomology-mediated break-induced replication

MSKCC - Memorial Sloan Kettering Cancer Centre

NAHR – nonallelic homologous recombination

NBPF – neuroblastoma breakpoint family

NCBI – National Centre for Biotechnology Information

NFPTR - National Familial Pancreas Tumor Registry

NHEJ – nonhomologous end joining

NIH – National Institute of Health

NGS – next generation sequencing

NK – natural killer cell

nsSNV – nonsynonymous single nucleotide variants

OC – ovarian cancer

OFCCR - Ontario Familial Colon Cancer Registry

OHI – Ottawa Heart Institute

OMIM - Online Mendelian Inheritance in Man

OPCS – Ontario Pancreas Cancer Study

OR – odds ratio

OR genes – olfactory receptor genes

QC – quality control

PACGENE - Pancreatic Cancer Genetic Epidemiology Consortium

PanIN – pancreatic intraepithelial neoplasia

PARP – poly-(ADP-ribose)-polymerase

PC – pancreatic cancer

PCR – polymerase chain reaction

PGFE – pulsed gel field electrophoresis

PJS - Peutz-Jeghers syndrome

qPCR – quantitative polymerase chain reaction

qRT-PCR – quantitative reverese-transcription polymerase chain reaction

ROMA – representational oligonucleotide microarray analysis

RR – relative risk

SDR – second degree relative

SEER – surveillance, epidemiology and end results

SIR – standardized incidence ratio

SNP – single nucleotide polymorphism

SNV – single nucleotide variants

SPC – sporadic pancreatic cancer

TCAG – The Centre for Applied Genomics

TN – true negative

TP – true positive

UCSC - University of California, Santa Cruz

UPD – uniparental disomy

UTR – untranslated region

VNTR - variable nucleotide tandem repeat

WT - wildtype

Chapter 1 - Literature Review

1. Pancreatic Cancer

1.1 Pathology and epidemiology Pancreatic ductal adenocarcinoma (otherwise known as pancreatic cancer) is a highly lethal invasive

epithelial neoplasm with ductal differentiation, obscuring the lobular pattern of normal pancreatic

parenchyma. Pancreatic cancer grossly appears as a firm highly sclerotic mass with poorly circumscribed

borders. Microscopically, infiltrating gland-forming neoplastic cells are commonly surrounded by non-

neoplastic stroma in a characteristically intense desmoplastic reaction which often results in low tumor

cellularity.1

Pancreatic cancer is the fourth leading cause of cancer death in North America. The estimated number of

incident cases and deaths due to pancreatic cancer in the US in 2010 was 43,140 and 36,800,

respectively.2 In Canada, the estimated number of new cases and deaths from pancreatic cancer in 2011

was 4,100 and 3,800, respectively.3 Age-adjusted incidence in the U.S. based on SEER (Surveilance,

Epidemiology and End Results) data between 2004-2008 was 12 per 100,000 men and women; total

lifetime risk was 1.45% (approximately 0.5% by age 70).2

Due to the retroperitoneal location of the pancreas and lack of specific symptoms of early pancreatic

cancer, most patients present with advanced disease that precludes surgical resection. For those patients,

the only treatment option is palliation, and despite many trials of various chemotherapeutic and

molecular-target drugs and/or radiotherapy, median survival is 9-11 months.4 For patients who do

undergo surgical resection of localized pancreatic cancer, 80-85% ultimately recur locally and/or

systemically, resulting in 5-year survival of < 20% and overall 5-year survival for all pancreatic cancer

patients of <5%.5

1.2 Molecular biology Three distinct pre-invasive lesions have been identified as precursors for pancreatic adenocarcinoma:

pancreatic intraepithelial neoplasia (PanIN), intraductal papillary mucinous neoplasms (IPMNs), and

mucinous cystic neoplasms (MCNs). Each of these lesions has been associated with increased risk of

cancer and the arising cancer has been shown to develop from cells within the precursor. PanINs are

microscopic lesions in the smaller pancreatic ducts, and they are associated with a progressive spectrum

of cytologic and architectural atypia (corresponding to the classification of PanIN1-A, PanIN1-B, PanIN-

2, and PanIN-3).6 Mouse models of pancreatic cancer develop very similar lesions to human PanINs, and

molecular analyses have demonstrated that PanINs sequentially accumulate genetic alterations found in

invasive cancer, suggesting an “adenoma-to-carcinoma” progressive model akin to that of colorectal

cancer.7

However, the natural history of PanINs is not yet clear: while it is evident that advanced stage PanIN-3

lesions are tightly associated with cancer8, early-stage PanIN-1 lesions are quite common and are most

prevalent in older subjects.9 Moreover, PanINs are frequently multi-focal, and although endoscopic

ultrasound can detect parenchymal changes associated with PanINs, it does so at less than 100%

specificity.10,11 Therefore, deciding if and when to resect pancreata with suspected PanIN lesions is

contentious. IPMNs are grossly visible cystic lesions with direct communication to the main or branch

pancreatic ducts. The mutational spectrum of IPMNs differs somewhat from that of PanINs and invasive

adenocarcinoma, suggesting an alternate path of development.12 Main-duct IPMNs are associated with up

to 40% risk of malignant transformation and usually are resected, especially if they are growing and/or

larger than 3 cm, demonstrate mural nodularity on imaging, or are associated with main duct dilation.13

However, branch-duct IPMNs are more challenging to manage as their natural history is less clear. They

are associated with up to 15% risk of malignancy, and most authorities recommend resection if the

branch-duct IPMN exceeds 3 cm in size or has mural nodules or other suggestion of malignancy, but it is

unclear what to do with smaller lesions since most branch-duct IPMNs remain unchanged over long-term

follow-up.13,14 Since IPMNs are often multifocal, patients who undergo subtotal pancreatic resections

would need to continue surveillance for potential cancer recurrence. MCNs are rare, mucin-producing

cystic lesions not directly communicating with the pancreatic ducts and with a distinctive ovarian-type

stromal epithelium.15 They only account for approximately 1% of pancreatic cancers, but if detected they

should always be resected because they have a 40% chance of malignancy and have a 100% cure rate if

the MCN is resected before invasive carcinoma develops whereas the cure rate is only 50-60% if cancer is

present at time of resection.15

Molecular analyses have identified a variety of genetic, epigenetic, and genomic alterations in pancreatic

adenocarcinoma. The most common genetic mutation is Kras2 activation, present in 90-95% of cases; it

also appears to be one of the earliest changes that promote tumor development, as evidenced by its

presence in 36% of PanIN-1A and the fact that mice engineered to express the activated KrasG12D mutant

develop PanIN-like lesions and eventually invasive pancreatic carcinoma.7 Kras2 is a well-established

proto-oncogene, part of the RAS family of GTP-binding protein which are involved in proliferation, cell

survival, cytoskeletal modeling, motility, and other cellular functions.16 In pancreatic cancer, activating

mutations primarily occurring in codon 12 cause constitutive activation of the intracellular signal

transduction function of the expressed protein. This constitutive signaling appears to be necessary for

maintenance of pancreatic cancer, in addition to initiating its development.17 Other oncogenes activated

in pancreatic cancer include BRAF18, AKT219, cMYC17, and EGFR17. Moreover, constitutive activation of

the Hedgehog developmental signaling pathways has also been implicated in the development of

pancreatic cancer. The mammalian Hedgehog signaling pathway appears to play a critical role in

developmental patterning and mature tissue homeostasis, and it has been observed to be dysregulated in

many cancers, including pancreas.20 In fact, Hedgehog signaling activation appears to be one of the

initiating events in pancreatic cancer, as evidenced by ligand overexpression in PanINs21 and IPMNs22

and the fact that Hedgehog signaling cooperates with KrasG12D mutant in mouse models to promote

development of PanINs.23 Hedgehog signaling also appears to be important in regulating metastases.24

While the KrasG12D mutation is necessary for development of pancreatic cancer in mice, latency to tumor

development is significantly shortened if additional inactivating mutations of the tumor suppressor genes

TP53, p16, or BRCA2 are added.25 All three tumor suppressor genes, along with others, have been

identified in pancreatic adenocarcinoma. Inactivating mutations (homozygous deletions, intragenic

mutations plus loss of second allele, or epigenetic silencing) of p16 are found in approximately 90% of

tumors.26 This gene is a well-known tumor suppressor that codes for a cyclin-dependent kinase involved

in inhibiting progression through the G1-S checkpoint of the cell cycle. TP53, the “guardian of the

genome”, is involved in maintenance of genomic stability, apoptosis, and activation of DNA repair

(among its many functions), and is inactivated in 50-75% of pancreatic cancers (almost always via

intragenic mutations coupled with loss of the second allele).27 Another tumor suppressor gene commonly

inactivated in pancreatic cancer (in about 55% of cases) is SMAD4, a critical signaling intermediate in the

transforming growth factor (TGF)-beta pathway, providing selective growth advantage to affected cells.28

Patients who undergo resection and whose pancreatic cancer has loss of SMAD4 function have worse

prognosis than age- and stage-matched patients without SMAD4 mutations.29 Other tumor suppressor

genes inactivated at a lower frequency (5-10%) include BRCA2, STK11, TGFBR1, and TGFBR2.26 Of

note, p16 inactivation appears to be a relatively early event in tumor development, as it is detectable in

PanIN-2 lesions, whereas TP53, SMAD4, and BRCA2 mutations are not seen until the PanIN-3 stage.7

Genomic instability is a hallmark of most solid tumors, including pancreatic cancer. The types of

genomic rearrangements commonly identified in pancreatic adenocarcinoma are reviewed elsewhere (see

“Literature Review - CNVs and Cancer”). Telomere shortening, which predisposes to end-to-end

chromosomal fusions and breakage during anaphase thus generating amplifications and deletions in the

daughter cell genomes, is a very frequent and early event in pancreatic cancer development, demonstrated

in over 90% of the earliest stage PanINs.30 It is believed that the inactivation of TP53 allows the survival

of the pre-invasive cells which develop a heavy burden of genomic instability as a result of telomere

attrition, permitting them to progress through the activation of oncogenes and inactivation of tumor

suppressor genes to invasive status.31 It should be noted that most invasive pancreatic cancers appear to

reactivate telomerase, mitigating the degree of genomic instability and helping to stabilize the neoplastic

cells.32

In addition to genetic and genomic alterations, epigenetic silencing of tumor suppressor genes (via

methylation of CpG islands in the 5’ regulatory regions) is frequently observed in pancreatic

adenocarcinoma.33 Alternatively, hypomethylation of candidate oncogenes (which are overexpressed in

pancreatic cancer) has also been observed.34 MicroRNAs have also been implicated in pancreatic cancer

tumorigenesis, both as potential tumor suppressor as well as oncogenes.35 Furthermore, inflammation and

the tumor micro-environment appears to have a role in pancreatic tumorigenesis.36

Jones et al.37 examined the genomic profile of pancreatic adenocarcinoma in depth by sequencing the

coding regions of 20,661 genes in 24 pancreatic adenocarcinoma as well as hybridizing tumor DNA to a

high-resolution single nucleotide polymorphism (SNP) array to detect genomic rearrangements. The

authors identified 1,562 somatic mutations in 1,007 genes, of which 74.5% mutations were missense,

nonsense, small insertions/deletions, or splice-site/untranslated region (UTR) changes. The average

number of mutated genes per tumor (48) was much less than the number of mutations discovered in breast

cancer (101) or colorectal cancer (77) in previous studies, and one potential explanation given is that the

cells which initiate pancreatic tumorigenesis are likely to have undergone fewer divisions than tumor

initiating cells in breast or colorectal cancer. Gene-set analyses of the genes mutated in pancreatic cancer

identified 69 gene sets that were altered in most pancreatic tumors, of which 31 gene sets can be grouped

in 12 core signaling pathways with discernible functional relevance to neoplasia, which were affected in

67-100% of the pancreatic tumors. Notably, although the 12 core pathways were altered in almost all

cancers, the specific genes that are mutated in each tumor differed significantly across patients, aside

from the few frequently mutated genes discussed above.

These results emphasized the importance of the pathway approach to understanding tumorigenesis, and

suggest that successful anti-cancer therapy may depend more on targeting pathways rather than individual

genes. A subsequent study applied massively parallel sequencing to sequence the entire genome of

metastases from seven of the subjects included in the previous study.38 On average, two-thirds of

mutations detected in each metastasis were also present in the paired primary tumor and were called

“founders”, while the remaining mutations that were only identified in metastases were termed

“progressors”. Subclones that led to the development of metastases were identified within each primary

tumor. The authors devised a mathematical model for calculating the timing of different stages of

pancreatic cancer development and estimated that it takes an average of 11.7 years from the initiation of

tumorigenesis until the generation of the cell that develops into the parental clone; another 6.8 years were

estimated for the evolution into subclones with metastatic capacity, and 2.7 years until the death of the

patient. It should be noted that most of the tumors in this study were not from familial cases, and tumors

with highly-penetrant germline predisposing mutations may follow a different evolutionary timeline and

pathway. Nonetheless, it appears that a significant window of opportunity for screening and curative

intervention exists, if it is possible to identify tumors before metastatic subclones develop.

1.3 Risk factors The list of putative risk factors for pancreatic cancer is long, with wide variability in degree of risk

conferred and strength of evidence for the association. Age is strongly correlated with increased risk of

pancreatic cancer, with the median age for diagnosis at 72 years and more than two-thirds of cases

occurring after age 65.2 Race is also a factor, with African-Americans having substantially higher rates of

pancreatic cancer than white, Asian, or Hispanic Americans.2 Perhaps the strongest association of a risk

factor exists for tobacco use, as numerous studies have demonstrated that smoking can double lifetime

risk and the estimated population attributable risk is 25%.39 Other risk factors with low-to-moderate

contribution to pancreatic cancer include alcohol consumption40, obesity40, occupational exposure to

certain chemicals41, long-standing diabetes mellitus42, and Helicobacter pylori infection43. However, only

smoking has been consistently associated with pancreatic cancer. Chronic pancreatitis is associated with

up to 13-fold increased risk in pancreatic cancer, and even higher risk in patients with hereditary

pancreatitis, caused by genetic mutations (e.g. PRSS1, SPINK1).44 Possible protective factors include

allergies45, Vitamin D intake46 (although this is contentious47), and consumption of citrus fruit48 and

“Mediterranean diet”49.

The role of germline genetic factors predisposing to pancreatic cancer is a subject of numerous studies

and ongoing collaborations. Polymorphisms in the following genes have been associated with increased

or decreased risk of sporadic pancreatic cancer: GCKR (odds ratio (OR) = 2.14 )50, IGF1 and IGF1R (OR

= 0.6-0.7)51, IGFPB1 (OR = 1.46)51, SSTR5 (OR = 1.62)52, [MGMT (OR = 0.6), PMS2 (OR = 1.44),

PMS2L3 (OR = 5.54)]53, HNF1A (OR = 1.16-1.22)54, SDF1 (OR = 2.74)55, [FTO (OR = 1.12), MNTR1B

(OR = 1.11), MADD (OR = 1.14)]56, ALDH2 (OR = 1.37)57, HK2 (OR= 0.68 in diabetic/3.69 in non-

diabetic)58, [PPARG (OR = 0.21), NR5A2 (OR = 0.57-0.77), ADIPOQ (OR = 0.67), GGT1 (OR = 1.86)59,

CASP9 (OR = 4.09-16.26)60, CAPN10 (OR = 1.57)61, p21 (OR = 1.70)62, CYP1B1 (OR = 0.67)63, CFTR

(OR = 1.4; OR = 1.83 if diagnosed under age 60)64, GSTP1 (OR = 3.09 if diagnosed under age 50)65,

CYP17A1 (OR = 0.63-0.77)66, PPARG in conjunction with high-dose Vitamin A (OR = 2.80)67, PTGS2

(OR = 1.34-1.63)68, MMS19L (OR = 0.7/1.34)69, IL1beta (OR = 2.0 for unresectable cancer)70, [LIG3 (OR

= 0.23), ATM (OR = 2.55)]71, IGF2 (OR = 0.07)72, [MTHFR (OR = 4.50), MTR (OR = 2.65), MTRR (OR

= 3.35) in heavy drinkers]73, MTRR (OR = 1.44-1.52)74, [FasL (OR = 0.35-0.73), CASP8 (OR = 0.56-

0.65)]75, NAT2 (slow-type, OR = 5.7)76, XRCC2 in smokers (OR = 2.32)77, ERCC2 in smokers (OR =

0.46)78, [MTHFR (OR = 2.6-5.12), TYMS (OR = 2.19)]79, NAT1-rapid type (OR = 1.5)80, RNASEL (OR =

2.12-3.5)81, UGT1A17 (OR= 1.98-4.7)82, XRCC1 in smokers (OR = 7.0 in women/OR = 2.4 in men)83.

Pathways affected by those genes include diabetes mellitus type II and glucose metabolism, insulin

growth factors, somatostatin, DNA repair, tumor growth, alcohol metabolization, obesity, glutathione

metabolism, cytochrome P450, cystic fibrosis transductance regulator, fatty acid storage,

cyclooxygensase-2, nucleotide excision repair, inflammation, folate metabolism, cell cycle and cell death,

and toxin detoxification. Many of the aforementioned studies suggest gene-environment interactions.

To date, four genome-wide association studies (GWAS) of pancreatic cancer have been published: two

related GWAS were conducted on subjects drawn from 12 cohort studies and 9 case-control studies

(mostly of European ancestry)84-85, a study performed in a Japanese population86, and the most recent

study was in a Chinese population.87 While SNPs in several loci were observed to be associated at

sufficiently low p-values to suggest statistical significance (7q36-SHH, 15q14-gene desert)84, (13q22.1-

near KLF5 and KLF12,1q32.1-NR5A2, 5p15.33-CLPTM1L-TERT)85, (6p25.3-FOXQ1, 12p11.21-BICD1,

7q36.2-DPP6)86, (21q21.3 – BACH1, 5p13.1 – DAB2, 10q26.11 – near PRLHR, 21q22.3 – near TFF1,

22q13.32 – near FAM19A5)87, to date only one association has been successfully replicated in additional

studies: the ABO blood group locus at 9q34. In the GWAS by Amundadottir et al.84, the ABO locus was

identified as a potential associated locus in the initial phase of the study and confirmed in a replication

case-control set (odds ratio (OR) per non-O allele = 1.20). This association of non-O blood group with

pancreatic cancer risk was further replicated in other case-control studies (OR 1.33-2.4288, OR 1.3789, OR

1.4390, protective O-blood type OR 0.5391). Furthermore, Wolpin et al.92 reported a higher risk of

pancreatic cancer for carriers of the A(1) variant of the A-allele, which has a higher glycosyltransfrase

activity than the A(2) allele (OR 1.38). In addition, Risch et al.89 observed increased risk of pancreatic

cancer in non-O blood group subjects who are seropositive for H.pylori but negative for its virulence

protein CagA (OR 2.78). Analyses in non-Caucasian populations found similar risk effects of the non-O

alleles (OR 1.37-1.3993; OR 1.67-3.2894). Wang et al.95 also found evidence for an additive effect of A

blood type with Hepatitis B infection. It should be noted that the association of non-O blood type with

pancreatic cancer predates these GWAS; one of the earliest reports suggesting an association was in The

British Medical Journal in 1960.96 How blood type mediates pancreatic cancer risk and tumorigenesis is

unknown97, but it appears that approximately 20% of pancreatic cancers in European populations is

attributable to having a non-O blood type status.88

Higher-penetrant genes may also predispose to pancreatic cancer, as shown by the co-occurrence of

pancreatic cancer with several known cancer syndromes. The highest-known risk is associated with

Peutz-Jeghers syndrome (PJS), caused by germline mutations of STK11. This autosomal dominant

syndrome is associated with melanocytic macules on the lips and buccal mucosa, gastrointestinal

hamartomas, and cancer. The lifetime risk of pancreatic cancer in PJS patients is up to 132-fold relative

to the general population, or about 66% by age 70.98,99 Another condition associated with up to 80-fold

higher risk of pancreatic cancer is hereditary pancreatitis, most commonly caused by mutations in PRSS1

in an autosomal dominant fashion (although SPINK1 mutations have also been implicated).100-101 Familial

atypical multiple mole melanoma (FAMMM) is an autosomal dominant syndrome characterized by

multiple nevi and increased risk of cancers, predominantly melanoma and pancreatic adenocarcinoma.

The primary genetic cause of FAMMM is mutations in CDKN2A/p16, and carriers (particularly of the

p16-Leiden founder) have up to 47-fold increased risk of developing pancreatic cancer.102 Some genes

that cause hereditary breast and ovarian cancer also raise risk of pancreatic cancer. To date, the gene

contributing to the largest proportion of hereditary pancreatic cancer is BRCA2, which is estimated to

raise lifetime risk of pancreatic cancer by 3.5- to -10-fold and accounts for up to 19% of high-risk

families103-107 (although the contribution of BRCA2 may be population dependent, as it appears to be

significantly lower in German, Korean, and Spanish populations108-111). Although most BRCA2 families

with pancreatic cancer also cluster breast and/or ovarian cancer, some families are characterized by

exclusive presence of pancreatic cancer112, and even apparently sporadic cases have been demonstrated to

carry deleterious germline BRCA2 mutations.113 Interestingly, while the BRCA2 locus was first proposed

to contain a cancer-associated gene via linkage to familial breast cancer,114 the localization of the gene

itself and suggestion of its tumor-suppressor role was facilitated by discovery of a homozygous deletion

at 13q12 in a pancreatic adenocarcinoma.115-116 Germline mutations of other Fanconi-anemia pathway

genes have been reported in pancreatic cancer families but the magnitude of risk associated with these

genes is unclear: PALB2 in ~0.9-4% of families117-120), BRCA1 in 2.6-4.4% of families121-122 (although

Axilbund et al. failed to find mutations in a series of 66 familial pancreatic cancer patients123), ATM in

2.4% of families124, and mutations in FANCC and FANCG have been reported in young-onset pancreatic

cancer subjects125 although these genes do not appear to contribute significantly to familial pancreatic

cancer.126-128

Several other syndromes associated with risk of pancreatic cancer include Lynch syndrome (caused by

mutations of the mismatch repair genes MLH1, MSH2, MSH6, PMS2 or TACSTD1-3’ deletion),129- 132 Li-

Fraumeni syndrome (caused by mutations of TP53)133, Familial Adenomatous Polyposis (caused by

mutations of APC)134, and cystic fibrosis (caused by mutations of CFTR)135.

However, the contribution of known genetic syndromes to the overall heritability of pancreatic cancer is

limited; approximately 10% of all pancreatic cancer cases appear to be familial or hereditary and most do

not have a known genetic explanation.136 Perhaps the earliest indications that a familial pancreatic cancer

syndrome exists were several case reports and case series in the 1970s and 1980s describing clusters of

pancreatic cancer in first- and second-degree blood relatives.(137-143). Subsequently, both retrospective

case-control and prospective cohort studies suggested increased risk of pancreatic cancer in close relatives

of patients compared to the general population. (Table 1)

Table 1- Studies estimating risk of pancreatic adenocarcinoma in relatives of affected patients

Paper Type of Study

Description Risk of pancreatic cancer in relatives of patients

Ghadirian et al.144 Case-control 179 cases vs 179 controls (French Canadian)

OR in subjects with positive family history = 13 (p<0.001)

Fernandez et al.145 Case-control 362 cases vs. 1408 controls (Italian)

OR in FDR of affected cases = 3.0 (95% CI 1.4-6.6)

Silverman et al.146 Case-control 484 cases vs. 2099 controls (US)

Schenk et al.147 Case-control 247 cases vs. 420 controls (US)

Ghadirian et al.148 Case-control 174 cases vs. 136 control s (Canada)

OR in FDR of affected cases = 5.0 (p=0.01)

Inoue et al.149 Case-control 200 cases vs. 2000 controls (Japan)

OR in subjects with positive family history = 2.09 (95% CI 1.01-4.33)

Rulyak et al.150 Nested case-control

251 members of 28 families (US)

OR with each affected FDR = 1.8 (95% CI 1.1-2.7)

Cote et al.151 Case-control 247 cases vs. 420 controls (US)

OR in subjects with positive family history = 2.49 (95% CI 1.32-4.69)

Hassan et al.152 Case-control 808 cases vs. 808 controls (US)

OR in FDR of affected cases = 3.3 (95% CI 1.8-6.1); OR in SDR of affected cases = 2.9 (95% CI 1.3-6.3)

Jacobs et al.153 Case-control 1,183 cases vs. 1,205 controls (US,Europe,China)

Matsabuyashi et al.154

Case-control 577 cases vs. 577 controls (Japan)

OR in FDR of affected cases = 2.5 (p=0.02)

Coughlin et al.155 Cohort 1.1 million US RR for PC mortality in FDR of affected cases (males) = 1.5 (95% CI 1.1-2.1); (females) = 1.7 (95% CI 1.3-2.3)

Tersmette et al.156 Cohort Prospectively followed 150 FPC kindreds and 191 SPC kindreds from NFPTR

SIR in FPC relatives if 2 or more affecteds = 18.3 (95% CI 4.74-44.5); SIR in FPC relatives if 3 or more affecteds (56.6 (12.4-175) [no significant elevated risk in SPC relatives – SIR in FDRs = 6.5 (0.78-23.3)]

Hemminki et al.157 Cohort 10.2 million Swedish (21,000 PC cases)

SIR for children of affected cases = 1.73 (95% CI 1.13-2.54)

Klein et al.158 Cohort Prospectively followed 370 FPC kindreds and 468 SPC kindreds from NFPTR

SIR in FDRs of FPC affecteds = 9.0 (4.5-16.1) if 1 FDR affected, SIR = 4.5 (95% CI 0.54-16.3); if 2 FDRs affected, SIR = 6.4 (95% CI 1.8-16.4); if 3 or more FDRs affected, SIR = 32 (95% CI 10.4-74.7) [no significant elevated risk in FDRs of SPC affecteds, Sir =1.8 (95% CI 0.2—6.42) or spouses/unrelated relatives, SIR =2.4 (95% CI 0.06-13.5)

Jacob et al.159 Cohort 1.1 million (US) RR for PC mortality in FDR of affected cases = 1.66 (95% CI 1.43-1.94)

Brune et al.160 Cohort Prospectively followed SIR in FDR of FPC affected = 6.79 (95% CI

1,718 kindreds from NFPTR

4.59-9.75) if 1 FDR affected, SIR = 6.86 (95% CI 3.75-11.04); if 2 FDRs affected, SIR = 3.97 (95% CI 1.59-8.2); if 3 or more FDRs affected, SIR = 17.02 (95% CI 7.34-33.5) Young-onset (< 50 years) in FDR associated with SIR=9.31 (95% CI 3.42-20.28); Late-onset (> 50 years) in FDR associated with SIR=6.34 (95% CI 4.02-9.51)

OR = odds ratio; 95% CI= 95% confidence interval; FDR= first-degree relative; SDR = second-degree relative; PC = pancreatic cancer; SIR = standardized incidence ratio; RR = relative risk; FPC = familial pancreatic cancer (at least 1 pair of affected FDRs); SPC = sporadic pancreatic cancer (no affected FDR pairs); NFPTR = National Familial Pancreas Tumor Registry at Johns Hopkins University (http://pathology.jhu.edu/pc/nfptr/index.php)

Segregation analysis of 287 families with an index case of pancreatic cancer recruited by Johns Hopkins

Medical Institutions supports the hypothesis that a major gene is involved in pancreatic cancer risk, with

the most likely model including the autosomal dominant inheritance of a rare allele.161 The degree of risk

is linked to the number of affected relatives, the degree of relation, as well as the age of onset of disease

in relatives. Three large cohort studies following kindreds recruited by the National Familial Pancreas

Tumor Registry (NFPTR) at Johns Hopkins Medical Institutes found risk in first-degree relatives (FDR)

of affected patients in families with at least one pair of affected first-degree relatives of 4.5-6.79 if only

one FDR is affected, 3.97-18.3 if two FDRs are affected, and 17.02-56.6 if three or more FDRs are

affected.156,158,160 Moreover, the younger the age of onset of cancer in the affected relative, the higher the

risk in first-degree relatives (hazard ratio (HR) 1.55 per decreased year of onset).160

It is not clear whether the average age of onset of pancreatic cancer is significantly lower in FPC, as many

studies found no difference in age of onset of disease between FPC and sporadic cases143,144,156,162,163 and

even the few studies that identified a difference found it to be rather small (65-68 yrs in FPC vs. 70 yrs in

SEER database).160,164,165 However, there is evidence for genetic anticipation in FPC families, with

members of each successive generation developing cancer on average 6-15 years younger than the

previous generation.166,167;168,169 There is strong evidence for gene-environment interaction in FPC,

particularly with respect to tobacco use; FPC kindred smokers developed pancreatic cancer a decade

earlier than non-smokers168 and the relative risk of developing cancer is approximately 19-fold that of the

average population in smokers from FPC families.158

In some cancer syndromes, there is a significant difference in survival between familial and sporadic

cases (e.g. colorectal cancer), but it is not clear that there is such a difference in FPC. Several studies

have found no difference in survival between sporadic and familial pancreatic cancer.143,164,170,171 Ji et

al.172 found that familial cases had worse outcome than sporadic cases (HR=1.37) in a Swedish Family

Cancer database, while Yeo et al.173 identified significantly worse survival in unresected FPC cases

compared to unresected sporadic cases but no significant difference for resected cases. Interestingly,

recent anecdotal reports and small series of FPC patients with mutations in BRCA-related genes who were

treated with platinum-based chemotherapy, topoisomerase inhibitors, or poly-ADP-ribose-polymerase

(PARP1)-inhibitors suggest that this subset of familial cases may have good chemotherapy responses and

improved survival compared to sporadic cases.174-178

Aside from the difference in inactivation of BRCA-related pathway between familial and sporadic cases

(up to a fifth of FPC tumors vs. less than 10% in sporadic cases), there has been limited investigation into

molecular genetic and pathologic differences between familial and sporadic pancreatic cancers. Pancreata

from FPC subjects appear to have increased prevalance of precursor lesions (PanINs and IPMNs)

compared to sporadic pancreatic cancer.179,180 Studies analyzing the rate and genome-wide distribution of

loss-of-heterozygosity (LOH) have shown conflicting results: Abe et al.181 identified LOH at

approximately 50% of informative markers in 20 FPC tumors while a similar study in 82 sporadic tumors

found the average LOH rate to be 25%182, but a third study that used a SNP array to identify LOH in 26

pancreatic cancer cell lines found a rate of LOH similar to that in familial tumors (average 43%).183

Differences in LOH rates aside, the pattern of LOH across the genome appeared similar across all three

studies. Brune et al.184 analyzed familial tumors for Kras mutations, Tp53 and SMAD4 expression, and

methylation rate of seven genes previously shown to be hypermethylated in sporadic tumors, and found

no significant difference between familial and sporadic tumors.

Given all the evidence supporting the existence of at least one major gene explaining the heritability of

pancreatic cancer in high-risk families, much effort has been directed at attempting to identify the

responsible gene, including genetic linkage. Linkage analysis is a statistical tool which uses family-based

data and the likelihood of recombination between loci on a chromosomal arm to identify genomic regions

that appear to be transmitted to affected members of the family more frequently than by chance alone.

Since linkage analysis was successful in mapping the location of and facilitating the identification of

highly-penetrant genes in many cancer syndromes (e.g. APC in Familial Adenomatous Polyposis185;

BRCA1 and BRCA2 in Hereditary Breast and Ovarian Cancer syndrome114,186), this technique has been

applied to the study of FPC. Familial registries fostered the collection of high-risk families, and a large

North American consortium has pooled the resources of six major sites: the Pancreatic Cancer Genetic

Epidemiology Consortium (PACGENE).165 This National Institute of Health (NIH)-funded collaboration

includes the University of Toronto, Mayo Clinic, Johns Hopkins University, MD Anderson Cancer

Centre, Dana Farber Cancer Institute, and Karmanos Cancer Institute. Each site prospectively identifies

pancreatic cancer patients with a family history of at least two affected members. If a pedigree is deemed

suitable for linkage analysis (with the help of linkage simulation programs), probands are asked to

consent to contact their relatives for recruitment to the study. Consenting individuals complete

questionnaires about clinical and family history and provide blood samples for DNA extraction.

Linkage efforts in FPC have yielded limited results. The linkage work by PACGENE is ongoing, but to

date no highly significant loci have emerged. Investigators at the University of Washington (not

connected to PACGENE) published results of a linkage analysis conducted in a single FPC family

(identified as “Family X”) characterized by four generations of affected members with an autosomal

dominant pattern of inheritance suggesting high penetrance, young age of onset (median age 43), and

concomitant endocrine and/or exocrine pancreatic insufficiency.187 Based on a genome-wide screen using

373 microsatellite markers, significant linkage with LOD (logarithm of odds) scores 4.56-5.36 was

identified on chromosome 4q32-34. Although other centres failed to find a significant association at this

locus in European188 or North American189 FPC kindreds, the University of Washington group

subsequently claimed to have pinpointed PALLD, coding for palladin, a cytoskeleton scaffold protein.190

They demonstrated a variant (P239S) that segregated only with the affected members of the family linked

to 4q32-34, and they further presented evidence of PALLD overexpression in premalignant and cancerous

pancreatic tissue. However, significant doubt has been cast on the likelihood that PALLD is the

responsible gene for FPC, or at least that it is a significant cause of this cancer syndrome. Due to the

large number of candidate genes in the 4q32-34 locus, Pogue-Geile et al.187 were unable to screen all

candidates for mutations in Family X. Rather, they used a custom expression microarray to analyze RNA

extracted from whole tissue PanIN in one of the affected members of Family X and in another 10 sporadic

pancreatic cancers. PALLD appeared to have the highest expression, and it was based on this finding that

this gene was sequenced in Family X. However, Salaria et al.191 used immunohistochemistry of 177

pancreatic adenocarcinomas to show that palladin overexpression was primarily localized to non-

neoplastic stroma, with 96.6% of tumors demonstrating overexpression in the stroma and only 12.4% of

tumors had overexpressed palladin in neoplastic cells. Furthermore, three studies of Canadian, US, and

European families found no deleterious PALLD mutations in any other FPC families. Zogopoulos et al.192

genotyped the P239S variant in 51 familial cases, 33 early-onset cases, and 555 controls and found only

one familial case diagnosed at age 74 (they did not have DNA available for the other family members)

and in one 91-year-old unaffected control. Slater et al.193 sequenced the locus containing the variant in 74

FPC families and found no mutations. Finally, Klein et al.194 performed sequencing on 92% of the coding

region of the entire PALLD gene in 48 FPC cases and found no deleterious mutations.

Since the PACGENE linkage study has not yet been completed, it is not known if any other loci will be

reliably linked to FPC. Some of the challenges associated with applying linkage analysis to FPC are: (1)

small number of affected individuals per family and rapid mortality, precluding recruitment and limiting

the number of meioses available to perform the analysis; (2) penetrance of the FPC gene(s) is likely lower

than in previously mapped hereditary cancer syndromes, reducing the power of linkage analysis; (3) there

is increasing evidence for locus heterogeneity in the etiology of FPC. To date, only BRCA2 has been

shown to account for a substantial portion of familial cases, while all other identified genes appear to be

responsible for fewer than 5% of cases each. Locus heterogeneity is a significant confounder of linkage

analysis, and the lack of distinguishing phenotypic or pedigree characteristics among families makes it

very difficult to confidently separate cases that are likely caused by different genes; (4) reduction of

power in linkage analysis due to phenocopies. Given all these challenges, it is evident that other

techniques are needed in the effort to identify germline genetic alterations that predispose to FPC.

2. Copy Number Variation

2.1 Copy Number Variation – a novel paradigm Our understanding of the nature and degree of variation in the human genome has accelerated in the past

few years. Until recently, single nucleotide polymorphisms (SNPs) appeared to be the most frequent and

important source of genomic variation in humans. Significant efforts have been directed at identifying

and genotyping SNPs in different populations, and numerous disease association and linkage studies have

been conducted using SNPs as genomic markers. Yet, the development of higher-resolution genomic

scanning technologies has highlighted a previously under-recognized but clearly significant

submicroscopic structural variation in the human genome. Structural variants encompass copy-number

variants (CNVs) (defined as genomic segments which are present in variable copy numbers when

comparing two or more genomes) as well as inversions, novel sequence or mobile element insertions, and

translocations.195 The original definition of CNVs used 1,000 base pairs as a lower-limit size threshold, to

differentiate from smaller “insertions/deletions”. However, more recently the spectrum of CNVs has

been expanded to include any variants larger than 50bp, reflecting the identification of smaller variants

using sequencing technologies.195

Although CNVs at certain loci had long been recognized as polymorphisms in normal individuals (e.g.

alpha-globin gene family; Rhesus blood group) as well as the cause of genomic disorders (e.g. Charcot-

Marie-Tooth neuropathy type IA; Williams-Beuren syndrome; Potocki-Lupski syndrome),196 the

ubiquitous presence of CNVs in normal human genomes first became apparent with the publication of

two genome-wide studies in 2004.197-198 Since that time, more CNV-detection surveys, with continually

improving genomic coverage and resolution, have reported thousands of CNVs affecting all human

chromosomes in apparently normal individuals.199-249 (See Table 2) While the number of known SNPs

(~11 million) exceeds that of CNVs, the proportion of genomic sequence that is different between any

two genomes due to indels/CNVs is approximately 12-fold that of SNPs (1.2% vs. 0.1%).238

Table 2 - Summary of published studies reporting germline genomic copy-number variation in non-disease samples

Study (Year Published)

Population Primary CNV detection method

Reference genome

Source of DNA

Number of CNVs

Size of reported CNVs

Proportion of CNVs detected in > 1 sample

Number of CNVs confirmed within same study

CNV confirmation methods

Sebat et al. (2004)197

20 ethnically diverse individuals

aCGH: ROMA (85,000 probes, 35kb apart; Bgl II restricti-on enzyme)

12 samples (mostly from a single male sample); single ref per hybridizati-on experiment

Blood, sperm, cell lines

76 Average = 465kb

41% 11/12 FISH, hybridization to HIND III ROMA platform

Iafrate et al. (2004)198

55 ethnically diverse individuals (39 unrelated healthy controls + 16 individuals with known chromoso-mal imbalances)

aCGH: BAC array (2632 clones, 1Mb apart)

Pooled male or female normal samples

Whole blood + cell lines

255 Average = 150kb

40% 19/19 qPCR, FISH

Sharp et al. (2005)199

47 ethnically diverse individuals

aCGH: BAC array (2194 clones, targeting 130 segment-al duplicat-ion regions)

Single male sample

Cell lines

160 (represe-nt 119 regions if merge BACs <250kb apart)

Average BAC insert size = 164kb, some CNVs involve > 1 clone

55% 7/11 FISH

Tuzun et al. (2005)200

Single female NA15510 (fosmid library)

In-silico Fosmid end sequence pair mapping

NCBI reference human genome Build 35 (hg17)

n/a 297 Median = 15.7 kb (8-329kb)

n/a 16/57 33/40 7/11

BAC array (comparing 97 genomes) Sequencing of fosmid inserts PCR

Conrad et al. (2006)201

30 YRI trios + 30 CEU trios (HapMap)

In-silico: Assessm-ent of Mendeli-an inconsis-tencies in trios

n/a n/a 586 (396 in YRI; 228 in CEU)

YRI median = 8.5kb (0.5-1200kb) CEU median = 10.6 kb (0.3-404kb)

61% 92/105 qPCR, hybridization to custom high-density oligo array

McCarroll et al. (2006)202

269 HapMap individuals (4 ethnic groups)

In-silico: Analysis of Mendeli-an

n/a n/a 541 Median = 7 kb (1-745kb)

51% 90/541 FISH, allele-specific fluorescence measure, PCR, qPCR

transmis-sion errors, HW disequili-brium, null genotyp-es

Hinds et al. (2006)203

24 ethnically diverse individuals (Discovery panel)

aCGH: High-density oligo custom array

NCBI reference human genome (build not indicated)

Cell lines

215 Median = 0.75kb (70bp – 10kb)

67% 100/215 PCR

Locke et al. (2006)204

269 HapMap individuals

aCGH: BAC array (2007 clones, targeting 130 segment-al duplicat-ion regions)

Well-characteriz-ed single male sample (GM15724)

Cell lines

384 (in 222 regions, if merge BACs < 250kb apart)

Average = 436kb (145kb-1.4Mb)

67% 136/207 Custom high-density oligo array

Mills et al. (2006)205

36 individuals (different ethnic groups)

In-silico: Computa-tional alignme-nt of DNA reseque-ncing traces from SNP studies to reference genome

NCBI reference human genome Build 35 (hg17)

n/a 294,498 2bp-9989bp

183/189 PCR, sequencing

Redon et al. (2006)206

aCGH: Whole Genome Tiling Path array (26,574 BACs) + SNP array intensity comparison: 500K SNP platform

Single male reference (NA10851) for aCGH; pairwise comparison between all samples for 500K

Cell lines

1447 merged CNVRs (913 on WGTP platform; 980 on 500K platform)

Average = 341kb (WGTP) 206kb (500K SNP)

~50% 173/1447 43% of all CNVs

Locus-specific quantitative assay Replicated on both platforms

Simon-Sanchez et al. (2007)207

276 well-phenotyped Cauasians, from NINDS study

SNP array intensity comparison: 1)109,365 gene-centric SNP array

Reference genotyping clusters (used in Illumina-specific CNV-detection algorithms)

Cell lines

340 ~20kb – 3Mb (for non-heteros-omic CNVs)

5 13/24

qPCR replication of CNV detection in DNA from whole blood

2) 300K SNP array

Wong et al. (2007)208

95 samples (include healthy blood donors, cancer screening program participants, 16 distinct ethnic groups)

aCGH: BAC array (26,363 clones)

Single male reference

Whole blood, cell lines

3654 >40 kb 22% detected in >2 samples

265 Confirmed in 5 cases on oligo array

Levy et al. (2007)209

Single diploid genome of Craig Venter

In-silico: Random shotgun sequenc-ing, compari-son to NCBI reference genome aCGH: 244K oligo array; 385 oligo array; 2 different SNP array platforms

NCBI reference genome Build 36 for one-to-one mapping of insertions/ deletions Single male reference (NA10851) for aCGH and SNP array compariso-ns

Whole blood

919,584 indels (600 ≥ 1kb in size) + 62 CNVs

Indels = 1-82,711 bp (average 2.4-11.7bp) CNV (~8kb-2Mb)

n/a 37/40 indels

Comparison to fosmid clones from 8 other individuals

Korbel et al. (2007)210

2 previously analyzed female subjects: NA15510 (presumed European ancestry) and NA18505 (YRI)

In-silico: Paired-end sequence mapping (generat-ed by next-generati-on massive parallel sequenc-ing)

NCBI reference human genome Build 36

Cell lines

1175 total (422 in NA15510; 753 in NA18505)

Majority <10kb, but variants up to >1Mb detected

89% of 249 variants tested in individuals from 4 population

132/261 (NA15510) 328/616 (NA18505) 95 (NA15510) 97 (NA18505) 31/48 (NA15510)

PCR (+ sequencing breakpoints in a subset of amplicons) Also present in Celera assembly aCGH with oligo tiling arrays comparing NA15510 to NA18505

Pinto et al. (2007)211

506 controls of North German descent (PopGen study)

SNP array intensity comparison: 500K SNP array

Multiple references

Cell lines

1023 CNVRs (430 high-confiden-ce; i.e. detected by ≥ 2 algorith-ms)

Average size of “high-confiden-ce” CNVRs = 369kb

4% of CNVRs in >2% of population

217/1010 Overlap with CNVRs called in 269 HapMap samples analyzed with identical algorithms to PopGen

Wang et al. (2007)212

SNP array intensity compari-son: 550K

Reference genotyping clusters (used in Illumina-specific

Cell lines

2633 Average 31.5kb-61.2kb (depend-ing on ethnic

52.6-74.8% of CNVs were also detected in parents

Assumes high heritability of CNVs, compares to CNVs called in parents

SNP array

CNV-detection algorithms)

group) 3 CNVs

PCR, re-sequencing of breakpoints

Zogopoulous et al. (2007)213

1190 controls from Ontario Familial Colorectal Cancer Registry (Canada); mostly Caucasian

SNP array intensity compari-son: 100K and 500K arrays

Multiple references

Blood 578 CNVRs

Average = 408kb (12bp – 4.5Mb)

< 7% are detected in >1% of population

4 qPCR

deSmith et al. (2007)214

50 males (north French origin)

aCGH--2-stages: 1) 185K oligo genome-wide array (in 35 individuals) 2) custom high-density 244K array

Pooled references for 185K array; single female reference (NA15510) for 244K array

Blood 9244 multi-probe CNVs (1469 CNVRs) 6089 single-probe CNVs (4705 CNVRs)

Median 4.4kb

45% 90-95% of common CNVRs detected on 185K array 21

Replication on 244K array PCR, MLPA

Jakobsson et al. (2008)215

485 individuals, from 29 populations (Human Genome Diversity Project)

SNP array intensity comparison: Illumina Infinium Human HapMap 500 Beadchip

Cell lines

3552 (map to 1428 loci)

Average = 82.7kb (deletion) 130.4kb (duplication) (2kb-998kb)

Perry et al. (2008)216

30 HapMap individuals (4 populations)

aCGH: Custom oligo array (470,163 probes) targeting CNVs previously detected by Redon et al. (2006)

Single male reference (NA10851)

Cell lines

2664 (map to 1153 loci)

15-33% smaller CNVs than detected by Redon et al. (2006) in same sample

50% 23/51 Sequencing over breakpoints

Takahashi et al. (2008)217

80 healthy Japanese offspring of atomic bomb survivors

aCGH: 2238 BAC custom array

One male and one female Japanese

Cell lines

251 (mapping to 30 regions)

Average: 120kb (deletion) 160kb (duplication)

53% 14/14 rare CNV regions

qPCR, FISH, PGFE-Southern Blot, sequencing)

Wheeler et Single In-silico: (sequence Blood 163,608 (2bp- n/a Excellent aCGH

al. (2008)218 diploid genome of James Watson

Next-generati-on sequenc-ing, compari-son to NCBI reference human genome + aCGH: 244K oligo array + 2.1 million probe array (3 experim-ents with 2 different referenc-es)

mapping) NCBI reference human genome Build 36 (aCGH) a) standard Caucasian male ref and b) NA10851

indels (by sequence compari-son) 23 CNVs (by aCGH)

38,896bp) 26kb-1.6Mb

concordan-ce in CNV calls when using same reference on different oligo arrays (data not shown)

experiments against NA10851 on 244k and 2.1 million probe arrays

McCarroll et al. (2008)219

270 HapMap

SNP microarray (Affy6.0)

270 HapMap

Cell lines

3048 CNVs (1320 CNVRs)

50% 27 loci qPCR

Cooper et al. (2008)220

9 HapMap SNP microarr-ay (Illumina)

Reference genotyping cluster

Cell lines

368 64-67% Fosmid sequence alignment date

Kidd et al. (2008)221

8 HapMap samples (4 ethnic groups)

In-sliico: Fosmid-end sequence pair mapping

Cell lines

7184 predicted non-redunda-nt CNVs

>6kb 50% 1471 MCD analysis (multiple complete restriction enzyme digest); High-density oligo arrays and SNP arrays; Correlation to SNP genotyping data for 130 deletions; Full-length sequencing of fosmid clones

Bentley et al. (2008)222

Single YRI male (NA18507)

In-silico: Paired reads of massive-ly parallel sequenc-ing

Cell line

4116 n/a

Wang et al. (2008)223

Single Asian male (Han Chinese)

In-silico: paired-end reads of massive-ly

blood 2474 Median = 492 bp

parallel sequenc-ing

Gusev et al. (2009)224

3000 individuals from Kosrae island (Micronesia)

In-silico: Uses novel algorithm to identify gaps in “identity-by-state” stretches of SNP genotyp-es

215 52 Used other computational methods and compared to previous reports

Itsara et al. (2009)225

2493 SNP microarr-ays (Illumina)

Cell lines; blood

13,843 (map to 3476 CNVRs)

77% Cross-platform comparison (to CGH array)

Shaikh et al. (2009)226

2026 (1320 Caucasian; 694 African-American; 12 Asian-American)

SNP microarr-ay (Illumina HumanHap550)

Blood 54,462 (non-unique CNVs map to 3272 CNVRs)

Median = 8kb

77.8% 16/20 1753/2409 19/21

qPCR array-based comparison (affy vs illumina) comparison to previously published data of a HapMap samples (Kidd et al)

Kim et al. (2009)227

Single Korean male (AK1)

In-silico: paired-end reads of massive-ly parallel sequenc-ing and end-sequenc-es of BAC clones aCGH: custom 24M microarr-ay; SNP arrays

NCBI reference human genome Build 36 Reference for CGH arrays not identified

Blood, sperm

315 277bp-2Mb

n/a Sequence data complement-ed microarray data

Ahn et al. (2009)228

Single Korean male

In-silico: paired-end reads of massive-ly parallel sequenc-

Blood 2920 0.1-100Kb

n/a 2344 Detected in DGV (no direct confirmation)

Matsuzaki et al. (2009)229

90 HapMap YRI samples

aCGH: Custom oligonuc-leotide microarr-ays

Signal compared to normalized signal of all 90 samples

Cell lines

6578 Median = 4.9kb

3850 31/40 qPCR (also compared to findings of previous studies – 87-99.97% agreement))

McKernan et al. (2009)230

Single YRI male (NA18507)

In-silico: ABI SOLiD paired-end and split-reads (ligation-based sequenc-ing assay)

NCBI reference human genome

Cell line

565 2-937kb n/a n/a n/a

McElroy et al. (2009)231

385 African Americans and 435 White Americans

SNP array (Affy 500K)

50 African Americans females (derived from blood)

Cell lines + Blood

1362 in African America-ns + 1972 in White America-ns (map to 412 African-American unique CNVRs; 580 White-unique CNVRs; 76 shared CNVRs)

Mean duplicat-ion = 827kb; mean deletion = 703kb

174 CNVRs

3 loci qPCR

Conrad et al. (2009)232

Discovery in 40 females (19 CEU + 20 YRI + 1 diversity panel); genotyping in 450 HapMap

Discove-ry: Nimble-Gen 42M arrays Genotyp-ing: Custom Agilent 105k arrays; SNP array (Illumina Infinium Human660W)

Discovery: NA10851 Genotyping: pooled DNA of 10 European samples (9 males + 1 female)

Cell lines

11,700 Median = 2.7kb

49% 79/99 (qPCR) 15% FDR (microarray)

qPCR; other microarrays

Alkan et al. (2009)233

3 individuals

Read-depth of massive-ly parallel sequenc-ing reads

Reference human genome

Cell lines

725 97% of all variants

17/25 aCGH FISH

Lin et al. (2009)234

813 Taiwanese individuals

Illumina 550K Bead-Chip

Blood 4452 (map to 1025 CNVRs)

Mean = 497kb

365 CNVRs

279/365 CNVRs

Identified on Affy 500K array

Li et al. (2009)235

1000 Caucasians and 700 Han Chinese

SNP array (Affyme-trix 500K)

Half the samples were used as references for the other half and vice-versa

Blood 2381 Median = 195kb

27.6% 680/985 overlap DGV

Compared to DGV No experimental validation

Altshuler et al. (2010)236

1184 (HapMap3-11 populations)

SNP arrays (Affyme-trix 6.0 and Illumina 1M arrays)

Reference genotyping clusters

Cell lines

856 Median = 7.2 kb

All CNPs detected in ≥ 1% of population

n/a FDR of algorithms determined by comparing to CGH data for 34 individuals

Ju et al. (2010)237

Single Caucasian male (HapMap NA10851)

Data from previous aCGH studies that used NA10851 as reference + read-depth of NA10851 massive-ly parallel sequenc-ing

73 individuals (from Conrad et al, 2010 and Park et al. 2010)

Cell line

1309 Median = 2.7kb

n/a n/a n/a

Pang et al. (2010)238

Single diploid genome of Craig Venter

In silico: de novo assembly comparison; paired-end reads; split-reads aCGH: Agilent 24M + Nimble-Gen 42M arrays SNP arrays: Affyme-

NA15510 for Agilent 24M and NimbleGen 42M arrays

Whole blood

808,179 insertions or deletions (2641 ≥ 1kb)

(1-1.7Mb)

n/a 89/96 SVs identified by sequence analysis 20/25 CNVs identified by microarrays 11,140 SVs in common to this study and Levy et al

Compared to SVs called in previous analysis of same genome (Levy et al) PCR/qPCR

trix 6.0 + Illumina 1M

Park et al. (2010)239

30 females (10 Korean; 10 HapMap Chinese; 10 HapMap Japanese)

aCGH: 24M custom Agilent arrays

Single male reference (NA10851)

Cell lines

20,099 (map to 5177 loci)

Median = 2.7kb (438bp-1.1Mb)

39% 106/116 loci

Teague et al. (2010)240

NA15510, NA10860, NA18994

Optical Mapping (single-molecule restriction mapping)

Cell lines

5416 3kb-megabases

>1/3 all variants

42-61% (depends on platform being compared against)

Compared to fosmid-end sequencing, paired-end sequencing, SNP array (Affy6.0), tiling arary CGH

Kidd et al. (2010)241

9 HapMap individuals

Identifyi-ng fosmid-end clones that did not map to reference genome

Cell lines

2363 novel insertion sites (corresp-ond to 720 loci)

Median = 1kb (1-20kb)

192 loci Sequencing, genotyping

Kidd et al. (2010)242

17 individuals

Capillary end sequenc-ing of fosmid clones

Cell lines

973 n/a n/a n/a n/a

Schuster et al. (2010)243

5 individuals

Read depth aCGH

Blood 187 n/a n/a n/a n/a

Yim et al. (2010)244

3578 Korean individuals

SNP array (Affy5.0)

NA10851 + pooled 100 Korean females

Median 18.9kb

656 CNVRs in ≥ 1% of samples

14/16 loci qPCR

Gayan et al. (2010)245

801 Spanish individuals

SNP array (Affyme-trix 250 NspI array)

25 female samples from other studies

Blood 11,743 Median 150.7kb

623 CNVs present in >2 individuals

519 CNVs previously described

Comparison to DGV (no experimental validation)

The 1000 Genome Project Consortium (2010)246; Mills et al. (2011)247

Three pilots: (1) 3 trios from 2 families – deep sequencing (avg 42x) (2) 179 unrelated – low depth (2-6x) (3) deep sequencing

Paired-end mapping, read-depth analysis, split-read analysis, and sequence assembly of massive-ly parallel

Cell lines

14,327 50bp - ~1Mb

<10% FDR PCR aCGH

of exons of 1000 genes in 697 individuals (avg >50x)

sequenc-ing

Chen et al. (2011)248

2789 individuals from three European populations

SNP array (Illumina Infinium Human-Hap 300)

Mean = 205kb

406 649 CNVRs

Overlap with reported CNVs in DGV (no experimental validation done)

Moon et al. (2011)249

Discovery: 100 Korean individuals Genotyping: 8842 Korean individuals

aCGH array (NimbleGen 3 x 720K) + SNP array (Affy 5.0)

NA10851 Blood 8779 (576 CNVRs chosen for frequen-cy analysis)

Median length of 576 CNVRs = 113kb (1kb-4.56Mb)

807 CNVRs (576 chosen for frequency analysis in larger sample set)

66.7%-100% positive predictive values for 20 randomly chosen CNVRs

TaqMan assays

Studies listed in chronological order by publication date. CGH, comparative genomic hybridization; oligo, oligonucleotide; FISH, fluorescence in situ hybridization; ROMA, representational oligonucleotide microarray analysis; qPCR, quantitative polymerase chain reaction; BAC, bacterial artificial chromosome; YRI, Yoruba in Ibadan, Nigeria; CEU, Utah residents with ancestry from northern and western Europe; NCBI, National Centre for Biotechnology Information; PGFE, pulsed gel field electrophoresis; MLPA, multiplex ligation-dependent probe amplification

2.2 CNV Databases The Database of Genomic Variants (DGV) (http://projects.tcag.ca/variation/) was founded in conjunction

with the publication of the first few CNVs in 2004 by Sebat et al.197 and Iafrate et al.198, to catalogue

former and future discoveries of structural variants in the human genome. Curated by The Centre for

Applied Genomics (TCAG) in Toronto, the objective of this database is to summarize published data on

structural variation detected in healthy control samples, and it is periodically updated as new data

becomes available.198 At this time, the DGV presents data from each study separately, only merging

overlapping CNV calls (in the same direction) across samples within the same study. Moreover, calls

made by different platforms in the same study are also presented separately. Regions are displayed in

relation to the human genome reference assembly (Build 35/May 2004 or Build 36/March 2006 or

GRCH37/Feb 2009). The latest version of the DGV (updated Nov 02, 2010) contains 101,923 entries

mapped to the human genome Build 36, corresponding to 66,741 CNVs >1kb (mapping to 15,963

genomic loci), 34,229 InDels (relative gains or losses between 100bp-1000bp in size), and 953 inversions.

Forty-two published articles are cited as the source of data in the DGV. A beta-version of the database

has been released (October 2011) which provides access to data in partner databases at European

Bioinformatics Institute (DGVa) and National Centre for Biotechnology Information (dbVar). The DGVa

repository has been the primary supplier of data to the DGV. dbVar includes structural variants from

multiple species and also includes data from clinical studies (non-healthy populations). Future

submission of CNV data will be managed by DGVa and dbVar, while the role of DGV will be to

manually curate and visualize selected studies to allow better interpretation of the clinical significance of

Clinically significant CNVs (mainly those linked to genomic syndromes) are catalogued in DECIPHER250

(DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources,

https://decipher.sanger.ac.uk) and ECARUCA251 (European Cytogeneticists Association Register of

Unbalanced Chromosome Aberrations, http://umcecaruca01.extern.umcn.nl:8080/ecaruca/ecaruca.jsp).

In addition, there are several data sources for copy number alterations that are detected in tumors or

cancer cell lines. Those include The Wellcome Trust Sanger Institute Cancer Genome Project252

(http://www.sanger.ac.uk/cgi-bin/genetics/CGP/conan/search.cgi) and the Pancreatic Expression

Database253 (http://www.pancreasexpression.org/).

2.3 Discovery and Genotyping of CNVs A variety of platforms and algorithms have been applied for CNV detection, with a wide range of

resolution, coverage, and signal-to-noise ratio, resulting in significant non-overlap in the CNVs detectable

between different platforms used to study the same samples. The earliest studies mapping CNVs in the

human genome were based on flourescent in situ hybridization (FISH) and spectral karyotyping and were

limited in resolution to variants of large size (>500kb), most of which were associated with disease.254

Later, genome-wide CNV mapping became possible with array comparative genomic hybridization

(aCGH), a technique involving competitive hybridization of flourescently labeled DNA samples from two

sources on a single array that contains immobilized target DNA sequences and use of computational

algorithms to analyze the hybridization ratio of the test and reference samples. The DNA targets on the

arrays originally comprised Bacterial Artificial Chromosome (BAC) clones but later were made of long

oligonucleotides.195 Early CGH arrays were of low resolution (typical CNV size detectable by these

platforms was greater than 100kb), and they significantly overestimated the true number of bases affected

by CNVs.197,198 Later, high density oligonucleotide tiling CGH microarrays became available, allowing

more accurate determination of CNV breakpoints and detecting many more CNVs of smaller size.232 One

important consideration in the use of CGH arrays for CNV detection is the reference sample. In any

given aCGH experiment, it is not possible to distinguish between a copy number loss on the test sample

versus a gain on the reference sample in the same region (or vice-versa), since both scenarios would

generate the same hybridization signal ratio. Moreover, a loss or gain present in both samples would be

entirely missed (since the signal ratio would appear to be 1). Ideally, the reference sample genome should

be well characterized using a variety of methods, and the same reference sample should be hybridized

against all test samples in an experiment to allow better comparison of the results. To date, several

individuals have had their genomes extensively mapped and have been used repeatedly in CNV studies

(HapMap NA10851, NA18507, NA15510).

Another type of microarray used for CNV detection is the SNP array. Originally designed to genotype

SNPs for genome-wide association studies, these arrays contain multiple probes corresponding to each

selected SNP, and a single test DNA sample is hybridized to each array. Various computational

algorithms have been developed to analyze the hybridization intensity data to estimate copy number at

each SNP location, and the two primary methods are the Hidden-Markov-Model and Segmentation.

Earlier SNP arrays had lower resolution and coverage for CNV detection due to the nature of SNP

selection (focused on “tag SNPs” with minimal allele frequencies of ≥ 1% to maximize coverage of the

genome while minimizing cost, and avoiding SNPs in regions that increase genotyping error due to

violation of Hardy-Weinberg Equilibrium or Mendelian inheritance errors).206,213 More recent SNP arrays

from Affymetrix and Illumina not only have a higher density of SNPs distributed genome-wide

(approximately 1 million) but also include probes for known CNV regions, hence allowing discovery of

smaller CNVs and the genotyping of polymorphic CNVs.219,220 Compared to CGH arrays, SNP arrays

have the added advantage of SNP genotype information which can be used to detect CNVs (by analyzing

“B-allele frequency”, which represents the proportion of total allele signal that is represented by a single

allele) as well as provide information on loss-of-heterozygosity (LOH) and uniparental disomy (UPD).

Both CGH and SNP microarrays are limited by detecting CNVs that map to regions known in the

reference genome that was the basis for the microarray build. Moreover, neither of those platforms

distinguishes between tandem and interspersed duplications, and they tend to be more sensitive in

detecting deletions than duplications (due to a higher signal ratio differential between 2 and 1 copies vs. 2

and 3 copies, for example).195 Furthermore, even the highest resolution arrays available lose sensitivity in

genome-wide detection of CNVs smaller than 10kb.219 Sequence-based methods have become used

increasingly to bridge the gap in mapping the full extent of variability of the genome. Even in the early

days of CNV discovery, several CNV papers were published based on mining of genotyping errors 219-220,

fosmid paired-ends200,221, and paired massively parallel sequencing of paired-ends of 3-kb fragments.210

Since then, many more studies have utilized the data from next-generation sequencing technologies to

identify CNVs, although there remain substantial bioinformatic challenges associated with analyzing this

data. The four main methods of using sequencing data to identify CNVs are255: (1) identifying read-pairs

whose mapping span is inconsistent with the reference genome; (2) identifying regions with significantly

increased or reduced read-depth compared to the distribution of read-depth across the (presumed diploid)

genome; (3) identifying “split-reads”, whereby there is a break in the alignment of a read relative to the

reference genome; (4) sequence assembly. To date the most commonly used method has been read-pair

mapping. All four approaches are limited in their sensitivity, specificity, and breakpoint accuracy

depending on read length, insert size, and physical coverage.

Future direction in CNV detection includes nascent technologies like optical mapping256, nanochannel

flow cells257, and emulsion picolitre droplet PCR258 that are being developed to allow high-throughput

detection of CNVs on an individual cellular and/or molecular level.

Multiple studies have demonstrated significant non-overlap between different platforms and algorithms

when analyzing the same samples.211,259 Given the variability in sensitivity and specificity of CNV

detection by the various platforms to date, validation is essential. Validation of detected CNVs has taken

two main forms in most studies: detection of the same (or overlapping) variants by different studies, and

replication within the same study (different array platform, PCR, qPCR, FISH, other experimental

methods). Overlap with regions identified in previous studies lends support to the variability of those

specific regions in the human genome, although many of the non-overlapping regions are also real (as

demonstrated by other replication methods). Similarly, replication on different platforms or with different

calling algorithms adds validity to detected CNVs in any tested sample, but regions identified by a single

approach can also be real. Experimental replication of CNVs provides the highest level of validation, but

those methods are often time-consuming and not optimized for high-throughput testing of multiple

regions and samples. As a result, most studies experimentally validated only a subset of their detected

CNVs (Table 2). However, high-throughput validation techniques have become available (e.g.

Sequenom©)260, so most CNVs published in the future should be confirmed more readily.

While most early CNV studies focused on variant discovery, determination of disease association with

specific CNVs requires accurate genotyping of the CNVs of interest. A number of techniques have been

employed for genotyping, including PCR based (e.g. PCR across breakpoints; quantitative PCR;

multiplex methodologies that assay multiple loci at once), SNP-array based (e.g. customizing arrays using

Illumina GoldenGate© assay for specific CNVs; using tag SNPs to impute common CNVs that are in high

linkage disequilibrium (LD) with the tag SNP), aCGH-based (e.g. customized high-density tiling arrays

with probes for known CNVs), and sequencing-based (e.g. building a library of breakpoints discovered

and validated from previous sequencing-based studies and comparing future de novo sequences against it

to rapidly genotype CNVs in those locations; calibrating aCGH data using sequencing-based data to

obtain absolute copy numbers).195 Accurate genotyping is easier for deletions than duplications, and is

particularly challenging in multi-allelic regions.

2.4 Structure and mechanism of CNV formation Several mechanisms of genomic rearrangement have been identified predisposing to duplications and

deletions, driven by structural motifs in the genome. One of the earliest observations in CNV surveys

was the association of CNVs with segmental duplications.197,198,199,200,206,208,209,212 Segmental duplications

(also called low-copy repeats or duplicons) are genomic regions ≥ 1kb in size and with ≥ 90% sequence

homology, present in multiple copies and covering approximately 5% of the human genome.261

Segmental duplications, particularly those with 97% or greater sequence identity and less than 10Mb

distance between them, can cause misalignment of homologous chromosomes or sister chromatids and

mediate non-allelic homologous recombination (NAHR), thus producing genomic duplications and

deletions of regions flanked by the segmental duplications.262 In addition, segmental duplications

themselves may be CNVs if they are not yet fixed in the human genome and they vary in copy number

between individuals.199 Most recurrent CNVs appear to be caused by NAHR mediated by segmental

duplications.

However, not all CNVs are associated with segmental duplications and other mechanisms have been

implicated in CNV formation. Different repetitive elements found in the breakpoint junctions of CNVs

include Alu SINES, L1 LINES, and long terminal repeats.210,247 Other mechanisms associated with CNV

formation include non-homologous end-joining (NHEJ), retrotransposition events (otherwise known as

mobile element insertion, or MEI), Variable Nucleotide Tandem Repeat (VNTR) expansion/contraction

events, replication Fork Stalling and Template Switching (FoSTeS), and microhomology-mediated break-

induced replication (MMBIR).263 In some cases, a parental inversion may predispose to de novo

unbalanced variants in the children, such as in the example of 17q21.31 microdeletion syndrome.264

Multiple studies have noted certain genomic locations as “hotspots” for CNVs, including 6cen, 8pter,

15q13-14, 11q11, 19q13, and 7q11.197,212,210,221 Some regions, such as 8p23, appear to be hotspots for

recombination as well as sequence variation, containing an enrichment of both structural variants as well

as SNPs205,221. In a recent report analyzing next-generation sequencing data for 1000 Genomes project,

structural variants were found to cluster into hotspots by the mechanism of their formation, with VNTR

clustering near the centromeres and NAHR near the telomeres247. Possible explanations for genomic

variation hotspots include: older evolutionary age of the target genomic segments; biological functional

effect of involved regions driving selective pressure to maintain diverse alleles; or complete lack of

functional importance and selective pressure.205,247

2.5 Population Genetics of CNVs Population genetics of CNVs are somewhat more complex than that of SNPs. Both forms of variation

may occur de novo or be inherited, but the de novo mutation rate for CNVs has been estimated to be 2-4

orders of magnitude greater than for single base mutations. Certain genomic regions are indeed

susceptible to recurrent rearrangements due to their structure (e.g. flanked by segmental duplications), but

when Mendelian inheritance was specifically investigated most common CNVs were indeed inherited

from a parent.219

Different studies have been differentially powered to detect common versus rare CNVs, thus yielding

conflicting data on the proportion of CNVs in the genome that are polymorphic (>1%). Earlier SNP

arrays and lower-resolution CGH arrays tended to be biased against common CNVs, so the majority of

CNVs identified using those platforms were rare in the general population. However, higher resolution

SNP arrays (such as Illumina 1M and Affymetrix 6.0) as well as very high-density CGH custom arrays

succeeded in detecting and genotyping a significant proportion of common CNVs over 1kb, and it is

evident that most of the variation between any two individuals at that resolution is due to common CNVs

that obey Hardy Weinberg Equilibrium.219,232 Sequencing based technologies have been identifying more

CNVs at a smaller size, and the data is a mix of rare and common CNVs.247

Most common CNPs are biallelic (with a bias for detecting deletions on the platforms used), and most of

those were found to be tagged well by SNPs of similar frequencies, suggesting that they are ancestral

events.219 CNPs that are in strong LD with tagging SNPs can be easily genotyped in association studies,

thus facilitating their study. However, SNP “taggability” depends on the frequency as well as density of

nearby SNPs, meaning that some CNVs of lower frequency or present in regions not populated by many

SNPs will need to be genotyped directly. The same is true for complex CNVs or CNVs that have

multiple copy number alleles, as those tend to be in poor LD with nearby SNPs as well.

Studies in populations of different ethnicities have suggested population differentiation in the frequency

of some CNVs, and some CNVs do appear to be population-specific.227-229,232 In keeping with the “out of

Africa” hypothesis, African populations have been found to have a higher number of rare or low-

frequency CNVs than non-African populations.229 These findings emphasize the importance of matching

the ethnicity of cases and controls in association studies to minimize spurious associations of population-

specific CNVs with disease.

2.6 Phenotypic impact of CNVs The earliest known CNVs, usually large genomic deletions and duplications often encompassing many

genes, were invariably linked to significant genomic disorders. With the discovery of ubiquitous CNVs

in healthy controls, interpreting the functional significance of such genomic alterations became more

complex. Of note, many studies have observed a general bias against genic CNVs in general, and large

genic deletions in particular265,232, suggesting that genomic alterations negatively impact fitness and

undergo purifying selection. Interestingly, there is also some evidence of positive selection (or potentially

reduced purifying selection266) acting on some genes, such as the salivary amylase gene AMY1 which

appears in higher copy number in humans than in other primates and which is found in higher copy

number in human populations with high-starch diets relative to populations with traditionally low-starch

diets.267 Alternatively, many common CNVs have been identified at high frequencies in all human

populations and appear to have only a modest effect, if any, on phenotype.

Early CNV surveys identified a large number of genes as copy number variable, but care must be

exercised in interpreting those results given the propensity of those early platforms to overestimate the

size of CNVs, and hence the actual number and identity of involved genes reported in earlier studies may

be inaccurate. However, even more recent studies, with the power to identify smaller CNVs with more

accurate breakpoints, have detected thousands of genes that are affected at least in part by deletions or

duplications. For example, Pang et al.238 reported an extensive analysis of the diploid genome of Dr.

Craig Venter based on multiple microarray and sequencing platforms, and they identified 189 genes

completely encompassed by gains or losses and an additional 4,867 genes whose exons were impacted by

CNVs. While they did find an overall paucity of CNVs affecting genes associated with autosomal

dominant or recessive diseases, cancer syndromes, imprinted and dosage-sensitive genes, 573 of the CNV

genes were in the Online Mendelian Inheritance in Man (OMIM) database. Conrad et al.232 used a

discovery cohort of 20 CEU and 20 YRI HapMap individuals to detect common CNVs using a high-

density CGH array, then genotyped 450 HapMap samples at approximately 5,000 common CNVs. On

average, they found 445/1,098 CNVs overlapping 622 genes between any two individuals, and they

identified 2,698 genes affected by CNVs in the total sample set. Over half of partial gene deletions were

predicted to induce frameshifts, and 267 genes appeared to be affected by unambiguous loss of function

CNVs. Genes affected by CNVs appeared to be enriched for extracellular functions such as cell adhesion,

recognition, and communication, whereas they appeared to be biased away from intracellular functions

such as metabolic and biosynthetic pathways. These results extended those of previous as well as

subsequent CNV surveys, which also reported enrichment of immune and defense responses as well as

neurological system processes.239,268,247 Those latter functions are also proposed to have been involved in

the adaptive differentiation of humans and chimpanzees.269

The exact contribution of CNVs to gene expression variability, and how they relate to SNPs, is unclear.

Stranger et al.270 interrogated the contribution of CNVs detected by Redon et al.206 on BAC-CGH array

and Affymetrix 500K array to gene expression variability in lymphoblastoid cell lines from 210 HapMap

samples (within a 2Mb CNV-gene), and found that 17.7% of 1,061 genes with expression variability were

associated with CNVs, with over half of the associations appearing to be long-range (i.e. the CNV did not

overlap the gene whose expression it appeared to impact). While 83.6% of variability was attributed to

SNPs, only 1.3% of genes were associated with both CNVs and SNPs. Schlattl et al.271 extended this

analysis of CNV-expression association by comparing normalized transcriptome data for lymphoblastoid

cell lines (LCLs) from 60 CEU and 69 YRI HapMap samples to CNV data published in the same samples

on multiple platforms (high-resolution tiling CGH array232, high-resolution SNP array219, and next-

generation sequencing data247). By concentrating on common CNVs and restricting to effect range of

200kb or less, they found a significant association between CNVs and the expression of 110 genes.

Despite an abundance of deletions in the CNV set, Schlattl et al.271 found enrichment of duplications

among CNVs associated with variable expression, suggesting purifying selection acting against deletions

that impact gene expression. While comparing results from this analysis to previously published studies,

the authors were able to confirm several CNV-gene expression associations, including 6/13 that were

identified by Stranger et al.270 within the same effect range. Most of the CNV associations (70%)

occurred without overlap of the CNV with the respective gene, although the range of effect appeared to be

<100 kb in most cases. Interestingly, several intronic deletions were associated with gene expression, but

expression was decreased in only half of the cases, whereas it was increased in the other half. Such a mix

of positive and negative CNV effect on expression was also observed for the CNVs which did not directly

overlap genes. CNVs that overlapped exons or completely encompassed CNVs usually affected

expression in the same direction as the copy number change. Unlike Stranger et al.270 , Schlattl et al.271

found that most CNVs associated with gene expression (70%) overlap previously published SNP-

expression associations. This discrepancy in overlap likely reflects the differences in CNV characteristics

detectable by earlier platforms (more rare than common CNVs, biased away from common SNPs) relative

to the platforms used by Schlattl et al.271 Conrad et al.232 proposed that since most common genotyped

CNVs were well tagged by SNPs, it would be expected that SNP-based genome-wide association studies

would have already screened most common CNVs for association with common diseases. Based on the

finding by Conrad et al.232 that less than 5% of trait-associated SNPs in 279 publications were in linkage

disequilibrium > 0.5 with a nearby CNV and the additional finding by the Wellcome Trust Case Control

Consortium that only three CNV loci reliably associated with one or more of eight common diseases (all

of which are tagged by SNPs that were previously detected in genome-wide association studies), the

authors of those papers argued that common genotyped CNVs do not explain a significant proportion of

heritability in common diseases. Nonetheless, the findings of Schlattl et al.271 indicate that a non-

negligible proportion of CNVs associated with gene expression variability do not link to SNPs, and

moreover 57% of genes with expression associated with CNVs were found to have a greater correlation

with their most strongly associated CNV than with any nearby SNP. This was especially true for CNVs

that overlap exons (10/10). Other studies of CNVs in mice, rats, and Drosophila have observed similar

impact of CNVs on gene expression.272-274

Many diseases have been associated with CNVs. Recurrent de novo microdeletions and

microduplications are linked to many sporadic genomic disorders such as Williams-Beuren syndrome,

Angelman syndrome/Pradel-Willi syndrome, Charcot-Marie-Tooth disease 1A, and idiopathic mental

retardation.195 Rare CNVs (de novo or heritable) have been associated with neuropsychiatric disorders

such as autism spectrum disorder and schizophrenia; neurodegenerative diseases such as Parkinson

Disease275; and metabolic disorders such as obesity276, among others. Common heritable CNVs have

been associated with autoimmune and infectious diseases such as Crohn’s disease277, rheumatoid

arthritis278, diabetes mellitus278, psoriasis279, lupus280, and susceptibility to HIV infection281. Both rare as

well as common CNVs have also been associated with susceptibility to cancer, as discussed below.

Determining the pathogenicity of CNVs, and delineating the responsible gene(s) or genomic elements,

can be challenging. CNVs may affect phenotype in a number of ways, including: increasing or

decreasing copy of dosage-sensitive genes; disrupting genes or producing fusion genes; position effect;

unmasking recessive alleles; affecting communication between alleles on homologous chromosomes.264

The effect of CNVs is also moderated by variable penetrance and expressivity.264 Some CNVs have been

associated with a wide range of phenotypes (e.g. 1q21.1 has been associated with dysmorphic features,

cardiac abnormalities, learning difficulties, mental retardation, autism, and schizophrenia)282; this may

reflect ascertainment bias due to the study design (e.g. phenotype-driven vs. genotype-driven)264 but may

also reflect variability in expressivity. Some studies have also demonstrated buffering effect in cells,

whereby the observed expression level of a given gene does not correspond linearly to the expected level

based on copy number.271,272 It should be noted that in addition to copy number, the phase information

and genomic context of CNVs is also important for understanding the potential effect of the variant.264

Other challenges in CNV research include distinguishing germline from somatic alterations. Many

studies used DNA from immortalized lymphoblast cell lines, and it has become apparent that some

structural variants occur exclusively in or may be amplified by the Epstein-Barr virus (EBV)

transformation process.278,283 Moreover, few studies addressed the issue of somatic mosaicism or

heterosomy (variants present in only a fraction of cells in the tissue/blood sample), since most

platforms/algorithms are not designed to identify the “partial” nature of these regions, and few studies

compared the genomes of different tissues from the same individual.207,212,284 One survey of large

structural variations in blood-derived DNA in 957 controls and 1,034 bladder cancer patients identified

mosaic structural variations in 1.7% of all individuals with no significant difference between cases and

controls.285 The regions most commonly found to be somatic or cell-line artifact are T cell receptors or

immunoglobulin genes, including loci at 2q11200, 2p11.2208,212, 22q11.2200,208,212, 14q32.3200,208,212, and

14q11.2212 as well as chromosomes 9 and 20.285 Interestingly, some studies identified copy-number

variation within monozygotic twin pairs, both phenotypically concordant as well as discordant,

suggesting post-twinning somatic development of CNVs.286,287,288

2.7 CNVs and cancer Chromosomal aneuploidy, whether involving entire chromosome, chromosomal arms, or segments of

chromosomes, is a characteristic feature of most solid malignant tumors. Chromosomal instability (CIN)

is the high rate of loss and gain of whole chromosomes and has been attributed to various mechanisms

that interfere with correct segregation of chromosomes during mitotic division.289 Chromosomal structure

instability (CSI) is another hallmark of most solid cancers, involving multiple chromosomal segmental

breakages and fusions associated with telomere shortening, inappropriate DNA repair of double-strand

breaks, and chromosomal fragile sites, resulting in amplifications or deletions of the involved genomic

regions. A “chicken-vs-egg” debate has revolved around the relationship of CIN and CSI with the

development of cancer: not all aneuploid cells are unstable or tumorigenic and certainly many copy

number alterations in tumors appear to be “passengers” rather than driver mutations. Nonetheless, there

is evidence for CIN and CSI in cancer development, such as generating LOH at loci of inactivated tumor

suppressor genes or amplified oncogenes.290 Two decades ago, comparative genomic hybridization

(CGH) was developed to facilitate identifying regions of copy number gain and loss by hybridizing

biotinylated DNA from paired tumor and normal samples to metaphase chromosome spreads. Several

years later, array-based CGH was introduced and became a commonly used tool in the study of cancer

genomes. Later, SNP microarrays also came into use, providing the added advantage of detecting regions

of copy-neutral LOH and uniparental disomy. Very recently, the drop in cost of whole-genome and

exome sequencing has allowed the use of these technologies to identify a wide range of variants in

tumors, from single base to large structural variants.

In keeping with the classical Knudson two-hit hypothesis for inactivation of tumor suppressors, a number

of well-known tumor suppressor genes were first identified by analyzing focal homozygous deletions in

cancer in combination with linkage and/or LOH results (e.g. CDKN2A/B, PTEN, WT1, BRCA2). Those

discoveries spurred the identification of numerous candidate tumor suppressors by characterizing

recurrent deletions in tumors or cancer cell lines. Mouse studies have even suggested that

haploinsufficiency of some cancer genes can be sufficient to cooperate with other oncogenic alterations in

initiating tumor development (e.g. LKB1 and BRCA2 heterozygosity have been reported to accelerate

pancreatic tumor development in mice with activated Kras mutations). Similarly, genomic amplifications

in cancer can help identify candidate oncogenes. Moreover, some deletions and amplifications carry

prognostic significance (e.g. MYCN amplification in neuroblastoma, ERBB2 amplification in breast

cancer, 18q deletion in colon cancer), and whole-genome profiling of copy number alterations in tumors

can be diagnostic or prognostic (e.g. distinguishing gastrointestinal stromal tumors from

leiomyosarcomas291; aCGH classifier based on BRCA1-mutated breast cancer predicting sensitivity to

double-strand-DNA-break-inducing chemotherapy in patients without germline BRCA1/2 mutations292).

Structural rearrangements of pancreatic adenocarcinoma have been described in multiple studies, ranging

from cytogenetic karyotyping293 and microsatellite genotyping12,182,294 to CGH295-306 and SNP

microarrays37,307,308,309 to next-generation sequencing38. Certain patterns have emerged: all chromosomal

arms manifest genomic rearrangements, and the most frequently reported rearrangements are losses on

1p, 3p, 6p, 6q, 8p, 9p, 9q, 17p, 18q, 19p and gains on 8q. Some studies attempted to identify candidate

tumor suppressor genes or oncogenes, and while most results were of insufficient resolution to pinpoint a

target gene, certain genes were highlighted by multiple studies using a combination of genomic and

expression data (e.g. SMURF1 on 7q22.1301,303 and GATA6 on 18q11.2304,310 were proposed as novel

oncogenes.) LOH is a common event across the pancreatic cancer genome, often occurring in the form of

whole chromosome loss, and there was no significant difference in the pattern of LOH between sporadic

and familial tumors.12,182 One recent study that used massive parallel sequencing technology to detect

variants at fine resolution in 3 primary tumors and 10 metastases reported significant inter-patient

heterogeneity in the number, type, and distribution of rearrangements.38 Interestingly, one sixth of all

rearrangements were in a pattern they termed “fold-back inversions”, whereby regions are duplicated but

with the duplications facing in opposite directions. This appeared to be an early event in the development

of pancreatic cancer and is associated with telomere loss. Moreover, sequence analysis of metastases

indicated that this type of rearrangement did not continue occurring later in the pancreatic cancer

developmental pathway, suggesting a reactivation of telomere repair function. Other interesting findings

from this analysis of somatic rearrangements in pancreatic cancer metastases were: evidence of ongoing

clonal evolution in the primary tumor among cells capable of initiating metastases (based on identifying

finding some rearrangements only in some metastases), evidence for driver mutations involved in

metastatic spread (based on finding some rearrangements only in the metastases but not in the primary

tumor), and evidence for differences in evolution of metastases within each organ.

Less well studied than somatic genomic rearrangements in cancer is the relationship between germline

CNVs and cancer susceptibility. It is well known that moderate-to-high-penetrance rare germline CNVs

contribute to the heritability of familial cancer. Large germline genomic rearrangements that are absent

or rare in healthy populations have been reported as the cause of 15% of Familial Adenomatous Polyposis

(APC)311, 19% of Von Hippel Lindau disease (VHL)312, 4% of Hereditary Diffuse Gastric Cancer

(CDH1)313, 2-12% of Hereditary Breast and Ovarian Cancer (BRCA1 and BRCA2)314-320, 6-27% of Lynch

Syndrome (MSH2 & MLH1 genes)321,322, 16% of Peutz-Jeghers Syndrome (STK11)323, and 15% of

juvenile polyposis (SMAD4, BMPR1A, and PTEN)324 cases. Deleterious germline CNVs have also been

reported in non-BRCA1/2 associated familial breast cancer (PALB2325; BARD1326), Hereditary

Leiomatomatosis and Renal Cell Cancer (FH)327, Cowden disease (PTEN)328, Familial Atypical Multiple

Mole Melanoma (CDKN2A)329, Neurofibromatosis Type 1 (NF1)330, Ataxia Telangiectasia (ATM)331, Li

Fraumeni syndrome (TP53)332, familial retinoblastoma (Rb)333, and Multiple Endocrine Neoplasia Type 1

(MEN1)334. Interestingly, there are examples of copy number alterations at a distance from the coding

region of a gene influencing its expression, whether by affecting regulatory elements or by inducing

epigenetic changes that inactivate the gene. For example, in approximately 20% of suspected Lynch

syndrome cases with MSH2 loss but no detectable germline mutations or rearrangements in MSH2335

(about 1-3% of all Lynch Syndrome patients336), the causative mutation is a large heritable deletion at the

3’ end of the TACSTD1 gene, which causes transcriptional read-through and epigenetic silencing of the

adjacent MSH2 gene. In one juvenile polyposis kindred with 10 affected members who had no mutations

or rearrangements in the coding regions of SMAD4 and BMPR1A, Calva-Cerqueira et al.337 identified a

large deletion mapping 119kb upstream of the coding region of BMPR1A segregating with disease. The

deletion affected a promoter of BMPR1A and was demonstrated to diminish expression of the gene.

Common copy number polymorphisms at some genes linked to cancer have also been associated with

modest risk. For example, the glutathione-S-transferases (GSTs) constitute a family of genes involved in

drug and toxin metabolism and are thus hypothesized to protect cells against xenobiotics and oxidative

stress. Two of those genes, GSTT1 and GSTM1, have polymorphic deletions shown to correlate with

lowered enzyme activity. In one recent study that accurately quantified the copy number of those genes

in approximately 2,000 cancer patients and 8,000 controls, a gene dosage effect was demonstrated in

GSTT1 for prostate cancer in men and corpus uteri cancer in women, and in GSTM1 for bladder cancer.338

Another interesting association between a common copy number polymorphism and cancer was identified

in familial breast cancer for a deletion that eliminates exon 4 of MTUS1, a gene implicated as a tumor

suppressor. Interestingly, the common deletion was found to have a protective effect against breast

cancer, suggesting that the exon 4 deletion may paradoxically increase the tumor suppressor activity of

the gene (although this has yet to be demonstrated in functional studies).339

All of the aforementioned germline rearrangements were identified in targeted studies, commonly

utilizing PCR-based assays, which specifically searched for and/or quantified deletions or duplications at

or near known cancer genes in high-risk populations. The discovery of predisposition germline

rearrangements in cancer subjects without a priori knowledge of the region/gene of interest requires a

different approach. Most studies addressing this question have adopted two main strategies: genome-

wide CNV surveys in large cohorts of sporadic cancer patients and controls allow the identification of

statistically significant associations between common CNVs and a low-to-modest cancer risk;

alternatively, genome-wide CNV surveys in familial or hereditary cancer patients should facilitate the

detection of rare heritable CNVs (not previously published in controls nor present in a concurrently

studied control cohort) that potentially alter cancer genes and produce a modest-to-high risk of cancer.

Genome-wide case-control CNV association studies have identified candidate risk alleles for several

sporadic cancers: neuroblastoma in a Caucasian population (deletion at 1q21.1, OR=2.49, p=2.97 x 10-

17)340, aggressive prostate cancer in Caucasian populations (deletion at 2p24.3, OR=1.31, p=0.006;

deletion at 20p13, OR=1.17, 2.75 x 10-4)341,342, and nasopharyngeal carcinoma in Han Chinese males

(deletion at 6p21.3, OR=18.92, ).343 Most recently, Huang et al.344 identified a common 10,379bp

deletion at 6q13 that was found to be higher in frequency in sporadic pancreatic cancer Han Chinese

patients compared to controls, and confirmed via a qPCR assay to have an odds ratio of 1.31 for 1-copy

carriers compared to 2-copy carriers. All those studies replicated their results in a confirmation cohort

and used ethnicity-matched cases and controls, and all but Diskin et al.340 used a PCR-based assay as the

confirmation assay; Diskin et al.340 applied multiple correction testing to verify the statistical significance

of their results. Three of the identified CNVs overlapped genes: The neuroblastoma CNV overlapped a

novel transcript that demonstrated high sequence homology to the neuroblastoma breakpoint family

(NBPF) genes, was shown to correlate in expression with copy number, and was highly expressed in fetal

brains. The prostate cancer CNV at 20p13 differentially affects isoforms of the SIRPB1 gene, which

codes for a signal regulatory protein. The CNV at 6p21.3 encompassed MICA, a major histocompatibility

complex class (MHC)-A gene which functions to mediate natural killer (NK) cell activation and T-

lymphocyte costimulation and which has been associated with nasopharyngeal cancer in previous studies.

The pancreatic cancer CNV at 6q13 and the prostate cancer CNV at 2p24.3 are non-genic and are

hypothesized to impact risk through long-range regulatory effects on an unidentified gene. Indeed,

functional analysis of the non-genic deletion associated with pancreatic cancer suggested that it may be

involved in long-range regulation of CDKN2B, an established tumor-suppressor gene. While these results

are interesting, they remain to be further validated in future studies. Some analyses may be confounded

by inaccurate genotyping of the CNV of interest: for example, the Database of Genomic Variants has

reports of gains as well as deletions at several of these putative cancer-associated CNVs, suggesting that

they may not be simple biallelic variants. Moreover, previous studies of CNVs in Asian populations232,239

reported higher frequencies of the deletion at 6p21 in controls than was identified in the population

studied in the nasopharyngeal carcinoma study. This is particularly significant because the odds ratio

identified for the 6p21 deletion (~19) was much higher than for any other common CNV or SNP

associations, and it may in fact be an overestimation if the deletion was undercalled in controls.

A few studies have been published surveying germline CNVs in familial solid cancer patients, and

although they have proposed several candidate predisposition genes based on overlap with patient-

specific CNVs, none to date have been able to show a significant contribution or segregation with disease

of any one gene to those cancer syndromes. One of the earliest studies analyzed 57 predominantly

Caucasian pancreatic cancer patients from 56 high-risk kindreds (each containing at least a pair of

affected first-degree relatives) using an oligonucleotide-based CGH platform, filtering out losses or gains

that were also identified in 607 mostly Caucasian controls (372 were analyzed in the same study, and 235

were previously reported in two other studies).345 Twenty-five losses overlapping 81 genes and 31 gains

overlapping 425 genes were identified specific to the cancer patients, and those genes were presented as

potential candidate predisposition genes. Due to lack of sufficient related samples, the authors were

unable to demonstrate heritability or segregation with disease of the patient-specific CNVs. Moreover,

the resolution of the CGH array used in this study was relatively lower than current platforms

(approximately 30kb), which resulted in relatively large CNV calls that likely overestimated the actual

breakpoint boundaries of rearrangements. Furthermore, the available control data available at the time of

publication was limited, so some of the supposedly familial pancreatic cancer (FPC)-specific CNVs were

identified in control populations in subsequent studies. The abstract of the paper refers to two deletions

that were observed in two different patients and one deletion that was observed in three different

individuals, yet no discussion of these regions is found in the main text of the manuscript. If such regions

were truly found to be recurrent in patients and absent in controls, they would be of particular interest as

candidate predisposition CNVs, but we cannot draw any conclusions given the paucity of information

provided.

Two other studies similarly provided a list of candidate genes in familial cancer. Yoshihara et al.346

compared 68 Japanese subjects with germline BRCA1 mutations (including 51 subjects with ovarian

cancer), 34 sporadic ovarian cancer patients, and 47 healthy controls, and they identified 31 CNVs

specific to the BRCA1-mutation group. All 31 CNVs overlapped genes, and three CNVs segregated with

ovarian cancer in affected members of the same family (of which two CNVs were present in two different

families each). No significant difference was found in the per-genome total number of CNVs between

BRCA1-mutation carriers and controls, although the number of deletions was higher in the BRCA1-

mutation subjects. Otherwise, they found no evidence for differential clustering of the global CNV data

between groups, and no correlation of age at diagnosis with CNV frequency. Since the BRCA1 gene was

already identified as the primary genetic mutation in this study, the list of genes overlapped by CNVs

represented potential modifying genes that may contribute to the unique biological characteristics of

BRCA1-mutated ovarian cancer. Venkatachalam et al.347 studied 41 young-onset and/or familial

colorectal cancer with microsatellite-stable tumors and identified four losses and three gains in six

patients (one patient had a loss and a gain) which were not present in a large control cohort nor reported

in previous control studies. Each CNV overlapped at least one gene and each was detected in a single

patient only.

A study by Shlien et al.348 presented an intriguing perspective of the connection between germline CNVs

and somatic tumor development in TP53 germline mutation carriers. They studied 53 Li-Fraumeni family

members (20 with wildtype TP53, 23 with TP53 mutations and history of cancer, and 8 with TP53

mutations and no cancer) and 70 unrelated healthy controls, and demonstrated a significantly elevated

frequency of germline CNVs in the TP53 mutation carriers relative to controls with wild-type TP53.

There was also a trend for a higher frequency of germline CNVs in cancer patients carrying TP53

mutations relative to mutation carriers without a history of cancer, but this did not reach statistical

significance possibly due to the small sample size. Furthermore, not only was the number of individual

CNVs elevated in mutation carriers but the number of copy-number variable bases was also higher, even

when the absolute number of CNVs was not, due to a tendency toward larger CNVs in the TP53 mutation

cohort. Comparison between germline and choroid plexus tumor DNA in four patients identified 15/21

loci overlapping germline CNVs that became substantially larger in the paired tumors, and three of four

tumors had loci at which a germline hemizygous deletion had progressed to homozygous deletion. These

findings suggested a model of tumor development in Li-Fraumeni syndrome in which germline genomic

instability (manifested as a higher than average CNV frequency) predisposes to additional genomic

rearrangements and/or expansion of germline CNVs in somatic tissue, affecting genes that drive the

development of cancer. The authors also report a list of cancer-related genes overlapped by germline

CNVs in the TP53-mutation carriers which may act synergistically with the TP53 mutation in promoting

cancer development. Of course, the role of TP53 in maintaining the genome is well known349, and it is

not surprising to find that even non-malignant cells exhibit increased genomic instability in Li-Fraumeni

patients. However, it is unclear if this phenomenon applies to other tumor suppressor genes that

predispose to familial cancer. Future surveys of CNV burden in other cancer syndromes would shed

more light on this question.

3. Whole-Exome Sequencing The human genome is comprised of approximately 3 billion base pairs, of which less than 2% code for

proteins. The release of the first reference build of the human genome in 2003, after a 13-year

collaborative international effort, opened the door to significant advancements in understanding the

genetic and genomic makeup of individuals, populations, and cancers. The Human Genome Project

expanded understanding of the identity and population frequency of SNPs, the most frequently occurring

variant in the human genome, and efforts to determine haplotype structure (blocks of SNPs present in

different combinations and segregating in populations) have accelerated progress in the fields of

population genetics, human evolution, and disease-gene associations.

The original sequencing effort was based on the technique developed by Fredrick Sanger in the 1970s,

utilizing labeled dideoxy trinucleotide triphosphates (ddNTPs) as DNA chain terminators and separating

terminated chains of various lengths by gel electrophoresis to determine base order in the sequence.

High-throughput requirements of the DNA sequencing effort drove the development of automated

capillary electrophoresis and other laboratory process automation. The International Human Genome

Sequencing Consortium (IHGSC) employed a “hierarchical shotgun sequencing“ approach that involved

fragmenting and cloning DNA (initially using yeast artificial chromosomes, then subsequently bacterial

artificial chromosomes), mapping clones on the physical map of the genome with the help of established

genomic markers, shot-gun sequencing clones, and finally aligning sequenced fragments to the

developing map.350 In the last few years of the IHGSC project, a competing effort undertaken by Craig

Venter’s company CELERA utilized a “whole genome shotgun sequencing” approach which was

considered by Venter to be more efficient and faster, although CELERA did end up incorporating

publicly available data that was generated by the IHGSC to allow accurate mapping of sequenced

fragments due to the difficulty of mapping to highly repetitive regions of the genome (which constitute a

large portion of the human genome) without the use of additional genome map information.350,351 The

approximate cost of sequencing the first reference human genome was $3 billion. Importantly, neither the

IHGSC nor the CELERA genomes was the sequence of a single diploid genome but rather each was a

haploid consensus sequence of DNA derived from several anonymous individuals of different ancestries

(although the IHGSC sequence was primarily based on a single male individual, and the CELERA

reference sequence may have included Craig Venter’s genome). Building on the data discovered from the

reference human genome, the International HapMap Project set out to identify common SNPs (defined as

minor allele frequency (MAF) >1% frequency, but most identified by this project have a MAF >5%) and

their haplotype structure in members of different populations.352 This important source of information

allowed the development of genotyping arrays for genome-wide association studies.

Only four years after the release of the nearly complete human reference genome, the first diploid human

genome sequence to be published belonged to Craig Venter, using the CELERA whole-genome shotgun

sequencing method, costing $70-100 million and was completed in about 4 years. (The cost estimate

incorporates costs incurred during the development of the CELERA reference genome).209 While this

sequence presented an interesting perspective on the makeup of individual genomes, it is also clear that

many more genomes need to be sequenced before the full potential of genomic analysis and comparisons

among individuals can be realized.

Making whole-genome sequencing possible for many genomes required a dramatic reduction in cost and

increase in the speed of the process. To that end, the development of massively-parallel next-generation

technologies presented a breakthrough in genomics. Since publication of the first sequencing-by-

synthesis technology in 2005353, a number of different platforms have been developed. While they

employ different techniques of sequencing (Illumina and Roche/454 use DNA polymerase-based

sequencing-by-synthesis approaches while ABI SOLiD uses DNA ligase-based sequencing by ligation),

all are based on clonal cluster amplification of target molecules to generate a sufficiently strong signal.354

The first human genome to be fully sequenced by a massively-parallel platform belonged to James

Watson, co-discoverer of the DNA double helix.218 In a demonstration of the significantly increased

power of next-generation sequencers, the Watson genome was sequenced in 4.5 months and this effort

cost less than $1.5 million.355 Since then, many other individuals of different ancestries have been

sequenced.209,218,222,223,227,228,230,239,243,356,357,358,359 The 1000 Genomes project is an endeavour to sequence

the genomes of 2,500 unidentified individuals from 29 populations to discover, genotype, and accurately

identify haplotypes, with the overarching goal of characterizing 95% of variants with allele frequency of

1% or greater in genomic regions that can be sequenced by the most recently available next-generation

platforms.246 To date, three pilot projects have been completed: (1) low-coverage sequencing (2-4x) of

the whole genome of 180 individuals – provides data on 1% or higher frequency SNPs; (2) deep

sequencing (20-60x) of two mother-father-adult child trios whole genomes – allow quality control of data

from pilot project (1) and inferring haplotypes; (3) targeted capture and deep sequencing (50x) of ~8,000

exons from approximately 900 randomly selected genes -- to test the effectiveness of targeted capture

sequencing in identifying common, low-frequency, and rare variants in protein-coding regions of the

genome. The main project involves low-depth sequencing (4x) of the whole genome of 2,500 individuals

as well as deeper sequencing of their exomes by the target-enrichment method (See below for more detail

on exome sequencing).

Whereas the Sanger-based automated sequencers generated approximately 100 kbp of data per day on a

single machine, the earliest next-generation platform increased the output by two orders of magnitude and

this was very quickly surpassed by further developments of other platforms with larger output, and a

single sequencer in 2011 produces around 40 Gbp per day.360,361 An important distinction between

Sanger-based and next-generation sequencers is the read length: 700-1000 bp for capillary Sanger

sequencers compared to 75-400bp in next-generation sequencers, depending on the platform. The cost of

whole-genome sequencing has dropped significantly, currently as low as $5000-$10000. Interestingly,

while the cost of generating a genome sequence has dropped dramatically, the capacity to analyze the data

has advanced less rapidly. Some challenges have included the inadequate adaptation of software

originally designed for alignment and variant calling of Sanger sequencing and the need for newer

validated software packages that can handle the significantly larger quantity of data that is generated with

newer platforms.362 The relatively short reads have also posed a problem for de novo genome assembly

and correct alignment to repetitive or highly homologous regions. In recent years, “third-generation”

sequencing methodologies have been introduced, characterized by the ability to directly sequence single

molecules without needing to amplify the template.363 Those newest methods of sequencing may address

some of the limitations of next-generation sequencers (e.g. they appear to generate longer reads

approximating the length obtainable by the Sanger capillary sequencers) but they have their own

challenges, such as higher raw read error rate from the single molecule sequencing approach. As such,

ongoing improvements in both sequencing technologies as well as bioinformatic tools will be necessary

to achieve the most cost-effective means of sequencing large numbers of genomes for disease gene

discovery and clinical diagnostic purposes. (I am not addressing other applications of next-generation

sequencing such as transcriptomics, epigenomics, and chromatin immunoprecipitation sequencing (ChIP-

seq) as they are outside the scope of this thesis).

The cost of whole-genome sequencing has not yet reached the promised “$1,000-genome” level that has

been identified as a goal for the genomic community, particularly if post-sequencing analysis cost is taken

into consideration; moreover, much of the information identified in a whole genome remains difficult to

evaluate in terms for functional impact on disease or phenotype since only 1-2% of the entire genome has

been annotated as protein-coding. Indeed, to date, several reports of whole-genome sequencing in disease

cases have been published but invariably they focus on coding region variants to identify candidate

causative genes.364-371 These two current limitations of whole-genome sequencing (cost and functional

annotation of the genome) have made exome-sequencing an attractive alternative for researchers. Exome

sequencing is based on capturing and subsequently amplifying and sequencing the coding region of the

genome using massively-parallel sequencing. Since the target region in exome sequencing is less than

2% that in whole-genome sequencing, it is possible to obtain much greater read-depth per base per run.

This means that more samples can be sequenced in the same amount of time and for the same price as a

single whole genome. A number of methods of target enrichment have been introduced, including both

solid-phase (e.g. Nimblegen Sequence Capture Human Exome 2.1M array) as well as in-solution

oligonucleotide arrays (e.g. Agilent SureSelect System).372,373 The latest arrays can capture up to 44-

50Mb of genomic sequence, encompassing most of the annotation of the Collaborative Consensus Coding

Sequence (CCDS 2009)374 database and flanking base pairs of target regions as well as microRNAs and

other non-coding RNAs. It should be noted that, although the coverage of exome sequencing for coding

regions and adjacent regulatory sequences is excellent, it is not perfect and the success of capture varies

between arrays to some extent, as well as sequence-specific characteristics such as high GC-content.375

The first description of a human exome was based on the coding variants identified in the previously

published diploid genome of Craig Venter (HuRef).376 The authors reported that most nonsynonymous

SNPs are common (15-20% are rare and ~95% of the rare variants were heterozygous). They also

identified 105 premature-terminating codons, many of which are common and do not appear to be under

negative selection. They noted that many of these variants were present in duplicated genes and

hypothetical genes, suggesting that their impact in this setting may be less deleterious. They also noted

that half of all coding indels occurred in tandem repeats, and tended to occur at the C and N termini of

genes and/or near exon boundaries (which in some cases were considered likely mapping errors in the

reference genome). There was a bias toward indels composed of multiples of 3 bases (3n) in coding

regions that are likely to be functionally significant, suggesting purifying selection acting on frameshift

indels in those regions. Of additional importance, the authors noted that the Venter genome contained at

least 680 nonsynonymous SNPs affecting 443 genes with some association with disease, including 7 that

were in dbSNP and OMIM database, which foreshadowed the challenge that would be encountered in

interpreting the clinical significance of coding variants as more genomes and exomes are sequenced.

The first report of target-captured exome sequencing using next-generation sequencing was published in

2009 by Ng et al.377, describing the exomes of 8 HapMap individuals whose genomes were previously

characterized by sequencing fosmid-clones to identify structural variants. In addition, in a proof of

concept experiment, the exomes of four unrelated individuals with a rare autosomal dominant disorder

(Freeman-Sheldon Syndrome) caused by MYH3 mutations were sequenced to demonstrate a filtering

strategy that would identify the causative gene. The average depth of coverage was 51x, translating into

95% of coding bases in 78% of genes being successfully called (based on a threshold of ≥ 8x depth per

base required to reliably call a heterozygous variant). The estimated average number of truncating single

base variants per genome was higher in African than non-African genomes (20/African vs. 10/non-

African), and a similar ratio was observed for rare frameshift indels (17/African vs. 8/non-African). As

was observed in the Venter exome, most indels in coding regions were non-frameshift. To identify the

causative gene in the four Freeman-Sheldon Syndrome patients, the authors filtered variants to focus on

non-synonymous and/or splice-site variants or indels that were not previously reported in dbSNP or found

in the 8 HapMap exomes, and which were in the same gene in all four affected patients. This approach

reduced the number of candidate genes to precisely one, namely MYH3. A subsequent study applied the

same filtering strategy to successfully identify the unknown genetic cause of a rare autosomal recessive

Mendelian disorder (Miller Syndrome), the first of approximately 90 such studies to be published in quick

succession over a period of 24 months. (Table 3) Currently ongoing large-scale projects employing

exome sequencing include the 1000 genomes project (which aims to sequence the exomes of ,2500

anonymous individuals) as well as the Exome Sequencing Project, which aims to discover variants

relevant to heart, lung, and blood diseases and has to date sequenced the exomes of nearly 5,400

individuals from multiple study cohorts (the project plans to sequence approximately 7,000 exomes).

Table 3 – Studies using exome-sequencing to identify genetic cause of disease

Authors Year Journal Disease Autosomal dominant or recessive (AD or AR)

Description

Vissers et al.378 2010 Nat Genet Mental Retardation Sporadic Studied 10 trios; identified de novo mutations as potential cause for unexplained mental retardation

Walsh et al.379 2010 Am J Hum Genet

Nonsyndromic Hearing Loss

AR Combined homozygosity mapping in consanguinous family with exome sequencing to identify DFNB82 as cause

Lalonde et al.380 2010 Hum Mut Fowler Syndrome AR Identified compound hets in FLVCR2 in two fetuses from consanguinous families

Pierce et al.381 2010 Am J Hum Genet

Perrault Syndrome AR Identified compound hets in HSD17B4 in two sisters

Ng et al.382 2010 Nat Genet Kabuki Syndrome AD Studied 10 unrelated affected subjects; identified MLL2 as cause

Bilguvar et al.383 2010 Nature Malformation of Cortical Development

AR Combined homozygosity mapping and exome sequencing in family with two affected members; identified WDR62 as cause

Gilissen et al.384 2010 Am J Hum Genet

Sensenbrenner Syndrome

AR Identified compound hets in WDR35 in two unrelated affected subjects

Krawitz et al.385 2010 Nat Genet Hyperphosphatasia Mental Retardation Syndrome

AR Performed identity-by-descent filtering on exome data to identify PIGV as cause in 3 affected siblings of nonconsanguinous family

Anastasio et al.386

2010 Am J Hum Genet

Van Den Ende-Gupta Syndrome

AR Combined homozygosity mapping with exome sequencing to identify SCARF2 as cause in 4 affecteds from 3 consanguinous families

Johnson et al.387 2010 Am J Hum Genet

Brown-Vialetto-van Laere Syndrome

AR Identified C20orf54 as cause in three affected siblings

Sirmaci et al.388 2010 Am J Hum Genet

Michels Syndrome AR Combined homozygosity mapping with exome sequencing to identify

MASP1 as cause in 3 individuals from 2 consanguinous families

Haack et al.389 2010 Nat Genet Isolated complex I deficiency

AR Identified compound hets in ACAD9 in single affected individual

Wang et al.390 2010 Brain Spinocerbellar ataxia AD Combined linkage analysis with exome squencing in a Chinese family with 4 affecteds; identified TGM5 as cause

Musunuru et al.391

2010 NEJM Combined hypolipidemia

AR Identified compound hets in ANGPTL3 in 2 affected sibs

Johnson et al.392 2010 Neuron ALS AD Combined linkage analysis with exome sequencing in 2 affected relatives, identified VCP as cause

Bolze et al.393 2010 Am J Hum Genet

Autoimmune lymphoproliferative syndrome (ALPS)

AR Found homozygous variants in FADD

Liu et al.394 2011 PLoS One Moyamoa disease AD Combined linkage analysis with exome sequencing to identify RNF213

Zuchner et al.395 2011 Am J Hum Genet

Retinitis pigmentosa AR Identified homozygous variants in DHDDS

Glazov et al.396 2011 PloS Genet Anauxetic dysplasia-like condition

AR Identified compound hets in POP1

Worthey et al.397 2011 Genet Med Inflammatory bowel disease

AR Identified hemizygous variant on X chromosomes (XIAP)

Simpson et al.398 2011 Nat Genet Hajdu-Cheney Syndrome

AD Exome sequencing of 3 unrelated affecteds identified NOTCH2

Becker et al.399 2011 Am J Hum Genet

Osteogenesis imperfecta

AR Identified homozygous variants in SERPINF1 in 2 affected sibs

Ostergaard et al.400

2011 J Med Genet Primary lymphoedema

AD Combined linkage analysis with exome sequencing to identify GJC2

Caliskan et al.401 2011 Hum Mol Genet

Non-syndromic mental retardation

AR Combined homozygosity mapping with exome sequencing to identify TECR

Erlich et al.402 2011 Genome Res Hereditary spastic paraparesis

AR Combined homozygosity mapping with exome sequencing to identify KIF1A

Sundaram et al.403

2011 Ann Neurol Tourette syndrome/chronic tic phenotype

AD Identified OFCC1 as cause

Puente et al.404 2011 Am J Hum Genet

Hereditary Progeroid Syndrome

AR Identified homozygous mutations in BANF1

Vissers et al.405 2011 Am J Hum Genet

Chondrodysplasia and abnormal joint development syndrome

AR Identified homozygous variants in IMPAD1 in three affected unrelated individuals

O’Sullivan et al.406

2011 Am J Hum Genet

Amelogenesis imperfecta and gingival hyperplasia syndrome

AR Combined homozygosity mapping with exome sequencing to identify FAM20A

Gotz et al.407 2011 Am J Hum Genet

Infantile hypertrophic mitochondrial cardiomyopathy

AR Identified compound heterozygous mutations in mtAlaRS

Shi et al.408 2011 PLoS Genet Myopia AD Identified mutations in ZNF644 in 2 relatives

Klein et al.409 2011 Nat Genet Hereditary sensory neuropathy with dementia and hearing loss

AD Combined linkage with exome data to identify mutations in DNMT1

Barak et al.410 2011 Nat Genet Malformations of occipital cortical development

AR Identified homozygous mutation in single affected child of consang parents

O’Roak et al.411 2011 Nat Genet Autism Sporadic Identified 11 de novo protein-altering mutations, some genes previously connected to autism

Alvarado et al.412 2011 Bone Joint Surg Am

Distal arthrogryposis type 1

AD Identified MYH3 as cause

De Greef et al.413 2011 Am J Hum Genet

Immunodeficiency, centromeric instability, and facial anomalies

AR Combined homozygosity mapping with exome sequencing to identify ZBTB24

Yamaguchi et al.414

2011 J Bone Miner Res

Primary failure of tooth eruption

AD Combined linkage with exome sequencing to identify PTH1R as cause

Zhou et al.415 2011 Hum Mutat Hereditary hypotrichosis simplex

AD Combined linkage with exome sequencing to identify RPL21 as cause

Le Goff et al.416 2011 Am J Hum Genet

Geleophysic and acromicric dysplasia

AD Identified FBN1 as candidate gene in 5 patients

Hanson et al.417 2011 Am J Hum Genet

3-M syndrome AR Combined homozygosity mapping with exome sequencing to identify mutation in CCDC8

Vilarino-Guell et al.418

2011 Am J Hum Genet

Late-onset Parkinson AD Identified mutation in VPS35

Zimprich et al.419 2011 Am J Hum Genet

Late-onset Parkinson AD Identified VPS35 as cause (different patients from Vilarino-Guell)

Sergouniotis et al.420

2011 Am J Hum Genet

Leber congenital amaurosis

AR Combined homozygosity mapping with exome sequencing to identify KCNJ13 as cause

Albers et al.421 2011 Nat Genet Gray Platelet Syndrome

AR Identified NBEAL2 as cause

Sanna-Cherchi et al.422

2011 Kidney Int Steroid-resistant nephrotic syndrome

AR Combined homozygosity mapping with exome sequencing in 3 affected sibs

of consang parents to identify homozygous mutations in MYO1E and NEIL1

Liu et al.423 2011 J Exp Med Chronic mucocutaneous candidiasis disease

AD Identified mutations in STAT1 as cause

Yariz et al.424 2011 Fertil Seril Empty Follicle Syndrome

AR Identified homozygous mutation in LHGCR in 2 sisters

Xu et al.425 2011 Nat Genet Schizophrenia Sporadic Identified 40 rare de novo protein altering mutations in 40 genes (in 27 cases), including DGCR2, a gene in schizophrenia-predisposing region 22q11.2

Sirmaci et al.426 2011 Am J Hum Genet

KBG syndrome AD Identified ANKRD11 as cause

Shaheen et al.427 2011 Am J H um Genet

Adams-Oliver syndrome

AR Combined homozygosiy mapping with exome sequencing to identify homozygous mutations in DOCK6

Noskova et al.428 2011 Am J Hum Genet

Adult-onset neuronal ceroid lipofuscinosis

AD Identified 5 unrelated individuals with mutations in DNAJC5

Weedon et al.429 2011 Am J Hum Genet

Charcot-Marie-Tooth

AD Found DYNC1H1 as cause in 3 relatives

Ozgul et al.430 2011 Am J Hum Genet

Retinitis pigmentosa AR Identified homozygous mutation in MAK as cause

Doi et al.431 2011 Am J Hum Genet

Cerebellar ataxia AR Identified mutation in SYT14 as cause

Sloan et al.432 2011 Nat Genet Malonic and methylmalonic aciduria

AR Identified mutation in ACSF3 as cause

Aldahmesh et al.433

2011 J Med Genet Knobloch Syndrome AR Identified ADAMTS18 as cause

Murdock et al.434 2011 Am J Med Genet A

Recurrent polymicrogyria

AR Identified compound het mutations in WDR62 as cause in 2 sibs

Regalado et al.435 2011 Circ Res Thoracic aortic aneurysms leading to acute aortic dissection

AD Identified SMAD3 as cause

Dickinson et al.436

2011 Blood Dendritic cell, monocyte, B and NK lymphoid deficiency

AD Identified GATA2 as cause in 4 unrelated affecteds

Hor et al.437 2011 Am J Hum Genet

Familial narcolepsy with cataplexy

AR Combined linkage with exome sequencing to identify MOG as cause

Marti-Masso et al.438

2011 Hum Genet Early-onset generalized dystonia

AR Identified GCDH as cause in 2 affected siblings

Tariq et al.439 2011 Genome Biol heterotaxy AR Combined homozygosity mapping with exome

sequencing to identify SHROOM3 as candidate cause

Takata et al.440 2011 Genome Biol Progressive external ophthalmoplegia

AR Combined homozygous mapping with exome sequencing to identify RRM2B as cause in patient from consang family

Theis et al.441 2011 Circ Cardiovasc Genet

Dilated cardiomyopathy

AR Combined homozygosity mapping with exome sequencing to identify GATAD1 mutations in 2 affected sisters

Pierson et al.442 2011 PLoS Genet Spastic ataxia-neuropathy syndrome

AR Identified AFG3L2 as cause in 2 brothers of consang family

Al Badr et al.443 2011 J Pediatr Urol Ochoa (urofacial) syndrome

AR Combined homozygosity mapping with exome sequencing to identify HPSE2 as cause in child of consang parents

Cullinane et al.444

2011 J Invest Dermatol

Oculocutaneous albinism and neutropenia

AR Combined homozygosity mapping with exome sequencing to identify two candidate genes (SLC45A2 and G6PC30

Ovunc et al.445 2011 J Am Soc Nephrol

Intermittent nephrotic-range proteinuria

AR Identified CUBN as cause in 2 sibs of consang parents

Bowne et al.446 2011 Eur J Hum Genet

Retinitis pigmentosa with choroidal involvement

AD Combined linkage analysis with exome sequencing to identify RPE65 as cause

Kitamura et al.447 2011 J Clin Invest Autoinflammation and lipodystrophy

AR Identified PSMB8 as cause in patients from 2 consang families

Tyynismaa et al.448

2011 Hum Mol Genet

Progressive external ophthalmoplegia with multiple mitochondrial DNA deletions

AR Identified TK2 as cause

Bjursell et al.449 2011 Am J Hum Genet

hypermethioninemia AR Identified ADK as cause

Zangen et al.450 2011 Am J Hum Genet

XX female gonadal dysgenesis

AR Combined homozygosity mapping with exome sequencing to identify PSMC3IP/HOP2 as cause

Galmiche et al.451

2011 Hum Mutat Mitochondrial cardiomyopathy

AR Identified compound hets in MRPL3 as cause in 4 affected sibs

Bredrup et al.452 2011 Am J Hum Genet

Ciliopathies with skeletal anomalies with renal insufficiency

AR Identified compound hets in WDR19 as cause

Saitsu et al.453 2011 Am J Hum Hypomyelinating AR Identified POLR3A and

Genet leukoencephalopathy POLR3B as cause Clayton-Smith et al.454

2011 Am J Hum Genet

Say-Barber-Biesecker variant of Ohde syndrome

sporadic Identified KAT6B as cause in 4 individuals

Aldahmesh et al.455

2011 Am J Hum Genet

Ichthyosis, intellectual disability, and spastic quadriplegia

AR Combined homozygosity mapping with exome sequencing to identify ELOVL4 as cause in 2 individuals

Chen et al.456 2011 Nat Genet Paroxysmal kinesigenic dyskinesia

AD Identified PRRT2 as cause in 8 families

Logan et al.457 2011 Nat Genet Early onset myopathy, areflexia, respiratory distress and dysphagia (EMARDD)

AR Identified MEGF10 as cause

Dauber et al.458 2011 J Clin Endocrinolo Metab

Severe infantile hypercalcemia

AR Identified CYP24A1 as cause

Shamseldin et al.459

2011 J Med Genet Split hand and foot malformation

AR Combined homozygosity mapping with exome sequencing in consang family to identify DLX5 as cause

Sergouniotis et al.460

2011 Am J Hum Genet

Benign Flack Retina AR Combined homozygosity mapping with exome squencing to identify PLA2G5 as cause

Berger et al.461 2011 Mol Genet Metabol

Early prenatal ventriculomegaly

AD Combined linkage with exome sequencing to identify AIFM1 as cause

Bhat et al.462 2011 Clin Genet Primary microcephaly

AR Identified WDR62 as cause

Wang et al.463 2011 Hum Mutat Leber congenital amaurosis

AR Identified ALMS1, IQCB1, CNGA3, MYO7A as candidates

To date, most successful exome-based studies were in monogenic Mendelian disorders. The first filtering

step in most studies was to exclude variants reported in dbSNP and any other exome data available to the

investigators. Depending on the version of dbSNP used and the number of available exomes, this step

usually eliminates at least half of the called variants. Furthermore, only variants that cause potential

protein change or truncation are included in the analysis (i.e. nonsynonymous single nucleotide variants;

splice-site variants; nonsense variants; and indels). At this point, studies diverge in their strategies,

depending on the nature of the condition being studied and the available samples for sequencing. A

notable characteristic of most exome studies published to date is that the diseases being investigated are

recessive (Table 3). This allows the application of homozygosity mapping or identity-by-descent analysis

to family data, or even simply filtering out all genes except those that have homozygous variants or

compound heterozygous variants in the exome samples. If multiple affected relatives and/or more than

one family are available for a rare, fairly homogeneous condition, this strategy is very successful at

narrowing down the list of candidate genes to just one or at most a few genes. Even if only one sample is

available, it is possible to identify the causative gene for an autosomal recessive condition using this

method. For autosomal dominant conditions, where the causative variant is heterozygous, the use of

family linkage data can aid in significantly reducing the number of candidate genes. Alternatively, for

diseases caused by mutations in a single gene in most affected cases, identifying genes with novel

variants in more than one subject also helps pinpoint the causal gene. Additional filtering by predicted

effect of variants (using such tools as Polyphen-2464 (http://genetics.bwh.harvard.edu/pph2/index.shtml)

and SIFT465 (http://sift.jcvi.org/) and/or conservation scores (using PhyloP and GERP) may help in

ranking multiple candidate genes. However, those latter tools have their limitations and are often not

consistent in ascribing functional importance to the same variant. Some investigators have presented

statistical attempts at ranking variants and genes identified in such exome studies, but their applicability

and success rates are not known as of yet.468-470 Regardless, almost all studies provide further evidence in

support of the gene identified by sequencing the gene in other patients with the disease and/or presenting

functional analysis of the gene in the disease process.

The somatic genomes of many cancers have been sequenced, shedding light on important genes and

pathways involved in driving tumorigenesis and/or metastasis. The earliest of those involved a laborious

approach of sequencing coding regions exon-by-exon using the conventional Sanger method.37,471-472 The

first cancer genome to be sequenced using next-generation platforms was that of a cytogenetically normal

acute myeloid leukemia (AML)473; subsequently, additional genomes of AML474-475; breast cancer476-477;

lung cancer478-479; uveal melanoma480; colorectal cancer481; multiple myeloma482; hepatocellular

carcinoma483; hairy cell leukemia484; diffuse large B-cell lymphoma485; pancreatic neuroendocrine

tumor486; and gastric cancer487. An international collaboration under the auspices of the International

Cancer Genome Consortium (ICGC)488 is currently undertaking a large-scale integrative analysis of 50

different cancer types and/or subtypes at the genomic, epigenomic, and transcriptomic levels.

In addition to investigating the somatic genome of cancer, germline sequencing can help identify genes

that predispose to Mendelian cancer syndromes and/or familial cancer clustering. The first such study

used paired germline-tumor exome data to identify PALB2 as a new FPC gene in a patient who did not

carry mutations in known predisposition genes.117 The paired tumor variants allowed Jones et al.117 to

narrow the search down to genes that had a germline truncating mutation as well as a somatic “second-

hit” deleterious mutation, thus excluding all but three genes, two of which were previously reported to

have truncating mutations in healthy controls. Resequencing the full PALB2 coding region in a cohort of

96 FPC subjects identified an additional three families with protein-truncating mutations in the gene,

whereas truncating mutations in PALB2 are rare in control populations, further supporting PALB2 as an

FPC predisposition gene. In addition, the function of PALB2, a partner of BRCA2 which is already

implicated in pancreatic tumorigenesis, provided further weight to this discovery.

Despite the success of this initial report, few familial and/or syndromic cancer exome studies have been

published to date. Two studies, investigating the cause of childhood classic Kaposi Sarcoma489 and

mosaic variegated aneuploidy490, were able to take advantage of apparently recessive inheritance to filter

the exome data and identify the causative genes. In the case of Kaposi Sarcoma, variants were filtered for

homozygosity, protein-altering effect, and absence in dbSNP129, 1000 Genomes, or 49 in-house exomes,

leaving only 1 splice-site variant and 11 missense variants. The splice-site variant affects a gene (STIM1)

that is also mutated in a recessive immunodeficiency syndrome, and given the previous link of Kaposi

Sarcoma to immunodeficiency, this was considered a strong candidate. The investigators of mosaic

variegated syndrome sequenced two siblings of non-consanguinous parents and attempted to identify a

gene with two loss-of-function mutations shared by both siblings (as compound heterozygotes).

Interestingly, they did not initially identify a single causal gene, and rather identified 12 genes with a

single loss-of-function mutation in common to the siblings. Focusing on a gene with a putative functional

connection to the disease (CEP57 -centrosomal localization), Snape et al. sequenced its full coding region

in both siblings and identified a second mutation, an 11-bp deletion that was not called in the exome data.

This highlights current limitations of sensitivity and specificity of exome analysis. Two additional

unrelated patients were also found to carry compound heterozygote mutations in CEP57.

Two studies of autosomal dominant hereditary cancer were able to harness the power of sequencing

multiple unrelated individuals or linkage analysis to narrow down the list of susceptibility gene

candidates. In a study of hereditary pheochromocytoma491, three unrelated patients were sequenced and

the variants filtered to only include heterozygous protein-altering mutations shared by all three subjects

and absent in dbSNP and 1000 Genomes data. This reduced the list of candidates to just two genes, of

which only one segregated with disease in the respective families (MAX). By demonstrating LOH at the

MAX locus and absence of MAX expression in tumors from the affected families, Comino-Mendez et al.491

presented strong evidence for the role of MAX as a tumor suppressor gene in pheochromocytoma.

Moreover, they identified five additional unrelated patients with mutations in this gene (2 truncating and 3

missense). To identify susceptibility genes for familial nodular Hodgkin’s lymphoma, Saarinen et al.492

used information from linkage analysis of a large family in conjunction with exome sequencing of one

family member to narrow the list of candidates with a deleterious mutation segregating in the affected

family members and not present in controls to one gene: a 2-bp deletion in NPAT. Further sequencing of

this gene in other unrelated patients identified no other rare deletrious mutations in NPAT but they did

find a common amino-acid deletion that seemed to be significantly more frequent in Hodgkin’s patients

than controls (4.2% vs. 1.1%, OR 4.11, p=0.018). Gene expression array demonstrated decreased NPAT

mRNA in carriers of the 2-bp deletion. These findings, in addition to the fact that NPAT shares a putative

promoter with another known tumor suppressor gene (ATM) and is thought to have a role in cell cycle

regulation, suggest that NPAT germline mutations predispose to nodular Hodgkin’s lymphoma.

One of the promises of whole-genome and exome sequencing is the power to bridge the gap occupied by

low-frequency moderately penetrant variants in explaining disease heritability which until recently could

not be identified by family-based studies (because they usually do not segregate with disease) nor by

genome-wide association studies based on common SNPs.493 Such variants have been identified in the

past through candidate gene sequencing in cases, and require relatively large case-control studies to

demonstrate significant enrichment in the disease population. (e.g. BRIP1 in prostate cancer494; CHEK2 in

breast cancer495). With the increasing number of exomes or whole genomes being sequenced, it is

possible to capture those functional variants on a genome-wide level. For example, a recent report

describes whole-genome sequencing of approximately 450 Icelandic individuals then imputes the

genotype of detected variants in a large cohort of Icelandic ovarian cancer cases and controls, thus

identifying the most significant association to be for an intronic SNP in BRIP1. Subsequent fine-mapping

of the associated regions revealed a 2-bp deletion in exon 14 of BRIP1 that was in partial linkage

disequilibrium with the intronic SNP, and which had an odds ratio > 8 for ovarian cancer. Alternatively,

exome or whole-genome data itself may reveal the functional variant directly in family-based studies,

although the challenge lies in determining which non-segregating rare/low-frequency variant is causally

important. In a recent study by Yokoyama et al.496, whole-genome sequencing of a single member of a

large familial melanoma kindred identified over 400 germline variants, one of which was a missense

variant in a gene called MITF. Genotyping of this variant in the remaining family members demonstrated

non-segregation (only three of eight affected members carried the variant). However, due to interest in

the previously reported role of MITF in development of melanoma, the investigators genotyped this

variant in two large case-control cohorts and identified a significantly elevated frequency of the MITF

variant in cases, with an odds ratio of approximately 2, supporting the hypothesis that this low-frequency

variant is enriched in familial cases and confers a moderate risk of melanoma. In a similar study by Park

et al.497 in which members of four early-onset, multiple-case breast cancer pedigrees underwent exome

sequencing, a functionally interesting gene (FAN1) with two deleterious-predicted missense variants in

two families (one family segregated while the second did not segregate the variant) was identified, but

Parks et al.497 reported no statistically significant association of the variant with breast cancer in two case-

control analyses.

Chapter 2 - Loss of Heterozygosity at BRCA1 Locus in Pancreatic Adenocarcinoma

The contents of this chapter have been published in Human Genetics 2008 Oct;124(3):271-8.

PMID: 18762988 [http://www.springerlink.com/content/9723278j89678256/] The final publication is

available at www.springerlink.com. (I am first author).

1. Abstract Although the association of germline BRCA2 mutations with pancreatic adenocarcinoma is well

established, the role of BRCA1 mutations is less clear. We hypothesized that loss of heterozygosity at the

BRCA1 locus occurs in pancreatic cancers of germline BRCA1 mutation carriers, acting as a “second-hit”

that contributes to tumorigenesis. Seven germline BRCA1 mutation carriers with pancreatic

adenocarcinoma and 9 patients with sporadic pancreatic cancer were identified from clinic- and

population-based registries. DNA was extracted from paraffin-embedded tumor and non-tumor samples.

Three polymorphic microsatellite markers for the BRCA1 gene, and an internal control marker on

chromosome 16p, were selected to test for loss of heterozygosity. Tumor DNA demonstrating loss of

heterozygosity in BRCA1 mutation carriers was sequenced, to identify the retained allele. The loss of

heterozygosity rate for the control marker was 20%, an expected baseline frequency. Loss of

heterozygosity at the BRCA1 locus was 5/7 (71%) in BRCA1 mutation carriers; tumor DNA was available

for sequencing in 4/5 cases, and three demonstrated loss of the wild-type allele. Only 1/9 (11%) sporadic

cases demonstrated loss of heterozygosity at the BRCA1 locus. Loss of heterozygosity occurs frequently

in pancreatic cancers of germline BRCA1 mutation carriers, with loss of the wild-type allele, and

infrequently in sporadic cancer cases. Therefore, BRCA1 germline mutations likely predispose to the

development of pancreatic cancer, and individuals with these mutations may be considered for pancreas

cancer screening programs.

2. Introduction As discussed in the Literature Review section of the thesis, identifying genes implicated in predisposition

to FPC is important for developing early-detection and prevention strategies as well as more effective

therapeutic options. Several hereditary syndromes due to mutations in tumor suppressor/caretaker genes

cause an elevated risk of pancreatic cancer. These syndromes contribute to a small proportion of familial

cases, and it is expected that other genes play an important role136. Both BRCA1 and BRCA2 were

initially identified as highly penetrant genes in familial breast and ovarian cancer, but germline mutations

of these genes are also associated with several other malignancies498. Studies of cancer risks in BRCA2

germline carriers have reported a relative risk of 3.51 – 6.61 for pancreatic cancer498-500, and it is

estimated that BRCA2 mutations contribute to 6-19% of FPC cases103,121,501,502. Molecular genetic studies

have confirmed the role of BRCA2 inactivation in the development of pancreatic cancer115,503-507.

As with BRCA2, clinic-based studies have suggested an increased risk of pancreatic cancer in germline

BRCA1 mutation carriers508,509. There is also evidence for downregulation of BRCA1 expression in

sporadic pancreatic cancer tumors510. However, the aforementioned levels of evidence are much weaker

for BRCA1 compared to BRCA2. Inactivation of the wild-type BRCA1 allele in breast and ovarian cancer

most commonly occurs by loss of heterozygosity (LOH)511. We hypothesized that LOH at the BRCA1

locus occurs in pancreatic cancers of germline BRCA1 mutation carriers, acting as a “second-hit” event

contributing to pancreatic tumorigenesis. In this study, we compared the rate of LOH at BRCA1 in

pancreatic tumors in mutation-carriers and patients with sporadic pancreatic cancers.

3. Materials & Methods Ethical approval for this study was obtained from the Mount Sinai Hospital Research Ethics Board.

Microdissection and DNA extraction from formalin-fixed paraffin-embedded (FFPE) tissue, primer

design and optimization for sequencing, PCR amplification, and interpretation of genotyping and

sequencing results was performed by W. Al-Sukhni. Microsatellite genotyping and Sanger sequencing

was performed by the Analytical Genetics Technology Centre (AGTC) at Princess Margaret Hospital,

Toronto.

3.1 Tissue Specimens Germline BRCA1 mutation carriers were identified by: (1) clinic-based recruitment of incident cases of

pancreatic cancer at the University of Toronto, as described in a previous report by our group121; and (2)

population-based recruitment of pancreatic cancer cases through the Ontario Pancreas Cancer Study

(OPCS)45. BRCA1 testing was performed at provincial labs in most cases due to a strong history of

breast/ovarian; in one case, a BRCA1 mutation was identified by our research group as part of 102

unselected hereditary pancreatic cancer patients screened for several germline mutations. This latter

mutation was subsequently confirmed by testing in an offsite provincial lab121. All seven mutation

carriers included in this study had pathologically-confirmed adenocarcinoma of the pancreas. Pancreatic

tumor resection or biopsy specimens were obtained for all patients. Non-tumor tissue and/or blood

samples were also obtained for each patient. Microdissected, formalin-fixed paraffin-embedded samples

were prepared from each tumor (≥ 70% cellularity) and non-tumor specimen, and DNA was extracted

using the QIAmp DNA FFPE Tissue Kit, as per the manufacturer’s recommendations (QIAGEN Inc.,

Mississauga, Ontario, Canada). Blood lymphocyte DNA was extracted using standard Ficoll-Paque

technique, as per the manufacturer’s recommendations (Amersham Biosciences, Baie d’Urfe, Quebec,

Canada).

Nine patients recruited through the clinic-based Familial Gastrointestinal Cancer Registry (FGICR)121

with newly-diagnosed pancreatic cancer and no known BRCA1 germline mutations or family history of

breast/ovarian syndrome were selected for comparison. Tumor and non-tumor/lymphocyte DNA was

similarly extracted for each patient.

All patients were deceased before this study was performed; tissue specimens were previously banked for

research after obtaining consent from patients or from family members.

3.2 LOH Assay Three microsatellite markers linked to the BRCA1 locus were used for LOH analysis: D17S855,

D17S1322, and D17S579. The first two markers are intragenic. (See Figure 1 for locations of

microsatellite markers on chromosome 17)

Figure 1 - Location of BRCA1 microsatellite markers on chromosome 17

Figure 1 Legend: D17S1322 and D17S855 are intragenic (in introns 19 and 20, respectively), while

D17S579 is distal to BRCA1. The distance in base pairs between markers is identified.

Primer pair sequences were published in previous studies576-578, and primers were purchased from

Invitrogen Canada Inc. (Burlington, Ontario, Canada). Primer sequences are listed in Appendix Table S1.

A microsatellite marker on 16p (D16S2616) was selected as an internal control. The expected allelic loss

rate on this chromosomal arm in sporadic and FPC is 20-25%.181,182

For each primer pair, a (FAM-6) 5’-labeled forward primer and an unlabeled reverse primer were used.

Platinum Taq DNA Polymerase from Invitrogen was used for polymerase chain reaction amplification.

For each reaction, 20-25ng of genomic DNA were amplified in 25 µL reaction volume containing 10X

PCR buffer (Invitrogen Canada Inc.), 2mM MgCl2, 0.5µL of 10mM dNTP, 1-1.5µL of 10mM primers,

and 0.2µL of Invitrogen Platinum Taq DNA Polymerase. Initial denaturation was performed at 95°C x 2

minutes; followed by 35 cycles of (a) 94°C x 30 seconds, (b) primer-specific annealing temperature x 30

seconds, and (c) 72°C x 30 seconds; and final extension at 72°C x 5 minutes.

Automated DNA fragment analysis was performed using the ABI 3100 Prism sequencer (Applied

Biosystems), and GeneMapper Software version 3.7 was used to measure the allelic peak intensities. A

case was informative for a particular marker if two distinct alleles were amplified in the non-

tumor/lymphocyte DNA. Allelic peak ratio was calculated in informative cases as (T1/T2)/(N1/N2),

where T1, N1 = peak intensities for larger alleles; T2, N2 = peak intensities for smaller alleles; T = tumor

DNA; N = non-tumor or lymphocyte DNA (Figure 2).

Figure 2 - Sample electropherogram of microsatellite marker fragment analysis

Figure 2 Legend: T=tumor DNA; N=non-tumor/lymphocyte DNA; T1,N1=peak intensities of larger alleles;

T2,N2=peak intensities of smaller alleles; Allelic peak ratio = (T1/T2)/(N1/N2); LOH = 0.70 > allelic ratio > 1.43

An allelic ratio of < 0.70 or > 1.43 was considered evidence of LOH in tumor DNA. Results were

confirmed with at least 2 separate PCRs.

3.3 Tumor DNA Sequencing in BRCA1 Mutation Carriers For carriers of germline BRCA1 mutations who demonstrated LOH in their pancreatic tumors, the DNA

of the pancreatic cancer tissue was sequenced to determine if the wild-type or mutated allele was retained.

Since paraffin-extracted DNA was being amplified, unique primers were designed for each BRCA1

mutation to obtain amplification products < 110 bp. Appendix Table S2 lists primer sequences. Non-

tumor/lymphocyte DNA was sequenced for comparison for each case. Unlabeled primers were purchased

from Invitrogen. The ABI Prism 3130 XL Genetic Analyzer (Applied Biosystems) was used to perform

automated sequencing. The forward primer was used for sequencing, and results were confirmed by

sequencing two independently amplified PCR products for each sample.

4. Results

4.1 Patient Characteristics Table 4 compares the characteristics of BRCA1 mutation carriers and sporadic pancreatic cancer patients.

Table 4 - Characteristics of BRCA1 mutation carriers and sporadic pancreatic cancer patients

Patient Characteristic BRCA1 Mutation Carriers (N=7)

Sporadic Pancreatic Cancer (N=9)

Gender (F:M) 0:7 4:5 Age at diagnosis with pancreatic cancer, years (mean +/- SD)

65.4 +/- 12.2

63.6 +/- 10.9

Ethnicity: (n;(%)) Ashkenazi Jewish

Caucasian Other

5 (71%) 2 (29%)

8 (89%) 1 (11%)

Source of specimen: (n;(%)) Whipple resection

Biopsy Autopsy

2 (29%) 4 (57%) 1 (14%)

6 (67%) 3 (33%)

0 BRCA1 mutation:

5382insC 185delAG 2318delG

N/A N/A N/A

Families with BRCA1 mutations demonstrated a history of breast +/- ovarian cancer, and four families

also had ≥ 2 pancreatic cancer cases (one of these cases has been previously reported)121. Most BRCA1

mutation carriers were of Ashkenazi Jewish descent, whereas we excluded patients with Jewish ancestry

from the sporadic cancer group due to the elevated prevalence of BRCA1 mutations in this population.

The two founder Ashkenazi Jewish BRCA1 mutations, 5382insC and 185delAG, were present in the

majority of mutation carriers (6/7 families). Table 5 summarizes the pedigree information for the seven

mutation carriers.

Table 5 - Pedigree summary for BRCA1 mutation carriers

BRCA1 mutation carrier ID

Ethnicity Mutation Age at diagnosis of PC (years)

Number of relatives with

BC and/or OC

Tumors at other sites

BRC-1 AJ 5382insC 79 2 (brother, 1st cousin)

BRC-2 Caucasian 5382insC 57 1 (1st cousin) 5 -

BRC-3* AJ 5382insC 52 1 (son) 1 (sister; dx age 42)

BRC-4 AJ 185delAG 77 0 1 (daughter; dx age 39)

Prostate

BRC-5 AJ 185delAG 76 0 3 Prostate

BRC-6 Caucasian 2318delG 51 0 6 -

BRC-7 AJ 185delAG 66 2 (sister, 1st cousin)

AJ = Ashkenazi Jewish; PC = pancreatic cancer; BC = breast cancer; OC = ovarian cancer; CRC = colorectal cancer *This patient did not have molecular testing to confirm mutation; his brother and son both have confirmed 5382insC mutation

The mean age at diagnosis was similar for the two groups: 65.4 years in mutation-carriers vs. 63.6 years

in sporadic patients. Three BRCA1 mutation carriers had a history of other malignancies: two prostate

cancer and one colorectal cancer. No sporadic cancer patient had a history of multiple primary tumors.

4.2 LOH Analysis All cases (BRCA1 mutation carriers and sporadic cancers) were informative for at least one BRCA1

marker. D17S855 was informative in 11/16 (69%) cases; D17S1322 and D17S579 were each informative

in 13/16 (81%) cases. The internal control marker D16S2616 was informative in 10/16 (63%) of all

cases. Two BRCA1 mutation carriers did not have enough tumor DNA to test for LOH with D16S2616;

tumor DNA from one sporadic cancer patient could not be amplified when testing for LOH with

D17S855.

Table 6 shows the LOH results for each case with each marker.

Table 6 - LOH results for BRCA1 mutation carriers and sporadic pancreatic cancer cases

BRCA1 Mutation Carriers

Sporadic Pancreatic Cancer Cases

Case ID

Marker

D17S855 + + + U + + U + U * - - U - - -

D17S1322 - + U - U + - + - - - - - U - -

D17S579 - U U - + + - + U - - - - - - -

D16S2616 U + * - * U - - - + - - - - U U

(+) = LOH [1.43 < allelic peak ratio < 0.70] (-) = No LOH [1.43 > allelic peak ratio > 0.70] (U) = uninformative sample (homozygous at the tested microsatellite marker in germline DNA) (*) = DNA unavailable for amplification/DNA did not amplify

Ten cases in total were successfully tested with D16S2616, and only 2/10 (20%) demonstrated LOH.

Five of seven (71%) BRCA1 mutation carriers demonstrated LOH with at least one marker, whereas only

one of nine (11%) sporadic cancer cases demonstrated LOH with any BRCA1 marker (p = 0.035, 2-tailed

Fisher’s Exact test). In four of the five BRCA1-mutated cases with LOH, the allelic peak ratio was < 0.5

or > 2.0. (See Figure 3 for representative genotyping results).

Figure 3 - Three representative matched-pair electropherograms for microsatellite LOH

Figure 3 Legend: T=tumor DNA; N=non-tumor DNA. (a) and (b) represent LOH; (c) represents no LOH

The histopathologies of pancreatic tumors from BRCA1 mutation carriers were moderately- and poorly-

differentiated ductal adenocarcinoma, with no distinguishing pathologic characteristics of tumors with

LOH compared to tumors without LOH.

4.3 Sequencing to Identify Retained Allele in LOH Tumors Four of five BRCA1-mutation carriers demonstrating LOH had sufficient tumor DNA for sequencing.

Three cases (BRC-1, BRC-2, and BRC-3) had the 5382insC mutation, and one (BRC-6) the 2318delG

mutation. Three of four sequenced cases (BRC-2, BRC-3, and BRC-6) demonstrated loss of or decrease

in wild-type allele, while BRC-1 was inconclusive. (Figure 4 demonstrates a sample sequencing result)

Figure 4 - Representative sequencing result for an individual with 5382insC germline BRCA1 mutation

Figure 4 Legend: T=tumor DNA; N=non-tumor/lymphocyte DNA. The top panel demonstrates

sequencing of two alleles in non-tumor DNA (mutant and wild-type allele); the bottom panel demonstrates only the mutant allele sequence in tumor DNA of the same individual.

Of note, patient BRC-3 who did not have molecular confirmation of the germline mutation was

successfully sequenced for the 5382insC mutation carried by his brother and son, confirming that he is a

carrier.

5. Discussion This analysis sheds light, at the molecular level, on the putative role of BRCA1 in pancreatic cancer

tumorigenesis. The importance of LOH as a “second-hit” in tumorigenesis is well-established in many

cancers. Since BRCA1 inactivation occurs via LOH in the majority of breast and ovarian tumors in

BRCA1-mutation carriers, we hypothesized that LOH also plays a primary role in inactivation of BRCA1

in mutation-positive pancreatic cancer. Indeed, we found that the majority of our mutation-positive

pancreatic cancer subjects (5/7) did demonstrate LOH in tumor DNA. In comparison, we found that only

1/9 sporadic cancer patients demonstrated LOH at the BRCA1 locus in tumor DNA. It is possible that the

remaining two subjects had inactivation of their wild-type allele by epigenetic methylation of the

promoter; promoter hypermethylation of the wild-type allele in a minority of BRCA1 mutation-positive

breast tumors has been previously reported512. Due to the limitations of quantity and quality of our

paraffin-embedded specimens, we were not able to correlate LOH with decreased BRCA1 expression.

However, our sequencing results did confirm loss of wild-type in most of the cases with LOH, suggesting

that only the truncated protein product from the mutated allele would be expressed in those cases.

The link between BRCA2 mutations and pancreatic cancer is well-established, and most recommend

including this gene in mutational screening for high-risk pancreatic cancer individuals and their relatives.

However, the contribution of germline BRCA1 mutations to increased risk of pancreatic cancer is less

clear. Both BRCA1 and BRCA2 have important roles in the repair of double-stranded DNA breaks.513 A

number of anecdotal reports have described pancreatic cancer in association with BRCA1 mutations.514,515

Our group previously identified 38 individuals from a group of 102 pancreatic cancer patients who were

considered to have intermediate/high-risk families, of whom one Ashkenazi Jewish patient screened

positive for a deleterious BRCA1 mutation.121 A study by Tonin et al.516 screened 220 Ashkenazi Jewish

breast cancer families for BRCA1 and BRCA2 mutations, and reported pancreatic cancer in 11/91 families

with a BRCA1 mutation compared to 5 cases in 120 families without BRCA1 mutations. More recently,

Skudra et al.122 screened 90 consecutive Latvian patients presenting with pancreatic cancer and 640

controls for several germline BRCA1 mutations, including two Latvian founder mutations (5382insC,

4154delA) and two less common mutations (300T>G, 185delAG) in the BRCA1 gene. Four of 90 (4.4%)

pancreatic cancer patients were found to carry a BRCA1 mutation compared to 1/640 (0.15%) controls. It

was noted, however, that the rate of mutation in controls likely underestimates the true prevalence of the

founder mutations in the general Latvian population since control subjects were relatively older, hence

selecting against highly penetrant mutations.

Two large studies used family-based designs to study cancer risk at sites other than breast or ovary in

families with multiple breast/ovarian cancers or with young age of onset of breast cancer. There was some

overlap in the families used between the two studies, but different analytical methods were used.508,509,517

Both studies found a statistically significant association for pancreatic cancer, albeit lower than the

association with BRCA2: Brose et al.509 reported a three-fold increase in pancreatic cancer risk among

BRCA1 carriers (3.6%, compared to 1.3% estimated general population risk); Thompson et al.508 reported

a relative risk of 2.26 (95% CI 1.26-4.06) for developing pancreatic cancer in BRCA1 mutation carriers,

with a greater association in individuals diagnosed under age 65 (RR 3.10, 95% CI 1.43-6.70). One

limitation of these studies was the family-based design, which may overestimate cancer risks due to

possible confounding effects of other genetic and/or environmental factors shared by members of a

family. To circumvent this problem, Risch et al.498 performed a population-based study of 1171

unselected women from Ontario, Canada who presented with new-onset ovarian carcinoma. Subjects

were screened for BRCA1 and BRCA2 mutations, and information about other cancers in their first-degree

relatives was used to estimate cancer risk at other sites in mutation carriers, and compared to estimated

cancer incidence rates in Ontario. Seventy-five BRCA1 mutation carriers were identified, and a relative

risk of 3.1 was calculated for pancreatic cancer; however, this was not statistically significant (95% CI

0.45-21).

More recently (and subsequent to completion of our study), Ferrone et al.502 published an analysis of

unselected Ashkenazi Jewish patients who underwent pancreatic cancer resection and found no significant

increase in BRCA1 frequency relative to the general Ashkenazi population (1.3% vs. 1.1%); however, the

BRCA1 mutation rate was based on previous reports and not directly assessed in a control cohort in this

study, and the authors acknowledged that the small size (145 subjects) may have resulted in insufficient

power to detect a statistically significant difference. Axilbund et al.123 did not find carriers of BRCA1

mutations in 66 FPC patients (defined as having at least two additional relatives with pancreatic cancer),

but most of the subjects did not report Ashkenazi Jewish ancestry. In the non-Jewish North American

population, the estimated frequency of BRCA1 mutations is 1/500-1/800518,519; this suggests that Axilbund

et al.’s study was underpowered to identify an association of BRCA1 with FPC unless the effect size was

at least 15-fold, a value exceeding the estimated risk of BRCA2. Kim et al.520 reported a statistically

lower age of onset for pancreatic cancer in BRCA1-mutation carriers than in non-carriers.

For our study, we identified seven unrelated individuals with pathologically-confirmed pancreatic

adenocarcinoma whose families have BRCA1 mutations. In all but one of these cases, a molecular

confirmation of the mutation was previously available. The patient without molecular confirmation had a

brother and son who carried the identical 5382insC mutation; we later confirmed the presence of the same

mutation in this patient when we sequenced his tumor DNA to identify the remaining allele. The age at

diagnosis of pancreatic cancer did not differ significantly between the mutation carriers and sporadic

cases; this is similar to findings of other studies.515,521 Though further studies are needed to definitively

determine if BRCA1 is associated with increased pancreatic cancer risk, current data suggests that the

penetrance of BRCA1 mutations for pancreatic cancer is lower than that of BRCA2.498 Moreover, some

studies have suggested that some pancreatic cancer patients with BRCA2 mutations may not have a family

history of breast or ovarian cancers.501,522 It is not clear if the same may be true for pancreatic cancer

patients with BRCA1 mutations; most studies to date have characterized families selected for breast or

ovarian cancer.

Possible sources of experimental artifact include contamination of microdissected tumor cells with

adjacent stromal cells and potential bias from PCR-based microsatellite assay. Measures to reduce the

impact of such bias included using microdissected tumor samples with minimum 70% cellularity, as

identified by an experienced pathologist), and confirming PCR-based results with at least two separate

PCR experiments. Since FFPE-specimens often yield DNA of variable quality as a result of nucleic acid

cross-linking by the fixation process, we minimized potential bias from degraded DNA by selecting

primers for microsatellite markers that amplify small fragments (125-150bp). Due to the limitation of

available DNA, and the amplicon size restriction in selecting microsatellite markers, we were limited to

just three BRCA1 markers for our experiments. However, every sample produced informative results for

at least one marker, and most generated results for two or more markers. We also attempted to include an

internal control, an unrelated microsatellite marker at chromosome 16 with a previously reported LOH

frequency of 20-25%. Due to technical reasons and inadequate DNA for further testing, only three of the

seven familial samples successfully amplified this marker, with 1/3 demonstrating LOH. In comparison,

seven of nine sporadic cases amplified this internal control marker, with 1/7 showing LOH. Overall, 2/10

(20%) of samples showed LOH at this locus, consistent with previous reports. Although the inadequate

number of informative samples among the familial cases reduced the value of this control in our

comparison, our results remain valid given the confirmatory Sanger sequencing that demonstrated

decreased signal for the functional allele in tumors from samples that demonstrated LOH.

Our small sample size (seven germline BRCA1 mutation carriers with pancreatic cancer) reflects the

challenges inherent in studying a malignancy as lethal as pancreatic cancer, in which only 15% of cases

are resectable. To our knowledge, this is the first molecular genetic study investigating BRCA1 LOH in

pancreatic cancer of germline BRCA1 mutation carriers. Two previous studies have investigated BRCA1

in sporadic pancreatic tumors. Beger et al.510 used quantitative reverse-transcription PCR (qRT-PCR) and

immunohistochemistry antibody staining to analyze BRCA1 and BRCA2 gene expression in 13 normal

pancreas samples, 30 chronic pancreatitis samples, and 53 sporadic pancreatic adenocarcinomas. They

found decreased BRCA1, but not BRCA2, mRNA and protein expression in 50% of pancreatic cancer

samples, and also found decreased BRCA1 mRNA expression in chronic pancreatitis samples, whereas

normal expression was observed in normal pancreatic tissue. Correlation of these findings with clinical

information demonstrated worse 1-year survival in patients whose tumors had reduced BRCA1

expression, compared to patients with normal BRCA1 expression. Another study by Peng et al.523 found

that BRCA1 was frequently methylated in sporadic pancreatic adenocarcinoma as well as in ductal cells

showing inflammatory background without histologic change. The authors suggested that promoter

methylation of the BRCA1 gene may be the mechanism explaining the reduced gene expression reported

by Beger et al.510 in pancreatic cancer and in chronic pancreatitis. However, they noted heterogeneity of

methylation in different sections of the same tumor, and they did not directly measure gene expression

level, so it is not clear how promoter methylation impacted expression. Moreover, they found

methylation of BRCA1 even in normal ductal cells. Our study adds to the evidence for BRCA1 in

pancreatic tumorigenesis by specifically demonstrating an inactivating mechanism in the pancreatic tumor

DNA of BRCA1 mutation carriers, likely akin to the role of BRCA1 in breast and ovarian cancer

tumorigenesis.

Determining the association between BRCA1 and pancreatic cancer has diagnostic and therapeutic

implications. The implication of BRCA2 in pancreatic cancer has allowed incorporation of this gene in

mutational screening panels and identification of kindreds at risk; the same can be done for BRCA1. As

for treatment, current chemotherapeutic protocols for pancreatic cancer are based on 5-FU and

gemcitabine.524 Interestingly, in-vitro and in-vivo studies have found BRCA1-deficient tumors to be

particularly sensitive to certain chemotherapeutic agents that take advantage of the impaired DNA repair

mechanism that characterizes these tumors, such as cross-linking agents (e.g. Mitomycin C), type II

topoisomerase inhibitors (e.g. etoposide), and PARP1 (Poly ADP-ribose polymerase family, member 1)

inhibitors.525-527 Recently, case reports and small series have shown that patients with BRCA1 or BRCA2

mutations respond to such therapies.174,178,528,529,530

In conclusion, we demonstrate that LOH occurs at the BRCA1 locus in pancreatic cancers of BRCA1-

mutation carriers, suggesting that this gene is inactivated in these tumors and may play a role in

pancreatic tumorigenesis. Further research into the role of BRCA1 in pancreatic cancer is needed to

assess the expression of this gene in pre-invasive and invasive pancreatic lesions. Subjects with germline

BRCA1 mutations should be considered for inclusion in pancreas cancer screening programs, and they

may benefit from chemotherapies that target the DNA repair pathway.

Chapter 3 - Germline Genomic Copy Number Variation in Familial Pancreatic Cancer

The contents of this chapter have been published in Human Genetics 2012 Jun 5 (Epub ahead of print).

PMID: 22665139 [http://www.springerlink.com/content/6665070t28854647/]. The final publication is

available at www.springerlink.com. (I am first author).

1. Abstract Adenocarcinoma of the pancreas is a significant cause of cancer mortality, and up to 10% of cases appear

to be familial. Heritable genomic copy number variants (CNVs) can modulate gene expression and

predispose to disease. We hypothesized that genes overlapped by rare germline genomic losses or gains

identified exclusively in pancreatic cancer patients from high-risk families are candidate FPC genes. A

total of 120 FPC cases and 1194 controls were genotyped on the Affymetrix 500K array, and 36 cases and

2357 controls were genotyped on the Affymetrix 6.0 array. Detection of CNVs was performed by

multiple computational algorithms and partially validated by quantitative PCR. We found no significant

difference in the germline CNV profiles of cases and controls. A total of 93 non-redundant FPC-specific

CNVs (53 losses and 40 gains) were identified in 50 cases, each CNV present in a single individual.

FPC-specific CNVs overlapped the coding region of 88 RefSeq genes. Several of these genes have been

reported to be differentially expressed and/or affected by copy number alterations in pancreatic

adenocarcinoma. Further investigation in high-risk subjects may elucidate the role of one or more of these

genes in genetic predisposition to pancreatic cancer.

2. Introduction As illustrated in Chapter 1 of this thesis, a small proportion of familial pancreatic cancer cases can be

attributed to known cancer genes, such as Hereditary Breast and Ovarian Cancer (HBOC);

BRCA2/BRCA1/PALB2;Peutz-Jeghers Syndrome (PJS), STK11; Familial Atypical Multiple Mole

Melanoma (FAMMM), p16/CDKN2A; and Hereditary Pancreatitis (HP), PRSS1. However, most cases of

Familial Pancreatic Cancer (FPC) have an unknown genetic etiology.136 Segregation analysis of families

with multiple affected members suggests that FPC is caused by heritable alterations in at least one rare

“major gene”, likely in an autosomal dominant manner.161 Moreover, multiple case-control and cohort

studies have demonstrated that members of FPC families, particularly those with an affected first-degree

relative, have a significantly elevated lifetime risk of developing the disease (up to 32-56 fold).156;158,160

However, to date traditional methods of linkage analysis for identifying predisposition genes have met

with challenges in studying FPC, due in part to probable genetic heterogeneity as well as difficulty in

collecting DNA specimens on multiple affected members in a family due to the rapid mortality of the

disease.

Recently, it has become clear that submicroscopic copy number variants (CNVs) are prevalent throughout

all genomes, accounting for at least 1.2% of nucleotide variation between any two individuals.238 CNVs

have been linked to rare genomic disorders531 as well as common neurodevelopmental196, psychiatric532,

autoimmune533 and metabolic534 diseases. Some studies have suggested an association between common

CNVs and sporadic cancers (e.g. pancreatic cancer (6q13)344, neuroblastoma (1q21.1)340, prostate cancer

(2p24.3; 20p13; GSTT1)338,341,342, nasopharyngeal carcinoma (6p21.3)343, and endometrial cancer

(GSTT1)535). The recent paper by Huang et al.344 is the first to describe an association of a germline CNV

with pancreatic cancer risk: a common 10,379bp deletion at 6q13 was found to be higher in frequency in

sporadic pancreatic cancer patients compared to controls, with an odds ratio of 1.31 for 1-copy carriers

compared to 2-copy carriers. Interestingly, functional analysis of this non-genic deletion suggested that it

may be involved in long-range regulation of CDKN2B, an established tumor-suppressor gene.

In addition, it is well known that rare germline CNVs contribute to the genetic basis of familial cancer.

Indeed, large germline genomic rearrangements cause 15% of Familial Adenomatous Polyposis (APC

gene)311, 2% of breast and ovarian cancer (BRCA1 gene)536, and 5% of Lynch Syndrome (MSH2 & MLH1

genes)321 cases. In 1-3% of Lynch Syndrome patients, the causative mutation is a large heritable deletion

at the 3’ end of the TACSTD1 gene, which causes transcriptional read-through and epigenetic silencing of

the adjacent MSH2 gene.336 Furthermore, a report by Shlien et al.348 identified an elevated frequency of

germline CNVs in individuals with Li Fraumeni syndrome (TP53 mutation), and suggested that the

increased predisposition to cancer in this syndrome may be proportional to the frequency of germline

CNVs, many of which overlap known cancer genes.

Since germline CNVs implicated in familial cancers to date are rare with relatively high penetrance, we

hypothesized that familial and young-onset pancreatic cancer patients have a distinctive germline

genomic copy number variation (CNV) profile compared to non-cancer controls and that tumor

suppressor genes or oncogenes predisposing to pancreatic cancer may be overlapped by one or more

CNVs that are detected exclusively in patients. Here we present an analysis of germline CNVs detected

in 120 high-risk pancreatic cancer patients and compare them to CNVs in a large cohort of unaffected

controls.

3. Materials & Methods This study was approved by the Research Ethics Boards at Mount Sinai Hospital and University Health

Network in Toronto, Canada; Office for Human Research Studies at Dana Farber/Harvard Cancer Centre

in Boston, Massachusetts; Institutional Review Board at Mayo Clinic in Rochester, Minnesota;

Institutional Review Board at M.D. Anderson Cancer Centre in Houston, Texas; Office of Human

Subjects Research at Johns Hopkins University in Baltimore, Maryland; and Human Investigation

Committee at Karmanos Cancer Institute, Wayne State University in Detroid, Michigan.

DNA extraction from blood or EBV-transformed cell lines was performed by technicians at each

participating site and provided to W. Al-Sukhni. Genotyping of samples and ancestry verification on

STRUCTURE was performed by W. Al-Sukhni. Computational analysis of Affy 500K data on dChip,

CNAG, and Partek was performed by W. Al-Sukhni, with assistance from S. Joe in script-writing for

organization and filtration of data (as directed by W. Al-Sukhni). To standardize the analysis of Affy6.0

chips in the same manner used for the POPGEN and OHI controls, computational analysis of Affy6.0

data on Birdsuite and iPattern was performed by A. Lionel at TCAG. Filtration and annotation of all

CNV data was performed by W. Al-Sukhni. Validation of CNVs by qPCR was performed by W. Al-

Sukhni with technical assistance from N. Zwingerman, A. Gropper, and S. Moore. Breakpoint-mapping

of CNV by qPCR and Sanger sequencing entirely performed by W. Al-Sukhni. Comparison of case and

control CNVs and statistical analysis performed by W. Al-Sukhni.

3.1 DNA extraction DNA was extracted at each centre from either whole blood (white blood cells/lymphocytes) or EBV-

transformed cell lines. Cells were purified from whole blood using Ammonium Chloride-Tris lysis of red

blood cells. DNA was extracted using MaXtract Low Density tubes, which is an adaptation of the

standard organic solvent method of DNA extraction using phenol and chloroform. Purified DNA was

precipitated with 95% ethanol and dissolved in low TE buffer.

3.2 FPC cases recruitment Genomic DNA was extracted from peripheral blood or EBV-transformed cell lines of 133 pancreatic

cancer patients from 131 high-risk families recruited by PACGENE (Pancreatic Cancer Genetic

Epidemiology Consortium; PI, G Petersen, Mayo)165, a six-centre consortium that recruits kindreds

containing two or more blood relatives affected with pancreatic cancer for genetic studies. Inclusion

criteria in the current study included: subjects with two or more affected relatives (“3+ FPC”; N=79);

subjects with only one affected relative diagnosed at age 49 years or younger (“2 FPC”; N=22); and

subjects without affected relatives who were diagnosed at age 49 years or younger (“single young”;

N=32). (Some of the families were reassigned based on updated information after analysis – see Results

section). We included young cases with no family history of pancreatic cancer because they may have de

novo mutations in the gene(s) of interest, although we acknowledge that the definition of FPC involves

more than one affected member in the family. Subjects were excluded if they carried known mutations or

were in families with syndromes which predispose to pancreatic cancer (BRCA2, BRCA1, p16/FAMMM,

STK11/PJS, PRSS1/HP, Lynch Syndrome). The majority of DNA samples were extracted from blood

(N=97) and the remaining samples were from EBV-transformed lymphoblast cell lines. (Appendix Table

S3 (excel sheet on attached CD) for details.)

3.3 Controls recruitment Control samples of matched ancestry (> 95% of cases and controls reported Caucasian ancestry) were

obtained from two sources: 45 samples were healthy controls recruited by the Familial Gastrointestinal

Cancer Registry (FGICR)537 at Mount Sinai Hospital, Toronto, and 1,153 samples were recruited by the

Ontario Familial Colon Cancer Registry (OFCCR)538. Almost all control DNA samples were extracted

from blood (only 12 OFCCR controls were from lymphoblasts). (Appendix Table S4 (excel sheet on

attached CD) for details.)

In addition, we had access to CNV data for 1,234 controls recruited through the Ottawa Heart Institute

(OHI)539 and 1,123 controls of German descent recruited by the POPGEN project540. Most of the OHI

and POPGEN DNA samples were extracted from blood, and the platform for CNV detection was the

Affymetrix 6.0 array.

3.4 SNP genotyping For primary CNV discovery, 128 cases and all 1,198 FGICR + OFCCR controls were genotyped at

approximately 500,000 genome-wide SNPs on the Affymetrix GeneChip Human Mapping 500K Array

(NspI and StyI chips) according to Affymetrix standard protocol. The cases and 45 FGICR controls

genotyping was performed at The Centre for Applied Genomics (TCAG) in Toronto, while the 1,153

OFCCR controls were previously genotyped at Genome Quebec Innovation Centre as part of the

ARCTIC case-control colorectal cancer GWAS study. Briefly, whole genomic DNA was digested with

restriction enzyme (NspI or StyI) and ligated to universal adaptors, and adaptor-ligated fragments were

PCR-amplified with preference for 200bp-1,100bp size range. Subsequently, PCR amplicons were

fragmented, labeled, and hybridized to NspI or StyI chips. Chips were scanned using GeneChip Scanner

3000 7G, and Affymetrix GeneChip Command Console (AGCC) files were produced for further

processing. Intensity files (CEL) and genotype files (CHP) were converted from AGCC files using

GeneChip Operating Software (GCOS) and GeneChip Genotyping Analysis (GTYPE) software,

respectively. Genotype calls were made by Affymetrix Genotyping Console (GTC 2.1), which

implements the BRLMM genotype calling algorithm (Bayesian Robust Linear Model with Mahalanobis

distance classifier), using default settings (Score Threshold = 0.5, Block Size = 0, Prior Size = 10,000,

DM Threshold = 0.7).

GTC 2.1 performs a quality control (QC) analysis of the SNP genotype call rate, to estimate overall

quality of the chip hybridization, based on the Dynamic Model genotype calling algorithm. For 500K

arrays, Affymetrix considers QC < 93% call rate to suggest poor hybridization. However, QC call rate in

the range of 88-93% can also produce useable data for CNV analysis, in the experience of collaborators at

TCAG. Therefore, if we were unable to obtain rehybridized chips for some samples, we retained arrays

with QC call rate> 88% in the CNV analysis but inspected the raw calls made from those arrays to verify

if they appear to be false.

A subset of the original FPC cohort (33 samples) plus five new cases (Appendix Table S3) were

genotyped on the Affymetrix 6.0 array according to standard protocol to validate CNVs detected on the

Affymetrix 500K array as well as detect new CNVs. Arrays meeting Affymetrix quality control

guidelines of Contrast QC > 0.4 were used for further analysis. The Affymetrix Power Tools platform

was used to extract normalized intensities for each array and inter-array intensity correlation was

calculated; arrays with average correlation of > 0.9 were considered suitable for joint analysis.

3.5 Ancestry verification Subject ancestry was verified using STRUCTURE software

(http://pritch.bsd.uchicago.edu/structure.html), which infers population structure using genotype data of

unlinked markers541. We used 1,089 unlinked genome-wide autosomal SNPs that map to the Affymetrix

500K array (NspI and StyI chips), with differing minor allele frequencies across three major HapMap

populations (Caucasian (CEU), African (YRI), and Asian (CHB/JPT)). The observed alleles (major and

minor) at each SNP in HapMap populations were obtained using UCSC genome browser “Tables”

function. To determine the population cluster (assuming three ancestral populations), 270 unrelated

HapMap samples were used (90 CEU, 90 YRI, 90 CHB/JPT) as reference of known ancestry. Ancestries

were assigned using a coefficint of ancestry threshold > 0.9.

3.6 CNV discovery Figure 5 is a summary flow chart of the primary CNV discovery on the Affy500K arrays.

Figure 5 – Analysis of 500K arrays in FPC cases and controls

128 FPCcases

1153 OFCCR controls

Affymetrix 500K SNP arrays

(TCAG)

Affymetrix 500K SNP arrays

(Genome Quebec)

dChip CNAG Partek Genomics Suite(HMM)

Merged overlapping CNVs per sample Merged overlapping CNVs per sample

LOW CONFIDENCE CNVs(single algorithm/chip)

HIGH-CONFIDENCE CNVs(≥2 algorithms or chips)

FPC-specific CNVs(HIGH-CONFIDENCE SET cases vs. controls)

LOW CONFIDENCE CNVs(single algorithm/chip)

45 FGICR controls

500K ARRAYANALYSIS PIPELINE

dChip CNAG Partek Genomics Suite(HMM)

120 Cases

8 cases excluded(noise, no longer FPC) 1194 controls

4 controls excluded (personal PC or family history suggests FPC)

CNVs in 45 controls

Figure 5 Legend: Cases and controls were analyzed in a parallel fashion on three independent computational algorithms. A high-confidence CNV set (based on support by at least two algorithms or chips) was obtained for each of cases and controls and compared.

Copy number at each SNP position was estimated using three validated Hidden Markov Model (HMM)-

based CNV-calling algorithms (dChip 2006542, CNAG 2.0543, and Partek Genomics Suite v6.3©). NspI

and StyI chips were analyzed separately for each individual. After conducting several trials of different

analysis approaches, we identified the following as the method that best addresses the noise level in our

data: for dChip and Partek, samples were analyzed in batches corresponding to the grouping of samples

during chip hybridization (to minimize “batch effect” differences in hybridization that may lead to false

differences in intensity between samples): FPC cases and FGICR controls were analyzed in two batches

(batch 1 contained 47 cases and 22 controls; batch 2 contained 81 cases and 23 controls); OFCCR

controls were analyzed on dChip and Partek in 10 batches of approximately 100 samples each. For

CNAG, use of a maximum number of samples improves CNV detection, so the full group of FPC cases

and FGICR (173 samples) were analyzed concurrently, while the ARCTIC controls were analyzed in 6

random batches of approximately 200 samples each. Default analysis settings were used for each of the

computational programs: invariant-set probe normalization and hidden markov model copy inference

method for dChip; “non-paired reference/test sample” category and “automated analysis” option for

CNAG; 2-probe minimum used for calling CNV on Partek Suite (HMM method). The Partek CNV

coordinates were based on hg18 genome build and were converted to hg17 to merge with dChip and/or

A loss was defined by two or more consecutive SNPs with estimated copy number of < 2; a gain was

defined by two or more consecutive SNPs with estimated copy number of > 2. CNVs whose size was

less than 1,100bp were excluded to avoid the bias of PCR artifact causing false calls (since the fragment

size of amplified fragments was 200-1,100bp). Losses larger than 2 Mb and gains larger than 7 Mb were

also excluded (the cut-off was based on the largest CNVs seen in cases, with intention of maximizing

sensitivity in detecting case CNVs while removing excessively large CNVs in controls that are likely

false calls and/or represent somatic events). CNVs that crossed the centromere were removed because

they were incompatible with chromosomal stability and expected to be false calls. For any given chip and

algorithm, if the number of CNVs (losses + gains) called in a sample exceeded 40 (after above filters),

that sample was eliminated from the analysis for that given algorithm and chip (i.e. considered too noisy).

For each sample on a given chip, CNVs identified by two or more algorithms with overlapping

breakpoints (same direction on all algorithms) are merged if the length of the overlap area corresponds to

at least 20% of the length of any of the overlapping CNVs (Figure 6).

Figure 6 – Criteria for merging CNVs

For each sample, CNVs identified on both chips of the 500K array with overlapping breakpoints (same

direction on both chips) are merged if the length of the overlap area corresponds to at least 20% of the

length of either of the overlapping CNVs (Figure 6). “High-confidence calls” were identified as CNVs

called by at least two different algorithms and/or on both chips. Note, if a CNV is called by a different

algorithm on each chip, it was not considered “high-confidence”. For the purpose of identifying “CNV

loci”, CNVs in multiple samples with overlapping CNVs are merged (using the above-described 20%

threshold).

CNV calling on Affy6.0 arrays was performed using the Birdsuite tools (Canary + Birdseye algorithms)544

and iPattern545 algorithms, using a reference set that included the 38 FPC cases in addition to 100 other

closely-correlated Affy6 arrays previously analyzed at TCAG (based on correlation coefficient > 0.9).

(Samples were also analyzed on GTC 4.1, but this data was only used to support calls made on Birdsuite

or iPattern). For each of these algorithms, we required CNVs to span 5 or more consecutive array probes

and be at least 20 kb in length. Detection by either Birdsuite or iPattern was sufficient for the purpose of

validating 500K array CNVs. Only “high-confidence” calls (i.e. called by at least two of Birdsuite,

iPattern, and/or GTC 4.1 software – boundaries of overlapping regions were determined in the same

manner as for 500K data) were included as novel FPC-specific CNVs. Samples with number of calls

greater than three times the standard deviation from the mean number of calls for an analysis batch were

excluded from the study. The combined results of Birdsuite (Canary and Birdseye) were filtered to

remove CNVs with the following: excluded centromere jumpers; excluded X chromosome variants; tag

of “loss” with a copy number of > 1 or tag of “gain” with a copy number of < 3. The iPattern results were

filtered to remove CNVs in X chromosome and CNVs tagged as “complex”.

3.7 PCR validation of CNVs Quantitative PCR validation of a subset of CNVs was performed using Invitrogen Platinum SYBR Green

qPCR Supermix – UDG, with primers designed within the CNV of interest, and MSH2-exon2 used as a

reference gene. (Appendix Table S5 for primer sequences). Standard PCR conditions were used: (50C x

2mins; 95C x 2mins; (95C x 15sec; 60C x 32sec) x 40 cycles). Reactions were performed in replicates of

4-8x per sample. A standard curve was performed on each plate using control DNA (From a single

sample for all experiments) to ensure primer efficiency is between 90%-110% (slope = -3.6 – 3.1) and the

correlation coefficient (R2) of the standard curve samples is > 0.99. Dissociation curve was checked for a

single peak (indicating a single product). Data was analyzed on the ABI 7500 real-time machine, setting

the baseline and threshold manually to reflect the exponential phase of amplification. Finally, data from

each plate was analyzed using the ddCt method546: for each sample with at least 4 replicates, one sample

may be excluded from the calculation if it falls outside the range of Mean +/- 2*SD of Ct values (range

calculated after removal of uppermost or lowermost value); a “validation” curve of dCt vs. log input DNA

amount was done for each primer set to prove that the absolute slope is <0.1, signifying that the

efficiencies of the test gene and reference gene primer sets are approximately equal. The calculations for

ddCt are made as follows:

dCt = mean Ct (test gene) – mean Ct (control gene (MSH2))

Standard deviation (SD) of dCt = SquareRoot[(SD Ct(gene of interest))2 + (SD Ct (MSH2))2]

ddCt = dCt (test sample) – dCt (control sample)

Fold difference in copy number = 2ddCt

SD of fold difference in copy number = Ln(2)*SD of dCt*2ddCt

3.8 Prioritization of CNVs Figure 7 illustrates the priority order for investigating CNVs detected in cases.

Figure 7 – CNV prioritization plan

Figure 7 Legend: CNVs segregating with disease in a family or de novo in single case are highest priority,

followed by recurrent CNVs in unrelated affected individuals that are not found in unaffected controls. Single-affected disease-specific CNVs are lower in priority, and least likely to yield candidate genes are CNVs found in both affecteds and unaffecteds.

We defined “FPC-specific CNVs” as losses or gains detected in FPC cases on the 500K or Affymetrix 6.0

array, and which did not overlap (by 20% or more) with losses or gains in FCIGR, OFCCR, OHI, or

POPGEN controls, nor overlapped CNVs reported from non-BAC based platforms in the Database of

Genomic Variants (DGV)547 (http://projects.tcag.ca/variation -updated Nov 2010). Although we did not

control for ancestry in this analysis, we did note which FPC-specific CNVs were detected in non-

Caucasian samples.

3.9 Annotation of CNVs Affymetrix 500K and Affymetrix 6.0 array coordinates were aligned to the NCBI hg17 and NCBI hg18

human genome builds, respectively. Genes overlapped by CNVs were identified through the University

of California, Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu/), using the respective human

genome build. Information about CNV-overlapped genes was obtained from Entrez Gene

(http://www.ncbi.nlm.nih.gov/gene) and Pubmed (http://www.ncbi.nlm.nih.gov/pubmed/). The Memorial

Sloan Kettering Cancer Centre (MSKCC) CancerGenes database

(http://cbio.mskcc.org/CancerGenes/Select.action)548 was used to identify genes with reported pathways

or functions linked to cancer development. The Wellcome-Trust Sanger Catalogue of Somatic Mutations

in Cancer (COSMIC version 55) database (http://www.sanger.ac.uk/genetics/CGP/cosmic)549 (used

Biomart to identify all genes with mutation type “complex-compound substitution; complex – frameshift;

deletion-frameshift; insertion-frameshift; substitution-missense; substitution – nonsense; unknown”. To

get all COSMIC genes fitting these categories, the “gene” field was left empty; otherwise the desired gene

lists were used) and the Pancreatic Expression Database – version 2.0

(http://www.pancreasexpression.org)253 identified genes that had previously reported point mutations or

copy number alterations in tumors or cancer cell lines, or which were reported to be differentially

expressed in pancreatic cancer according to published gene expression studies.

3.10 Comparing Affy500K CNV profile between cases and controls Only “high-confidence” CNVs from non-EBV samples were included in the CNV profile comparison to

minimize potential cell line artifacts and false calls.278 As well, only controls with data available for both

NspI and StyI chips were included in this comparison to minimize bias of undercalling CNVs in single-

chip samples. To minimize CNV calling errors for “complex” CNVs (i.e. losses and gains in different

samples overlapping the same region), we performed the “rare CNV” analysis only on regions reported as

either losses or gains only. CNV loci that are present in fewer than 1% of the total number of samples

(cases + controls) were considered “rare”, excluding EBV samples and the complex CNVs. For losses,

32 cases and 235 controls (total 267 samples) were included in the “rare loss” analysis, so a rare loss was

defined as present in fewer than 3 individuals. For gains, 56 cases and 551 controls (total 607 samples)

were included in the “rare gain” analysis, so a rare gain was defined as present in fewer than 7 samples.

3.11 Statistical analysis Comparison of medians was performed using the Mann Whitney U test and comparison of means was

performed using the two-tailed Student’s t-test with Levene’s test for equal variance. Testing for

significant difference in proportions was performed with the two-tailed Fisher’s exact test. A p-value <

0.05 was considered significant. Statistical testing was performed using the SPSS© software package

(version 17).

For comparing differences in proportions of cases and controls at each CNV locus, we only considered

regions containing only losses or only gains (in cases and/or controls) for non-EBV samples, and we

excluded samples with only a single chip in the analysis. After calculating two-tailed Fisher’s exact test

p-values for each loss and gain locus, we performed a Bonferroni correction to account for multiple-

testing. The number of multiple tests was defined as the total number of loss or gain loci in the above

comparison (losses and gains were assessed separately).

3.12 Breakpoint Mapping and Sequencing To precisely identify the CNV breakpoints, qPCR was performed at several positions near the estimated

breakpoints (based on the SNP microarray results), narrowing down the estimated location of the

breakpoint to a region approximately 1,000 bp in length. (See Appendix Table S6 for primer sequences;

standard PCR conditions were used as described previously). Primers were designed to PCR-amplify the

region estimated to contain the breakpoint (see Appendix Table S6) and Sanger sequencing was used to

identify the exact base pairs delineating the breakpoint. Products were cleaned up using Qiagen MinElute

PCR purification kit. Sanger sequencing was performed by the AGTC service lab.

4. Results

4.1 Affymetrix 500K results Of the original 128 FPC cases genotyped on the Affymetrix 500K array, eight were subsequently

excluded (two subjects had excessively noisy data based on CNV count > 40 per analysis run; one subject

was discovered to have had chronic lymphocytic leukemia at the time of blood sample donation, making

it difficult to distinguish germline from somatic CNVs detected in the sample; and five subjects no longer

met inclusion criteria in light of new information that became available after the start of the study),

leaving 120 cases in the final analysis with both NspI and StyI chips represented for each sample. Some

of the subjects were reassigned to different inclusion criteria after updated information became available,

resulting in 68 “3+ FPC” subjects, 28 “2 FPC” subjects, and 24 “single young” subjects contributing to

the final set of case CNVs detected on Affymetrix 500K array. Two controls were discovered to have a

history of sporadic pancreatic cancer (no affected relatives), and two other controls each reported having

two relatives with pancreatic cancer, suggesting potential FPC kindreds. After excluding those four

samples, 1,194 controls remained in the final analysis. For 236 of those controls, only one chip was

included in the analysis (137 NspI only; 99 StyI only) due to inadequate hybridization of the second chip.

STRUCTURE software was used for estimating population ancestry of the 120 FPC cases and 958

controls that had NspI + StyI chips available for analysis: 89.2% of cases and 94.8% of controls were

Caucasian; 1.7% of cases and 2.1% of controls were Asian; and 9.2% of cases and 3.1% of controls were

of admixed background.

Figures 8 and 9 summarize the number of gains and losses called by each algorithm on each chip in cases

and controls.

Figure 8 – Gains and losses identified in FPC cases by each algorithm/chip

Figure 8 Legend: Number of losses and gains identified by each algorithm and resultant number of losses and gains

after merging overlapping CNVs.

Figure 9 - Gains and losses identified in controls by each algorithm/chip

Figure 9 Legend: Number of losses and gains identified by each algorithm and resultant number of losses and gains

after merging overlapping CNVs.

The total number of autosomal CNVs identified in cases and controls was 873 and 10,794 respectively, of

which 382 CNVs (123 losses + 259 gains) in cases and 3,115 CNVs (805 losses + 2,310 gains) in controls

were considered high confidence calls (corresponding to 66 loss loci + 105 gain loci in cases and 313 loss

loci + 467 gain loci in controls). (Appendix Tables S7 to S10 for high- and low-confidence CNVs in cases

and controls (available as excel files on attached CD)). The proportion of losses and gains considered

“high-confidence” was significantly larger in cases than in controls (losses: 48% cases vs. 33% controls,

p<0.001; gains: 42% cases vs. 28% controls, p<0.001). As well, the percentage of cases with at least one

high-confidence loss was significantly greater than controls (68% vs 47%, p<0.001), but no significant

difference existed between cases and controls in the percentage of samples with high-confidence gains

(85% vs. 80%, p=0.227). Significance testing results were the same whether or not the 236 controls with

only one chip in the analysis were included, or whether the denominator is all samples vs. only samples

that had at least one CNV call. We note that no significant difference was observed between cases and

controls when restricting the analysis to FGICR controls that were genotyped at the same centre (TCAG).

(Tables 7 and 8)

Table 7 - Proportion of high-confidence losses in cases and controls

% of losses that are high-confidence (HC)

% of HC losses if remove controls with only 1 chip

% of samples with HC losses

% of samples with HC losses if remove controls with only 1 chip

% of samples with HC losses among 2-chip samples with at least one loss call

Cases 48 48 68 68 76 All Controls 33 35 47 53 63 Fisher's exact p < 0.001 p < 0.001 p < 0.001 p=0.002 p=0.009 FGICR controls 41 43 51 55 64 Fisher's exact (compared to cases) p=0.303 p=0.512 p=0.070 p=0.190 p=0.190

Table 8 - Proportion of high-confidence gains in cases and controls

% of gains that are high-confidence (HC)

% of HC gains if remove controls with only 1 chip

% of samples with HC gains

% of samples with HC gains if remove controls with only 1 chip

% of samples with HC gains among 2-chip samples with at least one gain call

Cases 42 42 85 85 87 All Controls 28 29 80 86 88 Fisher's exact p < 0.001 p < 0.001 p=0.227 p=0.782 p=0.882 FGICR controls 49 50 80 81 85 Fisher's exact (compared to cases) p=0.109 p=0.086 p=0.227 p=0.626 p=0.789

4.2 Affymetrix 6.0 results In 36 cases genotyped on the Affymetrix 6.0 array (two of the original 38 samples were excluded due to

excess noise – see methods), a total of 3,364 autosomal CNVs (2,665 losses and 699 gains) were

identified using Birdsuite, and 3,266 autosomal CNVs were identified using iPattern (1,975 losses and

1,291 gains). Table 9 summarizes some key parameters of CNVs identified by each algorithm.

Table 9 - CNVs called by each of Birdsuite and iPattern in 36 samples on Affymetrix 6.0 array Birdsuite iPattern # losses 2,665 1,975 # gains 699 1,291 median size losses (bp) 7,793 10,388 median size gains (bp) 60,599 19,857 # genic losses (% of all losses) 969 (36%) 693 (35%) # genic gains (% of all gains) 512 (73%) 690 (53%) # losses called as HC losses in 500K array (in same sample) 33 35 # losses called as LC losses in 500K array (in same sample) 20 20 # gains called as HC gains in 500K array (in same sample) 70 70 # gains called as LC gains in 500K array (in same sample) 33 38 mean # losses per sample/mean # gains per sample 74/19 55/36 HC = high-confidence; LC = low-confidence on 500K array

The high-confidence set of Affy6 CNVs (incorporating GTC-supported CNVs) comprised 2,187 CNVs

(1,656 losses + 531 gains). (Appendix Tables S11 to S12 for high-confidence CNVs on Affy6 array in

FPC cases and controls (available as excel files on attached CD)). The median size of high-confidence

losses and gains was 12.7kb (1kb-1.4Mb) and 48.9kb (1kb-1.6Mb), respectively, and the average number

of losses and gains per genome was 46 and 15, respectively.

4.3 CNV validation Quantitative PCR was used to attempt validation of 18 losses (13 high-confidence and 5 low-confidence)

and 10 gains (all high-confidence) in FPC cases, of which all the high-confidence CNVs validated and 4/5

low-confidence CNVs validated. (Appendix Figures S1 to S32 for qPCR results). Of the 33 FPC cases

that were hybridized to both Affy 500K and Affy6.0 arrays, 31 yielded useable results on both arrays.

For those 31 cases, 113 high-confidence CNVs and 142 low-confidence CNVs were called on the 500K

array, of which 107 (95%) high-confidence CNVs and 63 (44%) low-confidence CNVs were validated on

the Affy6 array. The combined results of qPCR validation and Affy6 genotyping demonstrated a

validation rate of 95% (121/127) for high-confidence CNVs but only 45% (66/146) for low-confidence

CNVs. Therefore, the remainder of this analysis was limited to high-confidence CNVs in cases and

controls. Approximately one third (121/382) of all high-confidence case CNVs identified on the 500K

array, corresponding to half (88/171) of all high-confidence CNV loci in cases, have been confirmed by

either the Affymetrix 6.0 array and/or qPCR.

4.4 Comparing CNV profile of cases and controls We compared several characteristics of CNVs identified on the 500K array between FPC cases and

FGICR/OFCCR controls. Table 10 compares several key CNV attributes between cases and controls

(based on high-confidence CNVs and excluding EBV-derived samples and controls with only one chip in

the analysis).

Table 10 - High confidence CNV profile of cases vs. controls (excluding EBV-derived samples and excluding controls with data from only one chip)

FPC cases Controls p-value

# Lymphocyte samples 91 950 #High-confidence losses/high-confidence gains 91/190 731/2,059 Median CNV size (range) 219.5kb

(1.2kb-6.4Mb) 219.5kb

(1.2kb-6.8Mb) 0.439

Median CNV SNP count (range) 42 (2-417) 40 (2-1318) 0.578 #Genic CNVs/all CNVs Losses Gains

52/91 (57%)

153/190 (81%)

400/731 (55%)

1,646/2,059 (80%)

0.738 0.850

#Samples with genic CNVs/samples with any CNVs Losses Gains

43/59 (73%) 70/75 (93%)

327/500 (65%) 765/816 (94%)

0.309 0.805

#CNV genes identified as “Cancer Genes” in MSKCC CancerGenes database/all CNV genes recognized by the MSKCC database Losses Gains

8/36 (22%) 53/335 (16%)

35/264 (13%) 507/2940 (17%)

0.201 0.541

#CNV loci included in rare analysis/all CNV loci Losses Gains

36/52 (69%) 65/83 (78%)

203/290 (70%) 349/428 (82%)

1.000 0.541

#CNVs that are part of rare loci/all CNVs Losses Gains

23/91 (25%)

47/190 (25%)

199/731 (27%)

461/2,059 (22%)

0.802 0.469

#Samples with CNVs included in rare analysis/samples with any CNV Losses Gains

32/59 (54%) 56/75 (75%)

235/500 (47%) 551/816 (68%)

0.335 0.244

#Samples with rare CNVs/samples with any CNV Losses Gains

21/59 (36%) 37/75 (49%)

169/500 (34%) 348/816 (43%)

0.773 0.275

#Genic rare CNVs/all rare CNVs Losses Gains

10/23 (43%) 33/47 (70%)

69/199 (35%)

330/461 (72%)

0.491 0.866

#Samples with genic rare CNVs/samples with rare CNVs Losses Gains

10/21 (48%) 27/37 (73%)

63/169 (37%)

267/348 (77%)

0.476 0.684

Mean CNVs per genome* Losses Gains

1.5 2.5

0.443 0.956

Mean rare CNVs per genome* Losses Gains

0.4 0.6

0.919 0.498

*mean and t-test calculated for losses and gains based only on samples with at least one high-confidence loss or gain, respectively (to avoid the bias of samples which didn’t get a high-confidence CNV call due to noise)

Overall, no significant difference was observed in the CNV profile of cases and controls, including such

parameters as CNV size, proportion of genic CNVs, proportion of rare CNVs, and average number of

CNVs per individual genome. In both groups, gains were larger than losses (median size - cases: 228.7kb

vs. 176.6kb, p=0.016; controls: 224.4kb vs. 168.0kb, p<0.001) and were more likely to overlap genes

(cases: 153/190 gains vs. 52/91 losses are genic, p<0.001; Controls: 1,641/2,059 gains vs. 400/731 losses

are genic, p<0.001).

4.5 CNVs of interest Figure 7 summarizes the CNV prioritization plan that we applied to our data. The highest priority is

assigned to CNVs that segregate with disease status in blood relatives, or alternatively de novo CNVs in

singleton young affected subjects.

Since no trios were available for analysis, we could not determine which CNVs were de novo. Only two

pairs of siblings were genotyped, while the remaining were all unrelated subjects. In one pair of siblings

whose parents are not consanguinous, only a single gain was shared by the two siblings and this CNV was

also identified in many other cases and controls. In the second pair of siblings whose parents are first-

cousins, one loss and three gains were shared by the two siblings but all the CNVs were also shared by

controls. Hence, no FPC-specific CNVs were found to segregate in either of the two pairs of siblings.

Next in priority are CNVs that overlap in two or more unrelated cases and are absent in controls. We also

considered CNVs present in cases and controls if they met the following conditions: (1) CNV present in

two or more cases; (2) CNV overlaps gene(s) in cases; (3) the genic portion of the region is not

overlapped by control CNVs or DGV CNVs. (To ensure that we are not missing anything significant, we

assessed the data for loci overlapping two or more cases and no controls even if reported in the DGV, but

none fit this criteria). A total of 64 FPC CNVs (27 losses and 37 gains) detected on the 500K array were

not identified in FGICR or OFCCR controls. After further excluding regions that overlapped POPGEN

or OHI controls or were reported in the DGV, the number of FPC-specific CNVs identified on the 500K

array is 37 CNVs (16 losses and 21 gains). On the Affymetrix 6.0 array, 119 FPC CNVs (71 losses and 48

gains) were not identified in POPGEN or OHI controls, and after further excluding regions which

overlapped FGICR and OFCCR controls or were in the DGV, 73 FPC-specific CNVs (45 losses and 28

gains) remained. Combining results from the two arrays (including regions identified on both platforms)

yielded a total of 93 non-redundant FPC-specific CNVs (53 losses and 40 gains), each CNV present in a

single individual only (a total of 50 FPC cases, including 7 EBV-derived samples); 13 losses and 8 gains

were in non-Caucasian individuals.

One duplication (G_97) appeared to affect the same gene (TGFBR3) in two unrelated cases, albeit with

different breakpoints in each case (Figure 10). This gene codes for a receptor of TGF-beta, a signaling

molecule with an important role in pancreatic cancer initiation and progression, and decreased expression

of TGFBR3 has been observed in various cancers suggesting that it behaves as a tumor-suppressor. Given

the potential significance of this gene for pancreatic cancer, we aimed to investigate this duplication

further.

Figure 10 – Duplications overlapping TGFBR3 gene

Figure 10 Legend: TGFBR3 transcripts circled; red bars represent breakpoints of CNVs identified on SNP arrays

Although an overlapping duplication was also present in one POPGEN control, the control duplication

only overlapped the beginning of one of the multiple isoforms of this gene. (There was also a large low-

confidence duplication called in one of our ARCTIC controls, but this appeared to be a false call as

demonstrated by qPCR – see Appendix Figure S33). The duplication in case ID-27 was validated by

qPCR using two different primer sets. We validated the duplication in case ID-203 using those same

primer sets, and additionally tested family members for this subject for whom DNA was available.

(Figure 11; Appendix Figures S33-S38).

Figure 11 – Pedigree of case ID-203, indicating results of qPCR testing for duplication G_97

Figure 11 Legend: GB = gallbladder; PC = Pancreas cancer; dup = duplication identified; no dup = no

duplication identified; blood = source of DNA is lymphocytes; tissue = source of DNA is FFPE resected specimen

At this point, we observed that the mother of the proband did not carry the duplication, which weakened

the argument for this CNV being causative for pancreatic cancer (since the pancreatic cancer was

considered matrilineal in this family, with a maternal grandmother reported to have died of the disease).

However, we considered the possibility of the disease being inherited from the paternal side, particularly

since the paternal grandmother was reported to have died of “gallbladder cancer”, which could have been

a misdiagnosis of pancreatic cancer. We did not have access to DNA from the father or paternal

grandmother, but as noted in the pedigree, a sister of the proband’s had also died of pancreatic cancer.

We wished to test for segregation of the duplication with the disease, but only formalin-fixed paraffin-

embedded (FFPE) tissue was available for DNA extraction from this sister. Due to the fragmented nature

of FFPE-derived DNA (caused by cross-linking and degradation of nucleic acid by formalin

preservation), qPCR performed on FFPE-DNA can be biased and difficult to verify. Therefore, we

decided to fine-map the breakpoints of the duplication to allow Sanger sequencing of the tandem

duplication point. Our fine-mapping method involved designing qPCR probes at several positions falling

within as well as outside the array-defined boundaries of the duplication (Figure 12; Appendix Figures

S39 to S45 for qPCR results).

Figure 12 – Fine-mapping the breakpoint of duplication overlapping TGFBR3 using qPCR walk-along method

Figure 12 Legend: Panel [A] depicts the array-based estimation of the duplication breakpoints; panels [B] and

[C] indicate the locations of the qPCR probes at either end of the duplication (shown as small vertical black bars). Panels [B] and [C], the red arrows indicate the area between the confirmed duplicated and non-duplicated positions at either end of the CNV.

At this point, we selected two primers used for qPCR analysis (O_Out_5 and T_Out_3) to attempt PCR

amplification of the region containing the duplication breakpoint. Although we did not know at this point

the exact size of the duplication, we were able to amplify a fragment approximately 1.5-2kb in size (see

Figure 13), whereas a control sample not containing the duplication failed to amplify anything using these

primers (as would be expected).

Figure 13 – PCR gel demonstrating amplification of ~1.5-2kb fragment containing G_97 duplication breakpoint in case Id_203

Figure 13 Legend: Each well represents a separate PCR reaction (three for duplication-carrying sample and

three for non-duplication control)

We submitted the fragment for Sanger sequencing from both ends; although the size of the fragment was

too large to read completely from either primer, we obtained sufficient length of reads from each primer

such that they overlapped at the breakpoint of the duplication, thus allowing us to pinpoint the exact

location of the breakspoint (see Figure 14).

Figure 14 – G_97 duplication breakpoint mapping by Sanger sequencing

Figure 14 Legend: Sequence [A] is located at the end of G_97 that does not transect TGFBR3; the purple-highlighted

portion is seen in Sanger sequence reads from forward primer (O_Out_5) located at that end of the duplication. Sequence [C] is located at the end of G_97 that transects TGFBR3; the yellow-highlighted portion is seen in Sanger sequence reads from reverse primer (T_Out_3) located at that end of the duplication. Non-highlighted portion of each of those reads represents the normally expected sequence in each location if no duplication was present. The red-higlighted sequence is the region of the tandem duplication breakpoint that observed in each of the Sanger sequence reads from the above-described primers; note the insertion of “TAT” at the point of duplication.

Based on this information, we designed a primer set to amplify a smaller fragment encompassing the

breakpoint (~100 bp), to allow amplification of FFPE-derived DNA (obtained from non-tumor region of

the specimen block) from the affected sister of the proband. We also performed PCR amplification of

several other amplicons of similar size to control for DNA degradation, and we used case Id-203 as a

positive control for the duplication. As Figure 15 illustrates, although the FFPE DNA appeared to

amplify the four other test amplicons well, no amplification of the duplication breakpoint region was

observed in the affected sister, indicating that she did not inherit the duplication.

Figure 15 - PCR gel illustrating amplification of test regions and duplication breakpoint in case Id-203 and affected sister

Figure 15 Legend: Wells within the blue boxes belong to sister of ID_203 (source of FFPE DNA); wells

outside blue boxes belong to case ID_203 (blood-derived DNA); every fifth column is water control

4.6 FPC-specific CNVs Since the TGFBR3 duplication did not segregate with pancreatic cancer in the family we studied, and no

FPC-specific CNV occurred in more than one case, we proceeded to annotate the FPC-specific CNVs and

to prioritize them based on gene content and their association with cancer. (Figure 16 illustrates the

distribution of FPC-specific CNVs across the genome).

100 bp

Figure 16 - FPC-specific losses and gains on autosomal chromosomes

Twenty-three FPC-specific losses and 23 FPC-specific gains overlapped introns, exons, and/or

untranslated regions of 104 RefSeq genes (Table 11).

Table 11 – FPC specific CNVs

CNV type CNV Id Sample Id

Coordinates (hg18) Size (kb) RefSeq Genes

Overlaps Pancreatic Expression Database CNVs?

Gain Affy6.0_G_11 127 chr1:49856085-50089082 233.0 AGBL4 no

Gain Affy500K_G_280 & Affy6_G_298 62

chr18:6838462-7291170 452.7

ARHGAP28, LAMA1, LRRC30, LOC400643

High-level amplification

Gain Affy500K_G_380 82 chr3:143693491-143928895 235.4

ATR, PLS1, TRPC1 no

Gain Affy6.0_G_324 20 (Admixed) chr19:60436319-60696243 259.9

BRSK1, UBE2S, SHISA7, TMEM190, COX6B2, no

Figure 16 Legend: Red box = loss; Green box = gain

FAM71E2, HSPBP1, TMEM150B, ISOC2, IL11, RPL28, TMEM238, ZNF628, SUV420H2, NAT14, PPP6R1, SSC5D

Gain Affy500K_G_136 37 (EBV) chr16:78810438-79258408 448.0

DYNLRB2, CDYL2, MIR548H4

chr7:133223330-133393933 170.6 EXOC4 no

Gain Affy6.0_G_235 99 chr15:32814039-32848252 34.2 GJD2 no

Gain Affy500K_G_365 79 chr4:93344017-93591992 248.0 GRID2 no

Gain Affy6.0_G_226 44 chr15:70381008-70436843 55.8 HEXA, CELF6 no

Gain Affy500K_G_603/604 & Affy6_G_93

123 (Admixed)

chr8:39935640-39943638 8.0 IDO2 no

Gain Affy6.0_G_39 123 (Admixed)

chr3:161448573-161518365 69.8 IFT80 no

Gain Affy6.0_G_143 17 chr10:71778181-71797516 19.3 LRRC20 no

Gain Affy6.0_G_170 20 (Admixed) chr11:65027491-65201466 174.0

LTBP3, PCNXL3, MAP3K11, MIR4489, MALAT1, RELA, SIPA1, SSSCA1, FAM89B, KCNK7, MIR4690, EHBP1L1, LOC254100, SCYL1 no

chr18:2254263-2555103 300.8 METTL4 no

Gain Affy6.0_G_33 69 chr2:216465517-216485115 19.6 none no

Gain Affy500K_G_88 24 chr4:26691114-26985948 294.8

(mRNA present) no

Gain Affy500K_G_369 80 chr4:29195980-29209908 13.9 none no

Gain Affy500K_G_602 & Affy6_G_50

123 (Admixed)

chr4:72734028-72817447 83.4 none no

Gain Affy500K_G_511 107 (EBV) chr4:105853937-106127766 273.8 none no

Gain Affy500K_G_407 86 chr6:48829836-49492706 662.9 none no

Gain Affy6.0_G_70 44 chr6:132466247- 12.9 none no

132479169

Gain Affy500K_G_49 12 (Admixed) chr9:81978854-82021829 43.0 none no

Gain Affy6.0_G_152 54 chr11:41420026-41456633 36.6

(mRNA present) no

chr11:81521790-81598468 76.7

(mRNA present) no

Gain Affy500K_G_502 106 (EBV) chr12:57378034-57482408 104.4

(mRNA present) no

Gain Affy500K_G_225 58 chr21:28431800-28667362 235.6

(mRNA present) no

Gain Affy500K_G_226 58 chr21:35973166-36013145 40.0

(mRNA present) no

Affy500K_G_105 & Affy6_G_283 & Affy6_G_284 28

chr17:2919396-3184579 265.2

OR1D2, OR1G1, OR1A2, OR1A1, OR1D4, OR3A2, OR3A1, OR3A4P no

Gain Affy500K_G_95 26 chr10:19849680-20589237 739.6 PLXDC2

Gain Affy6.0_G_90 202 chr8:49008716-49049657 40.9 PRKDC, MCM4 no

Gain Affy6.0_G_3 123 (Admixed)

chr1:157133096-157188413 55.3 PYHIN1 no

chr8:108696004-109010881 314.9 RSPO2

Gain Affy500K_G_303 65 chr2:230753632-230823051 69.4 SP110, SP140 no

Gain Affy6.0_G_179 11 (Asian) chr12:81711207-81762121 50.9 TMTC2 no

Gain Affy6.0_G_212 67 chr14:73405361-73432688 27.3 ZNF410, PTGR2 no

Gain Affy6.0_G_315 62 chr19:60824299-60923809 99.5

ZNF784, NLRP9, EPN1, CCDC106, ZNF580, U2AF2, ZNF581 no

Loss Affy500K_D_125 & Affy6_D_1246 68

chr12:39394850-39501843 107.0 CNTN1

Loss Affy6.0_D_870 123 (Admixed)

chr5:11220277-11229088 8.8 CTNND2 no

Loss Affy6.0_D_1507 11 (Asian) chr18:3670476- 45.1 DLGAP1 no

3715553

chr10:128752241-128780181 27.9 DOCK1 no

Loss Affy6.0_D_637 204 chr2:55010996-55019655 8.7 EML6 no

Loss Affy500K_D_24 & Affy6_D_1342 11 (Asian)

chr13:93544008-93670507 126.5 GPC6

Loss Affy6.0_D_739 69 chr4:70867305-70952889 85.6

HTN1, HTN3, STATH no

Loss Affy500K_D_152 85 chr3:125676839-125815545 138.7 KALRN no

Loss Affy6.0_D_477 97 chr1:62528216-62538049 9.8 KANK4 no

Loss Affy6.0_D_1548 61 chr19:61684427-61697318 12.9 LOC100128252 no

Loss Affy6.0_D_844 40 chr4:178997998-179018809 20.8 LOC285501 no

chr6:119578774-119604698 25.9 MAN1A1 no

Loss Affy500K_D_220 112 (EBV) chr8:6371546-6430547 59.0

MCPH1, ANGPT2 no

Loss Affy500K_D_142 77 (Admixed) chr8:17998784-18145035 146.3 NAT1 no

Loss Affy6.0_D_535 62 chr2:41356049-41390177 34.1 none no

chr2:41474986-41608172 133.2 none no

Loss Affy6.0_D_677 20 (Admixed) chr3:22405124-22481450 76.3 none no

Loss Affy6.0_D_769 28 chr4:123803190-123806840 3.7

(mRNA present) no

Loss Affy6.0_D_930 35 chr6:142219243-142324891 105.6

(mRNA present) no

Loss Affy6.0_D_992 64 (Admixed) chr7:23094182-23110722 16.5

(mRNA present) no

Loss Affy500K_D_93 48 chr8:89782116-89849946 67.8

(mRNA present) no

Loss Affy6.0_D_1644 91 chr8:131657747-131683625 25.9

(mRNA present) no

Loss Affy500K_D_134 74 chr9:2235919-2351848 115.9 none no

chr9:75525136-75638229 113.1 none no

Loss Affy6.0_D_1108 4 chr9:102517861-102553347 35.5

(mRNA present) no

chr11:39882017-40010124 128.1 none no

(mRNA present) no

Loss Affy6.0_D_1205 204 chr11:104741261-104793318 52.1

(mRNA present) no

chr12:130382166-130686668 304.5

(mRNA present) no

Loss Affy500K_D_121 & Affy6_D_1383 64 (Admixed)

chr14:85216336-85436133 219.8 none no

Loss Affy6.0_D_1428 35 chr15:60314660-60333770 19.1

(mRNA present) no

Loss Affy6.0_D_1467 67 chr16:54046835-54056160 9.3

(mRNA present) no

Loss Affy6.0_D_1601 11 (Asian) chr20:50766640-50780316 13.7 none no

Loss Affy500K_D_225 114 (EBV) chr21:23160325-23267106 106.8 none no

Loss Affy6.0_D_542 61 chr2:148426768-148464448 37.7 ORC4 no

Loss Affy6.0_D_925 101 chr6:162342089-162365931 23.8 PARK2 no

PCSK1, ERAP1, CAST

Loss Affy6.0_D_1065 61 chr8:85558196-85579549 21.4 RALYL no

Loss Affy6.0_D_1527 203 chr18:38603464-38605275 1.8 RIT2 no

Loss Affy6.0_D_1484 35 chr17:75852813-75870192 17.4 RNF213 no

Loss Affy6.0_D_741 28 chr4:53829489-53875712 46.2 SCFD2 no

Loss Affy6.0_D_549 99 chr2:78025162-78059816 34.7 SNAR-H no

chr4:147802903-148190197 387.3 TTC29 no

Fourteen genes (including one small nuclear RNA) had at least part of their coding regions affected by

FPC-specific losses, and 74 genes (including 3 microRNAs) had at least part of their coding regions

affected by FPC-specific gains (Table 12).

Table 12 – Genes whose coding regions are affected by FPC-specific CNVs

CNV type Gene Entrez Id Official full name Position (hg18) Array Sample

Extent of gene affected

Gain OR1A1 8383 olfactory receptor, family 1, subfamily A, member 1

chr17:2932535-3161719 500K 28 full

Gain OR1D2 4991 olfactory receptor, family 1, subfamily D, member 2

chr17:2919396-3019805

500K & Affy6 28 full

Gain OR1G1 8390 olfactory receptor, family 1, subfamily G, member 1

chr17:2919396-3019805

Gain OR1D4 653166

olfactory receptor, family 1, subfamily D, member 4 (gene/pseudogene)

chr17:2932535-3184579

Gain CDYL2 124359 chromodomain protein, Y-like 2

chr16:78810438-79258408 500K 37 partial

Gain DYNLRB2 83657 dynein, light chain, roadblock-type 2

chr16:78810438-79258408 500K 37 full

Gain MIR548H4 100313884 microRNA 548h-4 chr16:78810438-79258408 500K 37 partial

Gain METTL4 64863 methyltransferase like 4 chr18:2254263-2555103

500K & Affy6 44 partial

Gain ARHGAP28 79822 Rho GTPase activating protein 28

chr18:6838462-7291170

Gain LAMA1 284217 laminin, alpha 1 chr18:6838462-7291170

Gain LOC400643 400643 hypothetical LOC400643 chr18:6838462-7291170

Gain LRRC30 339291 leucine rich repeat containing 30

chr18:6838462-7291170

Gain SP110 3431 SP110 nuclear body protein chr2:230753632-230823051 500K 65 partial

Gain SP140 11262 SP140 nuclear body protein chr2:230753632-230823051 500K 65 partial

Gain GRID2 2895 glutamate receptor, ionotropic, delta 2

chr4:93344017-93591992 500K 79 partial

Gain ATR 545 ataxia telangiectasia and Rad3 related

chr3:143693491-143928895 500K 82 partial

Gain PLS1 5357 plastin 1 chr3:143693491-143928895 500K 82 full

Gain TRPC1 7220

transient receptor potential cation channel, subfamily C, member 1

chr3:143693491-143928895 500K 82 partial

Gain IDO2 169355 indoleamine 2,3-dioxygenase 2

chr8:39935640-39943638

Gain EXOC4 60412 exocyst complex component 4

chr7:133223330-133393933

Gain RSPO2 340419 R-spondin 2 homolog (Xenopus laevis)

chr8:108696004-108994913

Gain PLXDC2 84898 plexin domain containing 2 chr10:19849680-20589237 500K 26 partial

Gain AGBL4 84871 ATP/GTP binding protein-like 4

chr1:49856085-50089082 Affy6 127 partial

Gain EHBP1L1 254102 EH domain binding protein 1-like 1

chr11:65027491-65201466 Affy6 20 full

Gain FAM89B 23625 family with sequence similarity 89, member B

chr11:65027491-65201466 Affy6 20 full

Gain KCNK7 10089 potassium channel, subfamily K, member 7

chr11:65027491-65201466 Affy6 20 full

Gain LOC254100 254100 hypothetical LOC254100 chr11:65027491-65201466 Affy6 20 full

Gain LTBP3 4054 latent transforming growth factor beta binding protein 3

chr11:65027491-65201466 Affy6 20 full

Gain MALAT1 378938

metastasis associated lung adenocarcinoma transcript 1 (non-protein coding)

chr11:65027491-65201466 Affy6 20 partial

Gain MAP3K11 4296 mitogen-activated protein kinase kinase kinase 11

chr11:65027491-65201466 Affy6 20 full

Gain MIR4489 100616284 microRNA 4489 chr11:65027491-65201466 Affy6 20 full

Gain MIR4690 100616292 microRNA 4690 chr11:65027491-65201466 Affy6 20 full

Gain PCNXL3 399909 pecanex-like 3 (Drosophila) chr11:65027491-65201466 Affy6 20 full

Gain RELA 164014

v-rel reticuloendotheliosis viral oncogene homolog A (avian)

chr11:65027491-65201466 Affy6 20 partial

Gain SCYL1 57410 SCY1-like 1 (S. cerevisiae) chr11:65027491-65201466 Affy6 20 full

Gain SIPA1 602180 signal-induced proliferation-associated 1

chr11:65027491-65201466 Affy6 20 full

Gain SSSCA1 10534

Sjogren syndrome/scleroderma autoantigen 1

chr11:65027491-65201466 Affy6 20 full

Gain PTGR2 145482 prostaglandin reductase 2 chr14:73405361-73432688 Affy6 67 partial

Gain ZNF410 57862 zinc finger protein 410 chr14:73405361-73432688 Affy6 67 partial

Gain CELF6 60677 CUGBP, Elav-like family member 6

chr15:70381008-70436843 Affy6 44 partial

Gain HEXA 3073 hexosaminidase A (alpha polypeptide)

chr15:70381008-70436843 Affy6 44 partial

Gain GJD2 57369 gap junction protein, delta 2, 36kDa

chr15:32814039-32848252 Affy6 99 full

Gain CCDC106 29903 coiled-coil domain containing 106

chr19:60824299-60923809 Affy6 62 full

Gain EPN1 29924 epsin 1 chr19:60824299-60923809 Affy6 62 full

Gain NLRP9 338321 NLR family, pyrin domain containing 9

chr19:60824299-60923809 Affy6 62 partial

Gain U2AF2 11338 U2 small nuclear RNA auxiliary factor 2

chr19:60824299-60923809 Affy6 62 full

Gain ZNF580 51157 zinc finger protein 580 chr19:60824299-60923809 Affy6 62 full

Gain ZNF784 147808 zinc finger protein 784 chr19:60824299-60923809 Affy6 62 partial

Gain BRSK1 84446 BR serine/threonine kinase 1 chr19:60436319-60696243 Affy6 20 full

Gain COX6B2 125965

cytochrome c oxidase subunit VIb polypeptide 2 (testis)