department of biostatistics, genetics university of north ...weisun/research/genocn.pptx.pdf ·...

31
Integrated studies of copy number and genotype Wei Sun Department of Biostatistics, Genetics University of North Carolina, Chapel Hill

Upload: others

Post on 18-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Integrated studies of copy number and genotype

Wei Sun Department of Biostatistics, Genetics University of North Carolina, Chapel Hill

Page 2: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Outline   1. Introduction of copy number variation (CNV)

  2. Identification of CNV by segmentation algorithms

  3. Integrated study of copy number and genotypes

  4. Copy number aberration (CNA) in tumor tissue

  5. Conclusion and Future works

Page 3: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Central Dogma

Page 4: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

DNA

Deoxyribonucleic acid (DNA)

A nucleic acid that contains

the genetic instructions used

in the development and functioning of all known living

organisms and some viruses.

DNA consists of two long polymers

of simple units called nucleotidesA - T C - G

Double Helix

Page 5: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Definitions   Structural Variant

  An umbrella term. Genomic alterations that involve segments of DNA that are larger than 1kb.

  Copy Number Variation/Variant (CNV)   A segment of DNA that is 1kb or larger and is present at a variable

copy number in comparison with a reference genome.

  Copy Number Polymorphism   A CNV that occurs in more than 1% of the population. Originally, this

definition was used to refer to all CNVs

  Inversion, Translocation

Feuk et al. Nature Review Genetics vol.7, p85

Page 6: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Structure variation under microscope

Feuk et al. Nature Review Genetics vol.7, p85

Page 7: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

How frequent CNVs occur   Size of Human genome

  23 pairs of chromosomes, 3 billion base pairs   one nucleotide corresponds to one base pair

  Size of each chromosome ranges from 50 Mb to 247 Mb

  Large scale CNVs (> 50kb) cover 5%~18% of the genome   < 0.5% for any pair of genomes   There may be many rare and short CNVs

Page 8: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

LETTERS

Rare chromosomal deletions and duplicationsincrease risk of schizophreniaThe International Schizophrenia Consortium*

Schizophrenia is a severe mental disorder marked by hallucina-tions, delusions, cognitive deficits and apathy, with a heritabilityestimated at 73–90% (ref. 1). Inheritance patterns are complex,and the number and type of genetic variants involved are notunderstood. Copy number variants (CNVs) have been identifiedin individual patients with schizophrenia2–7 and also in neurode-velopmental disorders8–11, but large-scale genome-wide surveyshave not been performed. Here we report a genome-wide surveyof rare CNVs in 3,391 patients with schizophrenia and 3,181 ances-trally matched controls, using high-density microarrays. ForCNVs that were observed in less than 1% of the sample and weremore than 100 kilobases in length, the total burden is increased1.15-fold in patients with schizophrenia in comparison with con-trols. This effect was more pronounced for rarer, single-occur-rence CNVs and for those that involved genes as opposed tothose that did not. As expected, deletions were found within theregion critical for velo-cardio-facial syndrome, which includespsychotic symptoms in 30% of patients12. Associations withschizophrenia were also found for large deletions on chromosome15q13.3 and 1q21.1. These associations have not previously beenreported, and they remained significant after genome-wide cor-rection. Our results provide strong support for a model of schizo-phrenia pathogenesis that includes the effects of multiple rarestructural variants, both genome-wide and at specific loci.

The International Schizophrenia Consortium was established topromote rapid progress towards the identification of genetic causesunderlying schizophrenia. The consortium is composed of investi-gators from the University of Aberdeen, Cardiff University, theUniversity of Edinburgh, Karolinska Institutet, MassachusettsGeneral Hospital, the University of North Carolina-Chapel Hill,the Queensland Institute of Medical Research, the University ofSouthern California, the Stanley Center for Psychiatric Research atthe Broad Institute of Harvard and MIT, Trinity College Dublin andUniversity College London.

We surveyed single nucleotide polymorphisms (SNPs) and CNVsusing the Affymetrix Genome-Wide Human SNP 5.0 and 6.0 arraysin European cases of schizophrenia and in ancestrally matched con-trols (Table 1 and Supplementary Information)13. On the basis of thegenome-wide SNP data there was no evidence of major populationstratification within each site14 (data not shown). Intensity data fromboth SNP and CNV probes were used to identify autosomal deletionsand duplications, based on a hidden Markov model15.

This study focused on rare but highly penetrant structural vari-ation in schizophrenia, following a natural extension of the classicalmedical genetic approach. Common CNVs are better identified withdifferent algorithms and are better tested for association sepa-rately13,15. Considering CNVs that were present in less than 1% ofour total sample, there were 6,753 larger than 100 kilobases (kb) thatpassed sample and CNV quality filtering (see Supplementary

Information and Supplementary Table 1). The median size was182.1 kb (166.3 kb for deletions, 194.4 kb for duplications), 39%weredeletions and the median number per individual was 1. We assessedthe impact of rare structural variation on the risk for schizophrenia intwo ways: first in terms of an individual’s genome-wide burden, andsecond by searching for specific loci that were significantly associatedwith disease.

Structural variants have been identified for severe neurodevelop-mental disorders9–11,16,17. Because it has been postulated that schizo-phrenia might, at least in part, have a developmental aetiology18, weposited a role for CNVs in schizophrenia, as have others2–6. Severalloci have been identified, including variants containing genes withneurodevelopmental roles2–5. However, a critical question is theextent to which this is a general mechanism for producing schizo-phrenia in typical clinical populations rather than in cases selected foratypical phenotypic features such as very early onset or mentalretardation. This motivated our primary hypothesis: that individualswith schizophrenia have a greater genome-wide burden of CNVs.Considering all CNVs, we observed that cases had a greater averageburden than controls (one-sided, empirical P5 33 1025 controllingfor array type; Table 2). Controls on average had 0.99 CNVs perperson, whereas cases showed a 1.15-fold higher rate.

We next explored this subtle, but highly statistically significant,observation of increased burden. We defined burden in two ways: asthe number of CNVs carried by an individual (as above), and also asthe number of genes spanned by those CNVs. This secondmetric (the‘gene count’) in fact showed a stronger association with schizophre-nia (1.41-fold increase, empirical P5 23 1026) than burden definedsimply as the number of CNVs. Characteristics of CNV subgroupsstudied here are their frequency, type, size, and proximity to a gene(Tables 2 and 3, and Supplementary Table 2). We observed anincreased burden across multiple independent subgroups of CNVs,

*Lists of members and affiliations appear at the end of the paper.

Table 1 | Study sample characteristics and genotyping platforms

Sample Ancestry Case (n) Control (n) Genotypingplatform

University of Aberdeen Scottish 727 694{ 5.0University College London British 547 n/a{ 5.0Portuguese IslandCollection

Portuguese 333 200{ 5.0

Karolinska Institutet Swedish 622 437 5.0/6.01Cardiff University Bulgarian 479* 646 6.0Trinity College Dublin Irish 280 914 6.0University of Edinburgh Scottish 403* 290 6.0

Figures are the numbers of cases and controls passing quality control and included in the finalanalyses. Case samples received a diagnosis of schizophrenia. ‘Genotyping platform’ indicatesAffymetrix array type (5.0 or 6.0).*Cases were excluded if IQ was less than 70.{Controls were screened for psychiatric disorders.{University College London control samples genotyped with the Affymetrix 500K two-chipgenotyping platform were excluded because CNV data were not available.1 Swedish cases and controls matched for array type for all analyses.

Vol 455 | 11 September 2008 |doi:10.1038/nature07239

237 ©2008 Macmillan Publishers Limited. All rights reserved

Why do we care CNV

De novo copy number variants identify new genes andloci in isolated sporadic tetralogy of FallotSteven C Greenway1, Alexandre C Pereira2, Jennifer C Lin1, Steven R DePalma1, Samuel J Israel1,Sonia M Mesquita2, Emel Ergul3, Jessie H Conta3, Joshua M Korn1,4, Steven A McCarroll1,4, Joshua M Gorham1,Stacey Gabriel4, David M Altshuler1,4, Maria de Lourdes Quintanilla-Dieck1,5, Maria Alexandra Artunduaga1,5,Roland D Eavey5, Robert M Plenge4,6, Nancy A Shadick6, Michael E Weinblatt6, Philip L De Jager4,7,David A Hafler4,7, Roger E Breitbart3, Jonathan G Seidman1,9 & Christine E Seidman1,8,9

Tetralogy of Fallot (TOF), the most common severe congenitalheart malformation, occurs sporadically, without otheranomaly, and from unknown cause in 70% of cases. Througha genome-wide survey of 114 subjects with TOF and theirunaffected parents, we identified 11 de novo copy numbervariants (CNVs) that were absent or extremely rare (o0.1%)in 2,265 controls. We then examined a second, independentTOF cohort (n ! 398) for additional CNVs at these loci.We identified CNVs at chromosome 1q21.1 in 1% (5/512,P ! 0.0002, OR ! 22.3) of nonsyndromic sporadic TOF cases.We also identified recurrent CNVs at 3p25.1, 7p21.3 and22q11.2. CNVs in a single subject with TOF occurred at sixloci, two that encode known (NOTCH1, JAG1) disease-associated genes. Our findings predict that at least 10%(4.5–15.5%, 95% confidence interval) of sporadic

nonsyndromic TOF cases result from de novo CNVs andsuggest that mutations within these loci might be etiologic inother cases of TOF.

The combination of a malpositioned aorta that overrides bothventricles, ventricular septal defect, pulmonary stenosis (whichobstructs blood flow into the lungs) and right ventricular hypertrophy(Fig. 1) defines TOF. The most prevalent form of cyanotic heartdisease, TOF occurs in one of 3,000 live births and accounts for 10%of all serious congenital heart disease1. With recent advances incorrective surgery, early lethality from TOF is rare, but long-termsequelae, including arrhythmia, ventricular dysfunction and often life-long disability, persist.TOF can arise in the context of prenatal infections, exposure to

teratogens or maternal illness, and from dominant mutations that

Figure 1 Anatomy and pathophysiology oftetralogy of Fallot. (a) Normal heart structurepromotes unidirectional flow of deoxygenatedblood (blue) into the lungs and oxygenated blood(red) into the aorta. (b) In TOF, pulmonarystenosis and narrowing of the right ventricularoutflow tract (RVOT) impedes the flow ofdeoxygenated blood into the lungs, and both theventricular septal defect (VSD) and overridingaorta (*) promote the flow of deoxygenated bloodinto the systemic circulation to produce cyanosis(‘blue baby’ syndrome). Right ventricular hypertrophy (RVH) is also present. (c) A Doppler echocardiogram shows mixing of deoxygenated blood from the rightventricle (RV) and oxygenated blood from the left ventricle (LV) as blood is pumped out the overriding aorta (Ao) in an individual with TOF. RA, right atrium;LA, left atrium. Images from Multimedia Library of Congenital Heart Disease, Children’s Hospital, Boston, Massachusetts (ed. R. Geggel, http://www.childrenshospital.org/mml/cvp; used with permission).

Normal hearta b cTetralogy of Fallot

Aorta

RVOT

RVH

VSD

Ao

LV

RV

TR12

RA

LA

LV

RA

LA

*LV

RVRV

Received 17 February; accepted 3 June; published online 13 July 2009; doi:10.1038/ng.415

1Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA. 2Laboratory of Genetics and Molecular Cardiology, Heart Institute, University of SaoPaulo Medical School, Sao Paulo, Brazil. 3Department of Cardiology, Children’s Hospital, Boston, Massachusetts, USA. 4The Broad Institute of Harvard and MIT,Cambridge, Massachusetts, USA. 5Department of Otology and Laryngology, Massachusetts Eye & Ear Infirmary, Boston, Massachusetts, USA. 6Division ofRheumatology, Immunology and Allergy, 7Department of Neurology and 8Cardiovascular Division, Brigham and Women’s Hospital, Boston, Massachusetts, USA.9These authors contributed equally to this study. Correspondence should be addressed to C.E.S. ([email protected]).

NATURE GENETICS VOLUME 41 [ NUMBER 8 [ AUGUST 2009 931

LET TERS

References and Notes1. J. K. Rose, M. A. Whitt, in Fields’ Virology, D. M. Knipe,

P. M. Howley, Eds. (Lippincott, Williams & Wilkins,Philadelphia, ed. 4, 2001), pp. 1221–1244.

2. B. L. Rao et al., Lancet 364, 869 (2004).3. I. Le Blanc et al., Nat. Cell Biol. 7, 653 (2005).4. L. H. Luo, Y. Li, R. M. Snyder, R. R. Wagner, Virology

163, 341 (1988).5. A. Benmansour et al., J. Virol. 65, 4198 (1991).6. S. B. Vandepol, L. Lefrancois, J. J. Holland, Virology 148,

312 (1986).7. Y. Gaudin, C. Tuffereau, D. Segretain, M. Knossow,

A. Flamand, J. Virol. 65, 4853 (1991).8. R. W. Doms, D. S. Keller, A. Helenius, W. E. Balch, J. Cell

Biol. 105, 1957 (1987).9. Y. Gaudin, Subcell. Biochem. 34, 379 (2000).

10. Y. Gaudin, R. W. Ruigrok, M. Knossow, A. Flamand,J. Virol. 67, 1365 (1993).

11. P. Durrer, Y. Gaudin, R. W. Ruigrok, R. Graf, J. Brunner,J. Biol. Chem. 270, 17575 (1995).

12. B. L. Fredericksen, M. A. Whitt, Virology 217, 49 (1996).13. C. C. Pak, A. Puri, R. Blumenthal, Biochemistry 36, 8890

(1997).14. F. A. Carneiro, A. S. Ferradosa, A. T. Da Poian, J. Biol.

Chem. 276, 62 (2001).15. S. Roche, Y. Gaudin, Virology 297, 128 (2002).16. Y. Gaudin, C. Tuffereau, P. Durrer, A. Flamand,

R. W. Ruigrok, J. Virol. 69, 5528 (1995).17. S. Roche, S. Bressanelli, F. A. Rey, Y. Gaudin, Science

313, 187 (2006).18. M. Kielian, F. A. Rey, Nat. Rev. Microbiol. 4, 67

(2006).19. P. A. Bullough, F. M. Hughson, J. J. Skehel, D. C. Wiley,

Nature 371, 37 (1994).

20. J. J. Skehel, D. C. Wiley, Cell 95, 871 (1998).21. H. S. Yin, R. G. Paterson, X. Wen, R. A. Lamb, T. S. Jardetzky,

Proc. Natl. Acad. Sci. U.S.A. 102, 9288 (2005).22. Y. Modis, S. Ogata, D. Clements, S. C. Harrison, Nature

427, 313 (2004).23. S. Bressanelli et al., EMBO J. 23, 728 (2004).24. D. L. Gibbons et al., Nature 427, 320 (2004).25. E. E. Heldwein et al., Science 313, 217 (2006).26. Y. Gaudin, R. W. Ruigrok, C. Tuffereau, M. Knossow,

A. Flamand, Virology 187, 627 (1992).27. Single-letter abbreviations for the amino acid residues

are as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe;G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro;Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.

28. F. Forster, O. Medalia, N. Zauberman, W. Baumeister,D. Fass, Proc. Natl. Acad. Sci. U.S.A. 102, 4729 (2005).

29. P. Zhu et al., Nature 441, 847 (2006).30. H. S. Yin, X. Wen, R. G. Paterson, R. A. Lamb,

T. S. Jardetzky, Nature 439, 38 (2006).31. Y. Gaudin, H. Raux, A. Flamand, R. W. Ruigrok, J. Virol.

70, 7371 (1996).32. E. Krissinel, K. Henrick, in CompLife 2005, M. R. Berthold,

R. Glen, K. Diederichs, O. Kohlbacher, I. Fischer, Eds.,Lecture Notes in Bioinformatics, vol. 3695 (Springer-Verlag,Berlin, 2005), pp. 163–174.

33. F. Lafay, A. Benmansour, K. Chebli, A. Flamand, J. Gen.Virol. 77, 339 (1996).

34. Y. Gaudin, J. Virol. 71, 3742 (1997).35. C. Tuffereau, J. Benejean, D. Blondel, B. Kieffer,

A. Flamand, EMBO J. 17, 7250 (1998).36. L. V. Chernomordik, M. M. Kozlov, Cell 123, 375

(2005).37. C. M. Carr, C. Chaudhry, P. S. Kim, Proc. Natl. Acad. Sci.

U.S.A. 94, 14306 (1997).

38. W. L. Delano, The PyMOL Molecular Graphics System(DeLano Scientific, San Carlos, CA, 2002), available atwww.pymol.org.

39. We thank A. Flamand for constant support on this project;J. Lepault, R. Ruigrok, M. Knossow, A. Benmansour,C. Tuffereau, and D. Blondel for helpful discussions atdifferent stages of this work; and C. Maheu for viruspurification. Data collections were performed at the SwissLight Source (SLS), Paul Scherrer Institut, Villigen, Switzerland,and at the European Synchrotron Radiation Facility (ESRF),Grenoble, France. We acknowledge the help of T. Tomizaki(beamline X06SA, SLS), G. Leonard and D. Bourgeois(beamlines ID29 and ID23-2, ESRF), and S. Duquerroy andG. Squires in data collection. We acknowledge support fromthe CNRS and INRA, the CNRS program “Physique et Chimiedu Vivant,” the INRA Animal health department program“Les virus des animaux et leurs interactions avec la cellule,”the Ministère de l’éducation nationale, de la recherche et dela technologie program “Action Concertée Incitativeblanche,” and the Agence Nationale de la Rechercheprogram. S.R. was the recipient of an Agence Nationale deRecherche sur le Sida fellowship during part of this project.Coordinates and structure factors have been deposited withthe Protein Data Bank under accession code 2j6j.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/315/5813/843/DC1Materials and MethodsFigs. S1 to S5Table S1ReferencesMovie S1

29 September 2006; accepted 3 January 200710.1126/science.1135710

Relative Impact of Nucleotide andCopy Number Variation on GeneExpression PhenotypesBarbara E. Stranger,1 Matthew S. Forrest,1 Mark Dunning,2 Catherine E. Ingle,1Claude Beazley,1 Natalie Thorne,2 Richard Redon,1 Christine P. Bird,1 Anna de Grassi,3Charles Lee,4,5 Chris Tyler-Smith,1 Nigel Carter,1 Stephen W. Scherer,6,7 Simon Tavaré,2,8Panagiotis Deloukas,1 Matthew E. Hurles,1* Emmanouil T. Dermitzakis1*

Extensive studies are currently being performed to associate disease susceptibility with one formof genetic variation, namely, single-nucleotide polymorphisms (SNPs). In recent years, another typeof common genetic variation has been characterized, namely, structural variation, including copynumber variants (CNVs). To determine the overall contribution of CNVs to complex phenotypes,we have performed association analyses of expression levels of 14,925 transcripts with SNPs andCNVs in individuals who are part of the International HapMap project. SNPs and CNVs captured83.6% and 17.7% of the total detected genetic variation in gene expression, respectively, but thesignals from the two types of variation had little overlap. Interrogation of the genome for bothtypes of variants may be an effective way to elucidate the causes of complex phenotypes anddisease in humans.

Understanding the genetic basis of phe-notypic variation in human popu-lations is currently one of the major

goals in human genetics. Gene expression (thetranscription of DNA into mRNA) has beeninterrogated in a variety of species and experi-mental scenarios in order to investigate thegenetic basis of variation in gene regulation(1–8), as well as to tease apart regulatory net-works (9, 10). In some respects, a comprehen-sive survey of gene expression phenotypes

(steady-state levels of mRNA) serves as a proxyfor the breadth and nature of phenotypic var-iation in human populations (11). Much of theobserved variation in mRNA transcript levelsmay be compensated at higher stages of regu-latory networks, but an understanding of thenature of genetic variants that affect gene ex-pression will provide an essential frameworkand model for elucidating the causes of othertypes of phenotypic variation. Single-nucleotidepolymorphisms (SNPs) have long been known

to be associated with phenotypic variation eitherthrough direct causal effects or by serving asproxies for other causal variants with which theyare highly correlated (i.e., in linkage disequi-librium) (1, 2, 12). An understanding of this as-sociation has been facilitated by the validation ofmillions of SNPs by the International HapMapproject (13). However, during the last few years,structural variants, such as copy number variants(CNVs)—defined as DNA segments that are1 kb or larger in size present at variable copynumber in comparison with a reference genome(14)—have attracted much attention (2). It hasbecome apparent that they are quite common inthe human genome (15–19) and can have dra-matic phenotypic consequences as a result ofaltering gene dosage, disrupting coding se-

1Wellcome Trust Sanger Institute, Wellcome Trust GenomeCampus, Hinxton, Cambridge, CB10 1SA, UK. 2Departmentof Oncology, University of Cambridge, Cancer Research UKCambridge Research Institute, Li Ka Shing Centre,Robinson Way, Cambridge CB2 0RE, UK. 3Istituto diTecnologie Biomediche-Sezione di Bari, Consiglio Nazio-nale della Ricerche (CNR), 70126 Bari, Italy. 4Departmentof Pathology, Brigham and Women's Hospital and HarvardMedical School, Boston, MA 02115, USA. 5Broad Instituteof Harvard and Massachusetts Institute of Technology,Cambridge, MA 02142, USA. 6The Centre for AppliedGenomics and Program in Genetics and Genomic Biology,The Hospital for Sick Children, MaRS Centre, Toronto,Ontario, M5G 1L7, Canada. 7Department of Molecular andMedical Genetics, University of Toronto, Toronto, Ontario,Canada. 8Program in Molecular and ComputationalBiology, University of Southern California, Los Angeles,CA 90089–2910, USA.

*To whom correspondence should be addressed. E-mail:[email protected] (E.T.D.); [email protected] (M.E.H.)

9 FEBRUARY 2007 VOL 315 SCIENCE www.sciencemag.org848

REPORTS

on

Sept

embe

r 2, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Page 9: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

  1. Introduction of copy number variation (CNV)

  2. Identification of CNV by segmentation algorithms

  3. Integrated study of copy number and genotypes

  4. Copy number aberration (CNA) in tumor tissue

  5. Conclusion and Future works

Page 10: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Array CGH (Comparative Genomic Hybridization)

©!2006!Nature Publishing Group!

!

Cy5

Cy3

Cy5

Cy3

Block repeats withCOT-1 DNA

Block repeats withCOT-1 DNA

1.5

1.0

0.5

Hybridize to arrays

Detect and quantify signals(Cy3:Cy5)

Duplication

Deletion

Genomic DNA

BglII digested

Adaptor ligation

PCRamplification (<1.2 kb)

~2.5% of genome

Label referenceand test DNAand hybridize

Spurious signal

b

Array spotted with computationallydesignedoligonucleotideprobes (70 nt)

Test DNA

Reference DNATest DNA

a Reference DNA

COT-1 DNADNA that is mainly composed of repetitive sequences. It is produced when short fragments of denatured genomic DNA are re-annealed.

FosmidA bacterially propagated phagemid vector system that is suitable for cloning genomic inserts of approximately 40 kb in size.

genomic segments in an inverted orientation29,76. Of these, 23 intervals between 10 kb and 100 kb in size make up ~450 kb. Most of these differences probably represent the incompleteness of the respective assemblies, but a fraction are likely to be due to actual structural variation between the individuals on whom the different assemblies were based.

In a second computational approach, anchor points are derived from sequences at the ends of clones (for example, fosmids) from a genomic library of a selected genome28. These anchor points are then aligned to the reference assembly, and the distance between them is compared with the expected size of the clone. Any discrepancy highlights potential insertion or deletion variants. This method, known as the paired-end sequence approach, is also suitable for the detection of some inversions, as end sequences would be in an incorrect

orientation with respect to the reference assembly. Although this approach does not provide the same resolution as comparing sequence assemblies, it will remain a viable alternative until a reduction in the cost of generating further genome assemblies of high accuracy is achieved. Alternatively, structural variants can be identi-fied by analysing sequence read-depths from shotgun-sequencing data and comparing this to what is expected from the reference genome — a method that has been used successfully for annotation of segmental duplica-tions in the human genome79. Variations of this method will become more relevant when whole-genome shotgun sequencing of multiple human genomes becomes cost-efficient. Finally, comparison of human and primate (in particular chimpanzee) assemblies can highlight inter species structural variants, and in some cases these genomic sites also show intraspecies polymorphism29.

Figure 2 | Array-based, genome-wide methods for the identification of copy-number variants. a | In array-based comparative genome hybridization (array-CGH), reference and test DNA samples are differentially labelled with fluorescent tags (Cy5 and Cy3, respectively), and are then hybridized to genomic arrays after repetitive-element binding is blocked using COT-1 DNA. The array can be spotted with one of several DNA sources, including BAC clones, PCR fragments or oligonucleotides. After hybridization, the fluorescence ratio (Cy3:Cy5) is determined, which reveals copy-number differences between the two DNA samples. Typically, array-CGH is carried out using a ‘dye-swap’ method, in which the initial labelling of the reference and test DNA samples is reversed for a second hybridization (indicated by the left and right sides of the panel). This detects spurious signals for which the reciprocal ratio is not observed. An example output for a dye-swap experiment is shown: the red line represents the original hybridization, whereas the blue line represents the reciprocal, or dye-swapped, hybridization. b | Representational oligonucleotide microarray analysis (ROMA) is a variant of array-CGH in which the reference and test DNA samples are made into ‘representations’ to reduce the sample complexity before hybridization. DNA is digested with a restriction enzyme that has uniformly distributed cleavage sites (BglII is shown here). Adaptors (with PCR primer sites) are then ligated to each fragment, which are amplified by PCR. However, owing to the PCR conditions that are used, only DNA of less than 1.2 kb (yellow) is amplified. Fragments that are greater than this size (red) are lost, therefore reducing the complexity of the DNA that will be hybridized to the array. It is estimated that around 200,000 fragments of DNA are amplified, comprising approximately 2.5% of the human genome49. In ROMA, an oligonucleotide array is used, which is spotted with computationally designed 70-nt probes. Each probe is designed to hybridize to one of the fragments in the representation.

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 7 | FEBRUARY 2006 | 89

Feuk et al. Nature Review Genetics vol.7, p85

Page 11: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

DNA copy number profiling platforms

Affymetrix25bp probe

Illumina50bp probe

Agilent60bp probe

Agilent60bp probe

Bengtsson et al. Bioinformatics vol. 25 p861

Page 12: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Segmentation Methods Biostatistics (2004), 5, 4, pp. 557–572doi: 10.1093/biostatistics/kxh008

Circular binary segmentation for the analysis ofarray-based DNA copy number data

ADAM B. OLSHEN, E. S. VENKATRAMANDepartment of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York

Avenue, New York, NY 10021, [email protected]

ROBERT LUCITO, MICHAEL WIGLERCold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA

SUMMARY

DNA sequence copy number is the number of copies of DNA at a region of a genome. Cancerprogression often involves alterations in DNA copy number. Newly developed microarray technologiesenable simultaneous measurement of copy number at thousands of sites in a genome. We have developeda modification of binary segmentation, which we call circular binary segmentation, to translate noisyintensity measurements into regions of equal copy number. The method is evaluated by simulation and isdemonstrated on cell line data with known copy number alterations and on a breast cancer cell line data set.

Keywords: Array CGH; Binary segmentation; Change-point; ROMA.

1. INTRODUCTION

The DNA copy number of a region of a genome is the number of copies of genomic DNA. Inhumans the normal copy number is two for all the autosomes. Variations in copy number are commonin cancer and other diseases. These variations are a result of genomic events causing discrete gains andlosses in contiguous segments of the genome. For this reason, efforts have been made over the last tenyears to make whole genome copy number maps from a single study. Technologies to accomplish thishave included comparative genomic hybridization (CGH) (Kallioniemi et al., 1992) and representationaldifference analysis (RDA) (Lisitsyn et al., 1993). In order to increase the resolution of the resulting maps,both techniques have been modified for use with microarrays, the laboratory techniques for which aresimilar to cDNA gene expression experiments. Each microarray consists of thousands of genomic targetsor probes, which we will refer to as markers, that are spotted or printed on a glass surface. In a copynumber experiment a DNA sample of interest, called the test sample, and a diploid reference sampleare differentially labelled with dyes, typically Cy3 and Cy5, and mixed. This combined sample is thenhybridized to the microarray and imaged which results in test and reference intensities for all the markers.

The modification of conventional CGH to obtain high resolution data is called array CGH (aCGH)(Pinkel et al., 1998; Snijders et al., 2001). Here the genomic targets are bacterial artificial chromosomes(BACs), which are large segments of DNA, typically 100–200 kilobases. Representational oligonucleotidemicroarray analysis (ROMA) (Lucito et al., 2000, 2003) is the high resolution version of RDA. In ROMA,the test and reference samples are based on representations, (Lisitsyn et al., 1993), which are subsets of agenome. To create representations, genomic DNA is first shattered using an enzyme. The DNA pieces of

Biostatistics Vol. 5 No. 4 c! Oxford University Press 2004; all rights reserved.

BioMed Central

!"#$%&%'(%&&!"#$%&'()*%+&',-&.,+&/0-#-0,'&"(+",1%12

BMC Bioinformatics

Open AccessMethodology articleA forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) arrayTianwei Yu*1, Hui Ye2,3, Wei Sun4, Ker-Chau Li4, Zugen Chen5, Sharoni Jacobs6, Dione K Bailey6, David T Wong7 and Xiaofeng Zhou*2,8

Address: 1Department of Biostatistics, Rollins School of Public Health, Emory University, Atlanta, GA, USA, 2Center for Molecular Biology of Oral Diseases, College of Dentistry, University of Illinois at Chicago, Chicago, IL, USA, 3Shanghai Children's Medical Center, Shanghai Jiao-Tong University, Shanghai, China, 4Department of Statistics, University of California at Los Angeles, CA, USA, 5Department of Human Genetics & Microarray Core, University of California at Los Angeles, Los Angeles, CA, USA, 6Affymetrix, Inc., 3420 Central Expressway, Santa Clara, CA, USA, 7Dental Research Institute, School of Dentistry, David Geffen School of Medicine & Henry Samueli School of Engineering & Jonsson Comprehensive Cancer Center, University of California at Los Angeles, Los Angeles, CA, USA and 8Guanghua School & Research Institute of Stomatology, Sun Yat-Sen University, Guangzhou, China

Email: Tianwei Yu* - [email protected]; Hui Ye - [email protected]; Wei Sun - [email protected]; Ker-Chau Li - [email protected]; Zugen Chen - [email protected]; Sharoni Jacobs - [email protected]; Dione K Bailey - [email protected]; David T Wong - [email protected]; Xiaofeng Zhou* - [email protected]* Corresponding authors

AbstractBackground: DNA copy number aberration (CNA) is one of the key characteristics of cancercells. Recent studies demonstrated the feasibility of utilizing high density single nucleotidepolymorphism (SNP) genotyping arrays to detect CNA. Compared with the two-color array-basedcomparative genomic hybridization (array-CGH), the SNP arrays offer much higher probe densityand lower signal-to-noise ratio at the single SNP level. To accurately identify small segments ofCNA from SNP array data, segmentation methods that are sensitive to CNA while resistant tonoise are required.

Results: We have developed a highly sensitive algorithm for the edge detection of copy numberdata which is especially suitable for the SNP array-based copy number data. The method consistsof an over-sensitive edge-detection step and a test-based forward-backward edge selection step.

Conclusion: Using simulations constructed from real experimental data, the method shows highsensitivity and specificity in detecting small copy number changes in focused regions. The methodis implemented in an R package FASeg, which includes data processing and visualization utilities, aswell as libraries for processing Affymetrix SNP array data.

BackgroundMost human cancers are characterized by genomic insta-bilities. In-depth knowledge of genomic aberrations hasimportant clinical values in diagnosis, treatment, and

prognostics of cancer [1]. Genomic aberrations can beanalyzed using a variety of high-throughput genetic andmolecular technologies, such as array-based comparativegenomic hybridization (array-CGH) [2] and SNP array-

Published: 3 May 2007

BMC Bioinformatics 2007, 8:145 doi:10.1186/1471-2105-8-145

Received: 16 October 2006Accepted: 3 May 2007

This article is available from: http://www.biomedcentral.com/1471-2105/8/145

© 2007 Yu et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 13: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

  1. Introduction of copy number variation (CNV)

  2. Identification of CNV by segmentation algorithms

  3. Integrated study of copy number and genotypes

  4. Copy number aberration (CNA) in tumor tissue

  5. Conclusion and Future works

Page 14: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Copy Number & Genotype   Normal

  Two copies of each chromosome, one is inherited from father and one is from mother.

  SNP (Single Nucleotide Polymorphism) Genotype   ACGGTCA vs. ACGTTCA   Let A and B be the two alleles of one SNP, then in normal

state, the genotype could be AA, AB, or BB.

  Genotype at CNV regions   A or B for if one allele is deleted   AAA, AAB, ABB, or BBB if one allele is amplified

Page 15: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Data Processing of Illumina SNP array   The raw intensity measured for the A and B alleles of one SNP

are normalized to produce X and Y

  R = X + Y measures the overall copy number   LRR = log2(Robserved/Rexpected), where Rexpected is computed from

linear interpolation of canonical genotype clusters

  θ = arctan(Y/X)/(π/2) measures the contrast between the two alleles.

of copy number and genotype of each SNP. A completeparameter estimation scheme is developed so that thoseHMM parameters are estimated from the data. TheHMMs for genoCNV and genoCNA are designed di!er-ently to incorporate di!erent genotype classes in CNV andCNA data as well as the e!ects of tissue contaminationin CNA data. In addition, genoCNA is able to utilizegenotype calls from normal tissue to improve its robust-ness and accuracy.

METHOD

Calculation of LRR and BAF

Recall that for each SNP, X and Y are the normalizedintensity measurements of allele A and B, respectively. Xand Y are first transformed to be R=X+Y and!=arctan(Y/X)/("/2) so that R measures the overallcopy number and ! measures the allelic contrast. LRR isdefined as log2(Robserved/Rexpected). If SNP arrays areperformed for both tumor and normal samples of thesame individual, we simply have Robserved=Rtumor andRexpected=Rnormal. Otherwise, Rexpected is computed bylinear interpolation of the canonical genotype cluster cen-troids. The canonical genotype clusters are three clusterscorresponding to genotype AA, AB and BB on the scatterplot of R versus ! [Figure 1 of Pei!er et al. (13)].The canonical genotype clusters for each SNP can be gen-erated from all the samples in the study or from a set ofreference samples, such as HapMap samples. The BAF isnormalized !. Specifically,

BAF !

0 if ! < !AA

0:5"!# !AA$="!AB # !AA$ if !AA % ! < !AB

0:5& 0:5"!# !AB$="!BB # !AB$ if !AB % ! < !BB

1 if ! ' !BB

8>>><

>>>:

where !AA, !AB and !BB are the ! values for the centroidsof the three canonical genotype clusters correspondingAA, AB and BB, respectively. Based on the above for-mula, BAF should be around 0, 0.5 and 1 for genotypeAA, AB and BB, respectively. If the BAF value of an SNPis deviated away from these three values, it may indicatecopy number alterations. For example, a BAF value 0.33may indicate a genotype of AAB.

Two continuous time HMMs with discrete states

We employ HMM to infer both copy number states andgenotypes from SNP array data. HMMs have been widelyused in speech recognition (24), and more recently inDNA/protein sequence alignment (25). In those applica-tions, the ‘time’ space of the Markov process is discrete.For example, in DNA sequence studies, one time point isjust 1 nt. Therefore, discrete time HMMs are used in thesestudies. In the studies of SNP array data, ‘time’ is equiv-alent to the genomic location. Because the data (DNAallele intensities) are only observed at SNP probes andthe distances between adjacent SNP probes vary, weemploy continuous time HMM to model the transitionbetween adjacent probes. Two HMMs with di!erent

states are designed for CNV and CNA studies, which werefer to as genoCNV and genoCNA, respectively.Similar to the previous studies (22,23), genoCNV has

six states (Table 1). The State 2 is often referred to be‘copy number neutral LOH’. It is a genomic aberration,which may be due to uniparental disomy, mitotic recom-bination events or deletion of one allele and subsequentduplication of the remaining allele. LOH has been relatedwith cancer since losing one allele of a tumor suppressorgene may lead to cancer genesis. Unlike the previous stu-dies (22,23), we estimate the parameters from data insteadof fixing them or imposing prior distributions. Moreimportantly, genoCNV dissects both copy number andgenotype calls, while the previous studies (22,23) onlyoutput copy number state estimates.Unlike genoCNV, genoCNA has nine states (Table 2).

For copy number 3 or 4, it is possible that one allele isdeleted first before the other allele is amplified. States 6and 8 correspond to this situation. State 7 is due to simul-taneous amplification of both alleles, and State 9 is dueto the amplification of one allele twice. In addition, tissuecontamination leads to two extra genotype classes forthe states having only homozygous genotypes (States 2,4, 6 and 8). For example, in a locus of hemizygous dele-tions (loss of one allele) in tumor tissue, the remainingallele could be either A (corresponding to AA or AB innormal tissue) or B (corresponding to AB or BB in normaltissue). With tissue contamination, the observed LRR andBAF reflect the mixture distribution of (A, AA), (A, AB),(B, AB) and (B, BB). Here we use underscore to indicatethat the genotype is from normal tissue contamination.The expected LRR of these four mixtures are the same,but closer to 0 compared with the LRR without normaltissue contamination. This is because without normaltissue contamination, the copy number is 1, and thecorresponding LRR is negative, denoted by b (b< 0). Incontrast, the copy number within normal tissue is 2, thusthe corresponding LRR is (0. With normal tissue contam-ination, the copy number is between 1 and 2, and thus theLRR for the mixture is between b and 0. The expectedBAFs of (A, AA) and (B, BB) are still 0 and 1, respectively.The BAFs of (A, AB) and (B, AB) are no longer 0 and 1.Their exact values depend on the proportion of tissue

Table 2. Nine states of genoCNA

State Copynumber

Genotype

1 2 AA, AB, BB2 2 AA, (AA, AB), (BB, AB), BB3 0 Null4 1 (A, AA), (A, AB), (B, AB), (B, BB)5 3 (AAA, AA), (AAB, AB), (ABB, AB), (BBB, BB)6 3 (AAA, AA), (AAA, AB), (BBB,AB), (BBB, BB)7 4 (AAAA, AA), (AABB, AB), (BBBB, BB)8 4 (AAAA, AA), (AAAA, AB), (BBBB, AB), (BBBB, BB)9 4 (AAAA, AA), (AAAB, AB), (ABBB, AB), (BBBB, BB)

Genotype classes in parenthesis, such as (A, AB) are due to normaltissue contamination of genotype A from tumor tissue and genotypeAB from normal tissue. Here we use underscore to indicate that thegenotype is from normal tissue contamination.

Nucleic Acids Research, 2009 3

Page 16: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

LRR (Log R Ratio) & BAF (B Allele Frequency)   For each SNP, we have three canonical genotype clusters,

which indicate the location of of three genotypes AA, AB, and BB in the scatter plot of R and θ

!"#"$%&'%&()'**+,-!"#$#%!"&'()#$*+,)'!&-.*)/.'"/)'0)/%1)0'2/$3'"'456'#$/3"&'."3,&).

Page 17: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

An example

Wang et al. Genome Res.17(11):1665-74.

Page 18: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

A HMM to use both LRR and BAF: genoCNV

These states are the same as the HMM states used by two previous works. QuantiSNP: Colella et al. Nucleic Acid Research, 35:2013 PennCNV: Wang et al. Genome Research, 17:1665

frequency (BAF), which are overall copy number measureand allelic contrast measure, respectively (13). The calcu-lation of LRR and BAF is elaborated at the beginning ofthe ‘Method’ section.First, segmentation methods such as circular binary seg-

mentation (14) and forward–backward fragment assem-bling (FASeg) (15) have been applied to dissect copynumber states based on overall copy number measure-ments (e.g. LRR). These methods are simple and robust,but have the limitation that they cannot produce allele-specific copy number estimates. A more advanced segmen-tation method is proposed by Staaf et al. (16), which isable to detect allelic imbalance and loss of heterozygosity(LOH).Second, model-based approaches such as CARAT (17)

and PLASQ (18) have been developed to identify CNAs intumor tissue based on the assumption that the relationshipbetween copy number and probe intensity is approxi-mately linear on a log–log scale. The inputs are genotypesin normal tissues and allele-specific copy number measure-ments (i.e. X and Y) in both normal and tumor tissues.The outputs are allele-specific copy number estimates intumor tissue. Specifically, a linear model in log–log scale isbuild for each SNP based on the data from normal tissues,and then the resulting model is used to predict the copynumber in tumor tissue. At the end, the results are furthersmoothed across SNPs. One weakness of these approachesis that the model parameters estimated from normal tissuemay not be appropriate for tumor tissue, for example, dueto normal tissue contamination. Normal tissue contami-nation is inevitable in cancer studies and it may be due todi!erent reasons. For example, normal tissue adjacent totumor tissue that is incompletely removed during the pro-cess, and/or the presence of nontumor stromal cells andimmune cells, which are typically a part of every solidtumors examined.The third type of approach focuses on identifying LOH

together with qualitative copy number states (e.g. deletion,normal and amplification) in tumor tissue (19) or in gen-eric situations (20). Their inputs include copy numbermeasurements and some prior knowledge regardingthe copy number and/or genotypes. Specifically, inYamamoto et al. (19), a genomic region of copy number2 in tumor tissue needs to be known and the heterozygos-ity of each SNP needs to be redefined based on empiricalresults. Scharpf et al. (20) proposed a hidden Markovmodel (HMM) integrating observed heterozygositystatus and copy number measurements. Their methodenjoys the ability to exploit the confidence scores of thegenotype calls. One shared limitation of these two meth-ods is that the prior knowledge of the copy number and/orgenotype may not be available, or may be inaccurate inCNV/CNA regions.A recent paper (21) proposed a framework for inte-

grated study of genotype and copy number by analyzingcommon and rare CNVs separately. Specifically, Kornet al. (21) treated those common CNVs (>1% frequencyin the population) as copy number polymorphisms(CNPs) with known locations as well as a few allele-specific copy number states. Therefore the identificationof common CNVs reduced to ‘genotyping’ the CNPs.

For the rare CNVs (!1% frequency), they identifiedcopy number states within each sample by an HMMusing allele-specific copy number measurements. Thenthe genotypes of each SNP within CNV regions are iden-tified by a two dimensional clustering across individuals.

All the above methods are mainly designed forA!ymetrix SNP arrays. For Illumina SNP arrays, twoHMM-based approaches, QuantiSNP (22) andPennCNV (23), have been developed to identify copynumber states based on both LRR and BAF. BothQuantiSNP and PennCNV are based on a HMM withhidden states listed in Table 1. PennCNV assumes thatthe mean value and SD of LRR and BAF for eachHMM state are known. QuantiSNP imposes somecommon priors for the LRR/BAF parameters so thatonly a few hyper-parameters need to be estimated. Inaddition, PennCNV has an additional advantage thatfamily relationships can be utilized. However, both meth-ods are not designed for CNA studies and do not provideoutput on allele-specific information, such as genotypes.

Despite the successes of the aforementioned methods indi!erent applications, some important issues remain to beaddressed, which motivated our study. First, as we men-tioned previously, normal tissue contamination in tumorsoccurs and complicates the determination of true tumor-specific copy alterations of solid tumors, and as shown inthe ‘Results’ section, it may lead to significant changes ofthe data. However, no method has been designed to dis-sect copy number and genotype calls within CNA regionsin the presence of normal tissue contamination. Althoughboth CARAT (17) and PLASQ (18) are able to dissectallele-specific copy number states, and hence genotypes,none of them has taken normal tissue contaminationinto account. In addition, these two methods heavilydepend on the design of A!ymetrix SNP arrays, thus itis not trivial to extend them to the data generated fromIllumina SNP arrays. Secondly, existing methods such asQuantiSNP and PennCNV either assume that the modelparameters are known or impose some common priors.These restrictions may be reasonable for CNV studies,but they reduce the flexibility of CNA studies. For exam-ple, varying proportions of normal tissue contaminationacross samples require sample-specific model parameters.

In this article, we propose a more sophisticated HMM-based framework: genoCN. GenoCN consists of two com-ponents: genoCNV and genoCNA, which are designed forCNV and CNA studies, respectively. The input data areLRR and BAF of each SNP. For CNA studies, the geno-type calls of normal tissue (of the same patient) are anoptional input. The outputs are the posterior probabilities

Table 1. Six states of genoCNV

State Copy number Genotype

1 2 AA, AB, BB2 2 AA, BB3 0 Null4 1 A, B5 3 AAA, AAB, ABB, BBB6 4 AAAA, AAAB, AABB, ABBB, BBBB

2 Nucleic Acids Research, 2009

Page 19: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

HMM setup   Overall Likelihood

  we use ri, bi, and zi to indicate the LRR, BAF, and the hidden state at the i-th SNP probe, respectively. Assuming LRR and BAF are independent given the underlying states, the overall likelihood is:

contamination and they form two extra bands in BAF plot,or equivalently, two genotype classes.Following the notations of Wang et al. (23), we use ri, bi

and zi to indicate the LRR, BAF, and the hidden stateat the i-th SNP probe, respectively. Assuming that theLRR and BAF are independent given the underlyingstates, the full likelihood is

p!r1; . . . ; rL;b1; . . . ;bL"

#X

z1

$ $ $X

zL

p!z1"YL

i#1

p!rijzi"YL

i#1

p!bijzi"YL

i#2

p!zijzi%1"" #

:1

If SNP arrays are performed in both tumor and normaltissues of the same individual, we incorporate the geno-type calls in normal tissue into genoCNA. Let gi be thegenotype of the i-th SNP in normal tissue, we have theoverall likelihood:

p!r1; . . . ; rL; b1; . . . ; bLjg1; . . . ; gL"

#X

z1

$ $ $X

zL

p!z1"YL

i#1

p!rijzi"YL

i#1

p!bijzi; gi"YL

i#2

p!zijzi%1"" #

:

2

The genotype in normal tissue is based on the assump-tion that the copy number is 2, thus it can only be AA, ABor BB; therefore it does not provide information of theactual copy number in tumor tissue. This is why the like-lihood of ri does not depend on gi. However, the genotypein normal tissue restricts the genotype in tumor tissue. Forexample, if the genotype in normal tissue is AA, theneither deletion or amplification can only produce geno-types of homozygous A. Specifically, Table 3 lists all thepossible correspondences between genotypes in normaltissue and tumor tissue. We also allow a small probabilitythat those correspondences are violated, which could bedue to genotyping error in normal tissue.The full likelihoods in Equations (1) and (2) include tran-

sition probabilities (p(zi|zi%1)) and the emission probabil-ities of LRR (p(ri|zi)) and BAF (p(bi|zi) or p(bi|zi, gi)). SomeSNP arrays incorporate some copy-number-only probes,which only have one allele. For those probes, we discardthe BAF information, and only keep the emission proba-bility of LRR and transition probability in the full likeli-hood.Next we discuss how to formulate these probabilities.

Transition probability

For continuous time HMM, the transition probabilityis evaluated according to time. pjk(t)& p(s(w+ t)=k|s(w)= j) is the transition probability from state j to k

during time t, where s(w) and s(w+ t) indicate states attime w and w+ t, respectively. An intensity matrix!=(!jk) is used to model the instantaneous transitionrate, where !jk=lim"t!0pjk("t)/"t, j 6# k and !jj=%

Pk6#j

!jk. The transition probability can be calculated by matrixexponential: pjk(t)= exp(t!). However, matrix exponen-tials are often di!cult to compute e!ciently and reliably(26). We bypass this problem by assuming that there is atmost one state transition between two adjacent SNPprobes. Under this assumption, no matrix exponential isneeded. This assumption is reasonable for high-densitySNP arrays, as the adjacent SNP probes are generallyclose to each other. Occasionally, the distance betweentwo adjacent SNP probes is big (or in the extreme case, ifwe consider two chromosomes), then we restart theMarkov process.

Let !j=P

k6#j !jk. The waiting time that the Markovprocess stays at state j, denoted by Tj, follows an expo-nential distribution with parameter !j. Let di be the dis-tance between the (i% 1)-th probe and the i-th probe.

p!zi # jjzi%1 # j" # p!Tj ' di" # exp!%!jdi": 3

For a Markov process at time i and state j, once it leavesstate j, the transition probability to another state kis ajk= !jk/!j. In addition, Tj is independent with the des-tination state k (27). Based on our assumption that ‘thereis at most one state transition between two adjacentprobes’, the transition probability from state j to k(k 6# j) is

p!zi # kjzi%1 # j; di" # p!zi # kjzi%1 # j"p!Tj ( di"# ajk!1% exp!%!jdi"";

4

whereP

k6#jajk=1.

Emission probability of LRR

Similar to the previous studies (22,23), we model the emis-sion probability of LRR (denoted as r) by the mixture of auniform distribution and a normal distribution. Let "(r;m,s) be the density function of normal distribution withmean m and SD s.

p!rjz" # #r;z1

Rm) !1% #r;z""!r;$r;z; %r;z"; 5

where the uniform distribution (with density 1/Rm) modelsthe background noise, the normal distribution with meanmr,z and SD sr,z models the LRR signals of state z, and #r,zis the mixture proportion of the uniform component.We treat Rm as a known constant, which is simply thelength of LRR’s range.

Table 3. Correspondence between genotypes in normal tissue and tumor tissue

Normal HMM states and genotypes in tumor tissue

1 2 3 4 5 6 7 8 9

AA AA AA Null A AAA AAA AAAA AAAA AAAABB BB BB Null B BBB BBB BBBB BBBB BBBBAB AB AA, BB Null A, B AAB, ABB AAA, BBB AABB AAAA, BBBB AAAB, ABBB

4 Nucleic Acids Research, 2009

Page 20: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Transition Prob and Emission Prob   Emission Prob of LRR

  Emission Prob of BAF

contamination and they form two extra bands in BAF plot,or equivalently, two genotype classes.Following the notations of Wang et al. (23), we use ri, bi

and zi to indicate the LRR, BAF, and the hidden stateat the i-th SNP probe, respectively. Assuming that theLRR and BAF are independent given the underlyingstates, the full likelihood is

p!r1; . . . ; rL;b1; . . . ;bL"

#X

z1

$ $ $X

zL

p!z1"YL

i#1

p!rijzi"YL

i#1

p!bijzi"YL

i#2

p!zijzi%1"" #

:1

If SNP arrays are performed in both tumor and normaltissues of the same individual, we incorporate the geno-type calls in normal tissue into genoCNA. Let gi be thegenotype of the i-th SNP in normal tissue, we have theoverall likelihood:

p!r1; . . . ; rL; b1; . . . ; bLjg1; . . . ; gL"

#X

z1

$ $ $X

zL

p!z1"YL

i#1

p!rijzi"YL

i#1

p!bijzi; gi"YL

i#2

p!zijzi%1"" #

:

2

The genotype in normal tissue is based on the assump-tion that the copy number is 2, thus it can only be AA, ABor BB; therefore it does not provide information of theactual copy number in tumor tissue. This is why the like-lihood of ri does not depend on gi. However, the genotypein normal tissue restricts the genotype in tumor tissue. Forexample, if the genotype in normal tissue is AA, theneither deletion or amplification can only produce geno-types of homozygous A. Specifically, Table 3 lists all thepossible correspondences between genotypes in normaltissue and tumor tissue. We also allow a small probabilitythat those correspondences are violated, which could bedue to genotyping error in normal tissue.The full likelihoods in Equations (1) and (2) include tran-

sition probabilities (p(zi|zi%1)) and the emission probabil-ities of LRR (p(ri|zi)) and BAF (p(bi|zi) or p(bi|zi, gi)). SomeSNP arrays incorporate some copy-number-only probes,which only have one allele. For those probes, we discardthe BAF information, and only keep the emission proba-bility of LRR and transition probability in the full likeli-hood.Next we discuss how to formulate these probabilities.

Transition probability

For continuous time HMM, the transition probabilityis evaluated according to time. pjk(t)& p(s(w+ t)=k|s(w)= j) is the transition probability from state j to k

during time t, where s(w) and s(w+ t) indicate states attime w and w+ t, respectively. An intensity matrix!=(!jk) is used to model the instantaneous transitionrate, where !jk=lim"t!0pjk("t)/"t, j 6# k and !jj=%

Pk6#j

!jk. The transition probability can be calculated by matrixexponential: pjk(t)= exp(t!). However, matrix exponen-tials are often di!cult to compute e!ciently and reliably(26). We bypass this problem by assuming that there is atmost one state transition between two adjacent SNPprobes. Under this assumption, no matrix exponential isneeded. This assumption is reasonable for high-densitySNP arrays, as the adjacent SNP probes are generallyclose to each other. Occasionally, the distance betweentwo adjacent SNP probes is big (or in the extreme case, ifwe consider two chromosomes), then we restart theMarkov process.

Let !j=P

k6#j !jk. The waiting time that the Markovprocess stays at state j, denoted by Tj, follows an expo-nential distribution with parameter !j. Let di be the dis-tance between the (i% 1)-th probe and the i-th probe.

p!zi # jjzi%1 # j" # p!Tj ' di" # exp!%!jdi": 3

For a Markov process at time i and state j, once it leavesstate j, the transition probability to another state kis ajk= !jk/!j. In addition, Tj is independent with the des-tination state k (27). Based on our assumption that ‘thereis at most one state transition between two adjacentprobes’, the transition probability from state j to k(k 6# j) is

p!zi # kjzi%1 # j; di" # p!zi # kjzi%1 # j"p!Tj ( di"# ajk!1% exp!%!jdi"";

4

whereP

k6#jajk=1.

Emission probability of LRR

Similar to the previous studies (22,23), we model the emis-sion probability of LRR (denoted as r) by the mixture of auniform distribution and a normal distribution. Let "(r;m,s) be the density function of normal distribution withmean m and SD s.

p!rjz" # #r;z1

Rm) !1% #r;z""!r;$r;z; %r;z"; 5

where the uniform distribution (with density 1/Rm) modelsthe background noise, the normal distribution with meanmr,z and SD sr,z models the LRR signals of state z, and #r,zis the mixture proportion of the uniform component.We treat Rm as a known constant, which is simply thelength of LRR’s range.

Table 3. Correspondence between genotypes in normal tissue and tumor tissue

Normal HMM states and genotypes in tumor tissue

1 2 3 4 5 6 7 8 9

AA AA AA Null A AAA AAA AAAA AAAA AAAABB BB BB Null B BBB BBB BBBB BBBB BBBBAB AB AA, BB Null A, B AAB, ABB AAA, BBB AABB AAAA, BBBB AAAB, ABBB

4 Nucleic Acids Research, 2009

Emission probability of BAF

We model BAF (denoted as b) by the mixture of a uniformcomponent for background noise and several (truncated)normal components:

p!bjz" # !b;zI!0 < b < 1"

$ !1% !b;z"XHz

h#1

wz;h"!b; #z;h"I!0<b<1"!!0; #z;h"I!b#0"

& 1%!!1; #z;h"! "I!b#1"

; 6

where !b,z is the mixture proportion of the uniformcomponent for state z, I(.) is the indicator function, Hz

indicates the total number of normal components of statez, and wz,h is the weight of the h-th component. " and !indicate normal density and cumulative normal distri-bution, and #z,h={mb,z,h, sb,z,h} indicates the mean andSD of the h-th normal component for state z. The geno-type classes of one state are ordered by the number of Balleles as shown in Tables 1 and 2.

Now we discuss the values of wz, h. The weight for State3 (the null state with both alleles deleted) is 1 for eithergenoCNV or genoCNA. Besides State 3, for genoCNV orgenoCNA without genotype from normal tissue, wz,h arebinomial probabilities based on the population frequenciesof the B alleles. For genoCNA with genotype from normaltissue, we can refine the weights wz,h according to the cor-respondences in Table 3. First, if the genotype in normaltissue is homozygous, the genotype in tumor tissue is alsohomozygous for the same allele. Second, if the genotype innormal tissue is heterozygous, BAF in tumor tissue followsa mixture distribution with less components than in a gen-eral situation (except state 2). In either case, there is a smallprobability of exception, which can be attributed to factorssuch as genotyping error. See the Supplementary Data forthe detailed formulation.

The parameters to be estimated

There are a large number of parameters to be estimatedfor either genoCNV or genoCNA. We reduce the numberof parameters by some reasonable or obvious simplifica-tions. First, in either genoCNV or genoCNA, some statesshare the same copy number and the same genotype, sothat the corresponding parameters can be estimatedjointly. For example, in genoCNA, States 7, 8 and 9have the same copy number and they all share the geno-type classes AAAA and BBBB. Second, mean valuesof some normal components of BAF can be assumed asconstants. Specifically, for State 3, the null state with bothallele deleted, we assume mb,3,1=0.5. For the other states,we assume mb,z,1=0 for genotypes of homozygous A alleleand mb,z,Hz

=1.0 for genotypes of homozygous B allele.For transition probability, we set $j as constant based

on prior knowledge/preference. Because the duration ofstate j follows an exponential distribution with parameter$j, the average duration, denoted as "Tj, equals to 1/$j.Therefore, $j can be estimated by 1= "Tj. However, "Tj is dif-ficult to estimate because (i) state changes could occur atany position between two adjacent probes, which wecannot observe and (ii) even if we assume that statechanges always happen at SNP probes, the Baum–Welch

algorithm (24) for parameter estimation requires the pos-terior probability that any segment arises from state j,which is computationally infeasible because the numberof segments increases exponentially as the total length ofDNA sequence increases. Due to the above computationaldi!culties, and also because $j can be treated as a tuningparameter that determines the duration of state j, wechoose to specify $j based on prior knowledge/preference.

Parameter and state posterior probability estimation

The final maximum likelihood estimations (MLEs) of theparameters can be obtained by an EM (Expectation–Maximization) algorithm known as the Baum–Welch orforward–backward algorithm (24), by numerical optimiza-tion methods (28), or by MCMC (Markov chain MonteCarlo) methods (29). Numerical optimization methods,such as the Nelder–Mead method, become less reliable ifthere are a large number of parameters, which is the casein our study. The MCMC methods are computationallydemanding, especially for large-scale studies such as agenome-wide dissection of CNVs/CNAs. Therefore weemploy the Baum–Welch algorithm to estimate the para-meters. The estimation algorithm is briefly described asfollows and the details are left in the Supplementary Data.Let # be all the parameters to be estimated. First, given

#, either initial values or estimates from the previous EMstep, we can calculate the posterior probability that probei is from state z, i.e. %!i; z" # p!qi # zjX; #", where qi indi-cates the state of probe i, and X indicates the observeddata. Furthermore, we can calculate the posterior proba-bility that probe i is from state z and it belongs to aparticular genotype class, i.e. %!i; z;Nb;z;h" # p!qi #z; &i # Nb;z;hjX; #", where Zi=Nb,z,h indicates that probei belongs to the h-th genotype class of state z, and thesubscript b indicates this is the normal component forBAF. With these posterior probability estimates, we canre-estimate #. We iterate this procedure until the estimatesof # converges. By default, we use the convergence crite-rion that for at least 10 iterations, the maximum change ofany parameter estimate is <0.002.At the end, using the parameters at convergence, we can

estimate the posterior probabilities for each SNP belongingto a particular copy number state or a genotype class.These posterior probability estimates are our final outputs.The posterior probability of certain copy number is eitherg(i, z) or the summation of g(i, z)’s of all the HMM statescorresponding to the same copy number. For example,the copy numbers of States 1 and 2 are both 2, so

p!copy number of the i-th; SNP is 2jX;#"#%!i;1"$ %!i;2":

Similarly, the posterior probability of certain genotypeclass is the summation of all the corresponding g(i, z,Nb,z,h)’s.

RESULTS

CNVs in HapMap individuals

To evaluate genoCNV, we applied it to study CNVs inchromosome 1–22 of 162 HapMap individuals (30): 12

Nucleic Acids Research, 2009 5

Page 21: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Comparison to the previous approach   GenoCNV use an EM algorithm to estimate all the

parameters of the HMM. In contrast, QuantiSNP impose prior distributions to the model parameters and PennCNV assume the model parameters are known.

  GenoCNV estimate the parameters estimate the posterior probability of all the genotype, in addition to copy number.

Page 22: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

  1. Introduction of copy number variation (CNV)

  2. Identification of CNV by segmentation algorithms

  3. Integrated study of copy number and genotypes

  4. Copy number aberration (CNA) in tumor tissue

  5. Conclusion and Future works

Page 23: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

CNA study vs. CNV study   In tumor tissue, CNA is much more abundant than CNV

in normal individuals

  Tissue contamination: mixture of tumor tissue and normal tissue.

  We often have paired data: SNP array for both normal tissue and tumor tissue of the same individual

Page 24: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Copy Number Aberrations

Standard CNV method

fails, due to contamination

GenoCN clearly dissect

the CNA regions

Normal, AA/AB/BB

Copy number neutral

loss of Heterozygosity

AA/BB

One copy deletion

A/B

Page 25: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

GenoCNA: a HMM to detect Copy Number Aberrations

of copy number and genotype of each SNP. A completeparameter estimation scheme is developed so that thoseHMM parameters are estimated from the data. TheHMMs for genoCNV and genoCNA are designed di!er-ently to incorporate di!erent genotype classes in CNV andCNA data as well as the e!ects of tissue contaminationin CNA data. In addition, genoCNA is able to utilizegenotype calls from normal tissue to improve its robust-ness and accuracy.

METHOD

Calculation of LRR and BAF

Recall that for each SNP, X and Y are the normalizedintensity measurements of allele A and B, respectively. Xand Y are first transformed to be R=X+Y and!=arctan(Y/X)/("/2) so that R measures the overallcopy number and ! measures the allelic contrast. LRR isdefined as log2(Robserved/Rexpected). If SNP arrays areperformed for both tumor and normal samples of thesame individual, we simply have Robserved=Rtumor andRexpected=Rnormal. Otherwise, Rexpected is computed bylinear interpolation of the canonical genotype cluster cen-troids. The canonical genotype clusters are three clusterscorresponding to genotype AA, AB and BB on the scatterplot of R versus ! [Figure 1 of Pei!er et al. (13)].The canonical genotype clusters for each SNP can be gen-erated from all the samples in the study or from a set ofreference samples, such as HapMap samples. The BAF isnormalized !. Specifically,

BAF !

0 if ! < !AA

0:5"!# !AA$="!AB # !AA$ if !AA % ! < !AB

0:5& 0:5"!# !AB$="!BB # !AB$ if !AB % ! < !BB

1 if ! ' !BB

8>>><

>>>:

where !AA, !AB and !BB are the ! values for the centroidsof the three canonical genotype clusters correspondingAA, AB and BB, respectively. Based on the above for-mula, BAF should be around 0, 0.5 and 1 for genotypeAA, AB and BB, respectively. If the BAF value of an SNPis deviated away from these three values, it may indicatecopy number alterations. For example, a BAF value 0.33may indicate a genotype of AAB.

Two continuous time HMMs with discrete states

We employ HMM to infer both copy number states andgenotypes from SNP array data. HMMs have been widelyused in speech recognition (24), and more recently inDNA/protein sequence alignment (25). In those applica-tions, the ‘time’ space of the Markov process is discrete.For example, in DNA sequence studies, one time point isjust 1 nt. Therefore, discrete time HMMs are used in thesestudies. In the studies of SNP array data, ‘time’ is equiv-alent to the genomic location. Because the data (DNAallele intensities) are only observed at SNP probes andthe distances between adjacent SNP probes vary, weemploy continuous time HMM to model the transitionbetween adjacent probes. Two HMMs with di!erent

states are designed for CNV and CNA studies, which werefer to as genoCNV and genoCNA, respectively.Similar to the previous studies (22,23), genoCNV has

six states (Table 1). The State 2 is often referred to be‘copy number neutral LOH’. It is a genomic aberration,which may be due to uniparental disomy, mitotic recom-bination events or deletion of one allele and subsequentduplication of the remaining allele. LOH has been relatedwith cancer since losing one allele of a tumor suppressorgene may lead to cancer genesis. Unlike the previous stu-dies (22,23), we estimate the parameters from data insteadof fixing them or imposing prior distributions. Moreimportantly, genoCNV dissects both copy number andgenotype calls, while the previous studies (22,23) onlyoutput copy number state estimates.Unlike genoCNV, genoCNA has nine states (Table 2).

For copy number 3 or 4, it is possible that one allele isdeleted first before the other allele is amplified. States 6and 8 correspond to this situation. State 7 is due to simul-taneous amplification of both alleles, and State 9 is dueto the amplification of one allele twice. In addition, tissuecontamination leads to two extra genotype classes forthe states having only homozygous genotypes (States 2,4, 6 and 8). For example, in a locus of hemizygous dele-tions (loss of one allele) in tumor tissue, the remainingallele could be either A (corresponding to AA or AB innormal tissue) or B (corresponding to AB or BB in normaltissue). With tissue contamination, the observed LRR andBAF reflect the mixture distribution of (A, AA), (A, AB),(B, AB) and (B, BB). Here we use underscore to indicatethat the genotype is from normal tissue contamination.The expected LRR of these four mixtures are the same,but closer to 0 compared with the LRR without normaltissue contamination. This is because without normaltissue contamination, the copy number is 1, and thecorresponding LRR is negative, denoted by b (b< 0). Incontrast, the copy number within normal tissue is 2, thusthe corresponding LRR is (0. With normal tissue contam-ination, the copy number is between 1 and 2, and thus theLRR for the mixture is between b and 0. The expectedBAFs of (A, AA) and (B, BB) are still 0 and 1, respectively.The BAFs of (A, AB) and (B, AB) are no longer 0 and 1.Their exact values depend on the proportion of tissue

Table 2. Nine states of genoCNA

State Copynumber

Genotype

1 2 AA, AB, BB2 2 AA, (AA, AB), (BB, AB), BB3 0 Null4 1 (A, AA), (A, AB), (B, AB), (B, BB)5 3 (AAA, AA), (AAB, AB), (ABB, AB), (BBB, BB)6 3 (AAA, AA), (AAA, AB), (BBB,AB), (BBB, BB)7 4 (AAAA, AA), (AABB, AB), (BBBB, BB)8 4 (AAAA, AA), (AAAA, AB), (BBBB, AB), (BBBB, BB)9 4 (AAAA, AA), (AAAB, AB), (ABBB, AB), (BBBB, BB)

Genotype classes in parenthesis, such as (A, AB) are due to normaltissue contamination of genotype A from tumor tissue and genotypeAB from normal tissue. Here we use underscore to indicate that thegenotype is from normal tissue contamination.

Nucleic Acids Research, 2009 3

Page 26: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

GenoCNA   Make use of genotype data from normal tissue

  Output copy number and genotype probabilities

  Can be used to predict tumor tissue percentage

contamination and they form two extra bands in BAF plot,or equivalently, two genotype classes.Following the notations of Wang et al. (23), we use ri, bi

and zi to indicate the LRR, BAF, and the hidden stateat the i-th SNP probe, respectively. Assuming that theLRR and BAF are independent given the underlyingstates, the full likelihood is

p!r1; . . . ; rL;b1; . . . ;bL"

#X

z1

$ $ $X

zL

p!z1"YL

i#1

p!rijzi"YL

i#1

p!bijzi"YL

i#2

p!zijzi%1"" #

:1

If SNP arrays are performed in both tumor and normaltissues of the same individual, we incorporate the geno-type calls in normal tissue into genoCNA. Let gi be thegenotype of the i-th SNP in normal tissue, we have theoverall likelihood:

p!r1; . . . ; rL; b1; . . . ; bLjg1; . . . ; gL"

#X

z1

$ $ $X

zL

p!z1"YL

i#1

p!rijzi"YL

i#1

p!bijzi; gi"YL

i#2

p!zijzi%1"" #

:

2

The genotype in normal tissue is based on the assump-tion that the copy number is 2, thus it can only be AA, ABor BB; therefore it does not provide information of theactual copy number in tumor tissue. This is why the like-lihood of ri does not depend on gi. However, the genotypein normal tissue restricts the genotype in tumor tissue. Forexample, if the genotype in normal tissue is AA, theneither deletion or amplification can only produce geno-types of homozygous A. Specifically, Table 3 lists all thepossible correspondences between genotypes in normaltissue and tumor tissue. We also allow a small probabilitythat those correspondences are violated, which could bedue to genotyping error in normal tissue.The full likelihoods in Equations (1) and (2) include tran-

sition probabilities (p(zi|zi%1)) and the emission probabil-ities of LRR (p(ri|zi)) and BAF (p(bi|zi) or p(bi|zi, gi)). SomeSNP arrays incorporate some copy-number-only probes,which only have one allele. For those probes, we discardthe BAF information, and only keep the emission proba-bility of LRR and transition probability in the full likeli-hood.Next we discuss how to formulate these probabilities.

Transition probability

For continuous time HMM, the transition probabilityis evaluated according to time. pjk(t)& p(s(w+ t)=k|s(w)= j) is the transition probability from state j to k

during time t, where s(w) and s(w+ t) indicate states attime w and w+ t, respectively. An intensity matrix!=(!jk) is used to model the instantaneous transitionrate, where !jk=lim"t!0pjk("t)/"t, j 6# k and !jj=%

Pk6#j

!jk. The transition probability can be calculated by matrixexponential: pjk(t)= exp(t!). However, matrix exponen-tials are often di!cult to compute e!ciently and reliably(26). We bypass this problem by assuming that there is atmost one state transition between two adjacent SNPprobes. Under this assumption, no matrix exponential isneeded. This assumption is reasonable for high-densitySNP arrays, as the adjacent SNP probes are generallyclose to each other. Occasionally, the distance betweentwo adjacent SNP probes is big (or in the extreme case, ifwe consider two chromosomes), then we restart theMarkov process.

Let !j=P

k6#j !jk. The waiting time that the Markovprocess stays at state j, denoted by Tj, follows an expo-nential distribution with parameter !j. Let di be the dis-tance between the (i% 1)-th probe and the i-th probe.

p!zi # jjzi%1 # j" # p!Tj ' di" # exp!%!jdi": 3

For a Markov process at time i and state j, once it leavesstate j, the transition probability to another state kis ajk= !jk/!j. In addition, Tj is independent with the des-tination state k (27). Based on our assumption that ‘thereis at most one state transition between two adjacentprobes’, the transition probability from state j to k(k 6# j) is

p!zi # kjzi%1 # j; di" # p!zi # kjzi%1 # j"p!Tj ( di"# ajk!1% exp!%!jdi"";

4

whereP

k6#jajk=1.

Emission probability of LRR

Similar to the previous studies (22,23), we model the emis-sion probability of LRR (denoted as r) by the mixture of auniform distribution and a normal distribution. Let "(r;m,s) be the density function of normal distribution withmean m and SD s.

p!rjz" # #r;z1

Rm) !1% #r;z""!r;$r;z; %r;z"; 5

where the uniform distribution (with density 1/Rm) modelsthe background noise, the normal distribution with meanmr,z and SD sr,z models the LRR signals of state z, and #r,zis the mixture proportion of the uniform component.We treat Rm as a known constant, which is simply thelength of LRR’s range.

Table 3. Correspondence between genotypes in normal tissue and tumor tissue

Normal HMM states and genotypes in tumor tissue

1 2 3 4 5 6 7 8 9

AA AA AA Null A AAA AAA AAAA AAAA AAAABB BB BB Null B BBB BBB BBBB BBBB BBBBAB AB AA, BB Null A, B AAB, ABB AAA, BBB AABB AAAA, BBBB AAAB, ABBB

4 Nucleic Acids Research, 2009

Page 27: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments
Page 28: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments
Page 29: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Estimation of tumor tissue proportion

Let βO and βT be the observed mean value of BAF and expected BAF in pure tumortissue. Given the assumption that BAF can be approximated by the ratio of the numberof B alleles and the total number of alleles, we have

βT =nB

nA + nB

, (C-1)

βO =1− pT + pTnB

2− 2pT + pT(nA + nB), (C-2)

where nA and nB are the number of A or B alleles in pure tumor tissue, respectively.Therefore pT can be estimated as

pT =1− 2βO

βO(nA + nB) + 1− nB

. (C-3)

Let βO,G be the observed mean BAF value for genotype or genotype mixture G. Whencopy number is 1, we update βO,(A,AB) and βO,(B,AB) by taking into account of systematicdye bias as follows:

βO,(A,AB) = 0.5βO,(A,AB)/βO,AB, (C-4)

βO,(B,AB) = 0.5 + 0.5(βO,(B,AB) − βO,AB)/(1− βO,AB). (C-5)

Then we estimate βO,(A,AB) by averaging βO,(A,AB) and 1− βO,(B,AB):

βO,1 = 0.5�βO,(A,AB) + 1− βO,(B,AB)

= 0.25�βO,(A,AB)/βO,AB + 1− (βO,(B,AB) − βO,AB)/(1− βO,AB)

�. (C-6)

Similarly, βO,(AA,AB) is estimated by

βO,2 = 0.25�βO,(AA,AB)/βO,AB + 1− (βO,(BB,AB) − βO,AB)/(1− βO,AB)

�. (C-7)

βO,(AAB,AB) is estimated by

βO,3 = 0.25�βO,(AAB,AB)/βO,AB + 1− (βO,(BBA,AB) − βO,AB)/(1− βO,AB)

�. (C-8)

Then we can separately plug in βO,1, βO,2 and βO,3 into equation (C-3), together with thecorresponding nA and nB, to estimate pT . We denote the estimates of pT from βO,1, βO,2,and βO,3 as pT1, pT2, and pT3, respectively. As shown in Figure 5 (a) in the main text,pT1 and pT2 are highly consistent. Overall, pT1 and pT3 are also consistent (Figure C-1).However pT1 and pT3 have larger discrepancy than pT1 and pT2. This may be due to thedye bias, which has been completely corrected by equation C-4 and C-5.

The clinically estimated tumor purity tends to be very high despite the apparent patternin the data which indicates a relatively low tumor purity. Figure C-2 shows an example.

20

Page 30: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Conclusion and Future works   Extend GenoCN to study Affymetrix SNP Array Data

  Use the output of GenoCN for Genome-wise Association Studies (GWAS)

  Combining SNP arrays with next-generation sequencing

Page 31: Department of Biostatistics, Genetics University of North ...weisun/research/genoCN.pptx.pdf · Definitions Structural Variant An umbrella term.Genomic alterations that involve segments

Acknowledgement

Nucleic Acids Research, 2009, 1–13doi:10.1093/nar/gkp493

Integrated study of copy number states andgenotype calls using high-density SNP arraysWei Sun1,2,*, Fred A. Wright1, Zhengzheng Tang1, Silje H. Nordgard2,3, Peter Van Loo3,4,5,Tianwei Yu6, Vessela N. Kristensen3 and Charles M. Perou2,7,*

1Department of Biostatistics, 2Department of Genetics, University of North Carolina, Chapel Hill, NC, USA,3Department of Genetics, Institute for Cancer Research, Oslo University Hospital-Radiumhospitalet, Oslo, Norway,4Department of Molecular and Developmental Genetics, Vlaams Instituut voor Biotechnologie, 5Department ofHuman Genetics, Katholieke Universiteit Leuven, Leuven, Belgium, 6Department of Biostatistics andBioinformatics, Emory University, Atlanta, GA and 7Lineberger Comprehensive Cancer Center,University of North Carolina, Chapel Hill, NC, USA

Received February 17, 2009; Revised April 21, 2009; Accepted May 21, 2009

ABSTRACT

We propose a statistical framework, namedgenoCN, to simultaneously dissect copy numberstates and genotypes using high-density SNP(single nucleotide polymorphism) arrays. There areat least two types of genomic DNA copy numberdifferences: copy number variations (CNVs) andcopy number aberrations (CNAs). While CNVs arenaturally occurring and inheritable, CNAs areacquired somatic alterations most often observedin tumor tissues only. CNVs tend to be short andmore sparsely located in the genome comparedwith CNAs. GenoCN consists of two components,genoCNV and genoCNA, designed for CNV andCNA studies, respectively. In contrast to most exist-ing methods, genoCN is more flexible in that themodel parameters are estimated from the datainstead of being decided a priori. GenoCNA alsoincorporates two important strategies for CNA stu-dies. First, the effects of tissue contamination areexplicitly modeled. Second, if SNP arrays are per-formed for both tumor and normal tissues of oneindividual, the genotype calls from normal tissueare used to study CNAs in tumor tissue. We evalu-ated genoCN by applications to 162 HapMap indi-viduals and a brain tumor (glioblastoma) datasetand showed that our method can successfully iden-tify both types of copy number differences andproduce high-quality genotype calls.

INTRODUCTION

Several recent studies have documented the extensivepresence of inheritable copy number variations (CNVs)in the human genome (1–8). Copy number aberrations(CNAs), which are acquired somatic alterations, areoften observed in tumor tissues (9,10). In contrast toCNVs, CNAs tend to be longer and occupy a significantproportion of the genome. In addition to the traditionalarray CGH approach (11), CNVs or CNAs can also bedetected by SNP arrays, which typically have higher reso-lution and are able to capture allele-specific information(12). In this article, we propose a statistical framework tosimultaneously dissect copy number states and genotypeswithin CNV/CNA regions. Currently, the two most fre-quently used SNP array platforms are from A!ymetrix(8) and Illumina (13). In this article, we focus on IlluminaSNP arrays, however, our method, accompanied with anappropriate normalization and transformation of the rawdata, can also be applied to A!ymetrix SNP arrays. Adjust-ments for CNV and CNA are also needed to ensure precisegenotype calls when using SNP arrays for genotyping.Various methods have been proposed to study copy

number alterations. These methods can be classifiedbased on their input and output data. We first brieflyintroduce di!erent types of input data. Denote the twoalleles of one SNP as A and B, respectively, and let X/Ybe the normalized intensities of allele A/B, i.e. allele-specific copy number measurements. X and Y can betransformed to a measure of overall copy number and ameasure of allelic contrast. For example, the outputs ofIllumina SNP arrays are Log R ratio (LRR) and B allele

*To whom correspondence should be addressed. Tel: 919-966-7266; Fax: 919-966-3804; Email: [email protected] may also be addressed to Charles Perou. Tel: 919-843-5740; Fax: 919-843-5718; Email: [email protected]

! 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Nucleic Acids Research Advance Access published July 6, 2009