analysis of sequence-based copy number variation detection tools for cancer studies

8
1 Analysis of sequencebased copy number varia7on detec7on tools for cancer studies Sheida Nabavi Zhengqiu Cai Peter J Tonellato Laboratory for Personalized Medicine Center For Biomedical Informa7cs Harvard Medical School Why focus on CNV? CNVs have been associated with various diseases including cancers. Amplifica7on of oncogenes and loss of tumor suppressor genes CNVs (amplifica7ons and dele7ons of chromosomal segments) are major structural varia7ons. amplifica7on

Upload: amia

Post on 07-Nov-2014

17 views

Category:

Documents


0 download

DESCRIPTION

2013 Summit on Translational Bioinformatics

TRANSCRIPT

Page 1: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

1  

Analysis  of  sequence-­‐based  copy  number  varia7on  detec7on  tools  for  

cancer  studies    

Sheida  Nabavi  Zhengqiu  Cai  

Peter  J  Tonellato    

Laboratory  for  Personalized  Medicine  Center  For  Biomedical  Informa7cs  

 Harvard  Medical  School  

Why  focus  on  CNV?  

     CNVs  have  been  associated  with  various  diseases  including  cancers.    –  Amplifica7on  of  oncogenes  and  loss  of  tumor  suppressor  genes  

       CNVs  (amplifica7ons  and  dele7ons  of  chromosomal  segments)  are  major  structural  varia7ons.  

 amplifica7on  

Page 2: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

2  

 NGS  is  emerging  as  the  best  plaTorm  for  CNV  detec7on  

     Array  based  technologies  have  been  used  to  study  CNVs  

       Sequence  based  technologies  (using  next  genera7on  sequencing  (NGS))  are  emerging  

Advantages   Disadvantages  

Affordable   Noisy  ,  Hybridiza7on  

Established   Limited  resolu7on  

High  resolu7on  arrays   Predefined  probes  

Advantages   Disadvantages  

Higher  resolu7on  and  accuracy   Currently  expensive  

Single  plaTorm   No  standard  analysis  

Rapid  cost  reduc7on   Computa7onally  demanding    

Objec7ves  of  this  work  

•  Indicate  the  performance  characteris7cs  of  the  recent  sequence-­‐based  CNV  detec7on  tools  to  facilitate  tool  selec7on.      

•  Iden7fy  strengths  and  weaknesses  of  current    CNV  tools.  –  Address  limita7ons  

–  Iden7fy  modifica7ons  to  improve  performance  –  Develop  new  method  

Page 3: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

3  

Typical  pipeline  for  read  depth-­‐based  CNV  detec7on  

     The  number  of  reads  that  align  to  a  posi7on  in  a  genome  is  propor7onal  to  the  copy  number  at  that  posi7on    

Alignment  

NGS  data   CNV  list  Coun7ng  reads   Normaliza7on  

CNV    detec7on  

Segmenta7on  

Reference  genome  

Short  read  Coun7ng  window  @SRR034720.3267591    length=36  ATTATTTTATGTTATTTATTTTGTATGTTTTTTTTT  +  88888888888888888888885888%888888/8)  @SRR034720.3267592  length=36  TCGGGAACGTCTCGACCGAAATTATTTTGTATGTCT  +  8888788888888888878-­‐188288881878"888                                                                              .                                                                              .                                                                              .  

NGS  data   Read  count    

Normaliza7on  reduces  biases  

Soma7c  amplifica7on  

Soma7c  dele7on  

Control  

Aligned  short  reads  

Sample  

       Main  sources  of  biases  are  GC-­‐content,  mapability,  and    sample  prepara7on  

Page 4: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

4  

Read  depth-­‐based  CNV  detec7on  Tools  

Tool Method SE/PE

control Reference

CNVnator Analyze depth of coverage (DOC), mean-shift technique for segmentation, correct GC bias

SE PE

No Abyzov et al. (2011), Genome Res, 21, 974-984

ReadDepth Analyze DOC, circular binary segmentation algorithm, use negative binomial distribution, use PE information

SE PE

No Miller et al. (2011), PLoS One, 6, e16327

FREEC Analyze DOC, use LASSO-based algorithm for segmentation, normalize for GC content

SE PE

Yes/No

Boeva et al. (2011), Bioinformatics, 27, 268-269

CNVer Analyze DOC with paired-end mapping information, repeat graph algorithm

PE No Medvedev et al. (2010), Genome Res, 20, 1613-1622

CNV-seq Analyze DOC, fixed window, no segmentation

SE PE

Yes Xie and Tammi (2009), BMC Bioinformatics, 10, 80

SegSeq Analyze DOC, extend a window to include a fixed number of reads

SE PE

Yes Chiang et al. (2009), Nat Methods, 6, 99-103

Cell  line  and  synthesized  NGS  datasets  

Eight  breast  cancer  cell  line  NGS  datasets  

•  500 amplifications and 500 deletions with random lengths form 500 bp to 1Mbp on chr1 •  Coverage: 0.5, 2, 5, 10, 20, 50 •  50 base pair short read using SAMtools wgsim

Cell Line Read length

Single end/paired end PF Reads Coverage Illumina

platform BT-20 50 PE 13,822,800 0.43 X GA II

BT-474 50 PE 15,787,700 0.49 X GA II

MDA-MB-231 50 PE 12,372,400 0.39 X GA II

MDA- MB-468 50 PE 13,265,400 0.41 X GA II

MCF-7 50 PE 9,838,800 0.31 X GA II

T47D 50 PE 13,514,100 0.42 X GA II

ZR-75-1 50 PE 13,484,800 0.42 X GA II

HCC1143 36 SE 15,038,736 0.25 X 1G

Six  Synthe7c  NGS  datasets  

Page 5: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

5  

There  is  no  consistent  agreement  among  the  tools  

         Cell  lines  detected  CNVs’  size,  number,  span  and  type  are  different  across  the  tools.  

0%  

20%  

40%  

60%  

80%  

100%   >1000K  500k-­‐1000k  100k-­‐500k  50k-­‐100k  10k-­‐50k  5k-­‐10k  2k-­‐5k  1k-­‐2k  0-­‐1k  

CNV  size  

CNV  size  distribu7ons    

Few  consensus  genes  among  tools  

1  2  3  4  5  6   1  2  3  4  5  6   1  2  3  4  5  6   1  2  3  4  5  6   1  2  3  4  5  6   1  2  3  4  5  6   1  2  3  4  5  6   1  2  3  4  5  6  

             amplifica7on,              dele7on    1:  SegSeq,  2:  CNVer,  3:  CNVnator,  4:  FREEC,  5:  CNV-­‐seq,  6:  ReadDepth  

Page 6: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

6  

Tools  show  high  sensi7vity  for  cell  line  datasets  

         Benchmarks:  Common  CNVs  in  two  published  results    

         Detected  CNVs  with  more  70%  overlap  with  the  benchmark  CNVs  are  called  true  posi7ve      

0%  

20%  

40%  

60%  

80%  

100%  MCF7�

Sensi7vity  

Precision  

There  are  many  false  posi7ve  CNVs    

Chr20,  FREEC  

Detected  CNV  Sample  read  count  Normal  read  count  Benchmark  CNV  region  

Page 7: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

7  

There  are  many  false  posi7ve  CNVs    

Higher  sensi7vity  compare  to  precision  for  high  coverage  synthesized  datasets      

         Benchmark:  known  synthesized  CNVs  

         Detected  CNVs  with  more  70%  overlap  with  the  benchmark  CNVs  are  called  true  posi7ve      

0%  

20%  

40%  

60%  

80%  

100%  

0.5x   2x   5x   10x   20x   50x  

Precision�

Coverage  

0%  

20%  

40%  

60%  

80%  

100%  

0.5x   2x   5x   10x   20x   50x  

Sensi7vity�

Coverage�

SegSeq   CNVer  CNVnator   FREEC  CNV-­‐seq   ReadDepth  

Page 8: Analysis of Sequence-Based COpy Number Variation Detection Tools for Cancer Studies

8  

Conclusions  

•  The  CNV  results  across  the  tools  are  not  consistence.    •  Most  of  the  tools  show  high  sensi7vity  and  breakpoint  accuracy,  however  their  precision  is  not  high.  

•  Tools  with  advanced  algorithms  such  as  CNVnator  and  FREEC  perform  beoer,  however  they  are  computa7onally  more  expensive.    

•  Tools  u7lize  pair  end  informa7on,  such  as  CNVer  and  ReadDepth,  detect  CNVs  more  accurately.    

•  Development  of  more  efficient  and  accurate  tools  is  required.  

Acknowledgements    

– Peter  Tonellato  – Zengqui  Cai    – Erik  Gafni  – Vincent  Fusaro  – Chih-­‐Lin  Chi  – Michiyo  Yamada  –  Jessica  Correia  – Maohew  Crawford  

Laboratory  of  Personalized    Medicine  (LPM)