analysis of sequence-based copy number variation detection tools for cancer studies
DESCRIPTION
2013 Summit on Translational BioinformaticsTRANSCRIPT
1
Analysis of sequence-‐based copy number varia7on detec7on tools for
cancer studies
Sheida Nabavi Zhengqiu Cai
Peter J Tonellato
Laboratory for Personalized Medicine Center For Biomedical Informa7cs
Harvard Medical School
Why focus on CNV?
CNVs have been associated with various diseases including cancers. – Amplifica7on of oncogenes and loss of tumor suppressor genes
CNVs (amplifica7ons and dele7ons of chromosomal segments) are major structural varia7ons.
amplifica7on
2
NGS is emerging as the best plaTorm for CNV detec7on
Array based technologies have been used to study CNVs
Sequence based technologies (using next genera7on sequencing (NGS)) are emerging
Advantages Disadvantages
Affordable Noisy , Hybridiza7on
Established Limited resolu7on
High resolu7on arrays Predefined probes
Advantages Disadvantages
Higher resolu7on and accuracy Currently expensive
Single plaTorm No standard analysis
Rapid cost reduc7on Computa7onally demanding
Objec7ves of this work
• Indicate the performance characteris7cs of the recent sequence-‐based CNV detec7on tools to facilitate tool selec7on.
• Iden7fy strengths and weaknesses of current CNV tools. – Address limita7ons
– Iden7fy modifica7ons to improve performance – Develop new method
3
Typical pipeline for read depth-‐based CNV detec7on
The number of reads that align to a posi7on in a genome is propor7onal to the copy number at that posi7on
Alignment
NGS data CNV list Coun7ng reads Normaliza7on
CNV detec7on
Segmenta7on
Reference genome
Short read Coun7ng window @SRR034720.3267591 length=36 ATTATTTTATGTTATTTATTTTGTATGTTTTTTTTT + 88888888888888888888885888%888888/8) @SRR034720.3267592 length=36 TCGGGAACGTCTCGACCGAAATTATTTTGTATGTCT + 8888788888888888878-‐188288881878"888 . . .
NGS data Read count
Normaliza7on reduces biases
Soma7c amplifica7on
Soma7c dele7on
Control
Aligned short reads
Sample
Main sources of biases are GC-‐content, mapability, and sample prepara7on
4
Read depth-‐based CNV detec7on Tools
Tool Method SE/PE
control Reference
CNVnator Analyze depth of coverage (DOC), mean-shift technique for segmentation, correct GC bias
SE PE
No Abyzov et al. (2011), Genome Res, 21, 974-984
ReadDepth Analyze DOC, circular binary segmentation algorithm, use negative binomial distribution, use PE information
SE PE
No Miller et al. (2011), PLoS One, 6, e16327
FREEC Analyze DOC, use LASSO-based algorithm for segmentation, normalize for GC content
SE PE
Yes/No
Boeva et al. (2011), Bioinformatics, 27, 268-269
CNVer Analyze DOC with paired-end mapping information, repeat graph algorithm
PE No Medvedev et al. (2010), Genome Res, 20, 1613-1622
CNV-seq Analyze DOC, fixed window, no segmentation
SE PE
Yes Xie and Tammi (2009), BMC Bioinformatics, 10, 80
SegSeq Analyze DOC, extend a window to include a fixed number of reads
SE PE
Yes Chiang et al. (2009), Nat Methods, 6, 99-103
Cell line and synthesized NGS datasets
Eight breast cancer cell line NGS datasets
• 500 amplifications and 500 deletions with random lengths form 500 bp to 1Mbp on chr1 • Coverage: 0.5, 2, 5, 10, 20, 50 • 50 base pair short read using SAMtools wgsim
Cell Line Read length
Single end/paired end PF Reads Coverage Illumina
platform BT-20 50 PE 13,822,800 0.43 X GA II
BT-474 50 PE 15,787,700 0.49 X GA II
MDA-MB-231 50 PE 12,372,400 0.39 X GA II
MDA- MB-468 50 PE 13,265,400 0.41 X GA II
MCF-7 50 PE 9,838,800 0.31 X GA II
T47D 50 PE 13,514,100 0.42 X GA II
ZR-75-1 50 PE 13,484,800 0.42 X GA II
HCC1143 36 SE 15,038,736 0.25 X 1G
Six Synthe7c NGS datasets
5
There is no consistent agreement among the tools
Cell lines detected CNVs’ size, number, span and type are different across the tools.
0%
20%
40%
60%
80%
100% >1000K 500k-‐1000k 100k-‐500k 50k-‐100k 10k-‐50k 5k-‐10k 2k-‐5k 1k-‐2k 0-‐1k
CNV size
CNV size distribu7ons
Few consensus genes among tools
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
amplifica7on, dele7on 1: SegSeq, 2: CNVer, 3: CNVnator, 4: FREEC, 5: CNV-‐seq, 6: ReadDepth
6
Tools show high sensi7vity for cell line datasets
Benchmarks: Common CNVs in two published results
Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve
0%
20%
40%
60%
80%
100% MCF7�
Sensi7vity
Precision
There are many false posi7ve CNVs
Chr20, FREEC
Detected CNV Sample read count Normal read count Benchmark CNV region
7
There are many false posi7ve CNVs
Higher sensi7vity compare to precision for high coverage synthesized datasets
Benchmark: known synthesized CNVs
Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve
0%
20%
40%
60%
80%
100%
0.5x 2x 5x 10x 20x 50x
Precision�
Coverage
0%
20%
40%
60%
80%
100%
0.5x 2x 5x 10x 20x 50x
Sensi7vity�
Coverage�
SegSeq CNVer CNVnator FREEC CNV-‐seq ReadDepth
8
Conclusions
• The CNV results across the tools are not consistence. • Most of the tools show high sensi7vity and breakpoint accuracy, however their precision is not high.
• Tools with advanced algorithms such as CNVnator and FREEC perform beoer, however they are computa7onally more expensive.
• Tools u7lize pair end informa7on, such as CNVer and ReadDepth, detect CNVs more accurately.
• Development of more efficient and accurate tools is required.
Acknowledgements
– Peter Tonellato – Zengqui Cai – Erik Gafni – Vincent Fusaro – Chih-‐Lin Chi – Michiyo Yamada – Jessica Correia – Maohew Crawford
Laboratory of Personalized Medicine (LPM)