analysis of sequence-based copy number variation detection tools for cancer studies

1

Analysis of sequence-‐based copy number varia7on detec7on tools for

cancer studies

Sheida Nabavi Zhengqiu Cai

Peter J Tonellato

Laboratory for Personalized Medicine Center For Biomedical Informa7cs

Harvard Medical School

Why focus on CNV?

CNVs have been associated with various diseases including cancers. –  Amplifica7on of oncogenes and loss of tumor suppressor genes

CNVs (amplifica7ons and dele7ons of chromosomal segments) are major structural varia7ons.

amplifica7on

2

NGS is emerging as the best plaTorm for CNV detec7on

Array based technologies have been used to study CNVs

Sequence based technologies (using next genera7on sequencing (NGS)) are emerging

Advantages Disadvantages

Affordable Noisy , Hybridiza7on

Established Limited resolu7on

High resolu7on arrays Predefined probes

Advantages Disadvantages

Higher resolu7on and accuracy Currently expensive

Single plaTorm No standard analysis

Rapid cost reduc7on Computa7onally demanding

Objec7ves of this work

•  Indicate the performance characteris7cs of the recent sequence-‐based CNV detec7on tools to facilitate tool selec7on.

•  Iden7fy strengths and weaknesses of current CNV tools. –  Address limita7ons

–  Iden7fy modifica7ons to improve performance –  Develop new method

3

Typical pipeline for read depth-‐based CNV detec7on

The number of reads that align to a posi7on in a genome is propor7onal to the copy number at that posi7on

Alignment

NGS data CNV list Coun7ng reads Normaliza7on

CNV detec7on

Segmenta7on

Reference genome

Short read Coun7ng window @SRR034720.3267591 length=36 ATTATTTTATGTTATTTATTTTGTATGTTTTTTTTT + 88888888888888888888885888%888888/8) @SRR034720.3267592 length=36 TCGGGAACGTCTCGACCGAAATTATTTTGTATGTCT + 8888788888888888878-‐188288881878"888 . . .

NGS data Read count

Normaliza7on reduces biases

Soma7c amplifica7on

Soma7c dele7on

Control

Aligned short reads

Sample

Main sources of biases are GC-‐content, mapability, and sample prepara7on

4

Read depth-‐based CNV detec7on Tools

Tool Method SE/PE

control Reference

CNVnator Analyze depth of coverage (DOC), mean-shift technique for segmentation, correct GC bias

SE PE

No Abyzov et al. (2011), Genome Res, 21, 974-984

ReadDepth Analyze DOC, circular binary segmentation algorithm, use negative binomial distribution, use PE information

SE PE

No Miller et al. (2011), PLoS One, 6, e16327

FREEC Analyze DOC, use LASSO-based algorithm for segmentation, normalize for GC content

SE PE

Yes/No

Boeva et al. (2011), Bioinformatics, 27, 268-269

CNVer Analyze DOC with paired-end mapping information, repeat graph algorithm

PE No Medvedev et al. (2010), Genome Res, 20, 1613-1622

CNV-seq Analyze DOC, fixed window, no segmentation

SE PE

Yes Xie and Tammi (2009), BMC Bioinformatics, 10, 80

SegSeq Analyze DOC, extend a window to include a fixed number of reads

SE PE

Yes Chiang et al. (2009), Nat Methods, 6, 99-103

Cell line and synthesized NGS datasets

Eight breast cancer cell line NGS datasets

•  500 amplifications and 500 deletions with random lengths form 500 bp to 1Mbp on chr1 •  Coverage: 0.5, 2, 5, 10, 20, 50 •  50 base pair short read using SAMtools wgsim

Cell Line Read length

Single end/paired end PF Reads Coverage Illumina

platform BT-20 50 PE 13,822,800 0.43 X GA II

BT-474 50 PE 15,787,700 0.49 X GA II

MDA-MB-231 50 PE 12,372,400 0.39 X GA II

MDA- MB-468 50 PE 13,265,400 0.41 X GA II

MCF-7 50 PE 9,838,800 0.31 X GA II

T47D 50 PE 13,514,100 0.42 X GA II

ZR-75-1 50 PE 13,484,800 0.42 X GA II

HCC1143 36 SE 15,038,736 0.25 X 1G

Six Synthe7c NGS datasets

5

There is no consistent agreement among the tools

Cell lines detected CNVs’ size, number, span and type are different across the tools.

0%

20%

40%

60%

80%

100% >1000K 500k-‐1000k 100k-‐500k 50k-‐100k 10k-‐50k 5k-‐10k 2k-‐5k 1k-‐2k 0-‐1k

CNV size

CNV size distribu7ons

Few consensus genes among tools

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

amplifica7on, dele7on 1: SegSeq, 2: CNVer, 3: CNVnator, 4: FREEC, 5: CNV-‐seq, 6: ReadDepth

6

Tools show high sensi7vity for cell line datasets

Benchmarks: Common CNVs in two published results

Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve

0%

20%

40%

60%

80%

100% MCF7�

Sensi7vity

Precision

There are many false posi7ve CNVs

Chr20, FREEC

Detected CNV Sample read count Normal read count Benchmark CNV region

7

There are many false posi7ve CNVs

Higher sensi7vity compare to precision for high coverage synthesized datasets

Benchmark: known synthesized CNVs

Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve

0%

20%

40%

60%

80%

100%

0.5x 2x 5x 10x 20x 50x

Precision�

Coverage

0%

20%

40%

60%

80%

100%

0.5x 2x 5x 10x 20x 50x

Sensi7vity�

Coverage�

SegSeq CNVer CNVnator FREEC CNV-‐seq ReadDepth

8

Conclusions

•  The CNV results across the tools are not consistence. •  Most of the tools show high sensi7vity and breakpoint accuracy, however their precision is not high.

•  Tools with advanced algorithms such as CNVnator and FREEC perform beoer, however they are computa7onally more expensive.

•  Tools u7lize pair end informa7on, such as CNVer and ReadDepth, detect CNVs more accurately.

•  Development of more efficient and accurate tools is required.

Acknowledgements

– Peter Tonellato – Zengqui Cai – Erik Gafni – Vincent Fusaro – Chih-‐Lin Chi – Michiyo Yamada –  Jessica Correia – Maohew Crawford

Laboratory of Personalized Medicine (LPM)

analysis of sequence-based copy number variation detection tools for cancer studies

Documents