the$breakage$fusion$bridge$and$ … · somatic rearrangements in bacs. (a) chromosome 17q21...
TRANSCRIPT
The Breakage Fusion Bridge and other complex SVs: Combinatorics &
Cancer Genomics Vineet Bafna
Computer Science & Engineering UCSD
Non Allelic Homologous RecombinaFon (NAHR)
3/11/13 Bafna
Can these mechanisms explain complex rearrangements?
Kitada, 2008 Zhao, Cancer Res. 2004
Somatic rearrangements in BACs. (A) Chromosome 17q21 amplicon in HCC1954 including ERBB2; (B) chimeric amplicon in HCC1954 including MYC; (C) chimeric amplicon in NCI-
H2171 including MYC; (D) chromosome 2 amplicon in NCI-H1770 including MYCN. The color of t...
Bignell G R et al. Genome Res. 2007;17:1296-1303
©2007 by Cold Spring Harbor Laboratory Press
…The architecture of rearrangements in this amplicon recapitulates remarkably well the structural features predicted by a classical breakage–fusion–bridge cycle involving sister chromaFds, as originally proposed by Barbara McClintock (1941) (see Supplemental Fig. 4)….
….Therefore, the ERBB2 amplicon in HCC1954 bears all the architectural hallmarks at the sequence level of a classical sister chromaFd breakage–fusion–bridge process, the first Fme that this has been demonstrated in human cancer….
PNAS 1939
PNAS 1939
PNAS 1939
B A A B C D E F G
B A A B C D E F G D C B A A B B A A B C D
C C C C D D C B A A B B A A B C D E F G
Telomere fusions in breast carcinoma
Tanaka, PNAS 2012
B A A B C D E F G D C B A A B B A A B C D E F G
C C C C D D C B A A B B A A B C D E F G
A B C D E F G
The BFB-‐count-‐vector problem
While we cannot sequence and assemble the genome, we can measure copy counts of individual segments
The BFB reconstrucFon problem
A B C D E F G
Given a segmentaFon of the genome, and counts of copy numbers for each segment, are the copy numbers consistent with a BFB sequence?
A B C D E F G
✔ ✗
Input: 12 10 2
BFB Trees
ProperFes of BFB trees
• ProperFes – Long ends – Node symmetry – Pair Symmetry
• A traversal on a tree with these 3 properFes is a BFB sequence.
Pairs
Finding Longest Palindromes
2 10 12
1
1·n1= 5
122
2·n1 + 2·n2 + 1·n3 = 6
2 4 2·n1 + 4·n2 = x
BFB-Tree Has Better Worst-Case Performance
BFB-Tree for [20, 3, 19, 19, 19] took 12 seconds. BFB-Pivot for [20, 18, 20, 20, 6] took 22,251 seconds
Most Count Vectors Do Not Admit BFB
• Expanded analysis to 64,000,000 count vectors of length <= 6 and ni <= 20. Of those count vectors, only 7.2% can be achieved by BFB.
• Finding a count vector achievable by BFB is strong evidence BFB occurred?
Experimental Count Vectors Are Imprecise
Lung cancer cell-line PTX250
[3 3 4 6 34]
Many CVs close to a BFB sequence
A non-BFB count vector
[1, 2, 1, 2, 8, 1, 3, 1, 3] à [2, 2, 2, 2, 8, 3, 3, 3, 3]
Zhao, Cancer Res. 2004 (lung cancer cell-‐line NCI-‐H2171)
A linear Fme algorithm for BFB
• Once you allow errors in counts, many count vectors can conceivably be explained by BFB, although there are a few that are conclusivey not BFB
• What if we choose large numbers of segments (k=8,10,12)? The BFB-‐tree heurisFc is much too slow to deal with large data sets.
Count vector canonizaFon
4 D D D D D D D D 3 C C C C C C C C C C C C 2 B B B B B B B B B B B B B B B B B B 1 A A A A A A A A A A A A A A 0 $ $
2" ′=[2, 14, 18, 12, 8]
Layer 4
2" ↓4 =[0, 0, 0, 0, 8]
%↓4
4 D D D D D D D D 3 C C C C C C C C C C C C 2 B B B B B B B B B B B B B B B B B B 1 A A A A A A A A A A A A A A 0 $ $
• 4× DD
Layer 3
• There are 6 CC at level 3. • Soln:
• Add 2 ε to the 4 DDs • Wrap: to get 4XCDDC, 2XCC
2" ↓3 =[0, 0, 0, 12, 8]
%↓3 4 D D D D D D D D 3 C C C C C C C C C C C C 2 B B B B B B B B B B B B B B B B B B 1 A A A A A A A A A A A A A A 0 $ $
Layer 2
• We have 6 blocks (4XCDDC, 2XCC) at level 2, but 9 at level 2.
• Soln: Add 3 ε • Wrap to get 4XBCDDCB, 2XBCCB, 3XBB
2" ↓2 =[0, 0, 18, 12, 8] 4 D D D D D D D D 3 C C C C C C C C C C C C 2 B B B B B B B B B B B B B B B B B B 1 A A A A A A A A A A A A A A 0 $ $
Layer 1
• 9 blocks at level 2, 7 at Block 1 • Fold 2X{BCDDCB-‐BCCB-‐BCDDCB} to get 5 blocks. • Add 2 ε to get 7 • Wrap to get 2X{A-‐BCDDCB-‐BCCB-‐BCDDCB-‐A}, 3XABBA, 2XAA.
2" ↓1 =[0, 14, 18, 12, 8] 4 D D D D D D D D 3 C C C C C C C C C C C C 2 B B B B B B B B B B B B B B B B B B 1 A A A A A A A A A A A A A A 0 $ $
Layer 0
• Fold all of the 7 blocks at level 1 to get a single block.
• Wrap the $ around the block
2" ↓0 =[2, 14, 18, 12, 8] 4 D D D D D D D D 3 C C C C C C C C C C C C 2 B B B B B B B B B B B B B B B B B B 1 A A A A A A A A A A A A A A 0 $ $
Wrapping and folding
• The algorithm does a sequence of wrapping and folding.
• The folding must be done in a special way so that a greedy algorithm always folds correctly.
Folding
• Only the height of the block is important. Sort the blocks by their heights.
• Consider a non-‐increasing (by height) collecFon with 2i copies of the i-‐th height block.
• We call such a collecFon, an l-‐convex collec:on. • An l-‐convex-‐collecFon can be rearranged into an l convex plalindrome
Folding
• An l-‐convex-‐collecFon can be rearranged into an l convex palindrome.
• An l-‐convex-‐palindrome may not be a BFB-‐block. • However, all BFB-‐palindromes can be created by flanking an l-‐BFB-‐palindrome by two blocks of max (or higher height)
The algorithm
• In going from layer I to layer i-‐1 – If the number of blocks in layer i-‐1 is larger, just add extra ε
– If the number is smaller, do a series of l-‐convex-‐collecFon folding operaFons, each Fme reducing the number of blocks by a power of 2. If you jump over, add extra ε
• How can we fold in a way that will not corrupt the blocks for future layers?
Recursive Block decomposiFons
Recursive Block decomposiFons
• Remove the odd block(s) B, and put them in the center. • Blocks higher than B can be put around B • Divide remaining blocks into half, and recurse.
Recursive Block decomposiFons
d Bd (half of remaining) Ld (odd blocks) Hd (higher than odd) sd
0 {4β10, 2β7, 2β9, β8} β8 2β7
1 {2β10, β9} β9 Φ
2 β10 β10 Φ
3 Φ Φ Φ
Block decomposiFons and signature
d Bd (half of remaining) Ld (odd blocks) Hd (higher than odd) sd
0 {4β10, 2β7, 2β9, β8} β8 2β7 1
1 {2β10, β9} β9 Φ 0
2 β10 β10 Φ 0
3 Φ Φ Φ -‐1
s0 = L0
sd = Ld ! Ld!1 !12Hd!1 +max sd!1, 0{ }
Signature
• We order the signature using lexicographic sort.
• Let B1, B2 be block collecFons of the same size with s(B1) <= s(B2).
• Theorem 1: if B2 can be folded to size n, B1 can be folded to size n
• Theorem 2:We can fold in a way so as to always achieve the minimum signature in the folded block
Results • There is an algorithm following the described sketch, who succeeds to produce a BFB-‐string corresponding to the input vector iff there exists such a string. – Time complexity: &(("), where " is the maximum count in the vector (i.e. linear in the length of the output string, yet exponenFal in the length of the input representaFon, which is &((log " )).
– Variants: • An &((log " ) algorithm for the decision problem • An &(("↑2 log " ) algorithm for the problem, in the case where it is assumed that there are measurement errors in the input
Using BFB in pracFce
• For each window, we output the distance between the example, and the nearest BFB
• We apply a sliding window. At each posiFon (and window-‐length), we idenFfy a count-‐vector with minimum distance to a BFB.
• If the distance meets a threshold, we accept, else reject
DetecFng BFB versus other rearrangements
• The BFB operaFons take place among other rearrangements that change the copy number
• The count of copies is inexact at best • NegaFve examples: We simulate 15000 diploid chromosomes undergoing over 50 rearrangements (inversions, tandem duplicaFons, deleFons)
• PosiFve examples: 5000 cases with 50 rearrangements on diploid chromosomes, and one with 2-‐10 BFB operaFons
• In each case, a count-‐vector is created • δ= min. distance of the count vector from a true BFB
63% FPR
Fold-‐backs • Long count vectors certainly help, but they might not be enough either.
• If 10% of the cases are BFB, a 10% FPR rate becomes 50%FP in naturally occurring samples.
• Working with 1% FPR reduces TP to 18%. • Excess of fold-‐backs (e.g. –A, A) might provide other clues
-‐C -‐B -‐A A B C D E F
DetecFng fold-‐backs
Donor
Reference
A simple score funcFon
• f = fracFon of breakpoints that are fold-‐backs (higher the bewer)
• δ= min. distance of the count vector from a true BFB
• s=(1-‐λ)δ+λ(1-‐f) is a generic score funcFon (lower the bewer)
63% FPR
DetecFng BFBs • With errors in segment copy counts, it is easy to create short, false BFB-‐like count vectors.
• We cannot detect short BFB-‐cycles. • With 8 or more BFB cycles, and correspondingly long count-‐vectors, we need advanced algorithms to idenFfy BFBs.
• Long count vectors allow us to detect BFB operaFons with high confidence even amidst noise of other rearrangements.
• Other signals, such as fold-‐back operaFons help in further improving the signal.
Campbell, Nature 2010. (Exemplar of BFB in paQent with pancreaQc cancer) • Note the excess of Head-‐head fold-‐backs relaFve to others rearrangements • We idenFfy a count-‐vector of length 10 with a very low distance threshold that
allows us to detect this as BFB with some confidence.
Chr 12: BFB in primary tumor (pancreaQc) sample
Chromothripsis
ObservaFons • Massive remodeling of a single chromosome • Tens to hundreds of genomic rearrangements • Very few copy number states (two or perhaps three). • RetenFon of heterozygosity in regions with higher copy number. • Breakpoints show significantly more clustering than observed by
chance.
Concluding thoughts: complex structural variaFons, or not?
• There are many lines of enquiry for complex structural variaFon – Given fragmentary data, can you use it to reconstruct the architecture of the sampled genome (AKA sequence assembly)
– Given a set of operaFons, predict a plausible sequence of events (or the number) that takes a sampled genome to a reference target (AKA genome rearrangements)
– Given a sampled genome, what kind of rearrangements are the most likely explanaFon? Is there evidence for more complex rearrangements (e.g., this talk)?