succinct - ucf computer sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 diego díaz...
TRANSCRIPT
![Page 1: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/1.jpg)
Succinct de Bruijn graphsDiego Díaz DomínguezDepartment of Computer Science
University of Chile
![Page 2: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/2.jpg)
de Bruijn graphs
Given a parameter k and a set ! of strings, a de Bruijn graph
(dBG) is a directed graph that encodes the k-1 overlaps
between all the k-substrings (k-mers) of !.
ATGC TGCT
GCTA
CTAT
TATG
TGCG GCGT
k=5
T
A
G
T
G
C
T
ATGCTATGCGTATGCTGCT
k-mer = ATGCT
!={ }
10/25/18 2Diego Díaz Domínguez
![Page 3: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/3.jpg)
de Bruijn graphs
Different values for k yield different graph topologies
10/25/18 3Diego Díaz Domínguez
!=
C
ATGCTACTATGC
ATGCGT
ATG
TGC GCT
CTATAT
GCG CGT
k=4
T
A
TG
G
T
ATGC TGCT
GCTA
CTAT
TATG
TGCG GCGT
k=5
T
A
G
T
G
C
![Page 4: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/4.jpg)
de Bruijn graphsAdvantages:• They are easy to construct• They catch the underlying structure of the string set• They can be used to solve several problems in Bioinformatics
Disadvantages:• Choosing the right value for k is always a major concern.• The size of the graph grows with k.• Having several dBG, with different orders, for the same
dataset is expensive.
10/25/18 4Diego Díaz Domínguez
![Page 5: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/5.jpg)
Encoding a dBG
10/25/18 5Diego Díaz Domínguez
Things we have to do:• Represent the nodes and the edges succinctly• Traverse the graph efficiently• Combine the information of several graph orders• Colour the nodes
![Page 6: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/6.jpg)
Operations in a dBG
10/25/18 6Diego Díaz Domínguez
Navigational operations we are interested in:
• outdegree(v):number of outgoing edges of node v• indegree(v):number of incoming edges of v• outgoing(v,a):follow outgoing edge of v labeled with a• incoming(v,a):follow an incoming node of v whose label
starts with a
![Page 7: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/7.jpg)
Succinct dBG: BOSS
10/26/18 7Diego Díaz Domínguez
C
$$$ATGCTA$$$$$$CTATGC$$$$$$ATGCGT$$$
ATG
TGC
GCTCTA
TATGCG
CGT
k=4
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
$$$ A1 0
C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0
T2 1$$C T3 1TGC $ 0
G1 0
T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1
BOSSMatrix
EdgeBWT
NodeMarks
C[$] = 0C[A] = 7C[C] = 9
C[G] = 11 C[T] = 13
2
2
![Page 8: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/8.jpg)
BOSS:outdegree
10/26/18 8Diego Díaz Domínguez
2
2
select(NodeMarks,11)
select(NodeMarks,10)+1
![Page 9: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/9.jpg)
BOSS: forward
10/26/18 9Diego Díaz Domínguez
$$$ A1 0
C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0
T2 1$$C T3 1TGC $ 0
G1 0
T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1
BOSSMatrix
EdgeBWT
NodeMarks
C
ATG
TGC
GCTCTA
TATGCG
CGT
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
11
18
C[$]=0C[A]=7C[C]=9C[G]=11C[T]=13
forward(11,T)= 17= C[C] + rank(EdgeBWT,T,15)= 13 + 4 = 17
2
217
11
15
![Page 10: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/10.jpg)
$$$ A1 0
C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0
T2 1$$C T3 1TGC $ 0
G1 0
T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1
BOSSMatrix
EdgeBWT
NodeMarks
C
ATG
TGC
GCTCTA
TATGCG
CGT
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
C[$]=0C[A]=7C[C]=9C[G]=11C[T]=13
a
b
BOSS: indegree
10/26/18 10Diego Díaz Domínguez
select(EdgeBWT,G,3)=b
select(EdgeBWT,G,2)=aindegree(13)= 2= 1 + rank(EdgeBWT,*G,b)–
rank(EdgeBWT,*G,a)= 2
13
2
2
![Page 11: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/11.jpg)
Bidirectional BOSS
10/26/18 11Diego Díaz Domínguez
incoming is more expensive than outgoing because we don’t have accessto the first symbol of the nodes. By doubling the size of the index, we cancompute incoming fast.
C
ATG
TGC
GCTCTA
TATGCG
CGT
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
$$$ A 0
C 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T 1CTA $ 0
T 1$$C T 1TGC $ 0
G 0
T 1GCG T 1ATG C 1$AT G 1TAT *G 1$CT A 1GCT *A 1CGT $ 1
$$$ $ 1A$$ *$ 1C$$ *$ 1TA$ $ 1TC$ $ 1$$A T 1GTA $ 0
T 1$$C G 1TGC G 1ATC $ 0
G 1$CG T 1GCG *T 1TCG *T 1$TG C 1$$T G 1$AT C 1TAT *C 1CGT A 1
BOSSrBOSS
outgoingselect
incoming(13,G)= 21
13
21
![Page 12: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/12.jpg)
Variable-order BOSS
10/26/18 12Diego Díaz Domínguezk=4 k=2
By augmenting BOSS with the LCP, we can represent all the dBG up to order k
i
i’->LCP[i’]<(k-1)
j->LCP[j]<(k-1)
![Page 13: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/13.jpg)
Encoding EdgeBWT: Wavelet trees
10/25/18 13Diego Díaz Domínguez
![Page 14: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/14.jpg)
Encoding EdgeBWT: RL-BWT
10/25/18 14Diego Díaz Domínguez
![Page 15: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/15.jpg)
Closing remarks
10/26/18 15Diego Díaz Domínguez
• We can compress several dBG and still supporting fast operations
• It is still necessary to propose new algorithms that work ontop of BOSS
• Can we encode the patterns of the graph? (bubbles, tips cross-links)
• Can we combine BOSS with other compression schemes (LZ, grammars) to improve compression even further?
![Page 16: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/16.jpg)
10/26/18 16Diego Díaz Domínguez
Thanks!
![Page 17: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan](https://reader035.vdocuments.net/reader035/viewer/2022071009/5fc7231ef36839315236ea5f/html5/thumbnails/17.jpg)
Encoding EdgeBWT: RL-BWT
10/25/18 17Diego Díaz Domínguez
1 $1 *$1 *$1 $1 $0 T1 $0 T0 G0 G1 $0 G0 T0 *T0 *T0 C0 G0 C0 *C0 A
0 T0 T0 G0 G0 G0 T1 *T1 *T0 C0 G0 C1 *C0 A
T T
T T
C G
G
G
T
C
G
C
A
D EdgeBWT
M
E* RL-E