succinct - ucf computer sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 diego díaz...

17
Succinct de Bruijn graphs Diego Díaz Domínguez Department of Computer Science University of Chile

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Succinct de Bruijn graphsDiego Díaz DomínguezDepartment of Computer Science

University of Chile

Page 2: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

de Bruijn graphs

Given a parameter k and a set ! of strings, a de Bruijn graph

(dBG) is a directed graph that encodes the k-1 overlaps

between all the k-substrings (k-mers) of !.

ATGC TGCT

GCTA

CTAT

TATG

TGCG GCGT

k=5

T

A

G

T

G

C

T

ATGCTATGCGTATGCTGCT

k-mer = ATGCT

!={ }

10/25/18 2Diego Díaz Domínguez

Page 3: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

de Bruijn graphs

Different values for k yield different graph topologies

10/25/18 3Diego Díaz Domínguez

!=

C

ATGCTACTATGC

ATGCGT

ATG

TGC GCT

CTATAT

GCG CGT

k=4

T

A

TG

G

T

ATGC TGCT

GCTA

CTAT

TATG

TGCG GCGT

k=5

T

A

G

T

G

C

Page 4: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

de Bruijn graphsAdvantages:• They are easy to construct• They catch the underlying structure of the string set• They can be used to solve several problems in Bioinformatics

Disadvantages:• Choosing the right value for k is always a major concern.• The size of the graph grows with k.• Having several dBG, with different orders, for the same

dataset is expensive.

10/25/18 4Diego Díaz Domínguez

Page 5: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Encoding a dBG

10/25/18 5Diego Díaz Domínguez

Things we have to do:• Represent the nodes and the edges succinctly• Traverse the graph efficiently• Combine the information of several graph orders• Colour the nodes

Page 6: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Operations in a dBG

10/25/18 6Diego Díaz Domínguez

Navigational operations we are interested in:

• outdegree(v):number of outgoing edges of node v• indegree(v):number of incoming edges of v• outgoing(v,a):follow outgoing edge of v labeled with a• incoming(v,a):follow an incoming node of v whose label

starts with a

Page 7: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Succinct dBG: BOSS

10/26/18 7Diego Díaz Domínguez

C

$$$ATGCTA$$$$$$CTATGC$$$$$$ATGCGT$$$

ATG

TGC

GCTCTA

TATGCG

CGT

k=4

T

A

T

G

G

T

$AT$$A

$$$

$$C $CT

TA$A$$

GT$

T$$

GC$

C$$

A

TG

C

TA

$

$$

$$

$

$$

$

$$$ A1 0

C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0

T2 1$$C T3 1TGC $ 0

G1 0

T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1

BOSSMatrix

EdgeBWT

NodeMarks

C[$] = 0C[A] = 7C[C] = 9

C[G] = 11 C[T] = 13

2

2

Page 8: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

BOSS:outdegree

10/26/18 8Diego Díaz Domínguez

2

2

select(NodeMarks,11)

select(NodeMarks,10)+1

Page 9: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

BOSS: forward

10/26/18 9Diego Díaz Domínguez

$$$ A1 0

C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0

T2 1$$C T3 1TGC $ 0

G1 0

T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1

BOSSMatrix

EdgeBWT

NodeMarks

C

ATG

TGC

GCTCTA

TATGCG

CGT

T

A

T

G

G

T

$AT$$A

$$$

$$C $CT

TA$A$$

GT$

T$$

GC$

C$$

A

TG

C

TA

$

$$

$$

$

$$

$

11

18

C[$]=0C[A]=7C[C]=9C[G]=11C[T]=13

forward(11,T)= 17= C[C] + rank(EdgeBWT,T,15)= 13 + 4 = 17

2

217

11

15

Page 10: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

$$$ A1 0

C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0

T2 1$$C T3 1TGC $ 0

G1 0

T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1

BOSSMatrix

EdgeBWT

NodeMarks

C

ATG

TGC

GCTCTA

TATGCG

CGT

T

A

T

G

G

T

$AT$$A

$$$

$$C $CT

TA$A$$

GT$

T$$

GC$

C$$

A

TG

C

TA

$

$$

$$

$

$$

$

C[$]=0C[A]=7C[C]=9C[G]=11C[T]=13

a

b

BOSS: indegree

10/26/18 10Diego Díaz Domínguez

select(EdgeBWT,G,3)=b

select(EdgeBWT,G,2)=aindegree(13)= 2= 1 + rank(EdgeBWT,*G,b)–

rank(EdgeBWT,*G,a)= 2

13

2

2

Page 11: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Bidirectional BOSS

10/26/18 11Diego Díaz Domínguez

incoming is more expensive than outgoing because we don’t have accessto the first symbol of the nodes. By doubling the size of the index, we cancompute incoming fast.

C

ATG

TGC

GCTCTA

TATGCG

CGT

T

A

T

G

G

T

$AT$$A

$$$

$$C $CT

TA$A$$

GT$

T$$

GC$

C$$

A

TG

C

TA

$

$$

$$

$

$$

$

$$$ A 0

C 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T 1CTA $ 0

T 1$$C T 1TGC $ 0

G 0

T 1GCG T 1ATG C 1$AT G 1TAT *G 1$CT A 1GCT *A 1CGT $ 1

$$$ $ 1A$$ *$ 1C$$ *$ 1TA$ $ 1TC$ $ 1$$A T 1GTA $ 0

T 1$$C G 1TGC G 1ATC $ 0

G 1$CG T 1GCG *T 1TCG *T 1$TG C 1$$T G 1$AT C 1TAT *C 1CGT A 1

BOSSrBOSS

outgoingselect

incoming(13,G)= 21

13

21

Page 12: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Variable-order BOSS

10/26/18 12Diego Díaz Domínguezk=4 k=2

By augmenting BOSS with the LCP, we can represent all the dBG up to order k

i

i’->LCP[i’]<(k-1)

j->LCP[j]<(k-1)

Page 13: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Encoding EdgeBWT: Wavelet trees

10/25/18 13Diego Díaz Domínguez

Page 14: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Encoding EdgeBWT: RL-BWT

10/25/18 14Diego Díaz Domínguez

Page 15: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Closing remarks

10/26/18 15Diego Díaz Domínguez

• We can compress several dBG and still supporting fast operations

• It is still necessary to propose new algorithms that work ontop of BOSS

• Can we encode the patterns of the graph? (bubbles, tips cross-links)

• Can we combine BOSS with other compression schemes (LZ, grammars) to improve compression even further?

Page 16: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

10/26/18 16Diego Díaz Domínguez

Thanks!

Page 17: Succinct - UCF Computer Sciencecs.ucf.edu/stringbio2018/talks/talk9.pdf · 10/26/18 Diego Díaz Domínguez 11 incoming ismoreexpensivethan outgoing becausewedon’thaveaccess tothefirstsymbolofthenodes.Bydoublingthesizeoftheindex,wecan

Encoding EdgeBWT: RL-BWT

10/25/18 17Diego Díaz Domínguez

1 $1 *$1 *$1 $1 $0 T1 $0 T0 G0 G1 $0 G0 T0 *T0 *T0 C0 G0 C0 *C0 A

0 T0 T0 G0 G0 G0 T1 *T1 *T0 C0 G0 C1 *C0 A

T T

T T

C G

G

G

T

C

G

C

A

D EdgeBWT

M

E* RL-E