ppam presentatie

30
Department of Information Technology (INTEC), Ghent University, Belgium A Parallel, Distributed-Memory Framework for Comparative Motif Discovery 1 Dieter De Witte, 2,3 Michiel Van Bel, 1 Pieter Audenaert, 1 Piet Demeester, 1 Bart Dhoedt, 2,3 Klaas Vandepoele, and 1 Jan Fostier email: [email protected] 1 Department of Information Technology (INTEC), Ghent University, Belgium 2 Department of Plant Systems Biology, VIB, Ghent, Belgium 3 Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium

Upload: dieter-de-witte

Post on 18-Jul-2015

35 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Department of Information Technology (INTEC), Ghent University, Belgium

A Parallel, Distributed-Memory Framework

for Comparative Motif Discovery

1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester, 1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier

email: [email protected]

1Department of Information Technology (INTEC), Ghent University, Belgium2Department of Plant Systems Biology, VIB, Ghent, Belgium3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium

Background (1)

Department of Information Technology (INTEC), Ghent University, Belgium 2

CCACGTG

gene

promoter

transcription

factor

protein

regulate

Outline

Department of Information Technology (INTEC), Ghent University, Belgium 3

1) motif

target

prediction

2) motif

discovery

algorithm

3) parallel

framework

Parallel, comparative motif discovery framework

Gene families

Department of Information Technology (INTEC), Ghent University, Belgium 4

zma

sbi

bdi

osa

DNA sequences

orthologous genes (Van Bel et al., “Plaza 2.5”, Plant Phys. 2012)

TSS

2kbp promoter

gene family 1

TSS

2kbp promoter

gene family 17724

TSS

2kbp promoter

gene family 2

Branch Length Score (BLS)

Department of Information Technology (INTEC), Ghent University, Belgium 5

zma

sbi

osa

bdi

23,66%

5,38%26,88%

8,60%

Branch Length Score (BLS) 17.20 %

2kbp promoter

Scoring conservation within a gene family

motif occurrencesMST

Branch Length Score (BLS)

Department of Information Technology (INTEC), Ghent University, Belgium 6

zma

sbi

osa

bdi

23,66%

5,38%26,88%

8,60%

Branch Length Score (BLS) 64.52 %

2kbp promoter

Scoring conservation within a gene family

motif occurrencesMST

Branch Length Score (BLS)

Department of Information Technology (INTEC), Ghent University, Belgium 7

zma

sbi

osa

bdi

23,66%

5,38%26,88%

8,60%

Branch Length Score (BLS) 100 %

2kbp promoter

Scoring conservation within a gene family

motif occurrencesMST

Branch Length Score (BLS)

Department of Information Technology (INTEC), Ghent University, Belgium 8

zma

sbi

osa

bdi

23,66%

5,38%26,88%

8,60%

Branch Length Score (BLS) 0 %

2kbp promoter

Scoring conservation within a gene family

motif occurrence

Genomewide scoring (1)

Department of Information Technology (INTEC), Ghent University, Belgium 9

2kbp promoter

gene family 1

2kbp promoter

gene family 17724

2kbp promoter

gene family 2

Genome wide scoring of conservation

BLS = 64,52 % BLS = 17,20 % BLS = 72,72 %

> BLSthres ? > BLSthres ? > BLSthres ?

yes no yes

Fw = # families with BLS > BLSthres

Genomewide scoring (2)

Department of Information Technology (INTEC), Ghent University, Belgium 10

Genome wide scoring of conservation (2)

Fw = # families with BLS > BLSthres for word w

Fbg = median # families with BLS > BLSthres for random permutations of w

C = 1 −FbgFw

confidence score

Retain only motifs with confidence score C > 0.9

Stark et al., Genome Res. 2007

Example

Department of Information Technology (INTEC), Ghent University, Belgium 11

ACCCGGT 201 195 139 46 10 6

TCCAGCG 158 100 95 28 13 12

CGATCCG 204 191 185 100 46 11

CGCTGAC 246 203 145 39 23 12

GTGCACC 0 0 0 0 0 0

>40 % > 50% > 60% > 70% > 80% >90%

CCACGTG 578 545 423 224 135 116

C = 1 −Fbg

Fw

= 1 -201

578= 0.64

1000 random permutations of w

BLS threshold

Outline

Department of Information Technology (INTEC), Ghent University, Belgium 12

1) motif

target

prediction

2) motif

discovery

algorithm

3) parallel

framework

Parallel, comparative motif discovery framework

Motif discovery

Department of Information Technology (INTEC), Ghent University, Belgium 13

Main idea: extend previous ideas

to all words that occur in the sequences

Enumerating all words

Department of Information Technology (INTEC), Ghent University, Belgium 14

Generalized suffix tree (GST)

Example sequences:

S1 = ATGTAT$1

S2 = TTATGC$2

GC$2

AT C$2

GT

G

TAT$1

TAT$1$1

C$2

C$2 $1AT G TATGC$2

GC$2$1 $1

$1

$2

Enumerating all words

Department of Information Technology (INTEC), Ghent University, Belgium 15

GC$2

AT C$2

GT

G

TAT$1

TAT$1$1

C$2

C$2 $1AT G TATGC$2

GC$2$1 $1

$1

$2

Generalized suffix tree (GST)

Example sequences:

S1 = ATGTAT$1

S2 = TTATGC$2

Enumerating all words

Department of Information Technology (INTEC), Ghent University, Belgium 16

GC$2

AT C$2

GT

G

TAT$1

TAT$1$1

C$2

C$2 $1AT G TATGC$2

GC$2$1 $1

$1

$2

Generalized suffix tree (GST)

Example sequences:

S1 = ATGTAT$1

S2 = TTATGC$2

Enumerating all words

Department of Information Technology (INTEC), Ghent University, Belgium 17

GC

AT CG

T

G

TAT

TAT

C

C AT G TATGC

GC

Generalized suffix tree (GST)

Example sequences:

S1 = ATGTAT

S2 = TTATGC

Actual discovery

Department of Information Technology (INTEC), Ghent University, Belgium 18

GC

AT CG

T

G

TAT

TAT

C

C AT G TATGC

GC

word seq1 seq2

A y y

AT y y

ATG y y

ATGC - y

ATGT y -

ATGTA y -

ATGTAT y -

C - y

G y y

GC - y

Depth-first walk in the tree

Can be extended to IUPAC alphabet (Sagot, 2001)

Discovery of words

Department of Information Technology (INTEC), Ghent University, Belgium 19

Single gene family, words with length [6 … 12], max 3 deg. chars

0

50

100

150

200

250

300

350

400

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

ACGT ACGT+N IUPAC \ {BDHV} Full IUPAC

Tim

e (

s)

Nu

mb

er

of

wo

rds

Number of words Time

Outline

Department of Information Technology (INTEC), Ghent University, Belgium 20

1) motif

target

prediction

2) motif

discovery

algorithm

3) parallel

framework

Parallel, comparative motif discovery framework

Parallel discovery

Department of Information Technology (INTEC), Ghent University, Belgium 21

Intra-family phase

Process 1 Process P

w conserved in

3/4 organisms

BLS score

for w: 73% w’ conserved in

2/4 organisms

BLS score

for w’: 23%

gene family x gene family y

local hash table local hash table...

Communication phase

Department of Information Technology (INTEC), Ghent University, Belgium 22

Communication

phase

...

Core idea: words with the same composition are sent to the same processes

e.g. CACGTG and CCATGG

Process 1 Process P

Statistical evaluation

Department of Information Technology (INTEC), Ghent University, Belgium 23

Inter-family phase

aggregated hash table aggregated hash table

statistical evaluation

CACGTG: significant

ACCGGT: not significant

TGGCAC: not significant

statistical evaluation

TGANTGA: significant

GTATAGN: not significant

GANATTG: not significant

...

MapReduce pseudocode

Department of Information Technology (INTEC), Ghent University, Belgium 24

GC

AT CG

T

G

TAT

TAT

C

C AT G TATGC

GC

class Mapper

method Map(familyID id, family F)

GST = createGeneralizedSuffixTree(F);

for all word w ∊ GST do

BLS = computeBLS(w)

f = computeFrequencyVector(BLS)

Emit(sort(w), pair(w, f))

w = CCACGTG

BLS = 62%

f = [ 1 | 1 | 1 | 0 | 0 | 0 ]

Emit(ACCCGGT, <CCACGTG, [111000]>)

MapReduce pseudocode

Department of Information Technology (INTEC), Ghent University, Belgium 25

class Reducer

method Reduce(word s, pairs[<w1,f1>, <w2,f2>, …])

H = new AssociativeArray

for all <w,f> ∊ [<w1,f1>, <w2,f2>, …] do

H{w} = H{w} + f

BGM = computeBackgroundModel(H)

for all w ∊ H

C = computeConfidence(w, H{w}, BGM)

if (C > 0.9) Emit(w, H{w})

ACCCGGT 201 195 139 46 10 6

TCCAGCG 158 100 95 28 13 12

CGATCCG 204 191 185 100 46 11

CGCTGAC 246 203 145 39 23 12

GTGCACC 0 0 0 0 0 0

H C = 0.75

C = 0.60

C = 0.91

C = 0.93

C = 0.75

Results

Dataset description

4 organisms (zma, sbi, osa, bdi)

17639 gene families

Alphabet: IUPAC \ {BDHV} alphabet with length [6 ... 12]

maximum of 3 degenerate characters

20 instances on AWS (19 workers type m1.xlarge)

Map time: 20h 50 min

Reduce time: 12h 38 min

Cost price of 421.6$

Number of <k, v> emitted by mappers: 531 835 669 220

Map output bytes = 13 827 727 399 720

26 bytes per kv

Output materialized: 4 019 787 705 377 = 3.65 TByte

Average 7.55 byte per intermediate <k, v> pair (SnappyCompr)

Number of <k, v> emitted by reducers: 620 294 857Department of Information Technology (INTEC), Ghent University, Belgium 26

Where do we go from here?

Output data is still big (~ 50 Gbyte of data)

Redundancy present in output

Clustering algorithms

Mapping words back to the promoter sequences

Filtering output is easy

Subsets with higher confidence scores

Subsets with specific BLS scores

Subsets which match specific motif

– Worked example: KN1 motif

Department of Information Technology (INTEC), Ghent University, Belgium 27

We need further post-processing

Knotted 1 motif

Department of Information Technology (INTEC), Ghent University, Belgium 28

Motif BLS min Confidence PWM score # target predictedCHiP-seq

overlap

TGACTGACTGAC 15 100 13.94 5 3

TGACKGACTGAC 15 100 13.69 5 3

TGACKGAYTGAC 15 100 13.69 5 3

TGACTGACKGAC 15 100 13.45 5 3

TGACTGAYKGAC 15 100 13.45 5 3

TGACTGACWGAC 15 100 13.25 5 3

TGACTGAYWGAC 15 100 13.25 5 3

TGACKGACKGAC 15 100 13.20 6 3

TGACKGACWGAC 15 100 13.00 5 3

TGACWGACTGAC 15 100 12.33 7 4

TGACWGAYTGAC 15 100 12.33 8 4

… … … … … …

Bolduc et al. 2011

Conclusion

We developed…

A Parallel Framework for Motif Discovery

Based on Phylogenetic Footprinting

MPI implementation (memory issues)

MapReduce implementation (scales well)

Key advantages…

Runs in the cloud

Exhaustively explores search space (IUPAC)

Alignment-free (unique feature)

Flexible framework

Future work…

Development of post-processing tools

Likely also cloud-based

Department of Information Technology (INTEC), Ghent University, Belgium 29

Department of Information Technology – Internet Based Communication Networks and Services (IBCN)

Questions ?

Jan Fostier

[email protected]

Department of Information Technology (INTEC)

Ghent University