ppam presentatie
TRANSCRIPT
Department of Information Technology (INTEC), Ghent University, Belgium
A Parallel, Distributed-Memory Framework
for Comparative Motif Discovery
1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester, 1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier
email: [email protected]
1Department of Information Technology (INTEC), Ghent University, Belgium2Department of Plant Systems Biology, VIB, Ghent, Belgium3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium
Background (1)
Department of Information Technology (INTEC), Ghent University, Belgium 2
CCACGTG
gene
promoter
transcription
factor
protein
regulate
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 3
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Gene families
Department of Information Technology (INTEC), Ghent University, Belgium 4
zma
sbi
bdi
osa
DNA sequences
orthologous genes (Van Bel et al., “Plaza 2.5”, Plant Phys. 2012)
TSS
2kbp promoter
gene family 1
TSS
2kbp promoter
gene family 17724
TSS
2kbp promoter
gene family 2
…
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 5
zma
sbi
osa
bdi
23,66%
5,38%26,88%
8,60%
Branch Length Score (BLS) 17.20 %
2kbp promoter
Scoring conservation within a gene family
motif occurrencesMST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 6
zma
sbi
osa
bdi
23,66%
5,38%26,88%
8,60%
Branch Length Score (BLS) 64.52 %
2kbp promoter
Scoring conservation within a gene family
motif occurrencesMST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 7
zma
sbi
osa
bdi
23,66%
5,38%26,88%
8,60%
Branch Length Score (BLS) 100 %
2kbp promoter
Scoring conservation within a gene family
motif occurrencesMST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 8
zma
sbi
osa
bdi
23,66%
5,38%26,88%
8,60%
Branch Length Score (BLS) 0 %
2kbp promoter
Scoring conservation within a gene family
motif occurrence
Genomewide scoring (1)
Department of Information Technology (INTEC), Ghent University, Belgium 9
2kbp promoter
gene family 1
2kbp promoter
gene family 17724
2kbp promoter
gene family 2
…
Genome wide scoring of conservation
BLS = 64,52 % BLS = 17,20 % BLS = 72,72 %
> BLSthres ? > BLSthres ? > BLSthres ?
yes no yes
Fw = # families with BLS > BLSthres
Genomewide scoring (2)
Department of Information Technology (INTEC), Ghent University, Belgium 10
Genome wide scoring of conservation (2)
Fw = # families with BLS > BLSthres for word w
Fbg = median # families with BLS > BLSthres for random permutations of w
C = 1 −FbgFw
confidence score
Retain only motifs with confidence score C > 0.9
Stark et al., Genome Res. 2007
Example
Department of Information Technology (INTEC), Ghent University, Belgium 11
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0
…
>40 % > 50% > 60% > 70% > 80% >90%
CCACGTG 578 545 423 224 135 116
C = 1 −Fbg
Fw
= 1 -201
578= 0.64
1000 random permutations of w
BLS threshold
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 12
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Motif discovery
Department of Information Technology (INTEC), Ghent University, Belgium 13
Main idea: extend previous ideas
to all words that occur in the sequences
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 14
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
GC$2
AT C$2
GT
G
TAT$1
TAT$1$1
C$2
C$2 $1AT G TATGC$2
GC$2$1 $1
$1
$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 15
GC$2
AT C$2
GT
G
TAT$1
TAT$1$1
C$2
C$2 $1AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 16
GC$2
AT C$2
GT
G
TAT$1
TAT$1$1
C$2
C$2 $1AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 17
GC
AT CG
T
G
TAT
TAT
C
C AT G TATGC
GC
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT
S2 = TTATGC
Actual discovery
Department of Information Technology (INTEC), Ghent University, Belgium 18
GC
AT CG
T
G
TAT
TAT
C
C AT G TATGC
GC
word seq1 seq2
A y y
AT y y
ATG y y
ATGC - y
ATGT y -
ATGTA y -
ATGTAT y -
C - y
G y y
GC - y
…
Depth-first walk in the tree
Can be extended to IUPAC alphabet (Sagot, 2001)
Discovery of words
Department of Information Technology (INTEC), Ghent University, Belgium 19
Single gene family, words with length [6 … 12], max 3 deg. chars
0
50
100
150
200
250
300
350
400
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
ACGT ACGT+N IUPAC \ {BDHV} Full IUPAC
Tim
e (
s)
Nu
mb
er
of
wo
rds
Number of words Time
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 20
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Parallel discovery
Department of Information Technology (INTEC), Ghent University, Belgium 21
Intra-family phase
Process 1 Process P
w conserved in
3/4 organisms
BLS score
for w: 73% w’ conserved in
2/4 organisms
BLS score
for w’: 23%
gene family x gene family y
local hash table local hash table...
Communication phase
Department of Information Technology (INTEC), Ghent University, Belgium 22
Communication
phase
...
Core idea: words with the same composition are sent to the same processes
e.g. CACGTG and CCATGG
Process 1 Process P
Statistical evaluation
Department of Information Technology (INTEC), Ghent University, Belgium 23
Inter-family phase
aggregated hash table aggregated hash table
statistical evaluation
…
CACGTG: significant
ACCGGT: not significant
TGGCAC: not significant
…
statistical evaluation
…
TGANTGA: significant
GTATAGN: not significant
GANATTG: not significant
…
...
MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 24
GC
AT CG
T
G
TAT
TAT
C
C AT G TATGC
GC
class Mapper
method Map(familyID id, family F)
GST = createGeneralizedSuffixTree(F);
for all word w ∊ GST do
BLS = computeBLS(w)
f = computeFrequencyVector(BLS)
Emit(sort(w), pair(w, f))
w = CCACGTG
BLS = 62%
f = [ 1 | 1 | 1 | 0 | 0 | 0 ]
Emit(ACCCGGT, <CCACGTG, [111000]>)
MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 25
class Reducer
method Reduce(word s, pairs[<w1,f1>, <w2,f2>, …])
H = new AssociativeArray
for all <w,f> ∊ [<w1,f1>, <w2,f2>, …] do
H{w} = H{w} + f
BGM = computeBackgroundModel(H)
for all w ∊ H
C = computeConfidence(w, H{w}, BGM)
if (C > 0.9) Emit(w, H{w})
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0
…
H C = 0.75
C = 0.60
C = 0.91
C = 0.93
C = 0.75
Results
Dataset description
4 organisms (zma, sbi, osa, bdi)
17639 gene families
Alphabet: IUPAC \ {BDHV} alphabet with length [6 ... 12]
maximum of 3 degenerate characters
20 instances on AWS (19 workers type m1.xlarge)
Map time: 20h 50 min
Reduce time: 12h 38 min
Cost price of 421.6$
Number of <k, v> emitted by mappers: 531 835 669 220
Map output bytes = 13 827 727 399 720
26 bytes per kv
Output materialized: 4 019 787 705 377 = 3.65 TByte
Average 7.55 byte per intermediate <k, v> pair (SnappyCompr)
Number of <k, v> emitted by reducers: 620 294 857Department of Information Technology (INTEC), Ghent University, Belgium 26
Where do we go from here?
Output data is still big (~ 50 Gbyte of data)
Redundancy present in output
Clustering algorithms
Mapping words back to the promoter sequences
Filtering output is easy
Subsets with higher confidence scores
Subsets with specific BLS scores
Subsets which match specific motif
– Worked example: KN1 motif
Department of Information Technology (INTEC), Ghent University, Belgium 27
We need further post-processing
Knotted 1 motif
Department of Information Technology (INTEC), Ghent University, Belgium 28
Motif BLS min Confidence PWM score # target predictedCHiP-seq
overlap
TGACTGACTGAC 15 100 13.94 5 3
TGACKGACTGAC 15 100 13.69 5 3
TGACKGAYTGAC 15 100 13.69 5 3
TGACTGACKGAC 15 100 13.45 5 3
TGACTGAYKGAC 15 100 13.45 5 3
TGACTGACWGAC 15 100 13.25 5 3
TGACTGAYWGAC 15 100 13.25 5 3
TGACKGACKGAC 15 100 13.20 6 3
TGACKGACWGAC 15 100 13.00 5 3
TGACWGACTGAC 15 100 12.33 7 4
TGACWGAYTGAC 15 100 12.33 8 4
… … … … … …
Bolduc et al. 2011
Conclusion
We developed…
A Parallel Framework for Motif Discovery
Based on Phylogenetic Footprinting
MPI implementation (memory issues)
MapReduce implementation (scales well)
Key advantages…
Runs in the cloud
Exhaustively explores search space (IUPAC)
Alignment-free (unique feature)
Flexible framework
Future work…
Development of post-processing tools
Likely also cloud-based
Department of Information Technology (INTEC), Ghent University, Belgium 29
Department of Information Technology – Internet Based Communication Networks and Services (IBCN)
Questions ?
Jan Fostier
Department of Information Technology (INTEC)
Ghent University