comparativemotiffinding

46
Exact comparative motif discovery in Monocots Dieter De Witte Why motif discovery? How to detect sequence motifs Representation Different Algorithmical approaches Dataset Branch Length Speller Algorithm Distributed motif discovery Validation and Publication Conclusion and Future research Exact comparative motif discovery in Monocots Dieter De Witte Department Intec-IBCN Ghent University March 26 2012, Ghent

Upload: dieter-de-witte

Post on 08-Aug-2015

20 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Exact comparative motif discovery inMonocots

Dieter De Witte

Department Intec-IBCNGhent University

March 26 2012, Ghent

Page 2: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Page 3: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

How do cells acces their heriditaryinformation?

• Proteins take part in virtually all chemical processesin the cell

• The information to build proteins is stored in thecoding DNA or the genes

• Gene expression:Gene 7→︸︷︷︸

transcription

RNA 7→︸︷︷︸translation

Protein

• RNA polymerase reads the coding DNA andproduces a complementary RNA strand

• General transcription factors help with positioningRNA polymerase

• Repressors/Silencers slow down/block RNAp’sprogress during transcription

• Activators encourage gene expression by increasingthe attraction of RNA polymerase

Page 4: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

The presence of binding sitesinfluences expression

• Transcription factors bind to transcription factorbinding sites close to the gene (example TATA-box)

• Other factors bind to regulatory sites which can beover 1000bps away from the gene

Page 5: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Page 6: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Binding site identification

Sequence elements that have a biologicalfunction are under selective pressure

Regulatory sites are:

o 6-15 bps long

o not exact (biological examples where only 2/6 bpswere conserved among all binding sites)

⇒ FLEXIBLE MOTIF MODEL REQUIRED!

Page 7: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Different Motif Models

3 Motif models

• Mismatch String (MM)

(GACAGG, e = 2)

• IUPAC Degenerate String

• Position Weight Matrix(PWM)

Sequences

seq1 AGATACGACCseq2 ACGGACAGGCseq3 ACGAGATAGGseq4 AGATACGGGGseq5 GACAGGGTACseq6 ACGATAGGAC

1 2 3 4 5 6

A 0 1 0 1 0 0C 0 0 0.66 0 0.33 0G 1 0 0 0 0.66 1T 0 0 0.33 0 0 0

Page 8: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Assessment of different models

3 Motif models

• MM

• IUPAC

• PWM

⇒ Our choice:IUPAC

Assessment by Tompa

Page 9: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Which algorithms?

Random algorithms

• ExpectationMaximizationStrategies

• Gibbs Samplingmethods

• Drawbacks: Methodsare only guaranteed toprovide a locallyoptimal solution!

Combinatorial algorithms

• Exhaustif methodsbased on indexing

• Graph methods basedon finding cliques int-partite graphs

• Drawbacks:Computationallydemanding, alot ofcandidate motifs

Page 10: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Algorithm assessment

Page 11: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Which dataset?

ORIGINAL APPROACH (<2005):

• Datasets were promoter sequences of coregulatedgenes

• Drawback: Difficult to get a clean dataset!(coexpression doesn’t imply coregulation!)

PHYLOGENETIC FOOTPRINTING APPROACH(>2003):

• More genomes available ⇒ study orthologous genesin different organisms!

• Phylogenetic footprinting: Add phylogeneticinformation about the sequences to motif discoveryalgorithms

Page 12: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Multiple Alignment

• Current algorithms for phylogenetic footprinting areadaptations of Gibbs sampling or EM algorithms.

• Most of them rely on multiple alignments

• Are short degenerate binding sites (also reversestrand!) always aligneable?

⇒ Important motifs are missed!

Page 13: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

How to quantify phylogeneticrelationships?

• BLS Score of motif = Weight of minimal spanningtree containing all species occurrences

• One tree per family, due to presence of paralogs

⇒ Method is robust against missing data

Page 14: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Nonparametric statistics based oncontrol motifs

• We validate (Motif, BLS) pairs instead of motifs

• Statistical validation is more precise andconservative

• For every motif we generate 5-20 control motifs

• Control motifs are permutations of the original motifat maximum edit distance (to avoid large overlap)

• Motif confidence for a certain BLS score: #occ−#FP#occ

• #FP = The median of the number of control motifoccurrences

• Results in a BLS-Confidence graph per motif

Page 15: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

BLS-Confidence Graph

Page 16: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Wrap up

OUR METHOD IS:

• Exhaustif and therefore guaranteed to return theoptimal solution

• Uses the best motif model available

• Alignment-free

• Better defined due statistical evaluation of motifconservation pairs

• Robust: can handle imperfect data

Page 17: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Page 18: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Monocots Dataset: Some numbers

• 17724 gene families

• 4 Organisms: Zea Mays, Sorghum Bicolor, OryzaSativa, Brachypodium Distachion

• On average one paralog per tree: 5 promotersequences per family

• Investigate transcription factor binding sites withpromoters of 500bps length

• Investigate other regulatory sites with promoters of2000bps length

Page 19: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Monocots: Which organisms?

Zea Mays

• Regular maize, veryimportant for Europeanagriculture.

Sorghum Bicolor

• In Dutch known asKafferkoren is also agrass species. Ediblegrains are mainly usedin fodder.

Page 20: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Monocots: Which organisms?

Oryza Sativa

• Better known as Asian Rice.It is also the cereal with thesmallest genome.

Brachypodium Distachion

• Not of specific agriculturalinterest. Serves as a modelorganism for other grassspecies (small genome)

Page 21: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Page 22: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Text indexing with generalizedsuffix trees

Fragment of a generalized suffix tree:

seq1: · · · ACGACG· · ·seq2: · · · TATATG· · ·seq3: · · · TATAGG· · ·seq4: · · · TATACG· · ·

Page 23: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Text indexing with generalizedsuffix trees

Features of (generalized) suffix tree:

• Tree construction requires O(Nn log Nn) time

• Sequence information per node: O(1) quorumqueries!

• Any pattern can be found in O(s|P|) time

• Memory requirements are O(Nn log Nn)

• Memory(per char): tree ≈ 42b , SeqIds ≈ 36b

• N: Number of genes in one family

• n: Length of promoter sequences

• s: Motif degeneracy (number of exact patterns matching with motif)

• q: quorum is the number of sequences a pattern occurs in

• Alternative structures: Suffix Trie(Sergio), SuffixArray(Dries)

Page 24: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Architecture

1 big distributed tree

• Could allow for minimal motif discovery (tested butineffective)

• Distribution based on degenerated prefixes⇒ total tree requires cp · F ×O(Nn)

1 tree per genefamily

• Simple map between phylogenetic tree and suffixtree

• Suffix tree can be removed after discovery

• Perfect memory scalability: total tree requiresF ×O(Nn) space

• This is our choice since it anticipates all suffixtree memory issues!

Page 25: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Motif Discovery Algorithm

• The algorithm traverses through a virtuallexicographic motif tree

• In this trees all (degenerated) motifs are spelled in adepth first way

• The generalized suffix tree is used to branch andbound the spelling operation

• A subtree of the lexicographic tree is not explored ifits prefix doesn’t meet the quorum condition

• Quorum condition: The motif has to occur in atleast q sequences

Page 26: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Motif Discovery Algorithm

• TATA(!A)G is a motif with degeneracy s = 3

• It matches 3 sites in the suffix tree

• The number of sequences it occurs in is calculatedas the union of the sequenceID sets: q = 3

Page 27: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Complexity

• No tight bounds for the algorithm’s complexity interms of s

• The motif discovery frequency is rather stable perdegeneracy

⇒ The number of motifs M is a good indication forthe processing time per gene family!

Page 28: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Number of motifs in IUPACalphabet

Page 29: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Number of motifs in Don’t Carealphabet

Page 30: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Quorum dependence of thenumber of motifs

Page 31: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

From quorum to BLS

• Simple Mapping from quorum to BLS score

• BLS cutoff 7→ quorum constraint

• BLS score calculated with minimal spanning treealgorithm

Page 32: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

From BLS to BLS Frequencyvector

Interfamily merging via BLS Frequency vectors:⇒ counts the number of times a (motif,BLS) pair occurs

RAM dominated by motif frequency hash map:

• Motif: encoded in 8 byte integer

• Degeneracy: 4 bytes

• Frequency vector: 10 a 20 bytes (depends onresolution)

• Google Sparse Hash: overhead negligible (2 bit)

Page 33: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Page 34: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Intramerging

Joining motif maps: s ↗ ⇒ Overlap 7→ 0!

Page 35: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Permutation group partitioning

• Scalable!

• Memory limitations due to large motif map

⇒ Restricted Don’t Care alphabet: #N : 0 7→ 4-5⇒ IUPAC alphabet, high BLS only

Page 36: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Time Estimates 100 families perprocess

Time estimates for Don’t Care Alphabet

s Time (k = 6− 10) Time (k = 6− 15)

1 1 min 1 min4 4.5 min 6 min16 15 min 30 min64 1h 1h45256 1h20 5h1024 2h 10h45

Page 37: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Page 38: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Validation possibilities

• Transfac database: Rice motifs

• Jaspar database: Zea Mays motifs

• Paper Wang(2009):Discovering cis-elements between sorghum and rice

Page 39: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Transfac Motifs: Matching withour results

• Motifs are represented as consensus strings (ACGTalphabet)

• Levenshtein distance bad measure to find bestmatch with our results

• Better measure: Maximum motif overlap

• Degenerate symbols are considered matching

• Further optimizing: Minimizing length difference dand degeneracy #N

⇒ Simple modification of Manhattan algorithm

Page 40: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Transfac Motifs: Early results

Simulation with k ≤ 107→ ≈ 50000 motifs with Confidence ≥ 75%

3 classes: (total 49)

• Short motifs(11): found as substrings of longermotifs (d ≈ 2)

• Perfect matches(7)

• Quasi-perfect matches(16): ((d ≤ 1, #N ≤ 2)

• Long motifs(15): k = 10 overlapping motifs(#N ≈ 1)

Page 41: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Representation of results

• Tables with motifs vs biological function

• Most interesting motifs: Confidence-BLS graphs

• Venn diagrams to compare our results withbiological motifs in databases

• Number of conserved motifs as a function ofconfidence (one curve per BLS treshold)

• Scatterplot for database motifs: Number of motifoccurrences (grep) versus number of bls-conservedoccurrences

• Number of conserved motifs in Monocots datasetversus number of conserved motifs in shuffleddataset

• Compare with alignment data!

Page 42: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Page 43: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Timeline• June: Paper about BLS method on Monocot

dataseto Comparison with known motifs: Transfac Rice

motifs, Jaspar Zea Mays motifs, Rice-Sorghummotifs

o Sensitivity analysis(1): Effect of phylogenetic tree(shuffles sequences between different trees)

o Sensitivity analysis(2): False Discovery Rate (shufflebps in sequences)

• September/October: Paper on parallellizationstrategy for comparative motif discovery, togetherwith release of source code

o Focus on architecture: one tree versus many treeso Scalability and Parallel Efficiencyo Add features concerning motif database matching

within BLS context

• Divide and conquer motif discovery: still needstesting/only partially implemented

Page 44: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Suggestions for further research

• More advanced BLS measures

• Incorporating coregulation

• Adding positional priors (ChipSeq data)

• Motif combinations

• Gene clustering based on transcription factormatching

• Adding Epigenetics

• Design advanced motif models and crack theSandve Benchmark dataset

Page 45: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Conclusions

• We have reintroduced combinatorial motif discoveryin a comparative context

• Low BLS motifs will be found with the Don’t Carealphabet

• High BLS motifs can be analyzed with the fullIUPAC alphabet

• Initial results show that we are able to reproducebiological results

• This implies the algorithm can predict newregulatory sites!

• Algorithm is distrituted using MPI and able tohandle extremely large datasets

Page 46: ComparativeMotifFinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Questions?

Are there any questions?