comparativemotiffinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Exact comparative motif discovery inMonocots

Dieter De Witte

Department Intec-IBCNGhent University

March 26 2012, Ghent

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





How do cells acces their heriditaryinformation?

• Proteins take part in virtually all chemical processesin the cell

• The information to build proteins is stored in thecoding DNA or the genes

• Gene expression:Gene 7→︸︷︷︸

transcription

RNA 7→︸︷︷︸translation

Protein

• RNA polymerase reads the coding DNA andproduces a complementary RNA strand

• General transcription factors help with positioningRNA polymerase

• Repressors/Silencers slow down/block RNAp’sprogress during transcription

• Activators encourage gene expression by increasingthe attraction of RNA polymerase

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





The presence of binding sitesinfluences expression

• Transcription factors bind to transcription factorbinding sites close to the gene (example TATA-box)

• Other factors bind to regulatory sites which can beover 1000bps away from the gene

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Outline



3 Dataset





Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Binding site identification

Sequence elements that have a biologicalfunction are under selective pressure

Regulatory sites are:

o 6-15 bps long

o not exact (biological examples where only 2/6 bpswere conserved among all binding sites)

⇒ FLEXIBLE MOTIF MODEL REQUIRED!

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Different Motif Models

3 Motif models

• Mismatch String (MM)

(GACAGG, e = 2)

• IUPAC Degenerate String

• Position Weight Matrix(PWM)

Sequences

seq1 AGATACGACCseq2 ACGGACAGGCseq3 ACGAGATAGGseq4 AGATACGGGGseq5 GACAGGGTACseq6 ACGATAGGAC

1 2 3 4 5 6

A 0 1 0 1 0 0C 0 0 0.66 0 0.33 0G 1 0 0 0 0.66 1T 0 0 0.33 0 0 0

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Assessment of different models

3 Motif models

• MM

• IUPAC

• PWM

⇒ Our choice:IUPAC

Assessment by Tompa

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Which algorithms?

Random algorithms

• ExpectationMaximizationStrategies

• Gibbs Samplingmethods

• Drawbacks: Methodsare only guaranteed toprovide a locallyoptimal solution!

Combinatorial algorithms

• Exhaustif methodsbased on indexing

• Graph methods basedon finding cliques int-partite graphs

• Drawbacks:Computationallydemanding, alot ofcandidate motifs

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Algorithm assessment

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Which dataset?

ORIGINAL APPROACH (<2005):

• Datasets were promoter sequences of coregulatedgenes

• Drawback: Difficult to get a clean dataset!(coexpression doesn’t imply coregulation!)

PHYLOGENETIC FOOTPRINTING APPROACH(>2003):

• More genomes available ⇒ study orthologous genesin different organisms!

• Phylogenetic footprinting: Add phylogeneticinformation about the sequences to motif discoveryalgorithms

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Multiple Alignment

• Current algorithms for phylogenetic footprinting areadaptations of Gibbs sampling or EM algorithms.

• Most of them rely on multiple alignments

• Are short degenerate binding sites (also reversestrand!) always aligneable?

⇒ Important motifs are missed!

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





How to quantify phylogeneticrelationships?

• BLS Score of motif = Weight of minimal spanningtree containing all species occurrences

• One tree per family, due to presence of paralogs

⇒ Method is robust against missing data

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Nonparametric statistics based oncontrol motifs

• We validate (Motif, BLS) pairs instead of motifs

• Statistical validation is more precise andconservative

• For every motif we generate 5-20 control motifs

• Control motifs are permutations of the original motifat maximum edit distance (to avoid large overlap)

• Motif confidence for a certain BLS score: #occ−#FP#occ

• #FP = The median of the number of control motifoccurrences

• Results in a BLS-Confidence graph per motif

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





BLS-Confidence Graph

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Wrap up

OUR METHOD IS:

• Exhaustif and therefore guaranteed to return theoptimal solution

• Uses the best motif model available

• Alignment-free

• Better defined due statistical evaluation of motifconservation pairs

• Robust: can handle imperfect data

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Outline



3 Dataset





Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Monocots Dataset: Some numbers

• 17724 gene families

• 4 Organisms: Zea Mays, Sorghum Bicolor, OryzaSativa, Brachypodium Distachion

• On average one paralog per tree: 5 promotersequences per family

• Investigate transcription factor binding sites withpromoters of 500bps length

• Investigate other regulatory sites with promoters of2000bps length

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Monocots: Which organisms?

Zea Mays

• Regular maize, veryimportant for Europeanagriculture.

Sorghum Bicolor

• In Dutch known asKafferkoren is also agrass species. Ediblegrains are mainly usedin fodder.

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Monocots: Which organisms?

Oryza Sativa

• Better known as Asian Rice.It is also the cereal with thesmallest genome.

Brachypodium Distachion

• Not of specific agriculturalinterest. Serves as a modelorganism for other grassspecies (small genome)

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Outline



3 Dataset





Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Text indexing with generalizedsuffix trees

Fragment of a generalized suffix tree:

seq1: · · · ACGACG· · ·seq2: · · · TATATG· · ·seq3: · · · TATAGG· · ·seq4: · · · TATACG· · ·

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Text indexing with generalizedsuffix trees

Features of (generalized) suffix tree:

• Tree construction requires O(Nn log Nn) time

• Sequence information per node: O(1) quorumqueries!

• Any pattern can be found in O(s|P|) time

• Memory requirements are O(Nn log Nn)

• Memory(per char): tree ≈ 42b , SeqIds ≈ 36b

• N: Number of genes in one family

• n: Length of promoter sequences

• s: Motif degeneracy (number of exact patterns matching with motif)

• q: quorum is the number of sequences a pattern occurs in

• Alternative structures: Suffix Trie(Sergio), SuffixArray(Dries)

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Architecture

1 big distributed tree

• Could allow for minimal motif discovery (tested butineffective)

• Distribution based on degenerated prefixes⇒ total tree requires cp · F ×O(Nn)

1 tree per genefamily

• Simple map between phylogenetic tree and suffixtree

• Suffix tree can be removed after discovery

• Perfect memory scalability: total tree requiresF ×O(Nn) space

• This is our choice since it anticipates all suffixtree memory issues!

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Motif Discovery Algorithm

• The algorithm traverses through a virtuallexicographic motif tree

• In this trees all (degenerated) motifs are spelled in adepth first way

• The generalized suffix tree is used to branch andbound the spelling operation

• A subtree of the lexicographic tree is not explored ifits prefix doesn’t meet the quorum condition

• Quorum condition: The motif has to occur in atleast q sequences

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Motif Discovery Algorithm

• TATA(!A)G is a motif with degeneracy s = 3

• It matches 3 sites in the suffix tree

• The number of sequences it occurs in is calculatedas the union of the sequenceID sets: q = 3

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Complexity

• No tight bounds for the algorithm’s complexity interms of s

• The motif discovery frequency is rather stable perdegeneracy

⇒ The number of motifs M is a good indication forthe processing time per gene family!

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Number of motifs in IUPACalphabet

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Number of motifs in Don’t Carealphabet

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Quorum dependence of thenumber of motifs

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





From quorum to BLS

• Simple Mapping from quorum to BLS score

• BLS cutoff 7→ quorum constraint

• BLS score calculated with minimal spanning treealgorithm

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





From BLS to BLS Frequencyvector

Interfamily merging via BLS Frequency vectors:⇒ counts the number of times a (motif,BLS) pair occurs

RAM dominated by motif frequency hash map:

• Motif: encoded in 8 byte integer

• Degeneracy: 4 bytes

• Frequency vector: 10 a 20 bytes (depends onresolution)

• Google Sparse Hash: overhead negligible (2 bit)

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Outline



3 Dataset





Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Intramerging

Joining motif maps: s ↗ ⇒ Overlap 7→ 0!

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Permutation group partitioning

• Scalable!

• Memory limitations due to large motif map

⇒ Restricted Don’t Care alphabet: #N : 0 7→ 4-5⇒ IUPAC alphabet, high BLS only

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Time Estimates 100 families perprocess

Time estimates for Don’t Care Alphabet

s Time (k = 6− 10) Time (k = 6− 15)

1 1 min 1 min4 4.5 min 6 min16 15 min 30 min64 1h 1h45256 1h20 5h1024 2h 10h45

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Outline



3 Dataset





Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Validation possibilities

• Transfac database: Rice motifs

• Jaspar database: Zea Mays motifs

• Paper Wang(2009):Discovering cis-elements between sorghum and rice

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Transfac Motifs: Matching withour results

• Motifs are represented as consensus strings (ACGTalphabet)

• Levenshtein distance bad measure to find bestmatch with our results

• Better measure: Maximum motif overlap

• Degenerate symbols are considered matching

• Further optimizing: Minimizing length difference dand degeneracy #N

⇒ Simple modification of Manhattan algorithm

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Transfac Motifs: Early results

Simulation with k ≤ 107→ ≈ 50000 motifs with Confidence ≥ 75%

3 classes: (total 49)

• Short motifs(11): found as substrings of longermotifs (d ≈ 2)

• Perfect matches(7)

• Quasi-perfect matches(16): ((d ≤ 1, #N ≤ 2)

• Long motifs(15): k = 10 overlapping motifs(#N ≈ 1)

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Representation of results

• Tables with motifs vs biological function

• Most interesting motifs: Confidence-BLS graphs

• Venn diagrams to compare our results withbiological motifs in databases

• Number of conserved motifs as a function ofconfidence (one curve per BLS treshold)

• Scatterplot for database motifs: Number of motifoccurrences (grep) versus number of bls-conservedoccurrences

• Number of conserved motifs in Monocots datasetversus number of conserved motifs in shuffleddataset

• Compare with alignment data!

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Outline



3 Dataset





Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Timeline• June: Paper about BLS method on Monocot

dataseto Comparison with known motifs: Transfac Rice

motifs, Jaspar Zea Mays motifs, Rice-Sorghummotifs

o Sensitivity analysis(1): Effect of phylogenetic tree(shuffles sequences between different trees)

o Sensitivity analysis(2): False Discovery Rate (shufflebps in sequences)

• September/October: Paper on parallellizationstrategy for comparative motif discovery, togetherwith release of source code

o Focus on architecture: one tree versus many treeso Scalability and Parallel Efficiencyo Add features concerning motif database matching

within BLS context

• Divide and conquer motif discovery: still needstesting/only partially implemented

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Suggestions for further research

• More advanced BLS measures

• Incorporating coregulation

• Adding positional priors (ChipSeq data)

• Motif combinations

• Gene clustering based on transcription factormatching

• Adding Epigenetics

• Design advanced motif models and crack theSandve Benchmark dataset

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Conclusions

• We have reintroduced combinatorial motif discoveryin a comparative context

• Low BLS motifs will be found with the Don’t Carealphabet

• High BLS motifs can be analyzed with the fullIUPAC alphabet

• Initial results show that we are able to reproducebiological results

• This implies the algorithm can predict newregulatory sites!

• Algorithm is distrituted using MPI and able tohandle extremely large datasets

Exactcomparative


Dieter DeWitte

Why motifdiscovery?


Representation


Dataset





Questions?

Are there any questions?

comparativemotiffinding

Data & Analytics

publication conclusion

monocots dieter

future research outline

transcription activators

attraction of rna polymerase

genes gene expression

coding dna

gene example tatabox