comparativemotiffinding

Exactcomparative

motifdiscovery inMonocots

Dieter DeWitte

Why motifdiscovery?

How to detectsequencemotifs

Representation

DifferentAlgorithmicalapproaches

Dataset

Branch LengthSpellerAlgorithm

Distributedmotifdiscovery

Validation andPublication

Conclusionand Futureresearch

Exact comparative motif discovery inMonocots

Dieter De Witte

Department Intec-IBCNGhent University

March 26 2012, Ghent

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Outline

1 Why motif discovery?

2 How to detect sequence motifsRepresentationDifferent Algorithmical approaches

3 Dataset

4 Branch Length Speller Algorithm

5 Distributed motif discovery

6 Validation and Publication

7 Conclusion and Future research

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

How do cells acces their heriditaryinformation?

• Proteins take part in virtually all chemical processesin the cell

• The information to build proteins is stored in thecoding DNA or the genes

• Gene expression:Gene 7→︸︷︷︸

transcription

RNA 7→︸︷︷︸translation

Protein

• RNA polymerase reads the coding DNA andproduces a complementary RNA strand

• General transcription factors help with positioningRNA polymerase

• Repressors/Silencers slow down/block RNAp’sprogress during transcription

• Activators encourage gene expression by increasingthe attraction of RNA polymerase

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

The presence of binding sitesinfluences expression

• Transcription factors bind to transcription factorbinding sites close to the gene (example TATA-box)

• Other factors bind to regulatory sites which can beover 1000bps away from the gene

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Outline

3 Dataset

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Binding site identification

Sequence elements that have a biologicalfunction are under selective pressure

Regulatory sites are:

o 6-15 bps long

o not exact (biological examples where only 2/6 bpswere conserved among all binding sites)

⇒ FLEXIBLE MOTIF MODEL REQUIRED!

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Different Motif Models

3 Motif models

• Mismatch String (MM)

(GACAGG, e = 2)

• IUPAC Degenerate String

• Position Weight Matrix(PWM)

Sequences

seq1 AGATACGACCseq2 ACGGACAGGCseq3 ACGAGATAGGseq4 AGATACGGGGseq5 GACAGGGTACseq6 ACGATAGGAC

1 2 3 4 5 6

A 0 1 0 1 0 0C 0 0 0.66 0 0.33 0G 1 0 0 0 0.66 1T 0 0 0.33 0 0 0

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Assessment of different models

3 Motif models

• MM

• IUPAC

• PWM

⇒ Our choice:IUPAC

Assessment by Tompa

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Which algorithms?

Random algorithms

• ExpectationMaximizationStrategies

• Gibbs Samplingmethods

• Drawbacks: Methodsare only guaranteed toprovide a locallyoptimal solution!

Combinatorial algorithms

• Exhaustif methodsbased on indexing

• Graph methods basedon finding cliques int-partite graphs

• Drawbacks:Computationallydemanding, alot ofcandidate motifs

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Algorithm assessment

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Which dataset?

ORIGINAL APPROACH (<2005):

• Datasets were promoter sequences of coregulatedgenes

• Drawback: Difficult to get a clean dataset!(coexpression doesn’t imply coregulation!)

PHYLOGENETIC FOOTPRINTING APPROACH(>2003):

• More genomes available ⇒ study orthologous genesin different organisms!

• Phylogenetic footprinting: Add phylogeneticinformation about the sequences to motif discoveryalgorithms

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Multiple Alignment

• Current algorithms for phylogenetic footprinting areadaptations of Gibbs sampling or EM algorithms.

• Most of them rely on multiple alignments

• Are short degenerate binding sites (also reversestrand!) always aligneable?

⇒ Important motifs are missed!

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

How to quantify phylogeneticrelationships?

• BLS Score of motif = Weight of minimal spanningtree containing all species occurrences

• One tree per family, due to presence of paralogs

⇒ Method is robust against missing data

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Nonparametric statistics based oncontrol motifs

• We validate (Motif, BLS) pairs instead of motifs

• Statistical validation is more precise andconservative

• For every motif we generate 5-20 control motifs

• Control motifs are permutations of the original motifat maximum edit distance (to avoid large overlap)

• Motif confidence for a certain BLS score: #occ−#FP#occ

• #FP = The median of the number of control motifoccurrences

• Results in a BLS-Confidence graph per motif

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

BLS-Confidence Graph

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Wrap up

OUR METHOD IS:

• Exhaustif and therefore guaranteed to return theoptimal solution

• Uses the best motif model available

• Alignment-free

• Better defined due statistical evaluation of motifconservation pairs

• Robust: can handle imperfect data

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Outline

3 Dataset

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Monocots Dataset: Some numbers

• 17724 gene families

• 4 Organisms: Zea Mays, Sorghum Bicolor, OryzaSativa, Brachypodium Distachion

• On average one paralog per tree: 5 promotersequences per family

• Investigate transcription factor binding sites withpromoters of 500bps length

• Investigate other regulatory sites with promoters of2000bps length

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Monocots: Which organisms?

Zea Mays

• Regular maize, veryimportant for Europeanagriculture.

Sorghum Bicolor

• In Dutch known asKafferkoren is also agrass species. Ediblegrains are mainly usedin fodder.

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Monocots: Which organisms?

Oryza Sativa

• Better known as Asian Rice.It is also the cereal with thesmallest genome.

Brachypodium Distachion

• Not of specific agriculturalinterest. Serves as a modelorganism for other grassspecies (small genome)

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Outline

3 Dataset

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Text indexing with generalizedsuffix trees

Fragment of a generalized suffix tree:

seq1: · · · ACGACG· · ·seq2: · · · TATATG· · ·seq3: · · · TATAGG· · ·seq4: · · · TATACG· · ·

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Text indexing with generalizedsuffix trees

Features of (generalized) suffix tree:

• Tree construction requires O(Nn log Nn) time

• Sequence information per node: O(1) quorumqueries!

• Any pattern can be found in O(s|P|) time

• Memory requirements are O(Nn log Nn)

• Memory(per char): tree ≈ 42b , SeqIds ≈ 36b

• N: Number of genes in one family

• n: Length of promoter sequences

• s: Motif degeneracy (number of exact patterns matching with motif)

• q: quorum is the number of sequences a pattern occurs in

• Alternative structures: Suffix Trie(Sergio), SuffixArray(Dries)

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Architecture

1 big distributed tree

• Could allow for minimal motif discovery (tested butineffective)

• Distribution based on degenerated prefixes⇒ total tree requires cp · F ×O(Nn)

1 tree per genefamily

• Simple map between phylogenetic tree and suffixtree

• Suffix tree can be removed after discovery

• Perfect memory scalability: total tree requiresF ×O(Nn) space

• This is our choice since it anticipates all suffixtree memory issues!

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Motif Discovery Algorithm

• The algorithm traverses through a virtuallexicographic motif tree

• In this trees all (degenerated) motifs are spelled in adepth first way

• The generalized suffix tree is used to branch andbound the spelling operation

• A subtree of the lexicographic tree is not explored ifits prefix doesn’t meet the quorum condition

• Quorum condition: The motif has to occur in atleast q sequences

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Motif Discovery Algorithm

• TATA(!A)G is a motif with degeneracy s = 3

• It matches 3 sites in the suffix tree

• The number of sequences it occurs in is calculatedas the union of the sequenceID sets: q = 3

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Complexity

• No tight bounds for the algorithm’s complexity interms of s

• The motif discovery frequency is rather stable perdegeneracy

⇒ The number of motifs M is a good indication forthe processing time per gene family!

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Number of motifs in IUPACalphabet

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Number of motifs in Don’t Carealphabet

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Quorum dependence of thenumber of motifs

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

From quorum to BLS

• Simple Mapping from quorum to BLS score

• BLS cutoff 7→ quorum constraint

• BLS score calculated with minimal spanning treealgorithm

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

From BLS to BLS Frequencyvector

Interfamily merging via BLS Frequency vectors:⇒ counts the number of times a (motif,BLS) pair occurs

RAM dominated by motif frequency hash map:

• Motif: encoded in 8 byte integer

• Degeneracy: 4 bytes

• Frequency vector: 10 a 20 bytes (depends onresolution)

• Google Sparse Hash: overhead negligible (2 bit)

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Outline

3 Dataset

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Intramerging

Joining motif maps: s ↗ ⇒ Overlap 7→ 0!

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Permutation group partitioning

• Scalable!

• Memory limitations due to large motif map

⇒ Restricted Don’t Care alphabet: #N : 0 7→ 4-5⇒ IUPAC alphabet, high BLS only

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Time Estimates 100 families perprocess

Time estimates for Don’t Care Alphabet

s Time (k = 6− 10) Time (k = 6− 15)

1 1 min 1 min4 4.5 min 6 min16 15 min 30 min64 1h 1h45256 1h20 5h1024 2h 10h45

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Outline

3 Dataset

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Validation possibilities

• Transfac database: Rice motifs

• Jaspar database: Zea Mays motifs

• Paper Wang(2009):Discovering cis-elements between sorghum and rice

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Transfac Motifs: Matching withour results

• Motifs are represented as consensus strings (ACGTalphabet)

• Levenshtein distance bad measure to find bestmatch with our results

• Better measure: Maximum motif overlap

• Degenerate symbols are considered matching

• Further optimizing: Minimizing length difference dand degeneracy #N

⇒ Simple modification of Manhattan algorithm

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Transfac Motifs: Early results

Simulation with k ≤ 107→ ≈ 50000 motifs with Confidence ≥ 75%

3 classes: (total 49)

• Short motifs(11): found as substrings of longermotifs (d ≈ 2)

• Perfect matches(7)

• Quasi-perfect matches(16): ((d ≤ 1, #N ≤ 2)

• Long motifs(15): k = 10 overlapping motifs(#N ≈ 1)

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Representation of results

• Tables with motifs vs biological function

• Most interesting motifs: Confidence-BLS graphs

• Venn diagrams to compare our results withbiological motifs in databases

• Number of conserved motifs as a function ofconfidence (one curve per BLS treshold)

• Scatterplot for database motifs: Number of motifoccurrences (grep) versus number of bls-conservedoccurrences

• Number of conserved motifs in Monocots datasetversus number of conserved motifs in shuffleddataset

• Compare with alignment data!

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Outline

3 Dataset

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Timeline• June: Paper about BLS method on Monocot

dataseto Comparison with known motifs: Transfac Rice

motifs, Jaspar Zea Mays motifs, Rice-Sorghummotifs

o Sensitivity analysis(1): Effect of phylogenetic tree(shuffles sequences between different trees)

o Sensitivity analysis(2): False Discovery Rate (shufflebps in sequences)

• September/October: Paper on parallellizationstrategy for comparative motif discovery, togetherwith release of source code

o Focus on architecture: one tree versus many treeso Scalability and Parallel Efficiencyo Add features concerning motif database matching

within BLS context

• Divide and conquer motif discovery: still needstesting/only partially implemented

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Suggestions for further research

• More advanced BLS measures

• Incorporating coregulation

• Adding positional priors (ChipSeq data)

• Motif combinations

• Gene clustering based on transcription factormatching

• Adding Epigenetics

• Design advanced motif models and crack theSandve Benchmark dataset

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Conclusions

• We have reintroduced combinatorial motif discoveryin a comparative context

• Low BLS motifs will be found with the Don’t Carealphabet

• High BLS motifs can be analyzed with the fullIUPAC alphabet

• Initial results show that we are able to reproducebiological results

• This implies the algorithm can predict newregulatory sites!

• Algorithm is distrituted using MPI and able tohandle extremely large datasets

Exactcomparative

Dieter DeWitte

Why motifdiscovery?

Representation

Dataset

Questions?

Are there any questions?

comparativemotiffinding

publication conclusion

monocots dieter

future research outline

transcription activators

attraction of rna polymerase

genes gene expression

coding dna

gene example tatabox

Data & Analytics

quant skillz beyond wall st: deriving value from large,...

storytelling with data - see | show | tell | engage

like loggly using open source

presto meetup @ facebook (2014-05-14)

using microsoft azure machine learning to advance scientific...

[sis] the patterns of culture

data recovery service

the true winner in fifa 2014

creating the perfect wedding ambiance with lighting

community detection from a computational social science...

the hidden key to ecommerce personalization

aws activate webinar - scalable databases for fast growing...

lean analytics cycle

tps aromatics trade flow report 7.9.14

predicting retail kpis using magento & machine learning

global digital statistics 2014 by we are social

inside the cave

dealing with semantic heterogeneity in real-time information

azureml – zero to hero

process mining and lean