splash: structural pattern localization analysis by sequential histograms

rgan-l, thisshareroteiniduesple is

tic acid,

SPLASH: Structural Pattern LocalizationAnalysis by Sequential Histograms

Andrea Califano1

IBM TJ Watson Research CenterPO Box 704, Yorktown Heights, NY 10598

AbstractMotivation: The discovery of sparse amino acid patterns that match repeatedly in a set ofprotein sequences is an important problem in computational biology. Statistically significantpatterns, that is, patterns that occur more frequently than expected, may identify regions thathave been preserved by evolution and which may therefore play a key functional or structuralrole. Sparseness can be important because a handful of non-contiguous residues may play akey role, while others, in between, may be changed without significant loss of function orstructure. Similar arguments may be applied to conserved DNA patterns.

Available sparse pattern discovery algorithms are either inefficient or impose limitations onthe type of patterns that can be discovered.

Results: This paper introduces a deterministic pattern discovery algorithm, called Splash,which can find sparse amino or nucleic acid patterns matching identically or similarly in a setof protein or DNA sequences. Sparse patterns of any length, up to the size of the inputsequence, can be discovered without significant loss in performances.

Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as acomplete genome or the non-redundant SWISS-PROT database can be processed in a fewhours on a typical workstation. Alternatively, a protein family or superfamily, with low overallhomology, can be analyzed to discover common functional or structural signatures. Someexamples of biologically interesting motifs discovered by Splash are reported for the histone Iand for the G-Protein Coupled Receptor families. Due to its efficiency, Splash can be used tosystematically and exhaustively identify conserved regions in protein family sets. These canthen be used to build accurate and sensitive PSSM or HMM models for sequence analysis.

Availability: Splash is available to non-commercial research centers upon request, conditionalon the signing of a test field agreement.

Contact: [email protected], Splash main page http://www.research.ibm.com/topics/popups/deep/math/html/p1.html

1. IntroductionWhenever Nature finds a “recipe” to accomplish a task that gives differential fitness to an oism, chances are that such recipe will be conserved through evolution. At the molecular levemeans that biological sequences, belonging sometimes to widely distant species, will likelycommon motifs. By motif, we mean one or more highly conserved, ungapped regions of a por DNA sequence (Bailey and Gribskov, 1998). In many proteins, however, a handful of resthat play a key functional role can be separated by highly variable regions. An extreme examthat of serine proteases, where three individual residues, a histidine, a serine, and an asparin widely distant regions of the protein, join together in 3D to form the active site.

1. The results in the “Statistical Significance of Pattern” section were obtained in collaboration withGustavo Stolovitzky. They will appear independently in a separate joint publication.

SPLASH: Structural Pattern Localization Analysis by Sequential HistogramsNovember 4, 1999 1

nsited1991)formcreas-cter” is

ided in, andncy,

resiasoveralgo-ussioname-atterns

con-mera-cticalwild-uences.rns.ions

hichentifyimi-only

t startarseiently.rou-

wever,of ran-struc-

se will

t andpar-dis-calesuteddis-

resort-ring at

haveon a

As a result, the identification of sparse patterns in biological databases is becoming a very travenue within the bioinformatics community. Pattern databases such as PROSITE (Bairoch,are examples of this trend. And methods for the automatic discovery of patterns of the

, which match at least twice in a sequence or sequence database, are becoming iningly relevant in computational biology (Brazma et al, 1998). In this notation, is a charafrom finite-size alphabet, which can match either identically or similarly in a sequence, and “the “wildcard” character which matches any sequence character.

In recent years a number of interesting pattern discovery tools have emerged. These are divtwo main categories: statistical algorithms, such as the Gibbs Sampler (Neuwald, LiuLawrence, 1995) and MEME (Bailey and Elkan, 1994), which use heuristics to improve efficieand deterministic algorithms, such as Pratt (Jonassen, Collins, Higgins, 1995) and Tei(Rigoutsos and A.Floratos, 1998). Another interesting approach is that of EMOTIF which discinteresting motifs in pre-aligned sequences (Nevill-Manning, Wu, and Brutlag, 1998). Theserithms are valuable in that they can discover weak motifs hidden in the biosequences. A discof some of the most interesting algorithms is available in (Brazma et al, 1998). A statistical frwork has also been proposed to help determine the statistical significance of discovered p(Stolovitzky and Califano, 1999). However, probabilistic approaches use heuristics or ad hocstraints and become inefficient when only a small subset of the proteins is homologous. Enution and verification algorithms must explore an exponential search space. Therefore, pralimits are imposed on the number of characters in the pattern or in its maximum number ofcards. Both approaches work best on databases that contain no more than a few tens of seqSome hybrid algorithms are more efficient but they are limited to identically occurring patteThe rest rely on a multiple alignment preprocessing step which suffers from similar limitat(Wang and Jiang, 1994).

This paper introduces an efficient, deterministic pattern discovery algorithm called Splash, wdoes not require enumerating all possible patterns that could occur in the input. Splash can idall sparse patterns, of the form , which match at least twice, either identically or slarly, in one or more sequences and satisfy a predefined density constraint. The latter limitsthe maximum number of wildcards over any consecutive stretch of the pattern that does nowith a wildcard. It does not limit the total number of wildcards in the patterns. Therefore, sppatterns that extend over the entire input sequence are possible and will be discovered efficExtremely low density constraints, such as 28 wildcards in any stretch of 32 characters, aretinely used by Splash as shown in Section 9. Even lower densities are possible. These, hotend to be less useful because interesting patterns would then be hidden in large numbersdom patterns. This is discussed in detail in Section 3.3 on page 8 and it is supported by theture of PROSITE patterns.

Finally, Splash can also be used to discover patterns of values from a continuous range. Thebe called proximal patterns and will be more rigorously defined in Section 3.

In Sections 4. and 6., we will show that Splash is both computationally and memory efficiensignificantly outperforms some of the most commonly used pattern discovery algorithms. Inticular, we will demonstrate that only a minimal set of patterns must be explored in order tocover those that occur in the input. The algorithm is embarrassingly parallel in nature and it slinearly with the number of processors. It has been implemented both for SMP and distribarchitectures. In Section 5. we will discuss the statistical significance of the type of patternscovered by Splash.

The efficiency of Splash makes it possible to explore entire databases or genomes, withouting to supercomputers. For instance, in the single threaded version, all similar patterns occurleast 3 times in the current release of the non-redundant PDB (Bernstein et al., 1977), whichno more than 6 wildcards in any window of 12 characters, are discovered in one hour

Σ ′. ′∪( )*Σ

.

Σ ′. ′∪( )*


dis-ds indues.with

e G-ch as

ningfuln theOSITE

sparsekov,

ure ofntly thanmuta-t func-latterify andsparse

ce, or

gousprovidepre-

arese ofgicalyond

he dis-

slevantigids or to

er)fam-y bepat-

t one

nceboth

sultsthose

266MHz Pentium II laptop. Similarly, the parallel version of Splash takes about 8 hours tocover all identical patterns that are statistically significant and have no more than 5 wildcarany window of 10 characters, in a non-redundant protein database with over 90 million resiThe hardware configuration used for this purpose is a 133MHz 4-way SMP workstation (F50)1GB of memory.

In Sections 8. and 9., we will discuss some experimental findings for the histone I and for thprotein-coupled receptor families. Typical runs, required to analyze large protein families suthe GPCR or the histone, are performed in seconds to minutes on similar configurations.

2. Biological applications of pattern discovery algorithmsOne important and often controversial point about sparse patterns is whether these are meabiological entities by themselves. Clearly, amino acids do not freely exist in space but live icontext of a continuous sequence. And, although signatures such as the ones available in PRare useful to rapidly screen unknown sequences for functional or structural annotation, non-models such as Hidden Markov Models (Krogh et al., 1994) and PSSM (Bailey and Gribs1998) seem to perform better.

The position of this paper is that patterns are only meaningful as a local statistical meassequence conservation across a set of sequences. That is, patterns that occur more frequeexpected in a set protein or DNA sequences are likely to correspond to regions where lowertion rates have been favored. These patterns may then identify regions that play an importantional or structural role as well as pinpoint which residues may play a more crucial role. Thewill be the ones that are most conserved. Therefore unusual patterns can be used to identalign hard to find, highly specific regions in sequences. Once these have been identified, thepattern model can lead directly to a contiguous representation, via HMM or PSSM, for instanto multiple sequence alignments using algorithms such as MUSCA (Parida et al., 1999).

This is especially important if the sequence set is large or if it contains many non homolosequences. In this case, an efficient pattern discovery algorithm, such as Splash, can easilythe core for a large number of highly specific local alignments. Again, this makes it an idealprocessing candidate to systematically build HMM and PSSM models.

The following is a small, non-exhaustive list of applications of the Splash algorithm, whichbeing currently investigated in the context of protein function and structure studies. The purpothis section is to support the importance of efficient pattern discovery techniques in a biolocontext. Each of these constitutes an individual piece of work whose details extend well bethe scope of this paper. Rather, this paper is devoted to the description of the algorithm and tcussion of the algorithm’s performance.

Single (super)family analysis: pattern discovery in a family or superfamily can identify regionthat have been subjected to lower rates of mutation and that may therefore play a refunctional or structural role. HMMs or PSSM can then be derived from these local, rmultiple sequence alignments. These can then be used to label orphan sequencescreen large genomic databases, such as dbEST, for unknown members of the (supily. Conducting such an analysis systematically with existing discovery algorithms macomputationally prohibitive. Splash, for instance, can find all highly conserved sparseterns in each one of the more than 1300 protein families associated with at leasPROSITE motif. This process requires about 2 hours on a 4-way SMP workstation.

Structural similarity : many families, such as TNF and C1Q, show no homology at the sequelevel but are clearly similar from a structural perspective. Patterns that occur acrossfamilies can highlight elements that are responsible for the structural similarity. Reobtained with Splash over these families are in fundamental agreement and extend


C1Q36,ose of

candi-

na-ber offor

rs or

basetartingentifyh 3Dccess-er,

rsrob-ut.either

ns thatens allthe

nticalrac-ndingsual

ll also, or

dis-

,

mes

reported in (Shapiro and Scherer, 1998). In this paper, two short patterns,L..G andG.Y. ,have been identified by manual structural alignment and subsequent analysis of oneand one TNF structure. By analyzing all TNF and C1Q proteins in SWISS-PROTSplash finds these two most statistically significant patterns in the same regions as th(Shapiro and Scherer, 1998).[ILMV][ILMFV]. L...[DQEK][RQEHK][ILMV] in 52 out of 68 sequences.[ILMFV]..... G[ILMFV] Y.[ILMFV]..[RQEHK] in 60 out of 68 sequences.These have been used to screen dbEST and have produced 6 previously unknowndates that are now under investigation.

Pairwise (super)family analysis: pairs of functional families that are not homologous can be alyzed to discover patterns that are common across both members. Because the numpairs is extremely high, the efficiency of the algorithm is paramount. This is useful,instance, in the analysis of signature common to many different transcription factoantibodies.

Full Genome/Database analysis: Regions that are unusually preserved across an entire datacan be first automatically discovered and then clustered using an RMSD measure. Sfrom the Brookhaven database, this methodology can be used to systematically idpatterns that are structurally, as well as sequentially, conserved. A small set of sucsignatures, which were obtained by multiple sequence alignment, have been used sufully for predicting tertiary structure from sequence information (Bystroff and Bak1988).

2.1 The problemLet us start by defining theidentical patterndiscovery problem. Given a string of characte

from an alphabet , pattern discovery can be defined as the plem of identifying patterns of the form that occur at least times in the inpWe shall call a characters from the alphabet a “full character.” A pattern character is thena full character or a wildcard.

Example: A.B..A is an identical pattern which is repeated twice in the stringACBDDABBDCA.

Let us use the notation of (Brazma et al., 1998) to represent a more general class of patterwe shall callsimilar patterns. Let be different subsets of generated as follows: giva similarity metric and a threshold , for each character the subset containcharacters such that . Useful similarity metrics will be discussed in more detail innext section. A non-redundant set is obtained by removing all but one instance of any idesubset. Let us then define a similarity alphabet , disjoint from , with one chater for each subset . We will say that matches any character contained in the corresposubset . The characters in the class are usually denoted by the uregular expression sysntax as .

As in the case of identical patterns, we shall call any character in a full character. We shasay that a character matches a character , or that if either

. A wildcard matches any character . Then, one can define the similar patterncovery problem as the problem of finding every pattern of the form , which matches atleast times inS.

Example: let be an alphabet and , , ,, , a similarity metric. Then , ,, and . The similarity alphabet is then defined as

with , ..., . In this case, a pattern would match any ofAxBxC, AxBxD,BxBxC, andBXBXD. Using the regular expression syntax, the previous pattern beco[AB].B.[CD] .

S s1 … sL, ,= Σ a1 … an, ,{ }=Σ ′. ′∪( )* J0 1>

Σ

K1 … Kn, , ΣH Σ Σ′,( ) h Σi K iH Σi Σ,( ) h≤

Ψ b1 … bn, ,= ΣKi Ψ

K Ψ( ) Ki ai1 … aini, ,{ }=

ai1…aini[ ]

ΠΠ Π∈ Σ Σ∈ Π Σ≡ Π Σ=

Σ K Π( )∈ Σ Σ∈Π ′. ′∪( )*

J0 1>Σ A B C D, , ,{ }= H A B,( ) h< H B A,( ) h< H B C,( ) h<

H C B,( ) h< H C D,( ) h< H D C,( ) h< K1 AB[ ]= K2 ABC[ ]=K3 BCD[ ]= K4 CD[ ]= Ψ a,b,c,d{ }=

a K1≡ d K4≡ a.B.d


omeand

th anarityterns.prop-

ere-

et-ilar-

eid to

atrixtions to

efinedUMoff,ed:

Many biological problems can be best modeled using similarity metrics rather than identity. Suseful metrics are obtained from the well established probability of mutation matrix (SchwartzDayhoff, 1978).

A variant of the similar pattern discovery problem is obtained by replacing the alphabet wiinterval of the real axis . In that case, the pattern takes the form and the similmetric becomes a distance metric, . Such patterns are called proximal patThey are useful, for instance, to describe correlation in the value of specific physiochemicalerties of the amino acids, such as hydrophobicity or charge, in a string.

We shall call patterns that do not have leading and trailing wildcardscompact patterns. Splashonly reports compact patterns. Therefore, unless otherwise indicated, bypatternwe will intend acompact pattern.

2.2 Similarity MetricsSince we are interested in similar patterns, let us define a similarity metric , wh

, as a positive function ofx andy, such that and which satisfies the triangle inequality:

(2.1)

A symmetric similarity metric, i.e., , is equivalent to a distance metric (Gerrsen, 1962). For instance, if the log-probabilities of amino acid mutation are used to define simity classes,

, (2.2)

the resulting metric will be slightly asymmetric. because thprobability of alanine to mutate into aspartic acid is slightly larger than that of aspartic acmutate into alanine. Known log-probability matrices satisfy the triangle inequality.

Note that the minimum of is reached when . The opposite is true for a score mwhere the score is highest along the diagonal. Given an alphabet , one can use these funcdefine the similarity classes and the relative alphabets and .

2.3 Similarity Metrics for Protein SequencesAs mentioned above, a convenient similarity metric for protein sequence analysis can be dusing the of the amino acid mutation probabilities, such as the ones in PAM or BLOSmatrices. For instance, using a BLOSUM50 mutation probability matrix (Henikoff and Henik1992), and a threshold , the following similarity classes are obtain

(2.3)

Then, given the sequence

Σℜ ρ ′. ′∪( )*

H ρ ρ′,( ) ρ ρ′–=

H x y,( )x Σ∈ y Σ∈, H x x,( ) 0=

H x y,( ) H x w,( ) H w y,( )+≤

H x y,( ) H y x,( )=

H x y,( ) p x y→( )p y( )

-----------------------log–=

H Ala Asp,( ) H Asp Ala,( )<

H x y,( ) x y=Σ

Ki Ψ Π

P( )log–

h 2= K{ } K1 … K18, ,=

ala Ala[ ]≈ arg Arg, Lys[ ]≈ asn Asn, Asp[ ]≈asp Asp, Glu[ ]≈ cys Cys[ ]≈ gln Gln, Glu, Lys[ ]≈glu Glu, Asp, Gln[ ]≈ gly Gly[ ]≈ his His, Tyr[ ]≈ile Ile, Leu, Met, Val[ ]≈ leu met≈ lys Lys, Arg, Gln[ ]≈met Met, Ile, Leu[ ]≈ phe Phe, Tyr[ ]≈ pro Pro[ ]≈ser Ser, Thr[ ]≈ thr ser≈ trp Trp, Tyr[ ]≈tyr Tyr, His, Phe, Trp[ ]≈ val Val, Ile[ ]≈


at,ould

A/t

om-

.

n. Theusly.

n ofters

.-

e andd-

s that

Ala Cys Gln Gln Val Trp Ala Gly Ala Phe Ile Tyr Leu His Pro (2.4)

the pattern , for instance, would have locus . Note, however thbased on the same metric, the pattern , which appears at position 0, whave a smaller locus since both and .

3. Definitions and NotationLet be a string with values from a fixed alphabet . This could be the DNRNA alphabet , , or the prote in a lphabe

. could also be a valuefrom a continuous interval on the real line. If is discrete, given a similarity metric, one can cpute the alphabets and , as described in the previous section. Apattern is then any stringof the form . If is continuous, then a pattern is any string of the formThis is also described in the previous section.

Let us call thecompositionof . This is formed by taking each fullcharacter in the pattern and its relative position, in characters, starting at zero, on the pattercomposition of a pattern and the number of leading and trailing wildcards define it unambiguoThe length of a pattern is the total number of full characters it contains. A patterlengthk is called a -pattern. The span of a pattern is the total number of charac(full characters and wildcards) it contains. Ak-pattern with a span ofl characters is called a -pattern.

For instance,A.B..C has length 3, span 6, and composit ion[AB].B..[BCD] has length 3, span 6, and composition . Compact patterns start and end in a full character, therefore they have .

By definition, a -pattern matches (or occurs) at offsetw on a stringS if and only if, for eachi,there exists a such that

(3.1)

for each value of . The set ofj absolute offsets where a pattern matches inS is called thepattern locus.This is represented as . Akl-pattern that matches atoffsets inS is called ajkl-pattern.j is called the patternsupport.

Finally, the comb of a pattern is the string obtained by replacing each full character with a oneach wildcard with a zero. The termk-comb andkl-comb have similar meaning as the corresponing terms for patterns.

Example: The comb ofA..BC..D...E is 100110010001.

Given a stringS, by we shall indicate the pattern determined by the comb at offsetw onS.

Example: let ABCDEFGHI be a string. Then the pattern isD..G

3.1 Basic Pattern Operators and relationshipsThe purpose of this section is to define a small number of pattern operators and relationshipwill be conveniently used for a compact description of the algorithm.

Trim operator : is a pattern obtained by removing all leading and trailing wildcards from

Example: let ..A..B... be a pattern . Then .

Translation operator:is a pattern with locus and com-

position .

ala ... ile tyr 0 6 8, ,{ }ala ... val trp

0 6,{ } f Val,Leu( ) 2> f Trp,His( ) 2>

S s1 … sL, ,= L ΣΣDNA A,C,G T,{ }= ΣRNA A,C,G U,{ }=

ΣP A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V{ }= ΣΣ

Ψ Π πΠ ′. ′∪( )* Σ π ρ .∪( )*

Cπ Πi di,( ) i 1 … k, ,={ }= π

k Cπ=k l dk 1+=

kl

A 0,( ) B 2,( ) C 5,( ), ,{ }AB[ ] 0,( ) B 2,( ) BCD[ ] 5,( ), ,{ }

d1 0=

kΠi Π∈

sw di+ Πi∈

1 i k≤ ≤Wπ w1 … wj, ,{ }= j Wπ=

πψ w, ψπ1001 3,

π′ Tt π( )= ππ Tt π( ) A..B=

π′ π x+= W′ W x+ w1 x+ w2 x+ … wj x+, , ,{ }= =C′ C x– Πi di x–,( ){ }= =


-2,not

their

of-

n

nsd

an

rsd

ttern.

of

.

h

-rn, by

Example: let A..B be a pattern with locus . Such pattern, translated bybecomes..A..B with locus . Patterns produced by positive translations canbe conveniently represented as strings. However, they can still be represented bycomposition and locus.

Append Operator:is a pattern with all the characters of followed by all the characters

. If the span of isl, then has locus and composition .

Example: let =A..B.C and =B.C . Then =A..B.CB.C .

Add Operator :is a new pattern with locus and compositio.

Example: let = A..B.C...D be a pattern with locus {2, 13, 25, 40} and =...B.CE a pattern with locus {2, 7, 13, 25, 49}. Then, =A..B.CE..D ,with locus {2, 13, 25} = .

If there exists a translationx such that the intersection of the locus of two patteris non null, it is sometimes convenient to add an

to form a longer pattern.

Using these operators, we can define the following pattern relations:

Sub-pattern (super-pattern):is a sub-pattern of if there exists a translationx such that and

. Conversely, is a super-pattern of if is a sub-pattern of .Example: A..BC.D with locus is a sub-pattern ofBC.D with locus

. The translation is .

Aligned sub-pattern (super-pattern):is an aligned sub-pattern of if it is a sub-pattern of for . Then, is

aligned super-pattern of .Example: Given A..B.C. . .D with locus andA..B with locus

, thenA..B.C...D is an aligned sub-pattern ofA..B because, for a trans-lation value , .

Optimally aligned sub-pattern (super-pattern):, with spanl, is an optimally aligned sub-pattern of if has more full characte

than and is identical to the firstl characters of . Then, is an optimally alignesuper-pattern of .By definition, an optimally aligned sub/super-pattern is also an aligned sub/super-paThat is, and vice versa.For simplicity, if is an optimally aligned sub-pattern of , we shall call a child

and a parent of .Example: BothA..BC..E andA..BC.D are children ofA..BC .

Equivalent sub-pattern (super-pattern):is an equivalent sub-pattern of if it is a sub-pattern of and

Then, is an equivalent super-pattern of .Example: AB.D, with locus , inABCDABQD, is an equivalent sub-pattern of botA..D , with locus , andB.D , with locus .

3.2 Maximal PatternsA pattern is said to bemaximal in composition(or c-maximal) if there does not exist any equivalent sub-pattern of with the same span. That is, if cannot be extended to a longer patte

5 12,{ }3 10,{ }

πc πa πb⊕= πaπb πa πc W πc( ) W πa( ) W πb( ) l–( )∩=

Cc Ca Cb l–( )∪=

πa πb πa πb⊕

πc πa πb+= W πc( ) W πa( ) W πb( )∩=Cc Ca Cb∪=

πa πbπc πa πb+=

2 13 25 40, , ,{ } 2 7 13 25 49, , , ,{ }∩

W πc( ) W πa( ) W πb( ) x–∩= πaπb x–

πb πa W πa( ) W πb( ) x+⊇Cπb

Cπax–⊃ πa πb πb πa

1 31 76, ,{ }4 12 34 61 79, , , ,{ } x 3=

πa πb πb x 0= πbπa

2 35 60, ,{ }2 10 35 60, , ,{ }

x 0= 2 10 35 60, , ,{ } 2 35 60, ,{ }⊇

πb πa πbπa πb πa πa

πb

W πb( ) W πa( )⊇πb πa πb

πa πa πb

πb πa πa W πa( ) W πb( )=πa πb

0 4,{ }0 4,{ } 1 5,{ }

ππ π


ern isnedn, by

dingother

idse. We

ximal

sub-prettraintm-is-is

e laterme-

l char-

allull

it

nd

,to pro-allySITE

1254).ions,

dereferencing one of its wildcards into a full character, without decreasing its support. A pattsaid to beleft maximal(or l-maximal) if there does not exist any equivalent sub-pattern obtaiby appending it to another pattern. That is, if it cannot be extended into a longer patterappending it to another pattern, without decreasing its support. A pattern is said to beright maxi-mal (or r-maximal) if there does not exist any equivalent sub-pattern obtained by appenanother pattern it to. That is, if it cannot be extended into a longer pattern, by appending anpattern to it, without decreasing its support.

A maximal patternis a pattern that does not have any equivalent sub-pattern. That is, it isl-, c-,andr-maximal. Maximality is an essential property of pattern discovery algorithms. It avoreporting a combinatorial number of super-patterns of each maximal pattern in the sequencwill require that the Splash algorithm report only maximal patterns.

An important class, as we shall see later, is that of patterns that are both left maximal and main composition. These will be calledlc-maximal patterns, for short.

Example: let the patternA.B..C occur 100 times in a string and the patternA.B..C.EF occurtwice. That is, in 2 of the 100 occurrencesA.B..C is flanked by.EF . ThenA.B..C occurring 100times is maximal,A.B..C.EF , occurring twice, is also maximal. HoweverA.B..C.E , occurringtwice, andA....C , occurring 100 times, are not maximal.

3.3 Density ConstraintIt is important to be able to specify the minimum number of full characters required in anystring of a pattern which starts with a full character and has length . It is intuitive to inter

as a density of full characters per unit length. Therefore, we shall call such a consthedensity constraint. Given and , we shall say that a comb is a -valid comb (or siply a valid comb) if either its length is smaller than or the relation is satfied for eachi such that . A pattern with a valid comb is called a valid pattern. Ituseful to consider valid even patterns that have fewer than full characters, as they may bextended to form longer patterns with more than full characters and would otherwise be imdiately rejected. In any case, as we shall see later, patterns that have fewer than fulacters, with a user defined threshold, will not be reported anyway in the end.

Example: the patternA....BC....D satisfies the density constraint , becausesubstrings (A....BC andBC....D ), of length 7, starting on a full character, have at least 3 fcharacters. The fact that..BC... of length 7 has only 2 full characters is irrelevant becausedoes not start with a full character. The fact that substringC....D has 2 full characters is alsoirrelevant because the pattern contains only six positions.A....B is also valid pattern, for thesame parameters, because it has fewer than 3 full characters.

If one were not to specify such constraint,j-patterns in random strings would be distributed arouan average length.

(3.2)

where is the frequency of the symbol inSand is 1 if and 0 other-wise. If is an interval of the real axis, the sum must be replaced by an integral.

The Splash algorithm is usually run at three different densities, from ,aimed at dense to very sparse patterns. Lower densities are not generally useful as they tendduce large numbers of patterns that are not statistically significant. It is unlikely that biologicsignificant, rigid patterns contain more than 28 consecutive wildcards. For instance, the PROdatabase contains only one rigid patterns with more than 20 consecutive wildcards (PS00This is because, in general, long regions without matching full characters include loop reg

k0l0

r k0 l0⁄=k0 l0 k0 l0,⟨ ⟩

k0 di k0 1–+ di l0≤–1 i k k0– 1+≤ ≤

k0k0

K0 k0≤K0

k0 3= l0 7=

k L p Σi( ) p Σ j( )αi jj

∑j 1–

i∑=

p Σi( )[ ] 1– Σi αi j H Σi Σ j,( ) h≤Σ

k0 4= l0 8 16 32, ,=


in thecutivee easilyan callrnsTheoducesds.

, ,

,lledd in

calletotion,

racter;

tallfinite.

e-finalast

ich is-

hesi-

where insertions and deletions are likely. Not many very sparse flexible patterns are reportedliterature. For instance, PROSITE contains only 8 flexible patterns with more than 20 consewildcards. These are detected by Splash as two or more separate rigid patterns and can bfused in a simple post-processing phase. The algorithm has been implemented so that it citself recursively. This allows the user to find flexible patterns by discovering any rigid pattethat occur within a predefined window on the left and on the right of any other rigid pattern.process can be repeated until no more patterns are found. This deterministic approach prflexible patterns defined by rigid components separated by variable-length strings of wildcar

4. The AlgorithmLet us define the input of the algorithm as a string and a set of parameters

, and . Let us also define the set of patterns that occur at least times inS, andwhich are -valid. We shall call any such pattern afinal pattern. The output of the algorithmis then the set of all final patterns which have lengthwith their leading and trailing wildcards removed by the trim operator. The latter are careported patterns. The extension from one to multiple sequences is trivial and will be assumethe rest of the paper.

We shall start by defining a method to construct an initial set of patterns , which wea Seed Set, from any stringS. either contains a parent of any final pattern or itself. Wshall then define a recursive Splash operator ,process the input set and the output set . On the first iteration . At each iterathis operator performs three tasks: (a) it eliminates patterns in the input that are notlc-maximal; (b)it systematically extends all input patterns into valid patterns that have at least one more cha(c) it inserts any pattern which is a final pattern in .

We shall demonstrate that, at each iterationn, the set contains at least a parenof any final pattern that is not yet in . Therefore, when , will containthe final patterns. Finally, we shall demonstrate that the algorithm converges because isThe theorems and demonstrations apply to identical, similar, and proximal patterns.

Let us start by defining the following dataset and operators:

Seed-Set: Given the stringSand the parameters , let be the set of all thfinal patterns ofS. Then, a aSeed-Set is any set of patterns that either (a) contains at least one parent of each pattern or (b) contains . In other words, anypattern is either contained in or it is a child, possibly with a smaller locus, of at leone pattern in .Example: given the stringABCDABEFABCand the constraints , , ,and , the maximal patterns areAB..AB with locus {0,4}, ABCwith locus {0,8},andAB with locus {0,4,8}. All three of them can be obtained by extending the patternABwith locus {0,4,8}. Therefore, the set is a Seed Set.

Canonical Seed-Set: A seed set is a Canonical Seed Set if it does not have any subset whalso a Seed Set. That is, each valid pattern onS is either present once in or it is a subpattern of one and only one pattern in .Example: Given the stringABCDABQDand the constraints , , ,

, the set is a Canonical Seed Set. This is becauseAB is the only parent inof the only valid and maximal pattern,AB.D. The set is not canonical

because the valid patternAB.D is a sub-pattern of bothAB andA.C .

A trivial canonical Seed Set, for , , , and any is , where is tpattern with empty composition and locus , which includes all the pos

S s1 … sL, ,= k0 l0J0 K0 Qs′ J0 1>

k0 l0,⟨ ⟩Qs π{ } Tt Qs′ k K0≥( )( )= = k K0 k0≥ ≥

Ps π{ }=Ps π π

Ts Ps Qs,( ) Tsn

Ps Qs,( ) Ts Tsn 1–

Ps Qs,( )( )=Ps Qs Qs ∅=

Qs

Psn Ts

nPs Qs,( )=

Qs Tsn∅ Ps Qs,( ) ∅= Qs

n∅

l0 k0 J0 K0, , , Qs π{ }=Ps π{ }=

π π πPs

Psk0 2= l0 4= J0 2=

K0 2=

AB{ }Ps

PsPs

k0 2= l0 3= J0 2=K0 2= AB{ }AB{ } AB,A.C{ }

J0 2= k0 1= K0 1= l0 π∅{ } π∅W π∅( ) 1 2 … L, , ,{ }=


rn in

cu-100,tifiedic.the

:

y byigout-match

ble offsets ofS.This is because, by definition, this pattern is a super-pattern of any other pattes. Since this set contains only one pattern it is, by definition, canonical.

A convenient way to generate Seed Sets of a less general nature is to start from all possible

(4.1)

-combs, ,with , that start with a 1 and have 1s in the next consetive positions. For instance, for and , there are 4 of these combs, 11000, 1010010, and 10001. Given one such comb , one can list all the different patterns , idenby the comb and by an offsetw on S, and evaluate their support based on the similarity metrFinally, one can remove all the patterns with support . The resulting set will be called

-set or for short. The following pseudocode illustrates a procedure for assembling

Create an empty Seed Set ;for each comb {

for each offset on S and corresponding seed pattern {Create an initial locus with just that offset;count = 1;for each other offset on S such that {

if ( matches ) {add to ;count ++;

}}// Check that is not the same as any other withNotPreviouslyFound = true;for each offset on S such that {

if( ) {// the two loci are identicalNotPreviouslyFound = false;break;

}}// If this is a new pattern with a valid supportif ( NotPreviouslyFound && ( )) {

add to}

}return

}

Theorem 1:The set is a Seed Set.

Proofs of all theorems can be found in Appendix A. Sets of this nature can be built efficientlmeans of a look-up table, such as the one described in the Flash algorithm (Califano and Rsos, 1993). This procedure works also for continuous values, based on the definition of ausing a distance metric.

l0 1–

k0 1–

k0 ψ{ }k0k0 K0≤ k0 1– l0 1–

k0 2= l0 5=ψ πw ψ,

j J0<l0 k0,⟨ ⟩ Ps* Ps*

Ps* ∅=ψ ψ{ }k0

∈w πψ w,

Wψ w, w{ }=

w′ w w′≠πψ w, πψ w′,

w′ W

Wψ w, Wψ w′, w′ w<

w′ w′ w<Wψ w, Wψ w′,≡

count J0≥πψ w, Ps*

Ps*

Ps*


rial setatternsoting in

al

ther

and.

natesexpo-s are

ithm

4.1 MaximalityAs we discussed before, only maximal patterns should be reported. Otherwise a combinatoof super-patterns of any given pattern would be reported. Since, as we shall see, we extend pexclusively by appending other patterns to them, if we were to extend any pattern that is nlc-maximal we would be guaranteed to get a non-maximal pattern.On the other hand, by extendthis way anlc-maximal pattern, we may get new patterns that are maximal.

For instance, given the stringABCAABCACBCD, the patternBCwith locus {1,5,9} is lc-maximal.However,BCA is not lc-maximal since it is an equivalent super-pattern ofABCA. If we were toextendBCA, therefore, we would get non maximal patterns.

We define the operator to check for patternlc-maximality:

(4.2)

Theorem 2:Any parent of a final pattern islc-maximal.

It follows that if is a Seed Set, is also a Seed Set.

Theorem 3: Any valid lc-maximal pattern is either a final pattern or it is a parent of one finpattern with the same support.

Theorem 4: If a set contains only unique patterns and no pattern is a child of anopattern , then the set is a Canonical Seed Set.

For a compact description of , let us define a boolean function which is true ifonly if there exists any character such that for each value ofThen, can be implemented by the following pseudo code:

{// First check that pattern is composition-maximalfor each offset corresponding to a wildcard in {

if( ) {return ;

}}// Then Check for left-maximalityfor each offset such that

if( ) {return

}}return

}

This operator is a key component of the efficiency of the Splash algorithm because it elimimost potential pattern candidates very early, before they have a chance to contribute to annential growth of the number of hypotheses. If this operator is not implemented and patternchecked a posteriori for maximality, both memory and computational efficiency of the algordecrease exponentially in the size of the input.

Tm π( )

Tm π( ) π{ }=

Tm π( ) ∅=

Tm πi{ }( ) Tm πi( )∪≡ if π is lc maximal–

if π is not lc maximal–

Ps Tm Ps( )

π

Ps π Ps∈π′ Ps∈ Tm Ps( )

Tm E W x,( )Πi Π∈ s wm x+( ) Πi= 1 m j≤ ≤

Tm

Tm π( )

x πE Wπ x,( )

∅

x l– 0 dk0 1–+ x< 0<E Wπ x,( )

∅

π{ }


-rt at

by at

e full

e fulle

ed asofents

it is

y forfol-

4.2 The Histogram operatorGiven ajkl-pattern , the Histogram operator is used to create all the possible validpatterns , obtained by appending a string of the form to , which have suppoleast .

Given anlc-maximal pattern, , of lengthk, we can define theextension interval , adjacent to, such that would be maximal unless it could be extended, without reducing its support,

least one full character in .

Example: let A.B be alc-maximal pattern and , be density parameters. ThenA.Bwould be maximal unless one could extend it, without reducing its support, by at least oncharacter at either one or two positions to the right of theB. Similarly, thelc-maximal patternA.B.CD would also be maximal for the same parameters unless one could extend it by oncharacter, without reducing its support, either at one, two, or three positions to the right of thC.

can be defined as follows:

(4.3)

Here,l is the span of . In the previous example,

(4.4)

The Histogram operator, , which, given a pattern , produces the set can be definfollows. For each offset , let us define all the positions , with , the locus

. Our goal is to find how many of the similarity classes are supported by at least elemin each set defined by a specific value ofx. Any such class and offsetx can be used to extendinto a longer pattern which is valid by definition. If a pattern has support less thaneliminated.

If the alphabet is discrete, this can be accomplished with an array , with one entreach value of the offset and for each similarity class in . This is summarized by thelowing pseudo code:

{create an empty result setfor each in {

// Initialize the countersfor each {

set each}for each {

c =for each such that {

add to}

}// Remove all redundant countersfor each {

for each with {if {

set}

π Th l 1+( )π̇{ } ′. ′* Π′. ′* π

J0

π I ππ π

I πk0 3= l0 5=

I π

I π 1 l0 l k0 k–( )––,[ ]=

I π 1 l0 d+ k k0– 2+ l–,[ ]= if k k0<

if k k0≥

π

I A.B

I A.B.CD

1 5 3– 3 2–( )–,[ ]=

1 5 4 6–+,[ ]=

1 2,[ ]=

1 3,[ ]=

Th π π̇{ }x Iπ∈ w l x+ + w Wπ∈

π Πi J0π

π̇ π̇ J0

C x Πi,[ ]x Iπ∈ Π

Th π Qs,( )R ∅=

x Iπ

Πi Π∈C x Πi,[ ] ∅=

w Wπ∈S w l x+ +( )

Πi Π∈ c Πi∈w C x Πi,[ ]

Πi Π∈Πi ′ Π∈ i ′ i>

C x Πi,[ ] C x Πi ′,[ ]=C x Πi ′,[ ] ∅=


county, for

c con-

, thenit is

sn,an

.

, thehich

ained

here-

ort

re alsohere-

}if {

create a pattern with the character at offsetx and wildcards at any other positionsetadd the pattern to ,

}}

}if no pattern has support {

add to}return

}

If is an interval of the real axis, instead, one can use a similar deterministic procedure toall sets of values that satisfy the distance metric constraints. This can be done efficientlinstance, (a) by sorting the values , for each and for a fixed value ofx; (b)starting at each value, by counting how many consecutive values satisfy the distance metristraints; (c) by removing any set that is a subset of another set.

If none of the patterns is an equivalent sub-pattern of , i.e., it has the same support ofis maximal. In that case, if it has at least full characters, it is a reported pattern and

added to . As a result, is never contained in .

Example: let S = ABCDEABCCEABCDAbe a string andAB be a pattern. Given the constraint, the extension interval is . If the only similarity class is [CD], the

the only array entries with a positive count are ,, and . As a consequence, the following patterns c

be generated:

and

Since both and are equivalent sub-patterns of , the latter is not a maximal pattern

4.3 Enumerate Operatorhas been used to list all possible valid, single full character extensions of . Then

purpose of the enumerate operator is to create all the possible valid patterns wextend by one, two, or more full characters in the extension interval . These can be obtby adding together any possible subset of . For instance, if =AB and contains the pat-ternsAB.C.. andAB..D. , then the pattern also extendsAB.Any such pattern is valid by definition because it is formed by adding valid patterns and can tfore only be denser. Resulting patterns that have support less than are eliminated.

Let us define as the set of all possible patterns with supp, obtained by adding any possible subset of .

Theorem 5: The set is a Seed Set for any final pattern that is a child of .

Note that, since independent subsets of are unique, the prerequisites of Theorem 4 asatisfied. That is, no two patterns of are either identical or have a parent-child relation. Tfore, from Theorems 5 and 4, is a Canonical Seed Set.

C x Πi,[ ] J0≥π′ Πi

I ππ̇ π π′⊕=

π̇ R

π̇ R∈ Wπ̇ Wπ=π Qs

R

Σ

s w l x+ +( ) w W π( )∈

π̇ π ππ K0 k0≥

Qs π π̇{ }

k0 3 l0, 5= = I AB 1 3,[ ]=C 1 C,[ ] 2 7 12, ,{ }= C 2 D,[ ] 2 12,{ }=

C 2 [CD],[ ] 2 7 12, ,{ }= C 3 E,[ ] 2 7,{ }=

x 1= π̇0 ABC.. Wπ0; 0 5 10, ,{ }= =

x 2= π̇1 AB.D. ; Wπ10 10,{ }= = π̇2 AB.[CD]. Wπ2

; 0 5 10, ,{ }= =

x 3= π̇3 AB..E Wπ3; 0 5,{ }= =

π̇0 π̇2 AB

Th π̇{ } πTe π̇{ }( )

π I ππ̇{ } π π̇{ }

AB.CD. AB.C.. AB..D.+=

J0

π̃{ } Te π̇{ }( ) Te Th π Qs,( )( )= =j J0≥ π̇{ }

π̃{ } π

π̇{ }π̇{ }

Tm π̃( )


ing

recur-

Seed

set

.

eedll

The Enumerate operator can be defined recursively and efficiently by the followpseudocode:

{if( {

return} else {

define a results set ;for each pattern in {

initialize a temporary empty set ;for each pattern in such that {

computeif has support greater or equal than {

insert in ;}

}

}}

}return ;

}

4.4 SplashGiven the Seed-Set the Splash algorithm can be compactly represented by the followingsive notation:

(4.5)

where

(4.6)

Pattern discovery terminates at then-th. iteration, if . The following five proper-ties of the algorithm have been demonstrated:

1. An initial Seed Set can be constructed for any set of constraints: From Theorem 1, is aSet for any choice of .

2. At any iterationn, is a Seed Set for any final pattern not contained in the result. This follows from Theorems 2, 3 and 5.

3. As a consequence, when , all final patterns are contained in .

4. The algorithm converges. At each iterationn, patterns in contain at least onemore full character than patterns in . Therefore, for ,

5. Patterns in are maximal. This follows by the definition of the histogram operator.

6. The algorithm is efficient. Theorem 4 proves that the set is a Canonical SSet for any final pattern not in . This is, by definition, the smallest set of patterns that aremaining final patterns can be derived from.

Te π̇{ }( )

Te π̇{ }( )π̇{ } ∅=

∅

Rs π̇{ }=π̇i π̇{ }

Ts ∅=π̇i ′ π̇{ } i ′ i>

π̃ π̇i π̇i ′+=π̃ J0

π̃ Ts

Rs Rs Te Ts{ }∪=

Rs

Ps*

Tsn

Ps* Qs,( )Ts Ps* Qs,( )

Ts Ts

n 1–Ps* Qs,( )( )=

Teh Tm Ps*( ) Qs,( )=

Teh π{ } Qs,( ) Te Th π Qs,( )( )π∪=

Tsn

Ps* Qs,( ) ∅=

Ps*l0 k0 J0 K0, , ,

Tsn

Ps* Qs,( )Qs

Tsn∅ Ps* Qs,( ) ∅= Qs

Tsn 1+

Ps* Qs,( )Ts

nPs* Qs,( ) n∅ L≤ Ts

n∅ Ps* Qs,( ) ∅=

Qs

Tm Tsn

Ps* Qs,( )( )Qs


y other

, canrithmuired

ason

he par-at

havebino-here-

l-

tt andper-wardsset of

4.5 ParallelismAs shown in previous sections, each Seed Pattern can be processed independently of anseed pattern, as there are no global dependencies in any of the operators. Even thelc-maximaloperator, , which must verify that is not a sub-pattern of any other maximal patternbe implemented without any knowledge of other discovered patterns. It follows that the algois embarrassingly parallel in the number of patterns in the Seed Set. The only globally reqinformation is the input stringS. Also, final results must be collected and reported.

A typical approach to parallelize this kind of algorithm is to use a function such, where is a hash function that returns a positive integer based

the pattern and is the number of available processors. Then, each node operates on tt ia l Seed Set and returns a part ia l resul ts set such th

. More sophisticated load-balancing functions can be used as well.

5. The Statistical Significance of PatternsIn (Stolovitzky and Califano, 1999), it is shown how the average number of maximaljkl-patternsin a random databases of lengthL, with a density constraint, is given by

(5.1)

In the above expression, is the number of -valid combs that have a spanl, andlengthk. is given by the following expression,

, (5.2)

where is the probability of the value to occur in the sequence and

(5.3)

is the probability of the value to randomly match any other character inS. Also, andare respectively the probability that a givenjkl-pattern is maximal in composition and length. Fromthis analysis, it is possible to estimate the probability that any discovered pattern wouldoccurred in a random database of similar size and composition. This probability is close to amial distribution and its first and second moment, its mean and variance, are well defined. Tfore, it is useful to compute az-score as:

(5.4)

where is the number of discoveredjkl-patterns. Full details of this analysis, which is in excelent agreement with experimental data, are available in (Stolovitzky and Califano, 1999).

6. PerformanceTo test the scaleability of the algorithm we have run a range of comparison tests against PraMEME. The tests show that Splash significantly outperforms both algorithms in terms of rawformance, limitations on discovered patterns, and scaleability. Because MEME is geared tothe discovery of the most conserved regions in protein families rather than of the exhaustive

π

Tm π( ) π

f π( ) g π( ) mod Npr[ ]= g π( )π Npr

Ps* Npr( ) f Ps*( )= Qs Npr( )Qs Qs Npr( )∪=

l0 k0,⟨ ⟩

njkl⟩ N0 k l ),( ) L l– 2+j

℘⟨ ⟩k j 1–( ) 1 ℘⟨ ⟩k–[ ]L l– 2– j–

pin⟨ ⟩ pout⟨ ⟩≈

N0 k l,( ) l0 k0,⟨ ⟩℘⟨ ⟩

℘⟨ ⟩ p Σ( ) ℘ Σ( )[ ]Σ∑=

p Σ( ) Σ

℘ Σ( ) p Σ′( )H Σ Σ′,( )Σ′∑=

Σ pin pout

znjkl njkl⟨ ⟩–

σnjkl

-----------------------------=

njkl


.2 on

tabaseelectedminosame

0% of.

se isy

in-s are

lash is8 KRes,hilewould

et al.,case

,

n oure foras a

agni-

tabase,

conserved patterns, it cannot be compared directly. This will be discussed in Section 6page 17.

6.1 PrattIn Fig. 1, we report the performance of Splash and Pratt as a function of an increasing dasize. The database is produced as follows. First an appropriate sequence sample set is sfrom the Brookhaven PDB database to obtain the desired size. Then the position of the aacids are randomized. This results in a random database of an appropriate length, with theamino acid frequency as the corresponding PDB sample. Patterns that occur in more that 2the sequences in the database are reported. Total size of the databases is , withDatabases for values ofi ranging from 0 to 6 are processed. The largest random databaapproximately 512,000 residues. On the leftY axis, we report the time, in seconds, required bSplash and Pratt to process the random databases in log scale. On theX axis, we report the data-base size, also in log scale. On the rightYaxis, we report the number of discovered patterns in lear scale. This is shown by the curve with the diamond symbol. The density constraint

, . The maximum memory footprint of the program is 12MB.

The patterns discovered by Pratt are identical to those discovered by Splash. However, Spincreasingly faster than Pratt as the database size increases. At the smallest database size,it is about 6 times faster. At size 128 KRes., it is about 3 orders of magnitude faster. Also, wpatterns reported by Pratt are limited to a maximum span. 50 characters in this case, Splashdiscover any pattern that satisfies the density and support constraints, no matter how long.

We report similar performance measurement against a histone I database (Makalowska1999), with 209 proteins, at increasingly higher values of the support. This is an interestingbecause this database is pattern-rich, generating in excess of 10,000 patterns for ,and .

In Fig. 2(a) and (b), we report discovery performance for Splash and Pratt. Pratt crashes omachine for support . In Fig. 2(a), time is reported as a function of an increasing valu

, the minimum number of sequences containing the pattern. In Fig. 2(b) time is reportedfunction of the number of discovered patterns. For , Splash is almost 2 orders of mtude faster. The maximum memory footprint of the program is 8MB for .

Fig. 1: Splash and Pratt: Time versus Database Size. Pattern discovery time is reported versus dasize. Identical patterns are reported by the two algorithms. Discovery parameters are ,support is 20% of sequences in databases.

8192 2i⋅ 0 i 6≤ ≤

k0 2= l0 5=

k0 2= l0 5=J0 100=

8K 16K 32K 64K 128K 256K 512K

Database Size in Thousands of Residues

1

10

100

1000

Dis

cove

ry T

ime

(Sec

.)

Pattern Discovery Performance

Splash

Pratt

0

500

1000

1500

2000

2500

Patterns

Patterns

k0 2= l0 5=

J0 180<J0

J0 180=J0 100=


ervedontextom-

coverwithe 5

econds.nces,

with

revi-

samehe pat-

his isse oriety ofic acidsesig-

tically,, iso 2 ins well,fam-

er and

for

6.2 MEMEMEME, which is based on a PSSM model, is geared towards the discovery of the most conscontiguous regions in a protein set. Also, the density constraint does not make sense in the cof PSSM. Therefore a straightforward comparison is difficult and probably inappropriate. To cpare the performance of the two algorithms, therefore, we only tested Splash’s ability to disthe most conserved regions in a protein set. We run against the lipocalin file which is includedthe MEME distribution. MEME takes 8.9 seconds to discover the two top patterns in thsequences. Splash discovered two patterns that extend over the same regions in 1.16 sThese were the two most statistically significant pattern among those that occur in all sequefor , and . As discussed, this is the standard initial density constraint usedSplash to discover PROSITE-like patterns or motifs.

To test scaleability, we also performed the analysis of the histone I family described in the pous section. Splash takes 8.7 seconds to discover the motifG.S...[ILMV]...[ILMV] whichoccurs in every sequence. This motif is discussed in detail in Section 8. on page 19. On theCPU, MEME takes 3771.38 seconds to discover a PSSM spanning the first 6 characters of ttern discovered by Splash.

6.3 Density ConstraintsFinally, in Fig. 3(a) and (b), we study the impact of the density constraints on performance. Tan important parameter, as it determines the ability of the algorithm to discover either densparse patterns. Here we report the performance of pattern discovery, for a substantial vardensity constraints, against a set of 124 proteins, about 150 KRes, that contain the aspartand asparagine hydroxylation site. For we use . For , we u

. Larger values of result in patterns that are too sparse and not statisticallynificant.

This protein set has been obtained from the PROSITE entry PS00010. The goal is to automadiscover the PROSITE signatureC-x-[DN]-x(4)-[FY]-x-C-x-C . The support constraint, thenis set at , or 100% of the proteins in the set. BLOSUM50, with a thresholdused to produce the similarity metric. That is, pairs of amino acids that score better or equal tBLOSUM50 are considered similar. Other values for the support parameters are possible afor instance, to discover PROSITE signatures that do not occur in all proteins of a functionalily. A full scale analysis of the PROSITE database, however, is beyond the scope of this paphas been conducted separately. Results will be reported in a follow-up paper.

A pattern virtually identical to the PROSITE signatures (C-x-[DN]-x(4)-[FYW]-x-C-x-C ) iscorrectly discovered as the most statistically significant pattern for and

Fig. 2: Splash vs. Pratt (a) vs. pattern supportj (b) vs. the number of discovered patterns

100 120 140 160 180 200

Sequence Support

1

10

100

Dis

cove

ry T

ime

(Sec

.)

(a) Histones I and V: Time vs. support

SplashPratt

100 1000 10000

Discovered Patterns

1

10

100

Dis

cove

ry T

ime

(Sec

.)

(b) Histones I and V: Time vs. Patterns

SplashPratt

k0 4= l0 8=

k0 2= l0 2 4 8 16 32, , , ,= k0 4=l0 4 8 16 32, , ,= l0

J0 124= h0 2=

k0 2 l0, 16 32,= =


me-small

rainterable

aly-e beeneach

d withyhoff etMs.

er ofery,wn

0 res-oper-

es notf the

ith aginallutiong dis-met-hicho pat-bles

0

. Fig. 3(b) reports the number of discovered patterns for the various parater choices. As shown here, there are no patterns discovered for and just anumber of patterns, with two amino acids only and low statistical significance (e.g.,G......D ),for . Fig. 3(a) shows the time, in seconds, as a function of the density constparameters. As can be seen, the efficiency of the algorithm allows one to explore a considrange of density and support parameters in minimal time.

7. Sensitivity of Identity vs. Similarity MetricsTo analyze the increase in sensitivity resulting from similar pattern discovery, the following ansis has been performed. First, ten identical copies of a random sequence of length 40 havgenerated with composition identical to the histone I family (Makalowska et al., 1999). Then,sequence has been padded left and right to a predefined lengthL with random amino acids usingthe same composition. From this set, subsequent mutated populations of identical length anthe same number of sequences have been generated using the procedure described in (Daal., 1978), starting at an evolutionary period of 5 PAMs and increasing that in steps of 2 PAFinally pattern analysis is performed using either an identity or a similarity metric. The numbstatistically significant patterns (z-score > 3) are computed. Results for identical pattern discovfor increasing values ofL, are shown in Fig. 4(a). Results for similar pattern discovery are shoin Fig. 4(b).

The goal of this exercise is to model the likelihood of detecting an active site (chosen to be 4idue in length) which has been preferentially mutated to preserve amino acids of similar prties, in sequences of different length. The statistical significance of pattern afterx PAMs is ameasure of the correlation of the original sequences. This is an oversimplified model and doaccurately model evolution. However as discussed in (Dayhoff et al., 1978), it is indicative oamount of conservation one can expect in a fairly small peptide region.

The similarity metric of Section 2.3 on page 5 has been used. This is based on BLOSUM50 wthreshold of 2. Populations have been produced by repeatedly applying PAM1 to the orisequences. We have deliberately mixed PAM and BLOSUM matrices, during simulated evoand analysis, to show that even with different models, similar patterns are better at pinpointintant relationships. The number of statistically significant patterns discovered with the identityric drops to zero quickly. With the exception of the shortest sequence length , wgenerates the smallest amount of noise (i.e., patterns that are not statistically significant), nterns is reported after more than eleven PAMs. As shown, the similarity metric practically douthe time horizon.

Fig. 3: Pattern discovery performance as a function of the density constraint. Runs on the PS0001PROSITE family with , . and , .

k0 4 l0, 16 32,= =k0 4 l0 16<,=

k0 2 l0 16<,=

4.0 8.0 12.0 16.0 20.0 24.0 28.0 32.0

Window Sizel0

0.0

10.0

20.0

30.0

Dis

cove

ry T

ime

(Sec

.)

(a) Discovery Time vs. Density

k0 = 2k0 = 4

4.0 8.0 12.0 16.0 20.0 24.0 28.0 32.0

Window Sizel0

0.0

20.0

40.0

60.0

80.0

100.0

Dis

cove

red

Pat

tern

s

(b) Discovered Patterns vs. Density

k0 = 2k0 = 4

k0 2= l0 2 4 8 16 32, , , ,= k0 4= l0 4 8 16 32, , ,=

L 100=


hid-equalnsity

h aasedllyni-d.

et al.,nd

roce-

egion.

ed as:

t

on-alysis,

thebularken.to dis-there

the

8. Experimental Results: HistonesAn important measure of a pattern discovery algorithm’s performance is its ability to discoverden motifs in large protein databases. A typical approach is to start with a minimum supportto a high percentage of the total number of proteins in the set, say 95%, and a fairly high deconstraint. Typical values are , . BLOSUM50 or BLOSUM62 can be used witthreshold of 1 or 2. If no interesting motif is discovered, first the density constraint is decre

, until patterns that are not statistically significant start appearing (typicafor a few hundred proteins). If statistically significant patterns still fail to occur, the mi

mum support is progressively dropped to increasingly lower values and the cycle is restarte

In this case, we have run Splash against a histone I database with 209 proteins (Makalowska1999). If a similarity metric defined by the BLOSUM50 matrix with a threshold of 2 is used athe density constraint has and (this is the second attempt in the above pdure), Splash takes 8.7 seconds to detects the following motif:

, ( , )

Pratt and Meme take respectively 66 seconds and 3771, to find basically the same protein r

Thez-score is computed from the analysis described in Section 5. on page 15. It is defin

, (8.1)

where and are, respectively, the mean and the variance of the number ofjkl-pat-terns that would occur in a random database of that size and composition. Herebecause we are considering each pattern independently. That is, thejkl-pattern occurred at leasonce.

Given the very large value of the , it is highly unlikely that a pattern like this would arise sptaneously from a database with a histone-like amino acid frequency of the same size. The anaccounts for the fact that two of the four residues match exactly and two match similarly.

This is confirmed empirically by the analysis of the 3D structures of the globular domain ofonly two histone proteins where the motif occurs in PDB. These are respectively the glodomain of the H1 (Cerf et al., 1993) and H5 (Ramakrishnan et al., 1993) histone from chicNote that although H1 and H5 are homologous, H5 proteins were not in the database usedcover the motif. As can be seen from Fig. 5(a) and (b) where the two motifs are highlighted,

Fig. 4: Statistically significant patterns as a function of simulated evolution. Different curves model detection of a 40 amino acid conserved region within a longer uncorrelated sequence.

5.0 10.0 15.0 20.0

PAMs

0.0

5.0

10.0

15.0

20.0

Sta

t. S

igni

fican

t Pat

tern

s

Evolutionary Motif Detection (Identity)

L = 100L = 200L = 300L = 400

5.0 10.0 15.0 20.0

PAMs

0.0

5.0

10.0

15.0

20.0

Sta

t. S

igni

fican

t Pat

tern

s

Evolutionary Motif Detection (Similarity)

L = 100L = 200L = 300L = 400

k0 4= l0 8=

l0 12 16 …, ,=( )l0 40≤

k0 4= l0 12=

π0 G.S...[ILMV]...[ILMV]= j 209= Zs 3.013 105⋅=

Zs

Zs π jkl[ ]n π jkl( ) n π jkl( )–

σ π jkl( )----------------------------------------=

n π jkl( ) σ π jkl( )n π jkl( ) 1=

Zs


1 and

alpha-thestoneause

s anll the

pat-s con-of a

pledin the

.

nd:

aThese

is significant structural similarity. This occurs in the region between residues 22 and 32 on Hbetween residues 44 and 54 on H5.

The glycine and serine are exposed on the surface of the protein in a loop at the end of anhelix. The two leucines in H1, mutated into isolucines in H5, contribute to the stability ofhydrophobic core on the alpha-helix. Although the structural properties of this area of the higlobular domain have not been fully understood, it could play an important structural role becit harbors the most conserved motif in the whole histone I family.

It should be noted that, in this case, the ability to discover patterns with a similarity metric playimportant role. In particular, there are no 3 full character sub-patterns of that appear in asequences. The sub-pattern , which occurs in 201 proteins, haswhich makes it virtually indistinguishable from background noise. In fact, there are about 170terns of this kind spread all over the input sequences. From a statistical point of view, patterntaining two full characters are not significant in this database. Therefore, without the usesimilarity metric, directly in the discovery process, pattern would not readily emerge.

9. Experimental Results: G-Protein-Coupled ReceptorsAnother result is shown from a superfamily of 918 known and hypothetical G-Protein-CouReceptors. Here, Splash has been used to discover statistically significant motifs that occurlargest number of protein sequences in a set, with a procedure identical to that of Section 8

If an identity metric is used, even with a very low density constraint defined by a, a single 4 amino acid pattern is found that occurs in more than half of the proteins

= N.......................DRY,( , )

This pattern contains the previously known and well preservedDRYtripeptide for family A. Basedon the , however, this pattern would be difficult to separate from the noise.

Under identical conditions, if the similarity metric defined by the BLOSUM50 matrix withthreshold of 2 is used with the same density constraint, almost 500 patterns are discovered.

Fig. 5: Globular domain of histone I and V proteins (a) Chicken histone H1 (b) Chicken histone H5

π0π3 G.S.....K= Zs 0.64–=

π0

k0 4=l0 30=π0 j 474= Zs 11.13–=

Zs


. The

Thisly 30

ative

uch as

tical

otif

turns

have

l run-s to a

f sta-sults

e the

imalilitych as

on con-ignifi-fs that

aticsitive

milarityntlynif-rom thes.

have at least 2 identical and 2 similar residues and occur in more than half of the proteinsmost frequent pattern with a is

= C.................[ILMV]..[ILMV]..[NDE]R...[ILMV] , .

This highly statistically significant pattern has 2 identical and 3 similar residue matches.motif, used to search against SWISS-PROT Rel. 36, returns 874 GPCR proteins and onpotential false positives. About half of these are hypothetical proteins or proteins with a putannotation.

Even a pattern with a significantly lower chance of appearing in a large database than , s

= C.................L..[ILMV]..[DEB][RQK][HFWY]..[ILMV] , .

with 2 identical and 5 similar amino acid matches, still occurs in more proteins than the idenpattern , and it has a much higher , vs. . Note that thez-score is not a mea-sure of how likely a motif is to occur in a database. Rather it is a measure of how likely the mwould be to occurj times in a random database of the same size and composition.

This motif is extremely selective. If it is used to search against SWISS-PROT Rel. 36, it re578 GPCR proteins and only 4 possible false positives. These are:

1. C555_METCA Cytochrome C-555.

2. KEMK_MOUSE: Putative serine/threonine-Protein Kinase EMK

3. SRG8_CAEEL: SRG-8 Protein.

4. UL29_HCMVA: Hypothetical Protein UL29.

Given the extremely low false positive ratio (0.68%), it is quite possible that either 2 or 4 maybeen mislabeled and should really be considered members of the GPCR family.

We have been unable to replicate these results using either MEME or Pratt. MEME was stilning after several days. PRATT crashed on our hardware configuration. We attributed thicombination of the database size, its low overall homology, and its high pairwise homology.

In this paper we limit ourselves to single pattern analysis. However, given the large number otistically significant yet different patterns that can be discovered, even better classification recan be obtained by using multiple statistically significant patterns simultaneously. This will bsubject of a follow-up paper.

10. ConclusionsThis paper introduces Splash, a new deterministic algorithm for identical, similar, and proxpattern discovery. The most significant contribution of Splash is its efficiency and scaleabwithout any sacrifice in terms of the completeness of the results. Very large biological sets, suthe full Swissprot or PDB database or entire genomes, can be processed by Splash in hoursventional workstation class hardware. These reduced computational requirements allow a scant portion of the search parameter space to be explored and studied. As a result, motiwould normally elude discovery can be quickly and systematically identified.

The resulting motifs can be used for a variety of biologically significant purposes, from automsequence clustering, to the definition of HMM and PSSM models for accurate and sensequence screening, to multiple sequence alignment.

The paper also shows that, as expected from pairwise sequence comparison, the use of a simetric significantly increases the chances of detecting statistically significant motifs in distarelated families. Finally, by coupling the algorithmic part with the analysis of the statistical sigicance of patterns, we have shown that interesting motifs can, in some cases, be separated flarge background of patterns that are not statistically significant discovered in large data set

Zs 3≥π1 j 619 Zs; 2.26 1034⋅= =

π0

π2 j 483 Zs; 2.46 1034⋅= =

π0 Zs 2.46 1034⋅ 11.13–


matedvery of

blebson,ttern

ver

.”

es.,

ardrchi-

nce-

-

n of(1993)

”

atl.

es,”

uta-

ic loci,”

tein

Splash is currently being used for a wide variety of analysis tasks that range from the autofunctional and structural annotation of orphan sequences, to the systematic, automatic discoPROSITE motifs.

AcknowledgmentsWe would like to thank Gustavo Stolovitzky, Ajay Royyuru, and Laxmi Parida for their valuahelp with this paper and for many useful suggestions. We would also like to thank Barry RoAris Floratos, Yuan Gao, and Mike Pitman for the many useful discussions on the topic of padiscovery.

ReferencesBailey T.L. and Elkan C., 1994 “Fitting a mixture model by expectation maximization to disco

motifs in biopolymers.” in Proc. of 2nd ISMB Conf., 28-36, AAAI Press, Menlo Park

Bailey T.L. and Gribskov M., 1998 “Methods and statistics for combining motif match scoresJ.Comp.Bio. 5:211-221.

Bairoch. A, 1991 “PROSITE: a dictionary of sites and patterns in proteins.” in Nucleic Acids R19:2241-2245.

Bernstein F.C., Koetzle T.F., Williams G.J.B., Meyer Jr. E.F., Brice M.D., Rodgers J.R., KennO., Shimanouchi T., and Tasumi M., 1977 “The Protein Data Bank: A Computer-based aval file for macromolecular structures,” J. Mol.Bio. 112:535-542.

A.Brazma et al.: “Approaches to the Automatic Discovery of Patterns in Biosequences,”J.Comp.Bio. 5(2):279-305, (1998)

C. Bystroff and D. Baker, “Prediction of Local Structure in Proteins Using a Library of SequeStructure Motifs,” in J.Mol.Bio 281:565-577, (1988)

A.Califano, I.Rigoutsos: “FLASH: A Fast Look-up Algorithm for String Homology,” in Proceedings of 1993 ISMB, pages 56-64, Bethesda, MD, (1993)

C.Cerf, G.Lippens, S.Mulyldermans, A.Segers, V.Ramakrishnan, S.J.Wodak, K.Hallenga,L.Wyns, “Homo- and heteronuclear two-dimensional NMR studies of the globular domaiHistone H1: Sequential assignment and secondary structure,” Biochemistry 32:11345,

M.O. Dayhoff, R.M.Schwartz, and B.C.Orcutt: “A Model of Evolutionary Change in Proteins,Atlas of Protein Sequence and Structure, ed. Dayhoff, M.O., 345-352, (1978)

J.C.H. Gerretsen, “Lectures on Tensor Calculus and Differential Geometry,” ed. P.NoorthoffN.V.Groningem, (1962)

S.Henikoff and J.G.Henikoff: “Amino acid substitution matrices from protein blocks.” Proc. NAcad. Sci. USA. 89:10915-10919, (1992)

I.Jonassen, J.F.Collins, D.G.Higgins. “Finding flexible patterns in unaligned protein sequencProtein Science 4:1587-1595, (1995)

A.Krogh, M.Brown, I.S.Mian, K.Sjoelander and D.Haussler, “Hidden Markov models in comptional biology. Applications to protein modeling,” J. Mol. Bio., 235:1501-1531, (1994)

Izabela Makalowska, Erik S. Ferlanti, Andreas D. Baxevanis and David Landsman, “HistoneSequence Database: sequences, structures, post-translational modifications and genetNucleic Acids Research, 27(1):323-324, (1999)

Neuwald, Liu, & Lawrence, “Gibbs motif sampling: detection of bacterial outer membrane prorepeats,” Protein Science 4:1618-1632, (1995)

Nevill-Manning, C.G. Wu, T.D. & Brutlag, D.L. “Highly Specific Protein Sequence Motifs forGenome Analysis,” Proc. Natl. Acad. Sci. USA, 95(11):5865-5871, (1998)


93)

IRE-

ein

ts an

d to

37-

gd thepportthat oftly

to

r-

the

inition

orThen,

,rst

snonthe

nerator

L.Parida, A.Floratos and I.Rigoutsos, “An Approximation Algorithm for Alignment of MulitpleSequences using Motif Discovery,” J. Comb. Opt., 3:(2/3):247-275, (1999)

V.Ramakrishnan, J.T.Finch, V.Graziano, P.L.Lee, R.M.Sweet “Crystal Structure of globulardomain of histone H5 and its implications for nucleosome binding,” Nature 362:219, (19

I.Rigoutsos and A.Floratos: “Combinatorial pattern discovery in biological sequences: the TESIAS algorithm,” Bioinformatics 14(1):56-67, (1998)

R.M.Schwartz and M.O.Dayhoff: “Matrices for Detecting Distant Relationships,” Atlas of ProtSequence and Structure, ed. Dayhoff, M.O., 353-358 (1978)

L. Shapiro and P. E. Scherer, “The crystal structure of a complement-1q family protein suggesevolutionary link to tumor necrosis factor.” Current Biology, 8:335 - 338 (1998)

G.Stolovitzky and A.Califano: “Pattern Statistics in Biological Datasets,” to be communicateJ.Comp.Bio, available at www.research.ibm.com/topics/popups/deep/math/html/statis-tics.pdf (1999)

L.Wang and T.Jiang: “On the complexity of multiple sequence alignment,” J.Comp.Biol. 1(4):3338, (1994)

M.S.Waterman, “Introduction to Computational Biology,” Chapman & Hall, London, (1995)

Appendix ATheorem 1, Proof:Given any final pattern onS, let us construct , a parent of , by selectinits leftmost full characters. If is a final pattern, it must have at least full characters anfirst ones must span no more than positions. Given that must also satisfy the surequirements , then also satisfies it because, as a super-pattern, its locus includes

. Then, is contained in the set because this set contains all patterns that have exacfull characters over at most positions and satisfy the support requirements.

Theorem 2, Proof:Let us define , a parent of a final pattern of lengthk, obtained by remov-ing the lasti full characters of , with . By definition, the support of is greater or equalthat of because . Therefore, if were notlc-maximal then, any child of ,including , would also not belc-maximal. Since any parent of is of the form , then any paent of a final pattern must belc-maximal.

Theorem 3, Proof:Consider the set of all the valid, equivalent sub-patterns of a pattern . Ifset is empty, then is a final pattern by definition. Otherwise, If the stringSis finite, there must beone such sub-pattern that does not have any other valid, equivalent sub-pattern. By defthat pattern is a final pattern. is a parent of because, by definition, anlc-maximal pattern doesnot have any equivalent sub-pattern that is not a child.

Theorem 4, Proof: From theorem 3, it follows that can only contain final patternsparents of a final pattern . Suppose there are two independent parents of in .say that the first one, , contains thek leftmost full characters of and that the second one,contains the leftmost full characters of . If , then which violates the fiassumptions. If , is a child of , which violates the second assumption.

Theorem 5, Proof: If is maximal, it has no valid sub-patterns and satisfieour requirements. Otherwise, let be a final pattern which is a child of . There must be aempty set of full characters in that occur at an offset after . Otherwise,density constraint would be violated. Each such full character must occur at least times iSatthe offset after . Then, each one of these full characters is detected by the histogram op

π πk0π

k0 π k0k0 l0 π

j J0≥ πk0π πk0

Ps* k0l0

π i– ππ i k< π i–

π W π i–( ) W π( )⊇ π i– π i–

π π π i–

ππ

π′π π′

Tm π{ }( )π π Tm π{ }( )

πk π πk′k′ k≤ π k k′= πk πk′≡

k′ k< πk πk′

π π̃{ } π̇{ } ∅= =π π

Πi xi( ){ } π xi I π∈ πJ0

xi π


n bet .on-pat-

and it is used to extend , by a single full character, into a pattern . The latter cawritten as , where consists of wildcards and the character at the offseThe set of all the patterns , formed from , is therefore a subset of . By cstruction, the pattern obtained by adding all the together is contained in the set . Thistern is also a parent of since itsk full characters are the firstk full characters of .

π π̇i π̇{ }∈π̇i π πi⊕= πi I π Πi xi

π̇i{ } Πi xi( ){ } π̇{ }π̇i π̃{ }

π π


splash: structural pattern localization analysis by sequential histograms

Documents