Fast Search in Biological Sequences using Multiple Hash Functions
Algorithms & Complexity Evaluation
A T A C G T T C A G A T T G C C A G C A C G T T
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
Grasping the problemGrasping the problem
string matching??? what’s this?
We are going to deal with a very tiny alphabet representing nucleotydes in a genetic sequence.
T G T C G
A G G C A
T G A G C
A T G A C G A C T
C
G
T
A DENINE
HYMINE
UANINE
YTOSINE
A T G A C G A C T
T G T C G
A G G C A
T G A G C
Searching in a sequence for more patterns.
After veryfing matches, advance window: pos++
search window
patterns to search
DNA sequence
shift window by 1 position
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
Let‛s talk about Wu &Mamber
Let‛s talk about Wu & Mamber
don’t worry! It’s not a
magic spell... it’s just an algorithm
First we have pre - processing stage...
T G A G C A C T G
T G A G C A C T G
T G A
T G AHASH( )= #@!*$%£&?
sh[ ]=#@!*$%£&? shift
Then we can move to real search...
C T G A C C G C T C C
T G A G T A G T A G A
G T A G C G T G A G C
A C A A C T G G C G A
A C A A C T G G C
HASH( )= ^@!*%£$?#G G C
sh[ ]= ^@!*%£$?#shift
G G C
a patternNOT A TEXT!!!
gram dim q = 3
extracting the first q-gram
feeding the hash function with the extracted q-gram, hash is returned: 0 <= hash <= MAX
calculated hash is used as index in
shift array value used to shift the window
now... a text!
window size = pattern size = m
extracting the last q-gram only
hash function gets the q-gram, hash returned: 0 <= hash <= MAX
shift index
F[HASH(’CTG’)] = patterns[cur]
0? trueNAIVE CHECK
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
W-M limitW-M limitcannot increase them both...
0 1 0 1 00 0 0001 1 11 11
T G A
wkkk
k = Math.floor(w/q);
Increase q
Increase k
More text to analize
More bits per char
Decrease number of false positives
to be continued...
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
Enhancing W-M...Enhancing W-M...pre-processing
T G A G C A C T G
T G A G C A C T G
T G AHASH( )= #@!*$%
sh [ ]= m-q-i#@!*$%1
HASH(’CTG’) = h1
T G A G C A C T G
T G AHASH( )= #@!*$%
sh [ ]= m-2q-i#@!*$%2
HASH(’GCA’) = h2
γ = 1γ = 2
h = h1 << 1( ) + h2F[h] = patterns[cur]
...now you can’t go back
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
...Enhancing W-M...Enhancing W-M
search
In the end...
C T G A C C G C T C C
T G A G T A G T A G A
G T A G C G T G A G C
A C A A C T G G C G A
A C A A C T G G C
HASH( )= ^@!*%£$?#A C T
]= ^@!*%£$?#shift2 sh [2
HASH( )= §+!#*£$?%G G C
]= §+!#*£$?%shift1 sh [1
h1
h2
window
a text
if (shift1 == 0 && shift2 == 0) foreach (p in F[h]) checkOccurrInWin(p);
h = h1 << 1( ) + h2
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
ComplexitiesComplexities
O MAX 1 +( ) + r( ) = Space requirementO MAX + r m q( ) = Time requirement
O m(1) n( ) = Time requirement
Pre-processing
Search phase
m(1) =i=1
r
len pi( )( )
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
Experimental resultsExperimental results
Showing comparison on execution times among WM(q,γ) and one of the current fastest algorithms in literature
8 16 32 64 1280
5
10
15
20
25
30
35
|P| = 100
time
w
WM(6,1)
WM(8,1)
WM(8,1)WM(8,1) WM(8,1)
8 16 32 64 1280
20
40
60
80
100
|P| = 1000
time
w
WM(4,2)
WM(8,1)
WM(8,2) WM(8,3) WM(8,3)
8 16 32 64 1280
200
400
600
800
1000
1200
|P| = 10000
time
w
WM(4,2)
WM(8,2) WM(8,2) WM(8,2) WM(8,2)
best WM(q,γ)MBNDM
The End
A T A C G T T C A G A T T G C C A G C A C G T T