indexing dna sequences using q-grams
DESCRIPTION
Indexing DNA Sequences Using q-Grams. Adriano Galati & Bram Raats. Indexing DNA Sequences Using q-Grams. Method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database To sidestep the linear scan of the entire database Proposed: - PowerPoint PPT PresentationTRANSCRIPT
Indexing DNA Sequences Using q-Grams
Adriano Galati & Bram Raats
Indexing DNA Sequences Using q-Grams Method for indexing the DNA sequences efficiently
based on q-grams to facilitate similarity search in a DNA database
To sidestep the linear scan of the entire database Proposed:
Hash table C-trees
based on the q-grams These data structures allow quick detection of
sequences
Introduction
Two sequences share a certain number of q-grams if ed is a certain threshold
Since there are 4 letters combinations
Two level index to prune data sequences
4q
Introduction(2)Two level index
Two level index to prune data sequences:1. First level
Clusters of similar q-grams in DNA are generated A typical Hash table is built in the segments with respect to
the qClusters 2. Second level
The segments are transformed into the c-signature based on their q-grams
A new index called the c-signature trees is proposed to organize the c-signatures of all segments of a DNA sequence for search efficiency
Edit distance
To process approximate matching, one common and simple approximation metric is called edit distance
Definition: The edit distance between two sequences is defined
as the minimum number of edit operations (i.e. insertions, deletions and substitutions) of single characters needed to transform the first string into the second
Preliminaries
Intuition:Two sequences would have a large number of
q-grams in common when the ed between them is within a certain number
Given a sequence S, its q-grams are obtained by sliding a window of length q over the characters of S
|S| - q + 1 q-grams for a sequence S
Question (Bogdan) 1. I have noticed that the segments of the database text
that are considered in this method are disjoint (see page 4, Introduction). I understand that for each segment all the consecutive, non-disjoint, q-grams are taken into consideration when computing the q-cluster and the c-signature of the segment. However, I am a bit puzzled that at the border between two adjacent segments nothing is done, which means that (q-1) q-grams are disregarded at each border. Since each segment contains w-q+1 q-grams, it means that overall a ratio of approximately (q-1)/(w-q+1) of all q-grams are disregarded (if we ignore the difference of 1 between the nr. of segments and the nr. of borders between adjacent segments). For common values of q=3 and w=30, this means about 7% of the q-grams. Do you see a solution for overcoming this problem?
Answer (Bogdan)
Effort to improve the efficiency discarding the regions (filtering) with low sequence similarity
Approximate sequence matching is preferred to exact matching in genomic database due to evolutionary mutation in the genomic sequences and the presence of noise data in a real sequence database
q-gram Signature kinds of q-grams All the possible q-grams are denoted as
The q-gram signature is a bitmap with 4q bits where i-th bit corresponds to the presence or absence of ri .
For a sequence S, the i-th bit is set as ‘ 1’ if occurs at least once in sequence S, else ‘ 0’
ir
},...,,{ 1410 qrrr
qq 4},,,{ TGCA
c-signature
q-gram signature where
where and
when
),...,,()( 1101
naaaSsig
),...,,()( 110 kc uuuSsig cnk /
1)1( ci
icj ji au
ckjn 0ja
qn 4
Example c-signature
P=“ACGGTACT”
q-gram signature is (01 00 00 11 00 11 10 00) with 42 dimensions when q=2
)10020210()(2 Psig c
Hash table Any DNA segment s can be encoded into a λ-bit
(bitmap ) by the coding function:
Hash table with size 2λ respect to qClusters
),...,,( 21 eeee
},...,,{ 21 qClusterqClusterqClusterqClusters
01
)(
i
i
i
eelseethen
qClusterramgqsramgqif
112)(
i ii escoding
Question (Jacob)
I can't get my hands on the c-Trees (mentioned first on page 9). Could you please explain how such a tree is built up, because I can't figure it out.
c-Trees
Group of rooted dynamic trees built for indexing c-signature
Height l set by user Given trees Each path from the root to a leaf in Ti
corresponds to the c-signature string
internal node there are children
lkuuuSsig kc /),...,,()( 110 10 ,..., TT
),...,,()( 1)1(1 liililci vvvssig
1cT
Example c-Trees Consider the five DNA segments:
If we get trees
""""""""""
43
210
GACGTsTAAGCsACGTTsCTTAGsACGGTs
)11001001()(
)10101100()(
)01011001()(
)00110101()(
)02001001()(22
42
32
22
12
02
ssig
ssig
ssig
ssig
ssigcq
4l 24
cl
q
Example c-Trees(2)
Query Processing
HT and c-T are built on the DNA segments Query sequence Q is also partitioned in
sliding query patterns Two level filtering
FLF: Hash Table Based Similarity Search SLF: c-Trees Based Similarity Search
1 wQ
11,..., wQqq
Hash Table Based Similarity Search
Query pattern qi encoded to a hash key hi (λ bit) ngbr of hi are enumerated ngbr are encoded in λ bit from the segments
which are within a ed from qi
Once is enumerated, the segments in the bucket will be retrieved as candidates and stored into
ingbr qe ][ ngbreHT
htC
c-Trees Based Similarity Search
Candidates will be further verified by c-trees c-signature of query q is divided into
c-signature strings The algorithm retrieves the segment s which
satisfies the range constraint During query processing, for each leaf in the tree
T1 are computed
htC)(qsig c
iqsig ci 0),(
),( sqed
Space and Time complexity
Space complexity HT is for the table head for the bucket of the table Thus the total space complexity for the Hash
structure is Time complexity for query Space complexity of each tree is
)2( O
)/2( wDO
)/( wD
)( qO))/(4( cO q
Question (Bogdan) I have trouble understanding the graphic in Fig.
2(a). My intuition would tell me that the more common q-grams exist in the 2 sequences, the higher the probability of finding a high score alignment between them. However, the figure seems to show the opposite: as the nr. of q-grams increases, the probability decreases. I've obviously got something mixed up here, but I can't figure out what it is. Could you please explain?
The Sensitivity vs The Number of Common q-grams
Answer (Bogdan)
Sensitivity can be measured by the probability that a high score alignment is found by the algorithm
The graph starts with probability almost 1 when we have only 1 common q-gram and if we increase the number of q-grams, the probability (sensitivity) of matching the alignment will surely decrease