![Page 1: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/1.jpg)
SISAP’08 – 20080411
Approximate Similarity Search in Genomic Sequence Databases using
Landmark-Guided Embedding
Ahmet Sacan and I. Hakki Torosluemail: [ahmet,toroslu]@ceng.metu.edu.tr
Computer Engineering Department,Middle East Technical University
Ankara, TURKEY
![Page 2: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/2.jpg)
SISAP’08 – 20080411
Outline
• Background– Sequence Alignment– Blast
• Embedding Subsequences– Fastmap, LMDS– Analysis of parameters to achieve stable and
accurate mapping
• Indexing Subsequences
2
![Page 3: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/3.jpg)
SISAP’08 – 20080411
Sequence Similarity Search
• Sequence similarity search is at the heart of bioinformatics research– Similarity information allows: structural,
functional, and evolutionary inferences
3
![Page 4: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/4.jpg)
SISAP’08 – 20080411
Sequence Alignment
• Goal: maximize “alignment score”
• Score of aligning two residues:– Substitution matrix
• Optimal solution: Dynamic Programming– Global: Needleman-Wunsch (1970)
– Local: Smith-Waterman (1981)
4
![Page 5: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/5.jpg)
SISAP’08 – 20080411
Blast (Basic Local Alignment Search Tool)
• Popular tool for similarity search in sequence databases
1)Generate “k-tuples” (“k-mers”, “words”) from query• CDEFG CDE, DEF, EFG
• CDE ADE,CDC,CCE, CDE, …
2)Find (exact) matching k-tuples in the database
3)For each candidate sequence, extend the k-tuple match in both directions.
5
![Page 6: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/6.jpg)
SISAP’08 – 20080411
Time-accuracy trade-off
• Challenge:– Allow flexible matching for larger words at
reasonable time
6
1 2 3 …4 11
k:
Too many k-tuple hits to processSlows down the extension phase
Few/none k-tuple hitsFast executionExact k-tuple matching not sensitiveToo many false negatives
Proteins (203 tuples) DNA (411 tuples)
![Page 7: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/7.jpg)
SISAP’08 – 20080411
Raising the bar for k
1. Map k-tuples to a vector space• Mapping cannot be perfect, thus “approximate
results”
2. Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples
7
![Page 8: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/8.jpg)
SISAP’08 – 20080411
Mapping k-tuples
• Requirements:– Need to support out of sample extension– Speed
• Candidate methods:– Fastmap (Faloutsos, 1995)– Landmark MDS (de Silva, 2003)
8
![Page 9: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/9.jpg)
SISAP’08 – 20080411
Fastmap
1. Select two pivots• Distant pivots heuristic
2. Obtain projection using
cosine law
3. Project objects to
new hyperplane
4. Repeat9
![Page 10: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/10.jpg)
SISAP’08 – 20080411
Fastmap
• Fast! O(Nd)– N: number of data points– d is the target dimensionality
• For query, need only to calculate distances to set of pivots
• Unstable (esp. if original space is non-Euclidean)
10
![Page 11: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/11.jpg)
SISAP’08 – 20080411
Landmark MDS
1. Select n landmarks (pivots)
2. Embed landmarks using classical MDS
3. For the remaining objects, apply distance-based triangulation based on distances to landmarks
11
![Page 12: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/12.jpg)
SISAP’08 – 20080411
Landmark MDS
• Provides stable results
• Good selection of landmarks is critical.– LMDSrandom
– LMDSmaxmin • Add new landmarks that maximizes the minimum
distance to already selected landmarks
– LMDSfastmap • Use the same landmarks as found by Fastmap
12
![Page 13: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/13.jpg)
SISAP’08 – 20080411
Evaluation
• Synthetic datasets– Randomly generate k-tuples for a given k and
alphabet size σ
• Real dataset– Yeast proteins benchmark (σ=20)– 6,341 proteins, 2.9 million residues– 103 query proteins, 38-884 residues
• Weighted Hamming distance
• CB-EUC substitution matrix (Sacan, 2007)13
![Page 14: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/14.jpg)
SISAP’08 – 20080411
• Sammon’s metric stress:
• Breaking point dimensionality14
Target dimensionality (d)
k=5, synthetic dataset, identity matrix
![Page 15: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/15.jpg)
SISAP’08 – 20080411
Subsequence length (k)and alphabet size (σ)
15
![Page 16: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/16.jpg)
SISAP’08 – 20080411
Number of landmarks
16k=5, d=7, synthetic dataset, identity matrix
![Page 17: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/17.jpg)
SISAP’08 – 20080411
Approximate k-tuple search performance
• Find all k-tuples within a specified radius from a query k-tuple
17
k=6, d=8, real dataset, CB-EUC matrix
![Page 18: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/18.jpg)
SISAP’08 – 20080411
Homology search
18
k=6, d=8, real dataset, CB-EUC matrix
![Page 19: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/19.jpg)
SISAP’08 – 20080411
Search time
19
search radius=7 Database size=100,000
![Page 20: SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:](https://reader035.vdocuments.net/reader035/viewer/2022070410/56649eab5503460f94bb1978/html5/thumbnails/20.jpg)
SISAP’08 – 20080411
Conclusion
• Applied an embedding-based approach to approximate sequence similarity search for the first time
• Significant time improvements with negligible degradation in accuracy
• Achieved more stable embedding with combined pivot selection strategy
• Defined intrinsic Euclidean dimensionality of the dataset
20