sisap’08 – 20080411 approximate similarity search in genomic sequence databases using...
TRANSCRIPT
SISAP’08 – 20080411
Approximate Similarity Search in Genomic Sequence Databases using
Landmark-Guided Embedding
Ahmet Sacan and I. Hakki Torosluemail: [ahmet,toroslu]@ceng.metu.edu.tr
Computer Engineering Department,Middle East Technical University
Ankara, TURKEY
SISAP’08 – 20080411
Outline
• Background– Sequence Alignment– Blast
• Embedding Subsequences– Fastmap, LMDS– Analysis of parameters to achieve stable and
accurate mapping
• Indexing Subsequences
2
SISAP’08 – 20080411
Sequence Similarity Search
• Sequence similarity search is at the heart of bioinformatics research– Similarity information allows: structural,
functional, and evolutionary inferences
3
SISAP’08 – 20080411
Sequence Alignment
• Goal: maximize “alignment score”
• Score of aligning two residues:– Substitution matrix
• Optimal solution: Dynamic Programming– Global: Needleman-Wunsch (1970)
– Local: Smith-Waterman (1981)
4
SISAP’08 – 20080411
Blast (Basic Local Alignment Search Tool)
• Popular tool for similarity search in sequence databases
1)Generate “k-tuples” (“k-mers”, “words”) from query• CDEFG CDE, DEF, EFG
• CDE ADE,CDC,CCE, CDE, …
2)Find (exact) matching k-tuples in the database
3)For each candidate sequence, extend the k-tuple match in both directions.
5
SISAP’08 – 20080411
Time-accuracy trade-off
• Challenge:– Allow flexible matching for larger words at
reasonable time
6
1 2 3 …4 11
k:
Too many k-tuple hits to processSlows down the extension phase
Few/none k-tuple hitsFast executionExact k-tuple matching not sensitiveToo many false negatives
Proteins (203 tuples) DNA (411 tuples)
SISAP’08 – 20080411
Raising the bar for k
1. Map k-tuples to a vector space• Mapping cannot be perfect, thus “approximate
results”
2. Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples
7
SISAP’08 – 20080411
Mapping k-tuples
• Requirements:– Need to support out of sample extension– Speed
• Candidate methods:– Fastmap (Faloutsos, 1995)– Landmark MDS (de Silva, 2003)
8
SISAP’08 – 20080411
Fastmap
1. Select two pivots• Distant pivots heuristic
2. Obtain projection using
cosine law
3. Project objects to
new hyperplane
4. Repeat9
SISAP’08 – 20080411
Fastmap
• Fast! O(Nd)– N: number of data points– d is the target dimensionality
• For query, need only to calculate distances to set of pivots
• Unstable (esp. if original space is non-Euclidean)
10
SISAP’08 – 20080411
Landmark MDS
1. Select n landmarks (pivots)
2. Embed landmarks using classical MDS
3. For the remaining objects, apply distance-based triangulation based on distances to landmarks
11
SISAP’08 – 20080411
Landmark MDS
• Provides stable results
• Good selection of landmarks is critical.– LMDSrandom
– LMDSmaxmin • Add new landmarks that maximizes the minimum
distance to already selected landmarks
– LMDSfastmap • Use the same landmarks as found by Fastmap
12
SISAP’08 – 20080411
Evaluation
• Synthetic datasets– Randomly generate k-tuples for a given k and
alphabet size σ
• Real dataset– Yeast proteins benchmark (σ=20)– 6,341 proteins, 2.9 million residues– 103 query proteins, 38-884 residues
• Weighted Hamming distance
• CB-EUC substitution matrix (Sacan, 2007)13
SISAP’08 – 20080411
• Sammon’s metric stress:
• Breaking point dimensionality14
Target dimensionality (d)
k=5, synthetic dataset, identity matrix
SISAP’08 – 20080411
Subsequence length (k)and alphabet size (σ)
15
SISAP’08 – 20080411
Number of landmarks
16k=5, d=7, synthetic dataset, identity matrix
SISAP’08 – 20080411
Approximate k-tuple search performance
• Find all k-tuples within a specified radius from a query k-tuple
17
k=6, d=8, real dataset, CB-EUC matrix
SISAP’08 – 20080411
Homology search
18
k=6, d=8, real dataset, CB-EUC matrix
SISAP’08 – 20080411
Search time
19
search radius=7 Database size=100,000
SISAP’08 – 20080411
Conclusion
• Applied an embedding-based approach to approximate sequence similarity search for the first time
• Significant time improvements with negligible degradation in accuracy
• Achieved more stable embedding with combined pivot selection strategy
• Defined intrinsic Euclidean dimensionality of the dataset
20