MoBIoS
A Metric-space DBMS to Support Biological Discovery
Presenter: Enohi I. Ibekwe
Overview
• MoBIoS Project• Motivation• The challenge• Established similarity measures• Metric-space distance measure• Disk-based metric tree index• MoBIoS as a DBMS• Application of MoBIoS
MoBIoS Project
• Molecular Biological Information System
• Project at UT-Austin center for computational biology and bioinformatics.
• DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.
Motivation
Develop a DBMS to power Biological Information System
The Challenge
• Established biological model of similarity measure do not form a metrics.
• Scalable disk-based metric-indexes suffer from the Curse of dimensionality
Established Similarity Measure (I)
• Sequence Homology– Query Sequence– Database of sequences– Substitution Matrix (PAM / BLOSUM)– Similarity Measure
– Global Sequence Alignment (Edit distance)
– Local Sequence Alignment (Most important)
Established Similarity Measure (II)
• Local Sequence Alignment– A local sequence alignment query asks, given a query
sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S
– Main issue: Result is a set of answer.
• A metric distance function must return a single value for each pair of argument
Established Similarity Measure (III)
• Global Sequence Alignment– Given an alphabet A , a similarity substitution matrix
M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example)
– Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.
Metric-space Distance measure (I)
• Homology Search• Query Sequence: Sub strings of length q (q-grams)
• Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings.
• Substitution Matrix (mPAM)
• Similarity Measure (distance measure)
– Local Alignments is computed from global alignment.
• mPAM substitution Matrix – Accepted Point Mutation Model.
– PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other.
– mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution.
– mPAM has been validated.(Validation)
Metric-space Distance measure (II)
Metric-space Distance measure (III)
• Computing Local Alignment from Global Alignment (Algorithm)
– Offline
1. Divide database of sequence into sub strings (q-grams)
2. Build metric-space index structure on q-grams
– Online
1. Divide query sequence into sub strings (q-grams)
2. Using global alignment as a distance function to match query q-grams.
Disk-based metric-tree index• Phases
• Initialization• Searching
• Query performance metric• Number of disk I/O ( nodes visited)• Number of distance computation
• Options Exploited• M-Tree• Generalized Hyper plane tree• MVP-Tree (optimal)
Disk-based metric-tree index (initialization)
• M-Tree initialization– Best case : O(nlogn);
– worst case: O(n3)
• Generalized Hyper plane (GH-Tree) initialization– Best case : O(nlogn);
– worst case: O(n2)
• GH-tree: Bi-direction
• M-Tree: Bottom-up
• In practice, both M-Tree and GH-Tree scale linearly
Disk-based metric-tree index (Searching)
MoBIoS as a DBMS (I)
• Mckoi (Java RDBMS).– Plus metric-space indexing
– Plus Biological data types
– Plus biological semantics
• Life science data store– Biological sequence data
– Mass-spectrometry protein signature
MoBIoS as a DBMS (III)
• Language Extension– M-SQL
• Data type Extension– Data type for Sequences (DNA,RNA,peptide)– Data type for Mass spectrum
• Semantics Extension– Subsequence Operators– Local alignment
MoBIoS as a DBMS (IV)
• Semantics Extension – Similarity (metric distance) between data types
• mPAM250• Cosine distance• Lk norms
• Keys Extension – Primary key (metrickey)– Index (metric)
Application of MoBIoS (I)
• MS/MS Protein Identification1. Breakdown protein into fragments called
peptide using a protease enzyme
2. Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.
Application of MoBIoS(II)
M-SQL Solution Create table protein_sequences
(accesion_id int,
sequence peptide,
primary metrickey(sequence, mPAM250);
Create table digested_sequences (accession_id int,
fragment peptide,
enzyme varchar,
ms_peak int, primary key(enzyme, accession_id);
Create index fragment_sequence on digested_sequences (fragment)
metric(mPAM250);
Create table mass_spectra(accession_id int, enzyme varchar,spectrum spectrum, primary
metrickey(spectrum, cosine_distance);
Application of MoBIoS(III) • M-SQL SolutionSELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,mass_spectra MS
WHEREMS.enzyme = DS.enzyme = E andCosine_Distance(S, MS.spectrum, range1) andDS.accession_id = MS.accession_id = Prot.accesion_id andDS.ms_peak = P andMPAM250(PS, DS.sequence, range2)
BLAST vs MoBIoS
MoBIoS 1. Molecular Biological
Information System
2. DBMS specialized for storage, retrieval and mining of biological data
3. Sequence Database and query sequence is divided into q-grams and Database is indexed offline.
BLAST1. Basic Local Alignment Search
Tool
2. Utility specialized for retrieval and mining of biological data outside a database
3. Only query sequence is divide and hot-point index is done at query time
MoBIoS Demo
• MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfound/ccForm.jsp
• PDB : http://www.rcsb.org/pdb/
Conclusion
• Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.
References
• http://www.cs.utexas.edu/users/mobios/Publications/miranker-mobios-final-03.pdf
• http://www.cs.utexas.edu/users/mobios/Publications/mao-bibe-03.pdf
• http://www.cs.utexas.edu/users/mobios/
• http://www.mckoi.com/database/
Appendix
ReturnReturn
Appendix I- Metric
A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z,
1. Non-Negativity
d(x,y) > 0 for x = y; d(x,y) = 0 for x = y
2. Symmetry
d(x,y) = d(y,x)
3. Triangular inequality
d(x,y) + d(y,z) = d(x,y)
ReturnReturn
Appendix II - Sequence
• 2 RNA sequences from a DNA strand.
ReturnReturn
Appendix III - PAM
Percent Accepted Mutation(PAM)
A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250)
A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability.
ReturnReturn
Appendix IV – PAM250• At this evolutionary distance (250 substitutions per hundred
residues)
ReturnReturn
Appendix V - BLOSUM
Blocks Substitution Matrix (BLOSUM)
A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62)
A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability
ReturnReturn
Appendix VI – BLOSUM62• BLOSUM62 matrix is calculated from protein blocks such
that if two sequences are more than 62% identical
ReturnReturn
Appendix VII – mPAM250• Expected time based on 250 PAM distance as a unit.
ReturnReturn
Appendix VIII – mPAM Validation
• Based on benchmark query set by Smith-Waterman.
• Graph shows ROC50 values (Receiver Operating Characteristics)
• Negative x- axis indicate mPAM has better performance
Difference between ROC50 values using mPAM and PAM250
ReturnReturn
Appendix IX - Distance measure
Global Sequence Alignment
Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t.
ReturnReturn
Appendix X – Homology Search
Build Index Structure(Offline)1. Divide the database sequences into a set of overlapping sub
strings of length q (q-grams) with step size 1.2. Build a metric-space index D based on global alignment to
support constant time lookup of exact match.
Homology Search Query (Online)1. Divide the query sequence W into overlapping sub string , F
= {wi | i =0..| W |-q }, of length q with step size 1.
2. For each wi in F, run range query Q(wi, r) against database D to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r, f i,j E D wi E F }, where d is the distance function.
3. Using a greedy heuristic algorithm to extend and chain all fragments in R0UR1U…Rw-t to deduce the result of homology search based on local alignment for query W
ReturnReturn
Appendix XI - GSA
ReturnReturn