mobios a metric-space dbms to support biological discovery presenter: enohi i. ibekwe

MoBIoS

A Metric-space DBMS to Support Biological Discovery

Presenter: Enohi I. Ibekwe

Overview

• MoBIoS Project• Motivation• The challenge• Established similarity measures• Metric-space distance measure• Disk-based metric tree index• MoBIoS as a DBMS• Application of MoBIoS

MoBIoS Project

• Molecular Biological Information System

• Project at UT-Austin center for computational biology and bioinformatics.

• DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.

Motivation

Develop a DBMS to power Biological Information System

The Challenge

• Established biological model of similarity measure do not form a metrics.

• Scalable disk-based metric-indexes suffer from the Curse of dimensionality

Established Similarity Measure (I)

• Sequence Homology– Query Sequence– Database of sequences– Substitution Matrix (PAM / BLOSUM)– Similarity Measure

– Global Sequence Alignment (Edit distance)

– Local Sequence Alignment (Most important)

Established Similarity Measure (II)

• Local Sequence Alignment– A local sequence alignment query asks, given a query

sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S

– Main issue: Result is a set of answer.

• A metric distance function must return a single value for each pair of argument

Established Similarity Measure (III)

• Global Sequence Alignment– Given an alphabet A , a similarity substitution matrix

M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example)

– Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.

Metric-space Distance measure (I)

• Homology Search• Query Sequence: Sub strings of length q (q-grams)

• Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings.

• Substitution Matrix (mPAM)

• Similarity Measure (distance measure)

– Local Alignments is computed from global alignment.

• mPAM substitution Matrix – Accepted Point Mutation Model.

– PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other.

– mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution.

– mPAM has been validated.(Validation)

Metric-space Distance measure (II)

Metric-space Distance measure (III)

• Computing Local Alignment from Global Alignment (Algorithm)

– Offline

1. Divide database of sequence into sub strings (q-grams)

2. Build metric-space index structure on q-grams

– Online

1. Divide query sequence into sub strings (q-grams)

2. Using global alignment as a distance function to match query q-grams.

Disk-based metric-tree index• Phases

• Initialization• Searching

• Query performance metric• Number of disk I/O ( nodes visited)• Number of distance computation

• Options Exploited• M-Tree• Generalized Hyper plane tree• MVP-Tree (optimal)

Disk-based metric-tree index (initialization)

• M-Tree initialization– Best case : O(nlogn);

– worst case: O(n3)

• Generalized Hyper plane (GH-Tree) initialization– Best case : O(nlogn);

– worst case: O(n2)

• GH-tree: Bi-direction

• M-Tree: Bottom-up

• In practice, both M-Tree and GH-Tree scale linearly

Disk-based metric-tree index (Searching)

MoBIoS as a DBMS (I)

• Mckoi (Java RDBMS).– Plus metric-space indexing

– Plus Biological data types

– Plus biological semantics

• Life science data store– Biological sequence data

– Mass-spectrometry protein signature

MoBIoS as a DBMS (III)

• Language Extension– M-SQL

• Data type Extension– Data type for Sequences (DNA,RNA,peptide)– Data type for Mass spectrum

• Semantics Extension– Subsequence Operators– Local alignment

MoBIoS as a DBMS (IV)

• Semantics Extension – Similarity (metric distance) between data types

• mPAM250• Cosine distance• Lk norms

• Keys Extension – Primary key (metrickey)– Index (metric)

Application of MoBIoS (I)

• MS/MS Protein Identification1. Breakdown protein into fragments called

peptide using a protease enzyme

2. Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.

Application of MoBIoS(II)

M-SQL Solution Create table protein_sequences

(accesion_id int,

sequence peptide,

primary metrickey(sequence, mPAM250);

Create table digested_sequences (accession_id int,

fragment peptide,

enzyme varchar,

ms_peak int, primary key(enzyme, accession_id);

Create index fragment_sequence on digested_sequences (fragment)

metric(mPAM250);

Create table mass_spectra(accession_id int, enzyme varchar,spectrum spectrum, primary

metrickey(spectrum, cosine_distance);

Application of MoBIoS(III) • M-SQL SolutionSELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,mass_spectra MS

WHEREMS.enzyme = DS.enzyme = E andCosine_Distance(S, MS.spectrum, range1) andDS.accession_id = MS.accession_id = Prot.accesion_id andDS.ms_peak = P andMPAM250(PS, DS.sequence, range2)

BLAST vs MoBIoS

MoBIoS 1. Molecular Biological

Information System

2. DBMS specialized for storage, retrieval and mining of biological data

3. Sequence Database and query sequence is divided into q-grams and Database is indexed offline.

BLAST1. Basic Local Alignment Search

2. Utility specialized for retrieval and mining of biological data outside a database

3. Only query sequence is divide and hot-point index is done at query time

MoBIoS Demo

• MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfound/ccForm.jsp

• PDB : http://www.rcsb.org/pdb/

Conclusion

• Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.

References

• http://www.cs.utexas.edu/users/mobios/Publications/miranker-mobios-final-03.pdf

• http://www.cs.utexas.edu/users/mobios/Publications/mao-bibe-03.pdf

• http://www.cs.utexas.edu/users/mobios/

• http://www.mckoi.com/database/

Appendix

ReturnReturn

Appendix I- Metric

A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z,

1. Non-Negativity

d(x,y) > 0 for x = y; d(x,y) = 0 for x = y

2. Symmetry

d(x,y) = d(y,x)

3. Triangular inequality

d(x,y) + d(y,z) = d(x,y)

ReturnReturn

Appendix II - Sequence

• 2 RNA sequences from a DNA strand.

ReturnReturn

Appendix III - PAM

Percent Accepted Mutation(PAM)

A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250)

A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability.

ReturnReturn

Appendix IV – PAM250• At this evolutionary distance (250 substitutions per hundred

residues)

ReturnReturn

Appendix V - BLOSUM

Blocks Substitution Matrix (BLOSUM)

A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62)

A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability

ReturnReturn

Appendix VI – BLOSUM62• BLOSUM62 matrix is calculated from protein blocks such

that if two sequences are more than 62% identical

ReturnReturn

Appendix VII – mPAM250• Expected time based on 250 PAM distance as a unit.

ReturnReturn

Appendix VIII – mPAM Validation

• Based on benchmark query set by Smith-Waterman.

• Graph shows ROC50 values (Receiver Operating Characteristics)

• Negative x- axis indicate mPAM has better performance

Difference between ROC50 values using mPAM and PAM250

ReturnReturn

Appendix IX - Distance measure

Global Sequence Alignment

Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t.

ReturnReturn

Appendix X – Homology Search

Build Index Structure(Offline)1. Divide the database sequences into a set of overlapping sub

strings of length q (q-grams) with step size 1.2. Build a metric-space index D based on global alignment to

support constant time lookup of exact match.

Homology Search Query (Online)1. Divide the query sequence W into overlapping sub string , F

= {wi | i =0..| W |-q }, of length q with step size 1.

2. For each wi in F, run range query Q(wi, r) against database D to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r, f i,j E D wi E F }, where d is the distance function.

3. Using a greedy heuristic algorithm to extend and chain all fragments in R0UR1U…Rw-t to deduce the result of homology search based on local alignment for query W

ReturnReturn

Appendix XI - GSA

ReturnReturn

mobios a metric-space dbms to support biological discovery presenter: enohi i. ibekwe

similarity matrix

metric distance function

maximum similarity measure

similarity substitution

query sequence s

mobiosa metricspace

query qgrams

metric tree indexmobios

Documents

the mobios project mo lecular b iological i nformation s...

dbms summary

trigger dbms

distributed dbms - unit 3 - distributed dbms architecture

dbms information in detail || dbms (lab) ppt

knowledge organization research in the last two decades:...

unit01 dbms

dbms architecture

dbms unit01

distributed database: part 2. distributed dbms distributed...

user data warehouse warehouse dbms a dbms b dbms c database...

mark graves leveraging existing dbms storage for xml dbms

dipartimento di...

dbms rlde.ppt

5x mobios user en

distributed dbms

unit02 dbms

unit07 dbms

dbms final aug2011 de2011 dbms lab manual

unit08 dbms