mobios a metric-space dbms to support biological discovery presenter: enohi i. ibekwe

Post on 15-Jan-2016

34 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe. Overview. MoBIoS Project Motivation The challenge Established similarity measures Metric-space distance measure Disk-based metric tree index MoBIoS as a DBMS Application of MoBIoS. MoBIoS Project. - PowerPoint PPT Presentation

TRANSCRIPT

MoBIoS

A Metric-space DBMS to Support Biological Discovery

Presenter: Enohi I. Ibekwe

Overview

• MoBIoS Project• Motivation• The challenge• Established similarity measures• Metric-space distance measure• Disk-based metric tree index• MoBIoS as a DBMS• Application of MoBIoS

MoBIoS Project

• Molecular Biological Information System

• Project at UT-Austin center for computational biology and bioinformatics.

• DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.

Motivation

Develop a DBMS to power Biological Information System

The Challenge

• Established biological model of similarity measure do not form a metrics.

• Scalable disk-based metric-indexes suffer from the Curse of dimensionality

Established Similarity Measure (I)

• Sequence Homology– Query Sequence– Database of sequences– Substitution Matrix (PAM / BLOSUM)– Similarity Measure

– Global Sequence Alignment (Edit distance)

– Local Sequence Alignment (Most important)

Established Similarity Measure (II)

• Local Sequence Alignment– A local sequence alignment query asks, given a query

sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S

– Main issue: Result is a set of answer.

• A metric distance function must return a single value for each pair of argument

Established Similarity Measure (III)

• Global Sequence Alignment– Given an alphabet A , a similarity substitution matrix

M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example)

– Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.

Metric-space Distance measure (I)

• Homology Search• Query Sequence: Sub strings of length q (q-grams)

• Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings.

• Substitution Matrix (mPAM)

• Similarity Measure (distance measure)

– Local Alignments is computed from global alignment.

• mPAM substitution Matrix – Accepted Point Mutation Model.

– PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other.

– mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution.

– mPAM has been validated.(Validation)

Metric-space Distance measure (II)

Metric-space Distance measure (III)

• Computing Local Alignment from Global Alignment (Algorithm)

– Offline

1. Divide database of sequence into sub strings (q-grams)

2. Build metric-space index structure on q-grams

– Online

1. Divide query sequence into sub strings (q-grams)

2. Using global alignment as a distance function to match query q-grams.

Disk-based metric-tree index• Phases

• Initialization• Searching

• Query performance metric• Number of disk I/O ( nodes visited)• Number of distance computation

• Options Exploited• M-Tree• Generalized Hyper plane tree• MVP-Tree (optimal)

Disk-based metric-tree index (initialization)

• M-Tree initialization– Best case : O(nlogn);

– worst case: O(n3)

• Generalized Hyper plane (GH-Tree) initialization– Best case : O(nlogn);

– worst case: O(n2)

• GH-tree: Bi-direction

• M-Tree: Bottom-up

• In practice, both M-Tree and GH-Tree scale linearly

Disk-based metric-tree index (Searching)

MoBIoS as a DBMS (I)

• Mckoi (Java RDBMS).– Plus metric-space indexing

– Plus Biological data types

– Plus biological semantics

• Life science data store– Biological sequence data

– Mass-spectrometry protein signature

MoBIoS as a DBMS (III)

• Language Extension– M-SQL

• Data type Extension– Data type for Sequences (DNA,RNA,peptide)– Data type for Mass spectrum

• Semantics Extension– Subsequence Operators– Local alignment

MoBIoS as a DBMS (IV)

• Semantics Extension – Similarity (metric distance) between data types

• mPAM250• Cosine distance• Lk norms

• Keys Extension – Primary key (metrickey)– Index (metric)

Application of MoBIoS (I)

• MS/MS Protein Identification1. Breakdown protein into fragments called

peptide using a protease enzyme

2. Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.

Application of MoBIoS(II)

M-SQL Solution Create table protein_sequences

(accesion_id int,

sequence peptide,

primary metrickey(sequence, mPAM250);

Create table digested_sequences (accession_id int,

fragment peptide,

enzyme varchar,

ms_peak int, primary key(enzyme, accession_id);

Create index fragment_sequence on digested_sequences (fragment)

metric(mPAM250);

Create table mass_spectra(accession_id int, enzyme varchar,spectrum spectrum, primary

metrickey(spectrum, cosine_distance);

Application of MoBIoS(III) • M-SQL SolutionSELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,mass_spectra MS

WHEREMS.enzyme = DS.enzyme = E andCosine_Distance(S, MS.spectrum, range1) andDS.accession_id = MS.accession_id = Prot.accesion_id andDS.ms_peak = P andMPAM250(PS, DS.sequence, range2)

BLAST vs MoBIoS

MoBIoS 1. Molecular Biological

Information System

2. DBMS specialized for storage, retrieval and mining of biological data

3. Sequence Database and query sequence is divided into q-grams and Database is indexed offline.

BLAST1. Basic Local Alignment Search

Tool

2. Utility specialized for retrieval and mining of biological data outside a database

3. Only query sequence is divide and hot-point index is done at query time

MoBIoS Demo

• MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfound/ccForm.jsp

• PDB : http://www.rcsb.org/pdb/

Conclusion

• Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.

References

• http://www.cs.utexas.edu/users/mobios/Publications/miranker-mobios-final-03.pdf

• http://www.cs.utexas.edu/users/mobios/Publications/mao-bibe-03.pdf

• http://www.cs.utexas.edu/users/mobios/

• http://www.mckoi.com/database/

Appendix

ReturnReturn

Appendix I- Metric

A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z,

1. Non-Negativity

d(x,y) > 0 for x = y; d(x,y) = 0 for x = y

2. Symmetry

d(x,y) = d(y,x)

3. Triangular inequality

d(x,y) + d(y,z) = d(x,y)

ReturnReturn

Appendix II - Sequence

• 2 RNA sequences from a DNA strand.

ReturnReturn

Appendix III - PAM

Percent Accepted Mutation(PAM)

A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250)

A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability.

ReturnReturn

Appendix IV – PAM250• At this evolutionary distance (250 substitutions per hundred

residues)

ReturnReturn

Appendix V - BLOSUM

Blocks Substitution Matrix (BLOSUM)

A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62)

A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability

ReturnReturn

Appendix VI – BLOSUM62• BLOSUM62 matrix is calculated from protein blocks such

that if two sequences are more than 62% identical

ReturnReturn

Appendix VII – mPAM250• Expected time based on 250 PAM distance as a unit.

ReturnReturn

Appendix VIII – mPAM Validation

• Based on benchmark query set by Smith-Waterman.

• Graph shows ROC50 values (Receiver Operating Characteristics)

• Negative x- axis indicate mPAM has better performance

Difference between ROC50 values using mPAM and PAM250

ReturnReturn

Appendix IX - Distance measure

Global Sequence Alignment

Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t.

ReturnReturn

Appendix X – Homology Search

Build Index Structure(Offline)1. Divide the database sequences into a set of overlapping sub

strings of length q (q-grams) with step size 1.2. Build a metric-space index D based on global alignment to

support constant time lookup of exact match.

Homology Search Query (Online)1. Divide the query sequence W into overlapping sub string , F

= {wi | i =0..| W |-q }, of length q with step size 1.

2. For each wi in F, run range query Q(wi, r) against database D to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r, f i,j E D wi E F }, where d is the distance function.

3. Using a greedy heuristic algorithm to extend and chain all fragments in R0UR1U…Rw-t to deduce the result of homology search based on local alignment for query W

ReturnReturn

Appendix XI - GSA

ReturnReturn

top related