motifml a novel ontology-based xml model for data-exchange of regulatory dna motif profiles

32
MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University Ken Baclawski, Northeastern University

Upload: najwa

Post on 26-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University Ken Baclawski, Northeastern University. Motifs. DNA Motifs. ========== = ============ === = ===== === = = ===== == ======= - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

MotifMLA Novel Ontology-based XML Model for Data-

Exchange of Regulatory DNA Motif Profiles

Eric Neumann, Beyond Genomics

Tian Niu, Harvard University

Ken Baclawski, Northeastern University

MotifMLA Novel Ontology-based XML Model for Data-

Exchange of Regulatory DNA Motif Profiles

Eric Neumann, Beyond Genomics

Tian Niu, Harvard University

Ken Baclawski, Northeastern University

Page 2: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

========== = ============ === = ===== === = = ===== == =======human GCTTGAATTAGACAGGATTAAAGGC TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCAbovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCAmouse GCTTGAATTAGACAGGATTAAAGGC TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA | | | | -70 -45 -20 +1

DNA Motifs

Alignment ProfileAlignment Profile

Functional Significance?

Motifs

Page 3: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Motif Finding ToolsMotif Finding Tools

AlignACE GIBBS Consensus Propsector

Page 4: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Information resides at multiple sources Data follow multiple Structures Multiple Interfaces

The Need for motifMLThe Need for motifML

BioProspector Gibbs AlignACEConsensus

MotifML

Integrated XML view

Page 5: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Gene expression regulation that is dependent on activated transcriptional factors

Key element of Gene Networks: Complex analysis of microarrays

Motif FunctionMotif Function

Cis-ElementsAssociated with a Gene

Transcriptional Factors

++ Regulated Gene Expression

Page 6: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

motifML GoalsmotifML Goals

to allow the full specification of all experimental information known about motifs

to provide an extensible framework for this annotation and provide a common vehicle for exchanging the motif information

to provide a single document interface to integrate all project information, complete with protocols for network data retrieval.

Page 7: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

motifML DesignmotifML Design

formal and concise- ontology based motifML documents easy to create clarity more important than brevity use both XML schema and XML DTD

Page 8: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

motifML SemanticsmotifML Semantics

Annotation» The collection of features for a given set of

sequence(s) that have built in semantics Features

» Characteristics supported by analytic evidence Analyses

» Computational» Experimental

Page 9: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

motifML SemanticsmotifML Semantics

Annotation

Features

MotifsResults

Property

Intentional Extraction

Semantically Definable & Searchable

Ontology

Pragmatic Objects

Analyses

Page 10: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

<seq id=“demo_seq” name=“Human HAL Gene Exon 18”> <dbxref> <database>GenBank</database> <unique_id>14588658 </unique_id> </dbxref> <feature> <motif type=“cis-regulatory” name=“CBE” id=“dm312”/> <description> CRX Binding Element </description> <position start=“21” end=“32” /> <evidence> <reference paper=“Davies, J Mol Biol. 1993 296:1205-14”/> </evidence> </feature> <residues type=“dna”> ATAATGTCCAAGATCTTCTGGAGAGTGTATCCCATGCTGTGGAGCACTCTGTGGAAGCCACGGGTCCTTTAGACAGCTCATCCTATGAGGAGCACTTCTTAACTGGCACTGGTCTCTTGCAGTTTCTGAGAACAAGGCTCTGTGCCATCCCTCGTCTGTTGACTCCCTCTCCACCAGCGCAGCCACGGAGGACCACGTCTCCATGGGAGGATGGGCAGCAAGGAAAGCCCTCAGGGTCATCGAGCATGTGGAGCAAGGTAATGCTGATGAGTTCGGGGTGGCGGGCCTGCCTGATAGACCACTGTGCCTGTGGTTCTCAAGTGGGATCTCCCACCAGCAACATCAGCATC ACCTGGAAAC </residues></seq>

motifML Sequence Item

motifML Sequence Item

Page 11: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Computational Analysis

Computational Analysis

<!ELEMENT computational_analysis (date?, program, version?, parameter*, database?, result_set+)> <!ATTLIST computational_analysis seq IDREF #REQUIRED>

<!ELEMENT program (#PCDATA)>

<!ELEMENT result_set (score?, output*, result*)>

<!ELEMENT result (score, type, subtype?, seq_relationship+, output*)> <!ATTLIST result id ID #IMPLIED>

<!ELEMENT seq_relationship (location, alignment?)> <!ATTLIST seq_relationship seq IDREF #REQUIRED type (query | subject | peer ) #REQUIRED>

<!ELEMENT alignment (#PCDATA)><!ELEMENT type (#PCDATA)><!ELEMENT value (#PCDATA)><!ELEMENT parameter (type, value)><!ELEMENT output (type, value)><!ELEMENT database (name, date?, version?)><!ELEMENT version (#PCDATA)><!ELEMENT score (#PCDATA)>

Page 12: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Heat shock and other environmental and pathophysiologic stresses stimulate synthesis of heat shock proteins (Hsps). These proteins enable the cell to survive and recover from stressful conditions by as yet incompletely understood mechanisms.

A conserved 14 base pair regulatory sequence, referred to as the heat shock element (HSE), is found in multiple imperfect copies

upstream of the TATA box of all heat shock genes. Genes with an HSE at the upstream region may be co-regulated

HSP and HSEHSP and HSE

Page 13: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Dataset (Vertebrates)*Dataset (Vertebrates)*

> gid 3004462, start=1, end=1027 > gid 7861931, start=1, end=666 > gid 7108904, start=1, end=1519 > gid 7739662, start=1, end=800 > gid 64795, start=1, end=487 > gid 64791, start=1, end=614 > gid 64789, start=1, end=1128 > gid 64786, start=1, end=374 > gid 32480, start=1, end=483 > gid 32484, start=1, end=711 > gid 7669470, start=1, end=424 > gid 5729878, start=1, end=313 > gid 5031770, start=1, end=760

> gid 1816451, start=1, end=2179

> gid 184422, start=1, end=2634 > gid 184416, start=1, end=488 > gid 188491, start=1, end=959 > gid 4691417, start=1, end=2631 > gid 188489, start=1, end=485 > gid 188487, start=1, end=489 > gid 184416, start=1, end=488 > gid 211940, start=1, end=391 > gid 63508, start=1, end=1421 > gid 63512, start=1, end=2300 > gid 409185, start=1, end=1231 > gid 163160, start=1, end=491 > gid 414974, start=1, end=426*Data are from GenBank

Page 14: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

uses a Gibbs sampling strategy which is similar to that described by Neuwald et al., 1995

An iterative masking procedure is used to allow multiple distinct motifs to be found within a single data set

Reference: Hughes et al., J Mol Biol. 2000 296:1205-14

AlignACE programAlignACE program

Page 15: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

AlignACE ResultsAlignACE Results

...Motif 1GGGGAGGGGGTGGGGGGGC 23 788 0GGCGGGCGGGCGGCGGGGG 23 867 1GGACAGCGGCGGCTGGCTG 11 107 0GGGGTGCGGGGGCAGGCGC 23 1417 1CCGCGGGGGCGGGCGGGGC 13 2034 1...** * ***** ** *** *MAP Score: 794.004

Motif 2GGGGAGGGGGTGGGGGGGCGGGG 23 784 0GTGCGGGGGCAGGCGCGGAGAGC 23 1420 1GCGGAGCGGGAGGGGGCGTGGCC 13 1932 1GGGGTGCGGGAGGGCGGGCGGGC 23 1448 1GGGCAGTGGGCGGCTGGCAGCTG 14 1452 1 ...

Page 16: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Uses Stochastic Iterative Sampling The Bernoulli motif sampler assumes that each

sequence can contain zero or more ungapped motif elements of each motif type

Reference: » Lawrence et al., Science 1993;262(5131):208-14; » Neuwald et al., Protein Sci. 1995 Aug;4(8):1618-32.

Gibbs Motif Sampler Program

Gibbs Motif Sampler Program

Page 17: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Gibbs ResultsGibbs Results

...

4, 1 284 agtgc AGAGTCTGGAGAGC cgaat 271 0.87 R gid 7739662, start=1, end=800

4, 2 425 ggtat AGATGTCGGAGAGT cgttt 412 0.79 R gid 7739662, start=1, end=800

4, 3 643 atgga AGCCTCGGGAAACT tcggg 656 0.86 F gid 7739662, start=1, end=800

5, 1 239 atgga AGCCTCGGGAAACT tcggg 252 0.86 F gid 64795, start=1, end=487

7, 1 401 agtgt GGGTGCTGGAGGCT gacgg 388 0.99 R gid 64789, start=1, end=1128

9, 1 26 ggagt GGCGGTGGGAAGGG tgttg 13 0.99 R gid 32480, start=1, end=483... ************** ...

Page 18: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Uses entropy-based scoring functions References:

» Stormo and Hartzell, PNAS 1989;86:1183-1187» Hertz et al., 1990, CABIOS, 6:81-92

Consensus ProgramConsensus Program

Page 19: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Consensus ResultsConsensus Results

MATRIX 1...1|23 : 1/593 TGCAAGATTTTTAA2|9 : 2/8 TGGAGGCTTCCAGA3|10 : 3/889 TGGAGGCTTCCAGA...MATRIX 2...1|23 : 1/593 TGCAAGATTTTTAA2|9 : 2/8 TGGAGGCTTCCAGA3|10 : 3/889 TGGAGGCTTCCAGA...MATRIX 31|23 : 1/593 TGCAAGATTTTTAA2|9 : 2/8 TGGAGGCTTCCAGA3|10 : 3/889 TGGAGGCTTCCAGA...MATRIX 41|21 : 1/38 GGGAAAGCTCGAGA2|9 : 2/8 TGGAGGCTTCCAGA3|10 : 3/889 TGGAGGCTTCCAGA...

Page 20: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

a program that examines the upstream region of genes in the same gene expression pattern group to search for regulatory sequence motifs.

uses zero to third-order Markov background models allows for the searching of gapped motifs and motifs

with palindromic patterns Reference: Liu et al., Pac Symp Biocomput. 2001:127-38

BioProspector ProgramBioProspector Program

Page 21: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

BioProspector ResultsBioProspector Results

...Motif #1:... Seq #1 seg 1 r998 TCATCCAATCAGAGSeq #2 seg 1 f91 TCAACCGAACAGAASeq #3 seg 1 r638 TCGACCAATCAAAA...Motif #2:...Seq #1 seg 1 f38 GGGAAAGCTCGAGASeq #2 seg 1 r648 TGGAAGCCTCCAGTSeq #3 seg 1 r620 TGGAAGCCTCCAGT...Motif #3:...Seq #1 seg 1 r997 CTCATCCAATCAGASeq #2 seg 1 f90 CTCAACCGAACAGASeq #3 seg 1 r637 TTCGACCAATCAAA...

Page 22: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Conceptions and Interactions of the Underlying Statistical Algorithms

Used by the Motif Searching Programs

Conceptions and Interactions of the Underlying Statistical Algorithms

Used by the Motif Searching Programs

GibbsAlignACE

CONSENSUS T

Abbbseq pffI )/(log2

Information Content

BioProspector

Gibbs Sampler; Iterative Updating Strategy

Two Block Motif Model

Page 23: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Motif Data Representation

Motif Data Representation

Common data representation for motif information.

Uses XML Schema to specify format. Both human and machine readable. Supports “knowledge mining”. Statements can be asserted about a

motif such as a role in gene regulation.

Page 24: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Example of a motif<motif id="GXY1"> <block> <base type="G">0.21</base> <base type="C">0.21</base> <base type="T">0.59</base> </block> <block> <base type="G">0.44</base> <base type="C">0.50</base> <base type="T">0.06</base> </block> <block> <base type="A">0.70</base> <base type="G">0.29</base> </block> ...</motif>

Blk1 A G C T

1 0.00 0.21 0.21 0.59

2 0.00 0.44 0.50 0.06

3 0.70 0.29 0.00 0.00

4 0.32 0.62 0.00 0.06

5 0.03 0.00 0.97 0.00

6 0.00 0.00 1.00 0.00

7 0.85 0.09 0.03 0.03

8 0.88 0.12 0.00 0.00

9 0.03 0.00 0.03 0.94

10 0.03 0.09 0.88 0.00

11 0.70 0.12 0.18 0.00

...

Page 25: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

XML Schema

Extends the XML document type language:» Data format restrictions.» Data value (min and max) restrictions.» Element occurrence (min and max)

restrictions. No sophisticated restrictions:

» Probability distribution.

Page 26: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

XML Schema for MotifML<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"><xsd:element name="motif" type="MotifType"/><!-- A motif consists of a sequence of blocks. --><xsd:complexType name="MotifType"> <xsd:sequence> <xsd:element name="block" minOccurs="0" maxOccurs="unbounded" type="BlockType"/> </xsd:sequence></xsd:complexType><!-- A block specifies a probability for each DNA base type. --><xsd:complexType name="BlockType"> <xsd:sequence> <xsd:element name="base" minOccurs="1" maxOccurs="4">...

Page 27: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Statements about motifs

<?xml version="1.0"?>

<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#”

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#”

xmlns:mml="http://www.beyondgenomics.com/2001/07/motifml#"

xmlns:bp="http://www.beyondgenomics.com/2001/07/biopathway#"/>

<Description about="http://www.beyondgenomics.com/motifdb/gxy1">

<bp:upregulate rdf:resource="http://www.beyondgenomics.com/motifdb/awy5"/>

<bp:upregulate rdf:resource="http://www.beyondgenomics.com/motifdb/ftg6"/>

<bp:downregulate rdf:resource="http://www.beyondgenomics.com/motifdb/bgt3"/>

</Description>

</RDF>

Page 28: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

How do biologists learn the element structure of a document describing the heterogeneous sequence alignment output?

How do biologists share the structure and meta-data on motif profiles efficiently and unambiguously?

The Need for Bio-Ontologies

The Need for Bio-Ontologies

Page 29: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

========== = ============ === = ===== === = = ===== == =======human GCTTGAATTAGACAGGATTAAAGGC TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCAbovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCAmouse GCTTGAATTAGACAGGATTAAAGGC TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA PCE-I -CBE-- AP-4 8888 cETS cETS | | | | -70 -45 -20 +1

A multiple sequence alignment linked with TRANSFAC/TRANSPATH

Shown here is the alignment from -70 to +1. The numbering shown corresponds to the mouse sequence. Identical bases are shown by the = above each nucleotide. Consensus sequence matches conserved among all three species are: the Ret-1/PCE-I element at -65 to -60, the CRX-binding element (CBE) at -55 to -50, an AP-4 consensus core sequence at -37 to -34, a cETS consensus core at -35 to -31 and another at positions -57 to -54, and an S8 homeodomain is shown by "8888" at -64 to -61. Only the core bases are marked. The criteria for searching the TRANSFAC Database by MatInspector were a match to the core sequence of at least 80% and to the entire consensus sequence of at least 85%. The Genbank entries for human, bovine, and mouse are X53044, M32733, and M32734, respectively. (Boatright, Mol Vis 1997; 3:15)

Alignment ProfileAlignment Profile

Page 30: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Transcriptional Factors Ontology

Composite Element

Site

Transcriptional Motif Elements

Transcriptional Factors

Context Transcript

•Tissue

•Stage

•Disease

•Env.Cond.

•Induced

Kind of

Part of

Binds to

Upstream to

Within

Found in

produces

Gene

Observation

contains

Page 31: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Develop a data exchange format for DNA motif data

Handling output from motif analyses Annotation and data mining of micro-array

data Important in modeling transcriptional

regulatory networks in eukaryotes

MotifML ApplicationsMotifML Applications

Page 32: MotifML A Novel Ontology-based XML Model for Data-Exchange of Regulatory DNA Motif Profiles

Future DirectionsFuture Directions

Distributed Annotation System –Lincoln Stein, Open-Bio

Exchange with Other XML Dialects DAML development