dataset classes a dataset class tells us: – how to handle a particular type of dataset – exactly...

9
Dataset Classes A dataset class tells us: How to handle a particular type of dataset Exactly how to put it into manual delivery (it specifies the API for manual delivery) How to put it in the database (resource XML) How to process it in the workflow (graph XML)

Upload: shona-jenkins

Post on 13-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

Dataset Classes

• A dataset class tells us:– How to handle a particular type of dataset– Exactly how to put it into manual delivery

• (it specifies the API for manual delivery)

– How to put it in the database• (resource XML)

– How to process it in the workflow• (graph XML)

Page 2: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

Human Roles

• Dataset Integrator– Puts datasets into manual delivery (conforming to the dataset class API)– Provides a specification of each dataset for the workflow.

• Workflow Pilot– Configures the workflow– Runs the workflow

• Workflow Developer– Writes dataset classes– Writes graph files– Writes step classes– Writes plugins

• ReFlow Developer– Develops underlying workflow system

Page 3: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

Organism Abbrev

• Throughout the workflow system, we use a unique, stable “identifier” for an organism: its organism abbrev

• We do not use things like taxon IDs, scientific names, etc.• Examples:

– tgonME49– pfal3D7– ncanLIV

• It always includes:– One letter for the genus– Three letters for the species– The strain

• Once it is set, it does not change, even if we adjust the name of the organism

Page 4: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

Manual Delivery

• Manual delivery has a very specific structure:manualDelivery/

project/organismAbbrev/

category/datasetName/

datasetVersion/final/fromProvider/workspace/README

• final/ contains standard file names that conform to the dataset class API– Eg: SNPs.gff– They never have the name of the provider or any other dataset specific info

Page 5: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

<datasetClass name=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <prop name=“version”/> <graphPlanFile name=“dbXRefs.xml”/> <resource name=“${orgAbbrev}_${name}_dbxrefs”> <manualGet/> … </resource></datasetClass>

<dataset class=“dbxrefs”> <prop name=“orgAbbrev”>myOrg</prop> <prop name=“name”>uniprot</prop> <prop name=“version”>2.0</prop></dataset>

<resources> <resource name=“myOrg_uniprot_dbxrefs”> … </resource> …<resource>

<workflow> <datasetTemplate class=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <subgraph name=“${orgAbbrev}_${name}_dbxrefs” xmlFile=“loadResources.xml”> <paramValue name=“what”>for</paramValue> </subgraph> </datasetTemplate> ..</workflow>

Top Level Graph

Datasets

Dataset Classes

Workflow Plan

Code generator

Another Graph

Another Graph

<workflow> <step> <subgraph name=“myOrg_uniprot_dbxrefs”> <step></workflow>

Another Graph

myOrg.xml

classes.xml dbXRefs.xml

myOrg.xml

myOrg/dbXRefs.xml

Resources

WorkflowGraph

Generated files

Page 6: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

Graph FilesResource Files

Dataset Files

ToxoDB.xml

ToxoDB/tgonME49.xml

ToxoDB/tgonME49/Einstein.xml

ToxoDB.xml

ToxoDB/tgonME49.xml

ToxoDB/tgonME49/Einstein.xml

ToxoDB/project.xml

ToxoDB/tgonME49/ESTs.xml

ToxoDB/tgonME49/Einstein/chipChipSamples.xml

ToxoDB/tgonME49/dbXRefs.xml

ToxoDB/tgonME49/arrayStudies.xml

ToxoDB/tgonME49/SNPs.xml

Generates

Page 7: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

DataSource

• We store simple meta information in the database about each dataset– Provider contact info– Descriptions– Display names– References to WDK searches , tables and attributes that use the data

• The information is stored in two tables:– DataSource -- pulled right from the <resource>– DataSourceInfo -- provided by a specific file after loading data is completed

• And it available in the WDK as a DataSource record– The search and record pages (eg Gene) can access this info for display purposes– Soon we will support searches for these, eg, find all searches that involve a certain dataset

• It makes no sense to have two names:– <resource>– DataSource table and perl objects

• So, either:– Rename <resource> to <datasource>

• This is a pain to transition to in our code,

– Or, rename DataSource to DataResource and keep <resource> as is

Page 8: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

DataResource?

• It makes no sense to have two names:– <resource>– DataSource table, perl objects, and WDK record

• So, either:– Rename <resource> to <datasource>

• This is a pain to transition to in our code,

– Or, rename DataSource to DataResource and keep <resource> as is

Page 9: Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API

DataResourceInfo

• DatasetClasses do not include meta info about the dataset:– Contact info– Description– Mapping to wdk searches and records

• DatasetClasses describe how to load the data

• But, we can have DatasetClass