dataset classes a dataset class tells us: – how to handle a particular type of dataset – exactly...
TRANSCRIPT
Dataset Classes
• A dataset class tells us:– How to handle a particular type of dataset– Exactly how to put it into manual delivery
• (it specifies the API for manual delivery)
– How to put it in the database• (resource XML)
– How to process it in the workflow• (graph XML)
Human Roles
• Dataset Integrator– Puts datasets into manual delivery (conforming to the dataset class API)– Provides a specification of each dataset for the workflow.
• Workflow Pilot– Configures the workflow– Runs the workflow
• Workflow Developer– Writes dataset classes– Writes graph files– Writes step classes– Writes plugins
• ReFlow Developer– Develops underlying workflow system
Organism Abbrev
• Throughout the workflow system, we use a unique, stable “identifier” for an organism: its organism abbrev
• We do not use things like taxon IDs, scientific names, etc.• Examples:
– tgonME49– pfal3D7– ncanLIV
• It always includes:– One letter for the genus– Three letters for the species– The strain
• Once it is set, it does not change, even if we adjust the name of the organism
Manual Delivery
• Manual delivery has a very specific structure:manualDelivery/
project/organismAbbrev/
category/datasetName/
datasetVersion/final/fromProvider/workspace/README
• final/ contains standard file names that conform to the dataset class API– Eg: SNPs.gff– They never have the name of the provider or any other dataset specific info
<datasetClass name=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <prop name=“version”/> <graphPlanFile name=“dbXRefs.xml”/> <resource name=“${orgAbbrev}_${name}_dbxrefs”> <manualGet/> … </resource></datasetClass>
<dataset class=“dbxrefs”> <prop name=“orgAbbrev”>myOrg</prop> <prop name=“name”>uniprot</prop> <prop name=“version”>2.0</prop></dataset>
<resources> <resource name=“myOrg_uniprot_dbxrefs”> … </resource> …<resource>
<workflow> <datasetTemplate class=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <subgraph name=“${orgAbbrev}_${name}_dbxrefs” xmlFile=“loadResources.xml”> <paramValue name=“what”>for</paramValue> </subgraph> </datasetTemplate> ..</workflow>
Top Level Graph
Datasets
Dataset Classes
Workflow Plan
Code generator
Another Graph
Another Graph
<workflow> <step> <subgraph name=“myOrg_uniprot_dbxrefs”> <step></workflow>
Another Graph
myOrg.xml
classes.xml dbXRefs.xml
myOrg.xml
myOrg/dbXRefs.xml
Resources
WorkflowGraph
Generated files
Graph FilesResource Files
Dataset Files
ToxoDB.xml
ToxoDB/tgonME49.xml
ToxoDB/tgonME49/Einstein.xml
ToxoDB.xml
ToxoDB/tgonME49.xml
ToxoDB/tgonME49/Einstein.xml
ToxoDB/project.xml
ToxoDB/tgonME49/ESTs.xml
ToxoDB/tgonME49/Einstein/chipChipSamples.xml
ToxoDB/tgonME49/dbXRefs.xml
ToxoDB/tgonME49/arrayStudies.xml
ToxoDB/tgonME49/SNPs.xml
Generates
DataSource
• We store simple meta information in the database about each dataset– Provider contact info– Descriptions– Display names– References to WDK searches , tables and attributes that use the data
• The information is stored in two tables:– DataSource -- pulled right from the <resource>– DataSourceInfo -- provided by a specific file after loading data is completed
• And it available in the WDK as a DataSource record– The search and record pages (eg Gene) can access this info for display purposes– Soon we will support searches for these, eg, find all searches that involve a certain dataset
• It makes no sense to have two names:– <resource>– DataSource table and perl objects
• So, either:– Rename <resource> to <datasource>
• This is a pain to transition to in our code,
– Or, rename DataSource to DataResource and keep <resource> as is
DataResource?
• It makes no sense to have two names:– <resource>– DataSource table, perl objects, and WDK record
• So, either:– Rename <resource> to <datasource>
• This is a pain to transition to in our code,
– Or, rename DataSource to DataResource and keep <resource> as is
DataResourceInfo
• DatasetClasses do not include meta info about the dataset:– Contact info– Description– Mapping to wdk searches and records
• DatasetClasses describe how to load the data
• But, we can have DatasetClass