teragrid science workflow survey using taverna and swift in olsg and siggrid wenjun wu, aashish...
TRANSCRIPT
TeraGrid Science Workflow SurveyUsing Taverna and Swift in OLSG and SIGGrid
Wenjun Wu, Aashish Adhikari
Thomas D. Uram, Michael Wilde, Michael E. Papka
Outline
• Open Life Science Gateway• Provide Web-Services for commonly used bio-applications such as
BLAST, CLUSTALW, InterProScan, ….• Bioinformatics researchers can compose Taverna workflows using
OLSG services• OLSG doesn’t manage Taverna workflows.
• Social informatics Grid• Use the swift engine inside the gateway for workflow management• The swift workflow engine enables users to write parallel scripts for
their domain applications on TeraGrid• With existing swift scripts, gateway developers can focus on the
application management and web interfaces
• Open Protein Simulation Science Gateway• Reuse the application framework from the SIDGrid science gateway • Generate the oops WebUI from the swift script
Using Taverna in Open Life Science Gateways
Compose Taverna workflows based on OLSG Services
Add WSDL ScavengersAdd WSDL Scavengers
Compose a workflowAdd methodsDefine input/outputLink components together
Compose a workflowAdd methodsDefine input/outputLink components together
Wrap Bio-Applications as Web-Services Wrap Bio-Applications as Web-Services
Social Informatics Data Gridhttp://sidgrid.ci.uchicago.edu
• SIDGrid enables social and behavioral scientists to collect and annotate data, collaborate and share data, and analyze and mine large data repositories
speech, gesture, facial expression, and physiological measurements
• SIDGrid Services• Data Importing/Exporting• Query• Streaming• Large scale data analysis on
TeraGrid, especially multimedia media processing and data mining tasks
SIDGrid Workflow Management http://www.ci.uchicago.edu/swift/
• It supports data-intensive scientific applications that execute many tasks coupled by disk-resident datasets • A simple script language for describing workflows • Flexible data mapping mechanism for accessing large-scale
scientific dataset• Introduce an efficient task execution framework for high-
throughput computation• Easy to integrate into science gateways
SIDGrid Science Gateway Framework
Render Gadgets Instance
Gadgets XML
SIDGrid Data URLsRun application specific
workflows
Application Mobyle XML
Application Mobyle XML
Swift workflow scripts
Gadgets XML
•Integrates social applications and provides web2.0 interface• Extended Mobyle Application XML for application description: swift script templates as applications•Swift workflow engine: start/stop/resume workflows
•Integrates social applications and provides web2.0 interface• Extended Mobyle Application XML for application description: swift script templates as applications•Swift workflow engine: start/stop/resume workflows
Run Social and Behavior Science Tools as SIDGrid Gadgets
3. Launch SIDGrid gadgets (Praat and workflow history gadget) to run analysis and monitor the progress
3. Launch SIDGrid gadgets (Praat and workflow history gadget) to run analysis and monitor the progress
SIDGrid Experiment browsing pageListing project files and available analysis tools;Providing browser-side gadget execution environmentThree steps to launch SIDGRID application gadgets:
SIDGrid Experiment browsing pageListing project files and available analysis tools;Providing browser-side gadget execution environmentThree steps to launch SIDGRID application gadgets:
1. Select data files to analyze1. Select data files to analyze
2. Select an analysis application2. Select an analysis application
Build a domain-specific scientific workspace
Fluid FlowFluid Flow
Breadcrumbs
ParaView
WorkflowConfiguratorMonitor
+
Develop swift workflow scripts for domain applicationsDevelop swift workflow scripts for domain applications
Define the application execution descriptionsDefine the application execution descriptions
Generate the execution profiles and Web gadgetsGenerate the execution profiles and Web gadgets
Reuse general purpose gadgets from the gadget library
Reuse general purpose gadgets from the gadget library
Create a customized a layoutmanager for these gadgetsCreate a customized a layoutmanager for these gadgets
Describe the event channels between the gadgetsDescribe the event channels between the gadgets
TeraGrid Science Gateway:
Open Protein Simulator (OOPS) for the UChicago Department of Chemistry
and Institute for Biophysical Dynamics
Science topic: 3D Protein structure prediction
• 3D Protein structure prediction “ab-intio”• Specifically, protein targets that have few or no known homologies
• Primary algorithmic scaffold for the lab is called “ItFix” – iterative fixing
• Uses many parallel “rounds” of simulation: independent, randomly seeded simulated annealing runs
• Consensus structures are formed after each round to seed the next round
• Various exploratory algorithms and variations – e.g. “Loop Modeling” and “SPEED” are hung off of this framework
ItFix: iterative fixing for structure prediction
Slide courtesy of Glen Hocky
Science Value
• The special realm of OOPS/ItFix in the folding world is to predict structure of new proteins whose sequence is known, but for which few or no homologies to known structures exist (making the common “template-based” approaches inneffective)
• OOPS is now being applied to massive-scale prediction on pathogens (eg Staph. aureus) and on metagenomes with biomedical and energy applications.
• The CPU demands for these applications are significant – 10s of millions of hours expected to be required over next few years
• The core OOPS protein library was recently re-written and is now being recalibrated for resource estimation
14
Per-protein stats4MB, 3 files
BioPython and Protlib libs1 tar, ~2MB
Per-protein data4 files, ~ 100KB
Simulate100 - 10K X
Analyze
1 to 10X
Unique, per protein
Common stat data16 files, ~ 90MB
All-protein stats600MB, 8000 files
Per-protein output3 files, ~ 3MB
Repeated for each protein for each parameter set
Target: 100s proteins, 100’s paramsets
refined
Protein structure prediction data flow
Protein structure prediction
1. ItFix( Protein p, int nsim, int maxr, float temp, float dt)2. {3. ProtSim prediction[ ][ ];4. boolean converged[ ];5. PSimCf config;6. 7. config.st = temp;8. config.tui = dt;9. config.coeff = 0.1;10. 11. iterate r {12. prediction[r] =13. doRoundCf(p, nsim, config);14. converged[r] =15. analyze(prediction[r], r, maxr);16. } until ( converged[r] );17. }
15
Protein structure prediction
1. Sweep( )2. {3. int nSim = 1000;4. int maxRounds = 3;5. Protein pSet[ ] <ext; exec="Protein.map">;6. float startTemp[ ] = [ 100.0, 200.0 ];7. float delT[ ] = [ 1.0, 1.5, 2.0, 5.0, 10.0 ];8. foreach p, pn in pSet {9. foreach t in startTemp {10. foreach d in delT {11. ItFix(p, nSim, maxRounds, t, d);12. }13. }14. }15. }16. 17. Sweep();
16
10 proteins x 1000 simulations x3 rounds x 2 temps x 5 deltas
= 300K tasks
Structure prediction runs on TeraGrid and Blue Gene/P
Work of Tobin Sosnick, Karl Freed, Glen Hocky, Joe DeBartolo, Aashish Adhikari.
T1af7 T1r69T1b72
Work of Tobin Sosnick, Karl Freed, Glen Hocky, Joe DeBartolo, Aashish Adhikari.
Work of Tobin Sosnick, Karl Freed, Glen Hocky, Joe DeBartolo, Aashish Adhikari.
Generating a Mobyle XML for the oops swift script
<parameter ismandatory="1" issimple="1" ismaininput="1">
<name>plist</name>
<prompt lang="en">input protein fasta file</prompt>
<type>
<datatype>
<class>File</class>
</datatype>
</type>
<format>
<code proglang="python"> ("","-plist="+str(value))[value is not None] </code>
</format>
<argpos>1</argpos>
</parameter>
main(){ string plistfile=@arg("plist",""); // input protein fasta file string indir=@arg("indir","oops.input"); //… string outdir=@arg("outdir","output"); //… string nsims=@arg("nsims","1"); // simulation num string st=@arg("st","100"); // start temperature string tui=@arg("tui","100"); // time update interval string coeff=@arg("coeff","0.1");string plist[] = readData(plistfile); RAMAIn ramain[] <ext; exec="RAMAInProts.map.sh",i=indir,p=plistfile>; RAMAOut ramaout[][] <ext;exec="RandProtRadialMapper.py",o=outdir,p=plistfile,n=nsims,c=create>; foreach sim in [ 0 : @toint(nsims) -1 ] { foreach prot,index in plist {ramaout[index][sim] = predictCf(prot, ramain[index], st, tui, coeff); VizOut outpng[] <ext; exec="pngmapper.py", o=metadir, p=@filename(ramaout[index][sim].pdb) >; outpng[0] = pngviz(ramaout[index][sim]);
} } Oops script
XML for generatingWeb gadget
Build the Protein Folding Simulation Workspace Prototype using the framework
•A prototype portal for large-scale protein 3D structure simulations
• Domain specific gadgetProtein folding simulation gadget• Reusable gadgetsWorkflow history gadgetViewing Results gadgetFile Browsing gadget
• Integrate all these gadgets into an desktop application layout• Message passing between gadgets
•A prototype portal for large-scale protein 3D structure simulations
• Domain specific gadgetProtein folding simulation gadget• Reusable gadgetsWorkflow history gadgetViewing Results gadgetFile Browsing gadget
• Integrate all these gadgets into an desktop application layout• Message passing between gadgets