copyright ©2003 turboworx, inc. 1 high performance workflows for networks and grids andrew h....
Post on 18-Dec-2015
215 views
TRANSCRIPT
Copyright ©2003 TurboWorx, Inc. 1
High Performance Workflows for Networks and Grids
Andrew H. Sherman
Chief Technology Officer
Copyright ©2003 TurboWorx, Inc. 2
Outline
Technical Computing Workflows
Deploying Workflows in HPC Environments
TurboWorx Workflow Products
Copyright ©2003 TurboWorx, Inc. 3
• Complex technical computing problems and algorithms have become “business critical”
• Solutions often involve integrating several applications and many data sources into workflows
• Automated coarse-grain parallelism and grid computing are emerging as key technologies
Complex Technical Computationsare Critical in Many Industries
Copyright ©2003 TurboWorx, Inc. 4
• Complex technical computing problems and algorithms have become “business critical”
• Solutions often involve integrating several applications and many data sources into workflows
• Automated coarse-grain parallelism and grid computing are emerging as key technologies
Complex Technical Computationsare Critical in Many Industries
Life Sciences & Medicine
Discovery and Development
Data- & compute-intensive applications
Huge databases from multiple sources & in diverse formats
Manual workflows
Information-Based Medicine
Complex, heterogeneous databases & applications
Better and more effective diagnosis & treatment from faster, more accurate information interpretation
Automotive/AeroDesign and
Development Concurrent Engineering
requires integration and collaboration between Concept, Design and Development processes
Global design teams that work around the clock
Suppliers part of the design and development process
FinancePortfolio
Management/Pricing Scenario-based
modeling
Huge quantities of real-time data
Time is money!
Copyright ©2003 TurboWorx, Inc. 5
What is a Workflow?
“The automation of a business process, in whole or parts, where documents, information or tasks are passed from one participant to another to be processed, according to a set of procedural rules”
— Workflow Management Coalition
Copyright ©2003 TurboWorx, Inc. 6
Technical Computing Workflows
How do technical computing workflows differfrom traditional business process workflows?
Data flow vs. control flow
Widely distributed data (often with multiple owners)
Dynamic operating environment (e.g., the Grid)
Hierarchical workflow constructs
Requirement for parameterized executions
Evolving/Customized workflow definitions
Significance of collaboration and reuse
Copyright ©2003 TurboWorx, Inc. 7
Characterizing Technical Computing Workflows
Collaborative Production
Ad Hoc Administrative
Ref: Production Workflows (Leyman, Roller)
Repetition
Busi
ness
Valu
e
Technical
Workflows
Copyright ©2003 TurboWorx, Inc. 8
HPC Platforms: SMPs & Clusters
Linux Clusters
•Cost-effective
•Scalable
•Modular — easy to upgrade to faster, better cpus (e.g. 64-bit)
•Great for computation
Blade Solutions
•Similar attributes to Linux clusters
•More compact — Better flops/ft3
•Often cheaper
Linux UNIX
ComputationCluster
DatabaseServer
Shared Memory Multiprocessor
• Expensive to buy, costly to upgrade
• Poor scalability for computation
• Best use: Data storage & access
Copyright ©2003 TurboWorx, Inc. 9
HPC Platforms: Enterprise Grids
Enterprise Grids
•Efficient - Uses all the hardware available
•Provides user comfort and familiarity
•More than cycle stealing on idle desktops — usually includes computing on heterogeneous collections of servers
•Great for computation, particularly for Life Sciences, where desktop platforms are appropriate for many algorithms
Linux UNIX
ComputationCluster
DatabaseServer
AIX LinuxWindows Mac OS X Linux Linux
Copyright ©2003 TurboWorx, Inc. 10
Technical Computing and Workflows
Integrate, manage, and accelerate collections of heterogeneous applications, data, and platforms
Provide horsepower to process massive amounts of data by applying parallelism without source code modification
Address the needs of key user groups (end users, application experts, and IT staff) through easy-to-use interfaces
Facilitate collaboration and reuse to save time in the design, trials and testing, and deployment of new computing solutions
Workflows can address some critical computing challenges:
Copyright ©2003 TurboWorx, Inc. 11
But . . .
Scalability & performance: going beyond multithreading with “transparent parallelism”
Management of dynamic computing environments
Automated data and application staging
Integration with rapidly evolving grid standards(to support reuse and collaboration)
Desktop tools for workflow creation; portals for execution
Debugging and monitoring interfaces
There are difficulties to overcome:
Copyright ©2003 TurboWorx, Inc. 12
Traditional Workflow Implementation
Large, complex scripts to orchestrate applications Static embedded infrastructure control; usually aimed at single
machine Communication via temp files “Human-in-the-loop” operation
What’s wrong with this?
Copyright ©2003 TurboWorx, Inc. 13
Traditional Workflow Implementation
Large, complex scripts to orchestrate applications Static embedded infrastructure control; usually aimed at single
machine Communication via temp files “Human-in-the-loop” operation
Poor performance — Mainly aimed at SMPs (but scalability often limited)
Lack of automation is inefficient and error-prone
No support for application integration or data conversion
Difficult to create, maintain, modify (even for skilled programmers)
Little reusability or portability
What’s wrong with this?
Copyright ©2003 TurboWorx, Inc. 14
Typical “Human-in-the Loop” Workflow:
• Manual component startup• “Cut and paste” data movement• Sequential execution • Limited throughput due to “bottleneck components”
Access Data
Access Data AA BB CC
Store DataStore Data
Slow FastFast
Traditional Life Science Workflows
Copyright ©2003 TurboWorx, Inc. 15
Access Data
Access Data AA BB CC
Store DataStore Data
FastFast
A Better Way: Automation & Parallelism
TurboWorx High-Performance Workflow:
• Automated component startup & data conversion
• Transparent data-driven parallelism to eliminate bottlenecks
BB
BB
Fast
• Pipeline acceleration: asynchronous, dynamic, concurrent execution on distributed machines
Copyright ©2003 TurboWorx, Inc. 16
TurboWorx Enterprise Architecture
User
Data Storag
e
TurboWorx Hub
AIX LinuxWindows Mac OS X
WorkstationsComponent
Library
Linux Linux
Builder
Interfaces
Command Line
Web Portal
Compute Clusters (Managed by BQS/DRM Systems)
Data Repository
Copyright ©2003 TurboWorx, Inc. 17
Workflow Lifecycle
Design– End user or developer??– Component & workflow development environment– Integration with data– Testing & Debugging
Deployment– Local storage vs. centralized storage – Sharing & Collaboration
Execution– Execution interface: CLI, Proprietary GUI, Portal, Web/Grid
Service– Access Control for workflows and data– Resource management
Monitoring– Events reflecting from workflow and services execution
Refinement & Reuse
Copyright ©2003 TurboWorx, Inc. 18
TurboWorx Workflows
Atomic Components– Command-line programs (e.g. C/C++/Fortran, Perl), Java, Jython– XML wrappers created by wizards or by editing templates
Dataflow Components– Workflows built from other components (including other
workflows)– Automated data flow & transformations between components– Created using visual programming tool
Deployment– Components stored in a “Component Library” (Local or
Centralized)– Import/Export and component sharing (collaboration)– Data references via a virtual “Data Repository” interface
(supports WebDav, Avaki, FTP, NFS)
Design & Deployment
Copyright ©2003 TurboWorx, Inc. 19
TurboWorx Builder
ClustalW
ApplicationJava MethodJython Script
Component Library
Wizard
TurboWorx Component
AtomicComponent
Creation
WorkflowComponent
Creation
{ }
Copyright ©2003 TurboWorx, Inc. 20
Special Components: Conditionals
Copyright ©2003 TurboWorx, Inc. 21
Special Components: Loops
Support for: “For”, “While”, “Do Until”
While Loop:
Copyright ©2003 TurboWorx, Inc. 22
•Components to convert between groups of many data elementsand sequences of the individual data elements
•Support “Fork-Join” data parallelism
•Standard splitters/joiners provided with the TurboWorx system. Examples:
•Arrays: Convert between array and individual elements (in order)
•Collections: Convert between a Java.util.Collection and its elements
•Strings/Patterns: Split input stream based on regular expressions
•Users may create additional types using Jython or Java
Special Components: Splitters & Joiners
Copyright ©2003 TurboWorx, Inc. 23
Access Data
Access Data AA BB CC
Store DataStore Data
FastFast
Parallelism in Practice
TurboWorx High-Performance Workflow:
Slow
SPLIT
JOIN
Splitting enables pipeline parallelism (A, B, C run concurrently on different data)
Copyright ©2003 TurboWorx, Inc. 24
Access Data
Access Data AA BB CC
Store DataStore Data
FastFast
Parallelism in Practice
TurboWorx High-Performance Workflow:
BB
BB
Fast
SPLIT
JOIN
Scheduler determines amount of data parallelism dynamically at run time
Copyright ©2003 TurboWorx, Inc. 25
Protein Characterization Example
Overall Task:
Group protein domains into families
clustalwhmmbuildhmmsearch
clustalwhmmbuildhmmsearch
BLASTP
clustalw
clustalw
Key Programs
Identifyhomologous pairs
Build familiesaround pairs
Refine & optimizeprotein families
Find consensussequences
Compute identityscores vs. leaders
ProcessFamily
Subworkflow
Copyright ©2003 TurboWorx, Inc. 26
Example: “Process Family” Workflow
Copyright ©2003 TurboWorx, Inc. 27
Protein Family Example
Copyright ©2003 TurboWorx, Inc. 28
Take-Home Points
Technical computing workflows are important in various industries
Effective application of workflows requires HPC, including fault-tolerant automation and dynamic parallelism in a grid-like computing environment
TurboWorx workflow products offer one end-to-end solution for developing and deploying high performance technical workflows