dheeraj bhardwaj december 2003 1 dheeraj bhardwaj department of computer science & engineering...
Post on 15-Jan-2016
216 views
TRANSCRIPT
1Dheeraj Bhardwaj <[email protected]> December 2003
Dheeraj Bhardwaj Department of Computer Science & Engineering
Indian Institute of Technology, Delhi –110 016 Indiahttp://www.cse.iitd.ac.in/~dheerajb
BioGridChallenges, Problems and Opportunities
2Dheeraj Bhardwaj <[email protected]> December 2003
BIOLOGICAL PHENOMENON
DATA MODEL
measurement process inference,
conclusions
data analysis, learning
3Dheeraj Bhardwaj <[email protected]> December 2003
Bioinformatics Vs. Biocomputing
Bioinformatics
Biocomputing
IT BT
4Dheeraj Bhardwaj <[email protected]> December 2003
Genome
Phenome
Biological Data
“Maze” on a Jigsaw Puzzle
5Dheeraj Bhardwaj <[email protected]> December 2003
Equipments for New Quest
Data, Knowledge and ToolsHigh Performance Computers
Collaboration ofHuman Experts
The illustrations are quoted from the following sites:www.dnr.state.wi.us/org/ aw/air/ed/educatio.htmwww.mtnbrook.k12.al.us/academy/2ndgrade/mtn/map.htmwww.dnr.state.wi.us/org/ aw/air/ed/educatio.htm
6Dheeraj Bhardwaj <[email protected]> December 2003
Needs of High Performance Computing
• Increase of Genome Sequence Information• Combinatorial Increase of Search Space Genome * Transcriptome * Proteome* ... * Phenome• Computer Simulation and Unknown Parameter Estimation
Knowledge integration in “Omic Space”
7Dheeraj Bhardwaj <[email protected]> December 2003
Needs of High Performance Computing
•Impact of Genome Sequence Projects
Human Genome (3,000 Mbp, 2000) Rapid Increase of Genome Sequence Databases Strong Computation Demand for Homology Search
•Start of Structural Genomics Projects Determine 10,000 folds in 5 years Strong Computation Demand for Molecular Simulation
8Dheeraj Bhardwaj <[email protected]> December 2003
1st Issue:Homology Search
・ Rapid Increase of Data Size; double per year, daily update
(17 million entry, 50 Giga Bytes @ 2002 Oct. )
0
2,000,000,000
4,000,000,000
6,000,000,000
8,000,000,000
10,000,000,000
12,000,000,000
14,000,000,000
EMBL
GENBANK
DDBJ1cpu8cpu
32cpu256cpu
6,400cpu
1 year1 month1 week1 day1 hour
Rough Estimation Homology Search Timefor Mouse cDNA (5,000 Seq.) * Human Genome (3,000 M bp)
9Dheeraj Bhardwaj <[email protected]> December 2003
2nd Issue Molecular Simulation
Nano seconds order Molecular Dynamics simulation of protein molecules with 100,000 – 1,000,000 molecular weight
•Stability Analysis•Affinity Analysis•Folding Simulation
Ex. Ras p21 G # of residues: 189
Molecular weight: 21kD
Oncogene VariantGly12 →Val
5ns1000h/32Gflops Computer
GTPMg
Lys16
10Dheeraj Bhardwaj <[email protected]> December 2003
Needs of Resource Sharing
• Biological Databases (Unigene, TrEMBL,...)
• Bioinformatics Tools (BLAST, HMMER, ...)
• Programming Language (Bioperl, Biojava, ...)
11Dheeraj Bhardwaj <[email protected]> December 2003
Needs of Human Collaboration
12Dheeraj Bhardwaj <[email protected]> December 2003
Grid for Bioinformatics
• Effective for “Embarrassing Parallel Computation”: Homology Search, Motif Search, Unknown Parameter Estimation for Cellular Models etc• “Distributed Resource Sharing” among organizations: Web Services, Workflow and Computational Pipeline, Autonomous Database Update, etc• “Field” for Human Collaboration: Group Works for Genome Annotation, Whole Cell Simulation, Collaboration between Biologists and Computer Scientists, etc
13Dheeraj Bhardwaj <[email protected]> December 2003
Summary of Bioinformatics Trend
•Rapid increase of Genomic database size
•Demand for Molecular Dynamics Simulation
causes severe overhead for database service
requires High performance computers(including special-purpose computers)
Needs a new Bioinformatics Platform for sharing Databases and High performance computers
14Dheeraj Bhardwaj <[email protected]> December 2003
Strategic Technology Domain
Information Integration from Genome to Phenome
Modeling and SimulationFrom Molecular
to Cell
High Performance Computing(PC-cluster, SMP, Vector)
Grid
15Dheeraj Bhardwaj <[email protected]> December 2003
Evolution of the Scientific Process
• Pre-electronic– Theorize &/or experiment, alone or in
small teams; publish paper
• Post-electronic– Construct and mine very large databases
of observational or simulation data– Develop computer simulations & analyses– Exchange information quasi-
instantaneously within large, distributed, multidisciplinary teams
16Dheeraj Bhardwaj <[email protected]> December 2003
Alg
orit
hm
ic C
omp
lexi
ty/D
ata
Vol
um
e
Mainframes Vector Processors Supercomputers MPP/SMP Scalable Parallel Systems
Distributed& Grid
Compute Requirements 1970 1975 1980 1985 1990 1995 2000 2005
IBM 360/370 CDC 1604/600 UNIVAC 1100
~3 MFLOPS per $ million
DEC VAX/FPS IBM, CDC UNIVAC
~5 MFLOPS per $ million
CRAY 1 CDC 203
~20 MFLOPS per $ million
CRAY XMP CONVEX C1 ALLIANT
~60 MFLOPS per $ million
CRAY YMP CONVEX C2
~200-400 MFLOPS per $ million
SGI Power Ch IBM SP2 CM5
~2-3 GFLOPS per $ million
CRAY T3E SGI Origin IBM SP
~5-8 GFLOPS per $ million
CRAY T3E SGI Origin IBM SP SUN ES 10000
~20 GFLOPS per $ million
LINUX CLUSTERS
~100 GFLOPS per $ million
COMPUTATIONAL GRID
~1000 GFLOPS per $ million
• Systems getting larger by 2- 3- 4x per year !!
– Increasing parallelism: add more and more processors
• New Kind of Parallelism: GRID
– Harness the power of Computing Resources which are growing
17Dheeraj Bhardwaj <[email protected]> December 2003
HPC Applications Issues
• Architectures and Programming Models– Distributed Memory Systems MPP, Clusters – Message
Passing– Shared Memory Systems SMP – Shared Memory
Programming– Specialized Architectures – Vector Processing, Data
Parallel Programming – The Computational Grid – Grid Programming
• Applications I/O– Parallel I/O– Need for high performance I/O systems and techniques,
scientific data libraries, and standard data representation
• Checkpointing and Recovery• Monitoring and Steering• Visualization (Remote Visualization)• Programming Frameworks
18Dheeraj Bhardwaj <[email protected]> December 2003
Future of Scientific Computing
• Require Large Scale Simulations, beyond reach of any machine
• Require Large Geo-distributed Cross Disciplinary Collaborations
• Systems getting larger by 2- 3- 4x per year !!– Increasing parallelism: add more and more
processors
• New Kind of Parallelism: GRID– Harness the power of Computing Resources which
are growing
19Dheeraj Bhardwaj <[email protected]> December 2003
What do we want to Achieve ?
• Develop High Performance Computing Applications (HPC) which are
• Portable ( Laptop Supercomputers Grid)
• Future Proof– Grid Ready
• Develop HPC Infrastructure (Parallel & Grid Systems) which is
• User Friendly• Based on Open Source• Efficient in Problem Solving • Able to Achieve High Performance• Able to Handle Large Data Volumes
20Dheeraj Bhardwaj <[email protected]> December 2003
Parallel Computer and Grid
A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”.
A Computational Grid is an emerging infrastructure that enables the integrated use of remote high-end computers, databases, scientific instruments, networks and other resources.
21Dheeraj Bhardwaj <[email protected]> December 2003
A Comparison
SERIAL
Fetch/Store
Compute
PARALLEL
Fetch/Store
Compute/ communicate
Cooperative game
GRID
Fetch/Store
Discovery of Resources
Interaction with remote application
Authentication / Authorization
Security
Compute/Communicate
Etc
22Dheeraj Bhardwaj <[email protected]> December 2003
Serial and Parallel Algorithms - Evaluation
• Serial Algorithm
– Execution time as a function of size of input
• Parallel Algorithm
– Execution time as a function of input size, parallel architecture and number of processors used
Parallel System
A parallel system is the combination of an algorithm and the parallel architecture on which its implemented
23Dheeraj Bhardwaj <[email protected]> December 2003
What is the Grid
• “Grid Computing [is] distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high performance orientation…we review the “Grid problem”, which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources- what we refer to as virtual organizations.”
From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” by Foster, Kesselman and Tuecke
24Dheeraj Bhardwaj <[email protected]> December 2003
Distributed Computing vs. GRID
• Grid is an evolution of distributed computing– Dynamic– Geographically independent – Built around standards– Internet backbone
• Distributed computing is an “older term”– Typically built around proprietary software and
network– Tightly couples systems/organization
25Dheeraj Bhardwaj <[email protected]> December 2003
Web vs. GRID
• Web– Uniform naming access to documents
• Grid - Uniform, high performance access to computational resources
Colleges/R&D Labs
Software Catalogs Sensor
nets
http://
http://
26Dheeraj Bhardwaj <[email protected]> December 2003
Is the World Wide Web a Grid ?
• Seamless naming? Yes• Uniform security and Authentication?
No• Information Service? Yes or
No• Co-Scheduling? No• Accounting & Authorization ? No• User Services? No• Event Services? No• Is the Browser a Global Shell ? No
27Dheeraj Bhardwaj <[email protected]> December 2003
What does the World Wide Web bring to the Grid ?
• Uniform Naming• A seamless, scalable information service• A powerful new meta-data language:
XML– XML will be standard language for describing
information in the grid– SOAP – simple object access protocol
• Uses XML for encoding. HTML for protocol– SOAP may become a standard RPC
mechanism for Grid services• Uses XML for encoding. HTML for protocol
• Portal Ideas
28Dheeraj Bhardwaj <[email protected]> December 2003
The Ultimate Goal
• In future I will not know or care where my application will be executed as I will acquire and pay to use these resources as I need them
29Dheeraj Bhardwaj <[email protected]> December 2003
Why Grids?
• Large-scale science and engineering are done through the interaction of people, heterogeneous computing resources, information systems, and instruments, all of which are geographically and organizationally dispersed.
• The overall motivation for “Grids” is to facilitate the routine interactions of these resources in order to support large-scale science and Engineering.
30Dheeraj Bhardwaj <[email protected]> December 2003
Why Now ?
• Moore’s law improvements in computing produce highly functional endsystems
• The internet and burgeoning wired and wireless provide universal connectivity
• Changing modes of working and problem solving emphasize teamwork, computation
• Network exponentials produce dramatic changes in geometry and geography
31Dheeraj Bhardwaj <[email protected]> December 2003
Network Exponentials
• Network vs. computer performance– Computer speed doubles every 18 months– Network speed doubles every 9 months– Difference = order of magnitude per 5 years
• 1986 to 2000– Computers: x 500– Networks: x 340,000
• 2001 to 2010– Computers: x 60– Networks: x 4000
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
32Dheeraj Bhardwaj <[email protected]> December 2003
Why Grid ?
Motivation:When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances. Glider Technology Report, June 2002
We are seeing a Fundamental Change in Scientific Applications
•They have become multidisciplinary
•Require incredible mix of varies technologies and expertise
“Many problems require tightly coupled computers, with low latencies and high communication bandwidths; Grid
computing may well increase … demand for such systems by making access easier” - Foster, Kesselman, Tuecke
The Anatomy of the Grid
33Dheeraj Bhardwaj <[email protected]> December 2003
Convergence between e-Science and e-Business
• A biochemist exploits 10, 000 computers to screen 100,000 compounds in an hour
• A biologist combines a range of diverse and distributed resources (databases, tools, instruments) to answer complex questions
• 1,000 physicists worldwide pool resources for petaop analyses of petabytes of data
• Civil engineer collaborate to design, execute, & analyze shake stable experiments.
• An enterprise configures internal & external resources to support eBusiness workload
From Steve Tuecke 12 Oct’01
34Dheeraj Bhardwaj <[email protected]> December 2003
Convergence between e-Science and e-Business
• Climate Scientist visualize, annotate, & analyze terabytes simulation datasets
• An emergency response team couples real time data, weather model, population data
• A multidisciplinary analysis in aerospace couples code and data in four companies
• A home user invokes architectural design functions at an application service provider
• An insurance company mines data from partner hospitals for fraud detection
35Dheeraj Bhardwaj <[email protected]> December 2003
Important Grid Applications
• Data-intensive
• Distributed computing (metacomputing)
• Collaborative
• Remote access to, and computer enhancement of, experimental facilities
36Dheeraj Bhardwaj <[email protected]> December 2003
An Example Virtual Organization: CERN’s Large Hadron Collider
1800 Physicists, 150 Institutes, 32 Countries
100 PB of data by 2010; 50,000 CPUs?www.griphyn.org www.ppdg.org www.eu-datagrid.org
37Dheeraj Bhardwaj <[email protected]> December 2003
Grid Communities & Applications:Data Grids for High Energy Physics
Tier2 Centre ~1 TIPS
Online System
Offline Processor Farm
~20 TIPS
CERN Computer Centre
FermiLab ~4 TIPSFrance Regional Centre
Italy Regional Centre
Germany Regional Centre
InstituteInstituteInstituteInstitute ~0.25TIPS
Physicist workstations
~100 MBytes/sec
~100 MBytes/sec
~622 Mbits/sec
~1 MBytes/sec
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~622 Mbits/sec or Air Freight (deprecated)
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Caltech ~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 2Tier 2
Tier 4Tier 4
1 TIPS is approximately 25,000
SpecInt95 equivalents
www.griphyn.org www.ppdg.net www.eu-datagrid.org
Dheeraj Bhardwaj <[email protected]> December 2003 38
And comparisons must bemade among many
We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns – Grids will help get us there and further
A Brainis a Lot
of Data!(Mark Ellisman,
UCSD)
39Dheeraj Bhardwaj <[email protected]> December 2003
Biomedical InformaticsResearch Network (BIRN)
• Evolving reference set of brains provides essential data for developing therapies for neurological disorders (multiple sclerosis, Alzheimer’s, etc.).
• Today – One lab, small patient base– 4 TB collection
• Tomorrow– 10s of collaborating labs– Larger population sample– 400 TB data collection: more
brains, higher resolution– Multiple scale data integration
and analysis
40Dheeraj Bhardwaj <[email protected]> December 2003
•Early 90s– Gigabit testbeds, metacomputing
•Mid to late 90s– Early experiments (e.g., I-WAY), academic software
projects (e.g., Globus, Legion), application experiments
•2002– Dozens of application communities & projects– Major infrastructure deployments– Significant technology base (esp. Globus ToolkitTM)– Growing industrial interest – Global Grid Forum: ~500 people, 20+ countries
The Grid: A Brief History
41Dheeraj Bhardwaj <[email protected]> December 2003
Today’s Grid
• A single system interface
• Transparent wide-area access to large data banks
• Transparent wide-area access to applications on heterogeneous platforms
• Transparent wide-area access to processing resources
• Security, certification, single sing-on authentication– Grid Security
Infrastructure
• Data access, Transfer & Replication– GridFTP, Giggle
• Computational resource discovery, allocation and Process creation– GRAM, Unicore, Condor-
G
42Dheeraj Bhardwaj <[email protected]> December 2003
Grid Evolution
• First Generation Grid– Computationally intensive, file access/transfer– Bag of various heterogeneous protocols &
toolkits– Recognizes internet, ignores web– Academic Team
• Second Generation Grid– Data intensive knowledge intensive– Service based architecture – Recognizes Web and Web services– Global Grid Forum– Industry participation
43Dheeraj Bhardwaj <[email protected]> December 2003
Challenging Technical
Requirements
• Dynamic formation and management of virtual organizations
• Online negotiation of access to services: who, what, why, when, how
• Establishment of applications and systems able to deliver multiple qualities of service
• Autonomic management of infrastructure elements
Open Grid Services Architecturehttp://www.globus.org/ogsa
44Dheeraj Bhardwaj <[email protected]> December 2003
Elements of the Problem
• Resource sharing– Computers, storage, sensors, networks, …– Heterogeneity of device, mechanism, policy– Sharing conditional: negotiation, payment, …
• Coordinated problem solving– Integration of distributed resources– Compound quality of service requirements
• Dynamic, multi-institutional virtual orgs– Dynamic overlays on classic org structures– Map to underlying control mechanisms
45Dheeraj Bhardwaj <[email protected]> December 2003
The Grid
• Diverse Resources– Dynamic– Unreliable – Shared
• Administrative Issues
– Security
– Multiple organizations
– Coordinated problem Solving
46Dheeraj Bhardwaj <[email protected]> December 2003
Grid Technologies:Resource Sharing
Mechanisms That …
• Address security and policy concerns of resource owners and users
• Are flexible enough to deal with many resource types and sharing modalities
• Scale to large number of resources, many participants, many program components
• Operate efficiently when dealing with large amounts of data & computation
47Dheeraj Bhardwaj <[email protected]> December 2003
Aspects of the Problem
1) Need for interoperability when different groups want to share resources– Diverse components, policies, mechanisms– E.g., standard notions of identity, means of
communication, resource descriptions
2) Need for shared infrastructure services to avoid repeated development, installation– E.g., one port/service/protocol for remote access
to computing, not one per tool/appln– E.g., Certificate Authorities: expensive to run
• A common need for protocols & services
48Dheeraj Bhardwaj <[email protected]> December 2003
Hence, a Protocol-Oriented View
of Grid Architecture, that Emphasizes …
• Development of Grid protocols & services– Protocol-mediated access to remote resources– New services: e.g., resource brokering– “On the Grid” = speak Intergrid protocols– Mostly (extensions to) existing protocols
• Development of Grid APIs & SDKs– Interfaces to Grid protocols & services– Facilitate application development by supplying
higher-level abstractions
49Dheeraj Bhardwaj <[email protected]> December 2003
The Hourglass Model
• Focus on architecture issues– Propose set of core services as
basic infrastructure– Use to construct high-level,
domain-specific solutions
• Design principles– Keep participation cost low– Enable local control– Support for adaptation– “IP hourglass” model
Diverse global services
Coreservices
Local OS
A p p l i c a t i o n s
50Dheeraj Bhardwaj <[email protected]> December 2003
Layered Grid Architecture(By Analogy to Internet
Architecture)
Application
Fabric“Controlling things locally”: Access to, & control of, resources
Connectivity“Talking to things”: communication (Internet protocols) & security
Resource“Sharing single resources”: negotiating access, controlling use
Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services
InternetTransport
Application
Link
Inte
rnet P
roto
col
Arch
itectu
re
51Dheeraj Bhardwaj <[email protected]> December 2003
Globus Toolkit™
• A software toolkit addressing key technical problems in the development of Grid-enabled tools, services, and applications– Offer a modular set of orthogonal services– Enable incremental development of grid-enabled
tools and applications – Implement standard Grid protocols and APIs– Available under liberal open source license– Large community of developers & users– Commercial support
52Dheeraj Bhardwaj <[email protected]> December 2003
Application
Connectivity
Resource
Collective
Fabric
Core GridServices
Local OS
Grid Resource Information ServiceGrid Resource Access & ManagementGridFTP
Internet protocolGlobus Security Infrastructure
Resources to Share
Grid Information Index service Replica managementCertificate repository (My proxy)Co-allocation library
Building Grid
Grid Architecture & Globus ToolKit
54Dheeraj Bhardwaj <[email protected]> December 2003
Key Protocols
• The Globus Toolkit™ centers around four key protocols– Connectivity layer:
• Security: Grid Security Infrastructure (GSI)– Resource layer:
• Resource Management: Grid Resource Allocation Management (GRAM)
• Information Services: Grid Resource Information Protocol (GRIP) and Index Information Protocol (GIIP)
• Data Transfer: Grid File Transfer Protocol (GridFTP)
• Also key collective layer protocols– Info Services, Replica Management, etc.
55Dheeraj Bhardwaj <[email protected]> December 2003
Why Grid Security is Hard?
• Resources being used may be extremely valuable & the problems being solved extremely sensitive
• Resources are often located in distinct administrative domains– Each resource may have own policies & procedures
• The set of resources used by a single computation may be large, dynamic, and/or unpredictable– Not just client/server
• It must be broadly available & applicable– Standard, well-tested, well-understood protocols– Integration with wide variety of tools
56Dheeraj Bhardwaj <[email protected]> December 2003
1) Easy to use
2) Single sign-on
3) Run applicationsftp,ssh,MPI,Condor,Web,…
4) User based trust model
5) Proxies/agents (delegation)
1) Specify local access control
2) Auditing, accounting, etc.
3) Integration w/ local systemKerberos, AFS, license mgr.
4) Protection from compromisedresources
API/SDK with authentication, flexible message protection,
flexible communication, delegation, ...Direct calls to various security functions (e.g. GSS-API)Or security integrated into higher-level SDKs:
E.g. GlobusIO, Condor-G, MPICH-G2, HDF5, etc.
User View Resource Owner View
Developer View
Grid Security Requirements
58Dheeraj Bhardwaj <[email protected]> December 2003
Convergence on Service Oriented Architecture
• Development of service oriented grid middleware using different technologies (such as Java/Jini, web services) to instantiate the service architecture.
Service Requester
Service locator
Service provider
Lookup ServiceInte
ract
ion
with
Ser
vice
Register Service
Discover Service
Serv
ice
Mat
ches
A typical SOA
59Dheeraj Bhardwaj <[email protected]> December 2003
The future.. Web Services
• Web services are self-describing applications that can find and interact with other web applications to complete complex tasks over the internet.
• Unlike the hard-wired applications of the client-server computing days, web services are loosely coupled software components that can find and interact with other components on the internet without manual human intervention
60Dheeraj Bhardwaj <[email protected]> December 2003
The future… Web services
• Increasingly popular standards-based frameworks for accessing network applications – W3C standardization, Microsoft, IBM, SUN, others
• WSDL: Web Services Description Language– Interface definition Language for web services
• SOAP: Simple Object Access Protocol– XML based RPC protocol, common WSDL target
• WS-inspection – Conventions for locating service descriptions
• UDDI: Universal Description, Discovery, & Integration– Discovery for Web services.
61Dheeraj Bhardwaj <[email protected]> December 2003
Open Grid Service Architecture (OGSA)
• Utilize standard Web services infrastructures• Building on current Globus toolkit:
– Grid service: semantics for service interactions– Management of transient instances (&state)– Factory, registry, Discovery, other services– Reliable and secure transport
• Multiple hosting targets J2EE, .NET, “C”,…..• Service Orientated architecture enable
resource virtualization• Delivery via open source Globus Toolkit 3.0
– Leverage GT Experience, code, mindshare
62Dheeraj Bhardwaj <[email protected]> December 2003
BioGrid approach
• Standardize interfaces
• Provide global directory of objects
• Distribute computation transparently
• Distribute data transparently
• Provide security on all object storage, transfer and communications
• Provide accountability, credibility and identification
• Bundle everything in a plug-and-play package
63Dheeraj Bhardwaj <[email protected]> December 2003
Typical Computing in Bioinformatics
Job
Task 1
Task 2
Task 999
Task 1000
.
.
.
Task 1-250
Task 251-500
Task 501-750
Task 751-1000
great many and similar tasks independent to each other
DBSoftware
DBSoftware
DBSoftware
DBSoftware
64Dheeraj Bhardwaj <[email protected]> December 2003
Bioinformatics Environment
Unauthorized Local Users
Job Dispatcher (obidispatch)
NodeSearch
Set of Nodes
GlobusTool Kit
Globus Tool Kit
DB
Environment Scanner
(obiregist)SW HW
Temporal Work Area for Job Execution
Results
ReportingEnvironmental Information
OBIEnv User
Environment Information
Server
List of OBIEnv Users
LocalAuthentification
Divided Jobs
Job (List of Tasks)
transferred and updated by obiupdate command
Node
Node
PostgreSQL
P2PServer
65Dheeraj Bhardwaj <[email protected]> December 2003
Parallel Job Execution
blast Q1 genbankblast Q2 genbank
:blast Q10 genbank
Job Dispatcher (obidispatch)
Job (Task List)
Nodes with TrEMBL and BLAST?
Environment Information
Server
Set of Nodes
TrEMBL
TrEMBL
TrEMBL
TrEMBLTrEMBL
Q1,Q2
Q3,Q4
Q5,Q6
Q7,Q8Q9,Q10
Tasks are independent to each other
66Dheeraj Bhardwaj <[email protected]> December 2003
Typical Database Access in Bioinformatics
Web Services
App1 App2
Site A Site B
Mirroring
App1’ App2’
Site A Site B
67Dheeraj Bhardwaj <[email protected]> December 2003
Database Federation and Computational Pipeline
Phenome
Metabolome
Proteome
Transcriptome
Genome
Computational Pipeline
Database Federation + Web Services
App1
App2
App3
App4
App5
68Dheeraj Bhardwaj <[email protected]> December 2003
VO on Grid
Virtual Organization on Grid
VO provides the boundary of knowledge sharing overgeometrical and organizational limitation.
Project
Project
A B
CD
69Dheeraj Bhardwaj <[email protected]> December 2003
BioGrid Schematic
• Grid-aware client software
• Data and software resource directories
• Grid of processing computers
72Dheeraj Bhardwaj <[email protected]> December 2003
Open Grid Service Architecture
73Dheeraj Bhardwaj <[email protected]> December 2003
Future Grid Challenges
• Need ‘power station’ on the Grid– Buy (obtain) resources as required
• Need to understand how applications behave – Balance out data transfer Vs. compute shipping
• Need to scalable wide-area service discovery– Peer to Peer or centralized servers– Meta-data to describe Grid Services
• Need to exploit distributed services– Grid Service Orchestration – Optimise service selection and recover from failure
74Dheeraj Bhardwaj <[email protected]> December 2003
The GRID is all about• The Coordinated, Transparent, Secure and Effective
Utilization of Geographically distributed heterogeneous resources (both hardware & Software) for Applications
To be Successful• The Grid has to support applications in the same way
that the power utilities support the use of household appliances
The Metaphor• Computers to act as generators of computational
“power”, for applications to become computational appliances
• The software infrastructure to act as the utility responsible for managing the interaction between them
The GRID
75Dheeraj Bhardwaj <[email protected]> December 2003
Whom Does Grid Computing Serve ?
• The users and Their Applications
• Large Complex Applications which need
resources beyond the traditional– Parallel/Distributed processing in a box– Put-it-yourself together Clusters
• Applications that describe multiple aspects of
a system
• Applications consisting of multiple modules
• Applications with multi-source data
• Applications interfacing with measurement
systems and visualization systemsApplication Programmers will be able to write applications that leverage TeraFlops computations amd PetaBytes storage
76Dheeraj Bhardwaj <[email protected]> December 2003
Grid System – Three Point Checklist
• Coordinated resource sharing that are not subject to centralized control
• Using standard, open, general-purpose protocols and interfaces
• To deliver nontrivial quality of services
77Dheeraj Bhardwaj <[email protected]> December 2003
Applications Development On Grid
What do Application Developers Need to Think About in Grid Environments ?
• This is very similar to the requirements for an application to be able to run on many different architectures
• Need now to also think that not all processes in an application are necessarily running on the same resource or even the same architecture
• Not all processes have access to the same environment, or may be able to reach the same set of remote resources
78Dheeraj Bhardwaj <[email protected]> December 2003
Hook enough computers together and what do you get?
A new kind of utility that offers supercomputer processing on tap.
79Dheeraj Bhardwaj <[email protected]> December 2003
Access Grid
• High-end group work and collaboration technology
• Grid services being used for discovery, configuration, authentication
• O(50) systems deployed worldwide
• Basis for SC’2001 SC Global event in November 2001– www.scglobal.org
Ambient mic(tabletop)
Presentermic
Presentercamera
Audience camera
www.accessgrid.org
80Dheeraj Bhardwaj <[email protected]> December 2003
Building Bridges for
the Future of
Science
Grid Computing is a paradigm that will have considerable impact on how computing resources will be provisioned – and JavaTM technology is primary technology that will enable it