dheeraj bhardwaj december 2003 1 dheeraj bhardwaj department of computer science & engineering...

Post on 15-Jan-2016

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Dheeraj Bhardwaj Department of Computer Science & Engineering

Indian Institute of Technology, Delhi –110 016 Indiahttp://www.cse.iitd.ac.in/~dheerajb

BioGridChallenges, Problems and Opportunities

2Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

BIOLOGICAL PHENOMENON

DATA MODEL

measurement process inference,

conclusions

data analysis, learning

3Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Bioinformatics Vs. Biocomputing

Bioinformatics

Biocomputing

IT BT

4Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Genome

Phenome

Biological Data

“Maze” on a Jigsaw Puzzle

5Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Equipments for New Quest

Data, Knowledge and ToolsHigh Performance Computers

Collaboration ofHuman Experts

The illustrations are quoted from the following sites:www.dnr.state.wi.us/org/ aw/air/ed/educatio.htmwww.mtnbrook.k12.al.us/academy/2ndgrade/mtn/map.htmwww.dnr.state.wi.us/org/ aw/air/ed/educatio.htm

6Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Needs of High Performance Computing

• Increase of Genome Sequence Information• Combinatorial Increase of Search Space Genome * Transcriptome * Proteome* ... * Phenome• Computer Simulation and Unknown Parameter Estimation

Knowledge integration in “Omic Space”

7Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Needs of High Performance Computing

•Impact of Genome Sequence Projects

Human Genome (3,000 Mbp, 2000) Rapid Increase of Genome Sequence Databases Strong Computation Demand for Homology Search

•Start of Structural Genomics Projects Determine 10,000 folds in 5 years Strong Computation Demand for Molecular Simulation

8Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

1st Issue:Homology Search

・ Rapid Increase of Data Size; double per year, daily update

(17 million entry, 50 Giga Bytes @ 2002 Oct. )

0

2,000,000,000

4,000,000,000

6,000,000,000

8,000,000,000

10,000,000,000

12,000,000,000

14,000,000,000

EMBL

GENBANK

DDBJ1cpu8cpu

32cpu256cpu

6,400cpu

1 year1 month1 week1 day1 hour

Rough Estimation Homology Search Timefor Mouse cDNA (5,000 Seq.) * Human Genome (3,000 M bp)

9Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

2nd Issue Molecular Simulation

Nano seconds order Molecular Dynamics simulation of protein molecules with 100,000 – 1,000,000 molecular weight

•Stability Analysis•Affinity Analysis•Folding Simulation

Ex. Ras p21 G # of residues: 189

Molecular weight: 21kD

Oncogene VariantGly12 →Val

5ns1000h/32Gflops Computer

GTPMg

Lys16

10Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Needs of Resource Sharing

• Biological Databases (Unigene, TrEMBL,...)

• Bioinformatics Tools (BLAST, HMMER, ...)

• Programming Language (Bioperl, Biojava, ...)

11Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Needs of Human Collaboration

12Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Grid for Bioinformatics

• Effective for “Embarrassing Parallel Computation”: Homology Search, Motif Search, Unknown Parameter Estimation for Cellular Models etc• “Distributed Resource Sharing” among organizations: Web Services, Workflow and Computational Pipeline, Autonomous Database Update, etc• “Field” for Human Collaboration: Group Works for Genome Annotation, Whole Cell Simulation, Collaboration between Biologists and Computer Scientists, etc

13Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Summary of Bioinformatics Trend

•Rapid increase of Genomic database size

•Demand for Molecular Dynamics Simulation

causes severe overhead for database service

requires High performance computers(including special-purpose computers)

Needs a new Bioinformatics Platform for sharing Databases and High performance computers

14Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Strategic Technology Domain

Information Integration from Genome to Phenome

Modeling and SimulationFrom Molecular

to Cell

High Performance Computing(PC-cluster, SMP, Vector)

Grid

15Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Evolution of the Scientific Process

• Pre-electronic– Theorize &/or experiment, alone or in

small teams; publish paper

• Post-electronic– Construct and mine very large databases

of observational or simulation data– Develop computer simulations & analyses– Exchange information quasi-

instantaneously within large, distributed, multidisciplinary teams

16Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Alg

orit

hm

ic C

omp

lexi

ty/D

ata

Vol

um

e

Mainframes Vector Processors Supercomputers MPP/SMP Scalable Parallel Systems

Distributed& Grid

Compute Requirements 1970 1975 1980 1985 1990 1995 2000 2005

IBM 360/370 CDC 1604/600 UNIVAC 1100

~3 MFLOPS per $ million

DEC VAX/FPS IBM, CDC UNIVAC

~5 MFLOPS per $ million

CRAY 1 CDC 203

~20 MFLOPS per $ million

CRAY XMP CONVEX C1 ALLIANT

~60 MFLOPS per $ million

CRAY YMP CONVEX C2

~200-400 MFLOPS per $ million

SGI Power Ch IBM SP2 CM5

~2-3 GFLOPS per $ million

CRAY T3E SGI Origin IBM SP

~5-8 GFLOPS per $ million

CRAY T3E SGI Origin IBM SP SUN ES 10000

~20 GFLOPS per $ million

LINUX CLUSTERS

~100 GFLOPS per $ million

COMPUTATIONAL GRID

~1000 GFLOPS per $ million

• Systems getting larger by 2- 3- 4x per year !!

– Increasing parallelism: add more and more processors

• New Kind of Parallelism: GRID

– Harness the power of Computing Resources which are growing

17Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

HPC Applications Issues

• Architectures and Programming Models– Distributed Memory Systems MPP, Clusters – Message

Passing– Shared Memory Systems SMP – Shared Memory

Programming– Specialized Architectures – Vector Processing, Data

Parallel Programming – The Computational Grid – Grid Programming

• Applications I/O– Parallel I/O– Need for high performance I/O systems and techniques,

scientific data libraries, and standard data representation

• Checkpointing and Recovery• Monitoring and Steering• Visualization (Remote Visualization)• Programming Frameworks

18Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Future of Scientific Computing

• Require Large Scale Simulations, beyond reach of any machine

• Require Large Geo-distributed Cross Disciplinary Collaborations

• Systems getting larger by 2- 3- 4x per year !!– Increasing parallelism: add more and more

processors

• New Kind of Parallelism: GRID– Harness the power of Computing Resources which

are growing

19Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

What do we want to Achieve ?

• Develop High Performance Computing Applications (HPC) which are

• Portable ( Laptop Supercomputers Grid)

• Future Proof– Grid Ready

• Develop HPC Infrastructure (Parallel & Grid Systems) which is

• User Friendly• Based on Open Source• Efficient in Problem Solving • Able to Achieve High Performance• Able to Handle Large Data Volumes

20Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Parallel Computer and Grid

A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”.

A Computational Grid is an emerging infrastructure that enables the integrated use of remote high-end computers, databases, scientific instruments, networks and other resources.

21Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

A Comparison

SERIAL

Fetch/Store

Compute

PARALLEL

Fetch/Store

Compute/ communicate

Cooperative game

GRID

Fetch/Store

Discovery of Resources

Interaction with remote application

Authentication / Authorization

Security

Compute/Communicate

Etc

22Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Serial and Parallel Algorithms - Evaluation

• Serial Algorithm

– Execution time as a function of size of input

• Parallel Algorithm

– Execution time as a function of input size, parallel architecture and number of processors used

Parallel System

A parallel system is the combination of an algorithm and the parallel architecture on which its implemented

23Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

What is the Grid

• “Grid Computing [is] distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high performance orientation…we review the “Grid problem”, which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources- what we refer to as virtual organizations.”

From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” by Foster, Kesselman and Tuecke

24Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Distributed Computing vs. GRID

• Grid is an evolution of distributed computing– Dynamic– Geographically independent – Built around standards– Internet backbone

• Distributed computing is an “older term”– Typically built around proprietary software and

network– Tightly couples systems/organization

25Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Web vs. GRID

• Web– Uniform naming access to documents

• Grid - Uniform, high performance access to computational resources

Colleges/R&D Labs

Software Catalogs Sensor

nets

http://

http://

26Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Is the World Wide Web a Grid ?

• Seamless naming? Yes• Uniform security and Authentication?

No• Information Service? Yes or

No• Co-Scheduling? No• Accounting & Authorization ? No• User Services? No• Event Services? No• Is the Browser a Global Shell ? No

27Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

What does the World Wide Web bring to the Grid ?

• Uniform Naming• A seamless, scalable information service• A powerful new meta-data language:

XML– XML will be standard language for describing

information in the grid– SOAP – simple object access protocol

• Uses XML for encoding. HTML for protocol– SOAP may become a standard RPC

mechanism for Grid services• Uses XML for encoding. HTML for protocol

• Portal Ideas

28Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

The Ultimate Goal

• In future I will not know or care where my application will be executed as I will acquire and pay to use these resources as I need them

29Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Why Grids?

• Large-scale science and engineering are done through the interaction of people, heterogeneous computing resources, information systems, and instruments, all of which are geographically and organizationally dispersed.

• The overall motivation for “Grids” is to facilitate the routine interactions of these resources in order to support large-scale science and Engineering.

30Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Why Now ?

• Moore’s law improvements in computing produce highly functional endsystems

• The internet and burgeoning wired and wireless provide universal connectivity

• Changing modes of working and problem solving emphasize teamwork, computation

• Network exponentials produce dramatic changes in geometry and geography

31Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Network Exponentials

• Network vs. computer performance– Computer speed doubles every 18 months– Network speed doubles every 9 months– Difference = order of magnitude per 5 years

• 1986 to 2000– Computers: x 500– Networks: x 340,000

• 2001 to 2010– Computers: x 60– Networks: x 4000

Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

32Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Why Grid ?

Motivation:When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances. Glider Technology Report, June 2002

We are seeing a Fundamental Change in Scientific Applications

•They have become multidisciplinary

•Require incredible mix of varies technologies and expertise

“Many problems require tightly coupled computers, with low latencies and high communication bandwidths; Grid

computing may well increase … demand for such systems by making access easier” - Foster, Kesselman, Tuecke

The Anatomy of the Grid

33Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Convergence between e-Science and e-Business

• A biochemist exploits 10, 000 computers to screen 100,000 compounds in an hour

• A biologist combines a range of diverse and distributed resources (databases, tools, instruments) to answer complex questions

• 1,000 physicists worldwide pool resources for petaop analyses of petabytes of data

• Civil engineer collaborate to design, execute, & analyze shake stable experiments.

• An enterprise configures internal & external resources to support eBusiness workload

From Steve Tuecke 12 Oct’01

34Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Convergence between e-Science and e-Business

• Climate Scientist visualize, annotate, & analyze terabytes simulation datasets

• An emergency response team couples real time data, weather model, population data

• A multidisciplinary analysis in aerospace couples code and data in four companies

• A home user invokes architectural design functions at an application service provider

• An insurance company mines data from partner hospitals for fraud detection

35Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Important Grid Applications

• Data-intensive

• Distributed computing (metacomputing)

• Collaborative

• Remote access to, and computer enhancement of, experimental facilities

36Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

An Example Virtual Organization: CERN’s Large Hadron Collider

1800 Physicists, 150 Institutes, 32 Countries

100 PB of data by 2010; 50,000 CPUs?www.griphyn.org www.ppdg.org www.eu-datagrid.org

37Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Grid Communities & Applications:Data Grids for High Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

www.griphyn.org www.ppdg.net www.eu-datagrid.org

Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003 38

And comparisons must bemade among many

We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns – Grids will help get us there and further

A Brainis a Lot

of Data!(Mark Ellisman,

UCSD)

39Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Biomedical InformaticsResearch Network (BIRN)

• Evolving reference set of brains provides essential data for developing therapies for neurological disorders (multiple sclerosis, Alzheimer’s, etc.).

• Today – One lab, small patient base– 4 TB collection

• Tomorrow– 10s of collaborating labs– Larger population sample– 400 TB data collection: more

brains, higher resolution– Multiple scale data integration

and analysis

40Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

•Early 90s– Gigabit testbeds, metacomputing

•Mid to late 90s– Early experiments (e.g., I-WAY), academic software

projects (e.g., Globus, Legion), application experiments

•2002– Dozens of application communities & projects– Major infrastructure deployments– Significant technology base (esp. Globus ToolkitTM)– Growing industrial interest – Global Grid Forum: ~500 people, 20+ countries

The Grid: A Brief History

41Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Today’s Grid

• A single system interface

• Transparent wide-area access to large data banks

• Transparent wide-area access to applications on heterogeneous platforms

• Transparent wide-area access to processing resources

• Security, certification, single sing-on authentication– Grid Security

Infrastructure

• Data access, Transfer & Replication– GridFTP, Giggle

• Computational resource discovery, allocation and Process creation– GRAM, Unicore, Condor-

G

42Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Grid Evolution

• First Generation Grid– Computationally intensive, file access/transfer– Bag of various heterogeneous protocols &

toolkits– Recognizes internet, ignores web– Academic Team

• Second Generation Grid– Data intensive knowledge intensive– Service based architecture – Recognizes Web and Web services– Global Grid Forum– Industry participation

43Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Challenging Technical

Requirements

• Dynamic formation and management of virtual organizations

• Online negotiation of access to services: who, what, why, when, how

• Establishment of applications and systems able to deliver multiple qualities of service

• Autonomic management of infrastructure elements

Open Grid Services Architecturehttp://www.globus.org/ogsa

44Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Elements of the Problem

• Resource sharing– Computers, storage, sensors, networks, …– Heterogeneity of device, mechanism, policy– Sharing conditional: negotiation, payment, …

• Coordinated problem solving– Integration of distributed resources– Compound quality of service requirements

• Dynamic, multi-institutional virtual orgs– Dynamic overlays on classic org structures– Map to underlying control mechanisms

45Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

The Grid

• Diverse Resources– Dynamic– Unreliable – Shared

• Administrative Issues

– Security

– Multiple organizations

– Coordinated problem Solving

46Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Grid Technologies:Resource Sharing

Mechanisms That …

• Address security and policy concerns of resource owners and users

• Are flexible enough to deal with many resource types and sharing modalities

• Scale to large number of resources, many participants, many program components

• Operate efficiently when dealing with large amounts of data & computation

47Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Aspects of the Problem

1) Need for interoperability when different groups want to share resources– Diverse components, policies, mechanisms– E.g., standard notions of identity, means of

communication, resource descriptions

2) Need for shared infrastructure services to avoid repeated development, installation– E.g., one port/service/protocol for remote access

to computing, not one per tool/appln– E.g., Certificate Authorities: expensive to run

• A common need for protocols & services

48Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Hence, a Protocol-Oriented View

of Grid Architecture, that Emphasizes …

• Development of Grid protocols & services– Protocol-mediated access to remote resources– New services: e.g., resource brokering– “On the Grid” = speak Intergrid protocols– Mostly (extensions to) existing protocols

• Development of Grid APIs & SDKs– Interfaces to Grid protocols & services– Facilitate application development by supplying

higher-level abstractions

49Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

The Hourglass Model

• Focus on architecture issues– Propose set of core services as

basic infrastructure– Use to construct high-level,

domain-specific solutions

• Design principles– Keep participation cost low– Enable local control– Support for adaptation– “IP hourglass” model

Diverse global services

Coreservices

Local OS

A p p l i c a t i o n s

50Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Layered Grid Architecture(By Analogy to Internet

Architecture)

Application

Fabric“Controlling things locally”: Access to, & control of, resources

Connectivity“Talking to things”: communication (Internet protocols) & security

Resource“Sharing single resources”: negotiating access, controlling use

Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services

InternetTransport

Application

Link

Inte

rnet P

roto

col

Arch

itectu

re

51Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Globus Toolkit™

• A software toolkit addressing key technical problems in the development of Grid-enabled tools, services, and applications– Offer a modular set of orthogonal services– Enable incremental development of grid-enabled

tools and applications – Implement standard Grid protocols and APIs– Available under liberal open source license– Large community of developers & users– Commercial support

52Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Application

Connectivity

Resource

Collective

Fabric

Core GridServices

Local OS

Grid Resource Information ServiceGrid Resource Access & ManagementGridFTP

Internet protocolGlobus Security Infrastructure

Resources to Share

Grid Information Index service Replica managementCertificate repository (My proxy)Co-allocation library

Building Grid

Grid Architecture & Globus ToolKit

54Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Key Protocols

• The Globus Toolkit™ centers around four key protocols– Connectivity layer:

• Security: Grid Security Infrastructure (GSI)– Resource layer:

• Resource Management: Grid Resource Allocation Management (GRAM)

• Information Services: Grid Resource Information Protocol (GRIP) and Index Information Protocol (GIIP)

• Data Transfer: Grid File Transfer Protocol (GridFTP)

• Also key collective layer protocols– Info Services, Replica Management, etc.

55Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Why Grid Security is Hard?

• Resources being used may be extremely valuable & the problems being solved extremely sensitive

• Resources are often located in distinct administrative domains– Each resource may have own policies & procedures

• The set of resources used by a single computation may be large, dynamic, and/or unpredictable– Not just client/server

• It must be broadly available & applicable– Standard, well-tested, well-understood protocols– Integration with wide variety of tools

56Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

1) Easy to use

2) Single sign-on

3) Run applicationsftp,ssh,MPI,Condor,Web,…

4) User based trust model

5) Proxies/agents (delegation)

1) Specify local access control

2) Auditing, accounting, etc.

3) Integration w/ local systemKerberos, AFS, license mgr.

4) Protection from compromisedresources

API/SDK with authentication, flexible message protection,

flexible communication, delegation, ...Direct calls to various security functions (e.g. GSS-API)Or security integrated into higher-level SDKs:

E.g. GlobusIO, Condor-G, MPICH-G2, HDF5, etc.

User View Resource Owner View

Developer View

Grid Security Requirements

58Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Convergence on Service Oriented Architecture

• Development of service oriented grid middleware using different technologies (such as Java/Jini, web services) to instantiate the service architecture.

Service Requester

Service locator

Service provider

Lookup ServiceInte

ract

ion

with

Ser

vice

Register Service

Discover Service

Serv

ice

Mat

ches

A typical SOA

59Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

The future.. Web Services

• Web services are self-describing applications that can find and interact with other web applications to complete complex tasks over the internet.

• Unlike the hard-wired applications of the client-server computing days, web services are loosely coupled software components that can find and interact with other components on the internet without manual human intervention

60Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

The future… Web services

• Increasingly popular standards-based frameworks for accessing network applications – W3C standardization, Microsoft, IBM, SUN, others

• WSDL: Web Services Description Language– Interface definition Language for web services

• SOAP: Simple Object Access Protocol– XML based RPC protocol, common WSDL target

• WS-inspection – Conventions for locating service descriptions

• UDDI: Universal Description, Discovery, & Integration– Discovery for Web services.

61Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Open Grid Service Architecture (OGSA)

• Utilize standard Web services infrastructures• Building on current Globus toolkit:

– Grid service: semantics for service interactions– Management of transient instances (&state)– Factory, registry, Discovery, other services– Reliable and secure transport

• Multiple hosting targets J2EE, .NET, “C”,…..• Service Orientated architecture enable

resource virtualization• Delivery via open source Globus Toolkit 3.0

– Leverage GT Experience, code, mindshare

62Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

BioGrid approach

• Standardize interfaces

• Provide global directory of objects

• Distribute computation transparently

• Distribute data transparently

• Provide security on all object storage, transfer and communications

• Provide accountability, credibility and identification

• Bundle everything in a plug-and-play package

63Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Typical Computing in Bioinformatics

Job

Task 1

Task 2

Task 999

Task 1000

.

.

.

Task 1-250

Task 251-500

Task 501-750

Task 751-1000

great many and similar tasks independent to each other

DBSoftware

DBSoftware

DBSoftware

DBSoftware

64Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Bioinformatics Environment

Unauthorized Local Users

Job Dispatcher (obidispatch)

NodeSearch

Set of Nodes

GlobusTool Kit

Globus Tool Kit

DB

Environment Scanner

(obiregist)SW HW

Temporal Work Area for Job Execution

Results

ReportingEnvironmental Information

OBIEnv User

Environment Information

Server

List of OBIEnv Users

LocalAuthentification

Divided Jobs

Job (List of Tasks)

transferred and updated by obiupdate command

Node

Node

PostgreSQL

P2PServer

65Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Parallel Job Execution

blast Q1 genbankblast Q2 genbank

:blast Q10 genbank

Job Dispatcher (obidispatch)

Job (Task List)

Nodes with TrEMBL and BLAST?

Environment Information

Server

Set of Nodes

TrEMBL

TrEMBL

TrEMBL

TrEMBLTrEMBL

Q1,Q2

Q3,Q4

Q5,Q6

Q7,Q8Q9,Q10

Tasks are independent to each other

66Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Typical Database Access in Bioinformatics

Web Services

App1 App2

Site A Site B

Mirroring

App1’ App2’

Site A Site B

67Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Database Federation and Computational Pipeline

Phenome

Metabolome

Proteome

Transcriptome

Genome

Computational Pipeline

Database Federation + Web Services

App1

App2

App3

App4

App5

68Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

VO on Grid

Virtual Organization on Grid

VO provides the boundary of knowledge sharing overgeometrical and organizational limitation.

Project

Project

A B

CD

69Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

BioGrid Schematic

• Grid-aware client software

• Data and software resource directories

• Grid of processing computers

72Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Open Grid Service Architecture

73Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Future Grid Challenges

• Need ‘power station’ on the Grid– Buy (obtain) resources as required

• Need to understand how applications behave – Balance out data transfer Vs. compute shipping

• Need to scalable wide-area service discovery– Peer to Peer or centralized servers– Meta-data to describe Grid Services

• Need to exploit distributed services– Grid Service Orchestration – Optimise service selection and recover from failure

74Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

The GRID is all about• The Coordinated, Transparent, Secure and Effective

Utilization of Geographically distributed heterogeneous resources (both hardware & Software) for Applications

To be Successful• The Grid has to support applications in the same way

that the power utilities support the use of household appliances

The Metaphor• Computers to act as generators of computational

“power”, for applications to become computational appliances

• The software infrastructure to act as the utility responsible for managing the interaction between them

The GRID

75Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Whom Does Grid Computing Serve ?

• The users and Their Applications

• Large Complex Applications which need

resources beyond the traditional– Parallel/Distributed processing in a box– Put-it-yourself together Clusters

• Applications that describe multiple aspects of

a system

• Applications consisting of multiple modules

• Applications with multi-source data

• Applications interfacing with measurement

systems and visualization systemsApplication Programmers will be able to write applications that leverage TeraFlops computations amd PetaBytes storage

76Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Grid System – Three Point Checklist

• Coordinated resource sharing that are not subject to centralized control

• Using standard, open, general-purpose protocols and interfaces

• To deliver nontrivial quality of services

77Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Applications Development On Grid

What do Application Developers Need to Think About in Grid Environments ?

• This is very similar to the requirements for an application to be able to run on many different architectures

• Need now to also think that not all processes in an application are necessarily running on the same resource or even the same architecture

• Not all processes have access to the same environment, or may be able to reach the same set of remote resources

78Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Hook enough computers together and what do you get?

A new kind of utility that offers supercomputer processing on tap.

                                                                                                                           

             

79Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Access Grid

• High-end group work and collaboration technology

• Grid services being used for discovery, configuration, authentication

• O(50) systems deployed worldwide

• Basis for SC’2001 SC Global event in November 2001– www.scglobal.org

Ambient mic(tabletop)

Presentermic

Presentercamera

Audience camera

www.accessgrid.org

80Dheeraj Bhardwaj <dheerajb@cse.iitd.ac.in> December 2003

Building Bridges for

the Future of

Science

Grid Computing is a paradigm that will have considerable impact on how computing resources will be provisioned – and JavaTM technology is primary technology that will enable it

top related