dheeraj bhardwaj december 2003 1 dheeraj bhardwaj department of computer science & engineering...

76
1 Dheeraj Bhardwaj <[email protected]> December 2003 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India http://www.cse.iitd.ac.in/~dheerajb BioGrid Challenges, Problems and Opportunities

Post on 15-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

1Dheeraj Bhardwaj <[email protected]> December 2003

Dheeraj Bhardwaj Department of Computer Science & Engineering

Indian Institute of Technology, Delhi –110 016 Indiahttp://www.cse.iitd.ac.in/~dheerajb

BioGridChallenges, Problems and Opportunities

Page 2: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

2Dheeraj Bhardwaj <[email protected]> December 2003

BIOLOGICAL PHENOMENON

DATA MODEL

measurement process inference,

conclusions

data analysis, learning

Page 3: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

3Dheeraj Bhardwaj <[email protected]> December 2003

Bioinformatics Vs. Biocomputing

Bioinformatics

Biocomputing

IT BT

Page 4: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

4Dheeraj Bhardwaj <[email protected]> December 2003

Genome

Phenome

Biological Data

“Maze” on a Jigsaw Puzzle

Page 5: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

5Dheeraj Bhardwaj <[email protected]> December 2003

Equipments for New Quest

Data, Knowledge and ToolsHigh Performance Computers

Collaboration ofHuman Experts

The illustrations are quoted from the following sites:www.dnr.state.wi.us/org/ aw/air/ed/educatio.htmwww.mtnbrook.k12.al.us/academy/2ndgrade/mtn/map.htmwww.dnr.state.wi.us/org/ aw/air/ed/educatio.htm

Page 6: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

6Dheeraj Bhardwaj <[email protected]> December 2003

Needs of High Performance Computing

• Increase of Genome Sequence Information• Combinatorial Increase of Search Space Genome * Transcriptome * Proteome* ... * Phenome• Computer Simulation and Unknown Parameter Estimation

Knowledge integration in “Omic Space”

Page 7: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

7Dheeraj Bhardwaj <[email protected]> December 2003

Needs of High Performance Computing

•Impact of Genome Sequence Projects

Human Genome (3,000 Mbp, 2000) Rapid Increase of Genome Sequence Databases Strong Computation Demand for Homology Search

•Start of Structural Genomics Projects Determine 10,000 folds in 5 years Strong Computation Demand for Molecular Simulation

Page 8: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

8Dheeraj Bhardwaj <[email protected]> December 2003

1st Issue:Homology Search

・ Rapid Increase of Data Size; double per year, daily update

(17 million entry, 50 Giga Bytes @ 2002 Oct. )

0

2,000,000,000

4,000,000,000

6,000,000,000

8,000,000,000

10,000,000,000

12,000,000,000

14,000,000,000

EMBL

GENBANK

DDBJ1cpu8cpu

32cpu256cpu

6,400cpu

1 year1 month1 week1 day1 hour

Rough Estimation Homology Search Timefor Mouse cDNA (5,000 Seq.) * Human Genome (3,000 M bp)

Page 9: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

9Dheeraj Bhardwaj <[email protected]> December 2003

2nd Issue Molecular Simulation

Nano seconds order Molecular Dynamics simulation of protein molecules with 100,000 – 1,000,000 molecular weight

•Stability Analysis•Affinity Analysis•Folding Simulation

Ex. Ras p21 G # of residues: 189

Molecular weight: 21kD

Oncogene VariantGly12 →Val

5ns1000h/32Gflops Computer

GTPMg

Lys16

Page 10: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

10Dheeraj Bhardwaj <[email protected]> December 2003

Needs of Resource Sharing

• Biological Databases (Unigene, TrEMBL,...)

• Bioinformatics Tools (BLAST, HMMER, ...)

• Programming Language (Bioperl, Biojava, ...)

Page 11: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

11Dheeraj Bhardwaj <[email protected]> December 2003

Needs of Human Collaboration

Page 12: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

12Dheeraj Bhardwaj <[email protected]> December 2003

Grid for Bioinformatics

• Effective for “Embarrassing Parallel Computation”: Homology Search, Motif Search, Unknown Parameter Estimation for Cellular Models etc• “Distributed Resource Sharing” among organizations: Web Services, Workflow and Computational Pipeline, Autonomous Database Update, etc• “Field” for Human Collaboration: Group Works for Genome Annotation, Whole Cell Simulation, Collaboration between Biologists and Computer Scientists, etc

Page 13: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

13Dheeraj Bhardwaj <[email protected]> December 2003

Summary of Bioinformatics Trend

•Rapid increase of Genomic database size

•Demand for Molecular Dynamics Simulation

causes severe overhead for database service

requires High performance computers(including special-purpose computers)

Needs a new Bioinformatics Platform for sharing Databases and High performance computers

Page 14: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

14Dheeraj Bhardwaj <[email protected]> December 2003

Strategic Technology Domain

Information Integration from Genome to Phenome

Modeling and SimulationFrom Molecular

to Cell

High Performance Computing(PC-cluster, SMP, Vector)

Grid

Page 15: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

15Dheeraj Bhardwaj <[email protected]> December 2003

Evolution of the Scientific Process

• Pre-electronic– Theorize &/or experiment, alone or in

small teams; publish paper

• Post-electronic– Construct and mine very large databases

of observational or simulation data– Develop computer simulations & analyses– Exchange information quasi-

instantaneously within large, distributed, multidisciplinary teams

Page 16: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

16Dheeraj Bhardwaj <[email protected]> December 2003

Alg

orit

hm

ic C

omp

lexi

ty/D

ata

Vol

um

e

Mainframes Vector Processors Supercomputers MPP/SMP Scalable Parallel Systems

Distributed& Grid

Compute Requirements 1970 1975 1980 1985 1990 1995 2000 2005

IBM 360/370 CDC 1604/600 UNIVAC 1100

~3 MFLOPS per $ million

DEC VAX/FPS IBM, CDC UNIVAC

~5 MFLOPS per $ million

CRAY 1 CDC 203

~20 MFLOPS per $ million

CRAY XMP CONVEX C1 ALLIANT

~60 MFLOPS per $ million

CRAY YMP CONVEX C2

~200-400 MFLOPS per $ million

SGI Power Ch IBM SP2 CM5

~2-3 GFLOPS per $ million

CRAY T3E SGI Origin IBM SP

~5-8 GFLOPS per $ million

CRAY T3E SGI Origin IBM SP SUN ES 10000

~20 GFLOPS per $ million

LINUX CLUSTERS

~100 GFLOPS per $ million

COMPUTATIONAL GRID

~1000 GFLOPS per $ million

• Systems getting larger by 2- 3- 4x per year !!

– Increasing parallelism: add more and more processors

• New Kind of Parallelism: GRID

– Harness the power of Computing Resources which are growing

Page 17: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

17Dheeraj Bhardwaj <[email protected]> December 2003

HPC Applications Issues

• Architectures and Programming Models– Distributed Memory Systems MPP, Clusters – Message

Passing– Shared Memory Systems SMP – Shared Memory

Programming– Specialized Architectures – Vector Processing, Data

Parallel Programming – The Computational Grid – Grid Programming

• Applications I/O– Parallel I/O– Need for high performance I/O systems and techniques,

scientific data libraries, and standard data representation

• Checkpointing and Recovery• Monitoring and Steering• Visualization (Remote Visualization)• Programming Frameworks

Page 18: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

18Dheeraj Bhardwaj <[email protected]> December 2003

Future of Scientific Computing

• Require Large Scale Simulations, beyond reach of any machine

• Require Large Geo-distributed Cross Disciplinary Collaborations

• Systems getting larger by 2- 3- 4x per year !!– Increasing parallelism: add more and more

processors

• New Kind of Parallelism: GRID– Harness the power of Computing Resources which

are growing

Page 19: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

19Dheeraj Bhardwaj <[email protected]> December 2003

What do we want to Achieve ?

• Develop High Performance Computing Applications (HPC) which are

• Portable ( Laptop Supercomputers Grid)

• Future Proof– Grid Ready

• Develop HPC Infrastructure (Parallel & Grid Systems) which is

• User Friendly• Based on Open Source• Efficient in Problem Solving • Able to Achieve High Performance• Able to Handle Large Data Volumes

Page 20: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

20Dheeraj Bhardwaj <[email protected]> December 2003

Parallel Computer and Grid

A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”.

A Computational Grid is an emerging infrastructure that enables the integrated use of remote high-end computers, databases, scientific instruments, networks and other resources.

Page 21: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

21Dheeraj Bhardwaj <[email protected]> December 2003

A Comparison

SERIAL

Fetch/Store

Compute

PARALLEL

Fetch/Store

Compute/ communicate

Cooperative game

GRID

Fetch/Store

Discovery of Resources

Interaction with remote application

Authentication / Authorization

Security

Compute/Communicate

Etc

Page 22: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

22Dheeraj Bhardwaj <[email protected]> December 2003

Serial and Parallel Algorithms - Evaluation

• Serial Algorithm

– Execution time as a function of size of input

• Parallel Algorithm

– Execution time as a function of input size, parallel architecture and number of processors used

Parallel System

A parallel system is the combination of an algorithm and the parallel architecture on which its implemented

Page 23: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

23Dheeraj Bhardwaj <[email protected]> December 2003

What is the Grid

• “Grid Computing [is] distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high performance orientation…we review the “Grid problem”, which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources- what we refer to as virtual organizations.”

From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” by Foster, Kesselman and Tuecke

Page 24: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

24Dheeraj Bhardwaj <[email protected]> December 2003

Distributed Computing vs. GRID

• Grid is an evolution of distributed computing– Dynamic– Geographically independent – Built around standards– Internet backbone

• Distributed computing is an “older term”– Typically built around proprietary software and

network– Tightly couples systems/organization

Page 25: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

25Dheeraj Bhardwaj <[email protected]> December 2003

Web vs. GRID

• Web– Uniform naming access to documents

• Grid - Uniform, high performance access to computational resources

Colleges/R&D Labs

Software Catalogs Sensor

nets

http://

http://

Page 26: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

26Dheeraj Bhardwaj <[email protected]> December 2003

Is the World Wide Web a Grid ?

• Seamless naming? Yes• Uniform security and Authentication?

No• Information Service? Yes or

No• Co-Scheduling? No• Accounting & Authorization ? No• User Services? No• Event Services? No• Is the Browser a Global Shell ? No

Page 27: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

27Dheeraj Bhardwaj <[email protected]> December 2003

What does the World Wide Web bring to the Grid ?

• Uniform Naming• A seamless, scalable information service• A powerful new meta-data language:

XML– XML will be standard language for describing

information in the grid– SOAP – simple object access protocol

• Uses XML for encoding. HTML for protocol– SOAP may become a standard RPC

mechanism for Grid services• Uses XML for encoding. HTML for protocol

• Portal Ideas

Page 28: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

28Dheeraj Bhardwaj <[email protected]> December 2003

The Ultimate Goal

• In future I will not know or care where my application will be executed as I will acquire and pay to use these resources as I need them

Page 29: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

29Dheeraj Bhardwaj <[email protected]> December 2003

Why Grids?

• Large-scale science and engineering are done through the interaction of people, heterogeneous computing resources, information systems, and instruments, all of which are geographically and organizationally dispersed.

• The overall motivation for “Grids” is to facilitate the routine interactions of these resources in order to support large-scale science and Engineering.

Page 30: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

30Dheeraj Bhardwaj <[email protected]> December 2003

Why Now ?

• Moore’s law improvements in computing produce highly functional endsystems

• The internet and burgeoning wired and wireless provide universal connectivity

• Changing modes of working and problem solving emphasize teamwork, computation

• Network exponentials produce dramatic changes in geometry and geography

Page 31: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

31Dheeraj Bhardwaj <[email protected]> December 2003

Network Exponentials

• Network vs. computer performance– Computer speed doubles every 18 months– Network speed doubles every 9 months– Difference = order of magnitude per 5 years

• 1986 to 2000– Computers: x 500– Networks: x 340,000

• 2001 to 2010– Computers: x 60– Networks: x 4000

Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

Page 32: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

32Dheeraj Bhardwaj <[email protected]> December 2003

Why Grid ?

Motivation:When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances. Glider Technology Report, June 2002

We are seeing a Fundamental Change in Scientific Applications

•They have become multidisciplinary

•Require incredible mix of varies technologies and expertise

“Many problems require tightly coupled computers, with low latencies and high communication bandwidths; Grid

computing may well increase … demand for such systems by making access easier” - Foster, Kesselman, Tuecke

The Anatomy of the Grid

Page 33: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

33Dheeraj Bhardwaj <[email protected]> December 2003

Convergence between e-Science and e-Business

• A biochemist exploits 10, 000 computers to screen 100,000 compounds in an hour

• A biologist combines a range of diverse and distributed resources (databases, tools, instruments) to answer complex questions

• 1,000 physicists worldwide pool resources for petaop analyses of petabytes of data

• Civil engineer collaborate to design, execute, & analyze shake stable experiments.

• An enterprise configures internal & external resources to support eBusiness workload

From Steve Tuecke 12 Oct’01

Page 34: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

34Dheeraj Bhardwaj <[email protected]> December 2003

Convergence between e-Science and e-Business

• Climate Scientist visualize, annotate, & analyze terabytes simulation datasets

• An emergency response team couples real time data, weather model, population data

• A multidisciplinary analysis in aerospace couples code and data in four companies

• A home user invokes architectural design functions at an application service provider

• An insurance company mines data from partner hospitals for fraud detection

Page 35: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

35Dheeraj Bhardwaj <[email protected]> December 2003

Important Grid Applications

• Data-intensive

• Distributed computing (metacomputing)

• Collaborative

• Remote access to, and computer enhancement of, experimental facilities

Page 36: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

36Dheeraj Bhardwaj <[email protected]> December 2003

An Example Virtual Organization: CERN’s Large Hadron Collider

1800 Physicists, 150 Institutes, 32 Countries

100 PB of data by 2010; 50,000 CPUs?www.griphyn.org www.ppdg.org www.eu-datagrid.org

Page 37: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

37Dheeraj Bhardwaj <[email protected]> December 2003

Grid Communities & Applications:Data Grids for High Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

www.griphyn.org www.ppdg.net www.eu-datagrid.org

Page 38: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

Dheeraj Bhardwaj <[email protected]> December 2003 38

And comparisons must bemade among many

We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns – Grids will help get us there and further

A Brainis a Lot

of Data!(Mark Ellisman,

UCSD)

Page 39: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

39Dheeraj Bhardwaj <[email protected]> December 2003

Biomedical InformaticsResearch Network (BIRN)

• Evolving reference set of brains provides essential data for developing therapies for neurological disorders (multiple sclerosis, Alzheimer’s, etc.).

• Today – One lab, small patient base– 4 TB collection

• Tomorrow– 10s of collaborating labs– Larger population sample– 400 TB data collection: more

brains, higher resolution– Multiple scale data integration

and analysis

Page 40: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

40Dheeraj Bhardwaj <[email protected]> December 2003

•Early 90s– Gigabit testbeds, metacomputing

•Mid to late 90s– Early experiments (e.g., I-WAY), academic software

projects (e.g., Globus, Legion), application experiments

•2002– Dozens of application communities & projects– Major infrastructure deployments– Significant technology base (esp. Globus ToolkitTM)– Growing industrial interest – Global Grid Forum: ~500 people, 20+ countries

The Grid: A Brief History

Page 41: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

41Dheeraj Bhardwaj <[email protected]> December 2003

Today’s Grid

• A single system interface

• Transparent wide-area access to large data banks

• Transparent wide-area access to applications on heterogeneous platforms

• Transparent wide-area access to processing resources

• Security, certification, single sing-on authentication– Grid Security

Infrastructure

• Data access, Transfer & Replication– GridFTP, Giggle

• Computational resource discovery, allocation and Process creation– GRAM, Unicore, Condor-

G

Page 42: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

42Dheeraj Bhardwaj <[email protected]> December 2003

Grid Evolution

• First Generation Grid– Computationally intensive, file access/transfer– Bag of various heterogeneous protocols &

toolkits– Recognizes internet, ignores web– Academic Team

• Second Generation Grid– Data intensive knowledge intensive– Service based architecture – Recognizes Web and Web services– Global Grid Forum– Industry participation

Page 43: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

43Dheeraj Bhardwaj <[email protected]> December 2003

Challenging Technical

Requirements

• Dynamic formation and management of virtual organizations

• Online negotiation of access to services: who, what, why, when, how

• Establishment of applications and systems able to deliver multiple qualities of service

• Autonomic management of infrastructure elements

Open Grid Services Architecturehttp://www.globus.org/ogsa

Page 44: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

44Dheeraj Bhardwaj <[email protected]> December 2003

Elements of the Problem

• Resource sharing– Computers, storage, sensors, networks, …– Heterogeneity of device, mechanism, policy– Sharing conditional: negotiation, payment, …

• Coordinated problem solving– Integration of distributed resources– Compound quality of service requirements

• Dynamic, multi-institutional virtual orgs– Dynamic overlays on classic org structures– Map to underlying control mechanisms

Page 45: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

45Dheeraj Bhardwaj <[email protected]> December 2003

The Grid

• Diverse Resources– Dynamic– Unreliable – Shared

• Administrative Issues

– Security

– Multiple organizations

– Coordinated problem Solving

Page 46: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

46Dheeraj Bhardwaj <[email protected]> December 2003

Grid Technologies:Resource Sharing

Mechanisms That …

• Address security and policy concerns of resource owners and users

• Are flexible enough to deal with many resource types and sharing modalities

• Scale to large number of resources, many participants, many program components

• Operate efficiently when dealing with large amounts of data & computation

Page 47: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

47Dheeraj Bhardwaj <[email protected]> December 2003

Aspects of the Problem

1) Need for interoperability when different groups want to share resources– Diverse components, policies, mechanisms– E.g., standard notions of identity, means of

communication, resource descriptions

2) Need for shared infrastructure services to avoid repeated development, installation– E.g., one port/service/protocol for remote access

to computing, not one per tool/appln– E.g., Certificate Authorities: expensive to run

• A common need for protocols & services

Page 48: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

48Dheeraj Bhardwaj <[email protected]> December 2003

Hence, a Protocol-Oriented View

of Grid Architecture, that Emphasizes …

• Development of Grid protocols & services– Protocol-mediated access to remote resources– New services: e.g., resource brokering– “On the Grid” = speak Intergrid protocols– Mostly (extensions to) existing protocols

• Development of Grid APIs & SDKs– Interfaces to Grid protocols & services– Facilitate application development by supplying

higher-level abstractions

Page 49: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

49Dheeraj Bhardwaj <[email protected]> December 2003

The Hourglass Model

• Focus on architecture issues– Propose set of core services as

basic infrastructure– Use to construct high-level,

domain-specific solutions

• Design principles– Keep participation cost low– Enable local control– Support for adaptation– “IP hourglass” model

Diverse global services

Coreservices

Local OS

A p p l i c a t i o n s

Page 50: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

50Dheeraj Bhardwaj <[email protected]> December 2003

Layered Grid Architecture(By Analogy to Internet

Architecture)

Application

Fabric“Controlling things locally”: Access to, & control of, resources

Connectivity“Talking to things”: communication (Internet protocols) & security

Resource“Sharing single resources”: negotiating access, controlling use

Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services

InternetTransport

Application

Link

Inte

rnet P

roto

col

Arch

itectu

re

Page 51: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

51Dheeraj Bhardwaj <[email protected]> December 2003

Globus Toolkit™

• A software toolkit addressing key technical problems in the development of Grid-enabled tools, services, and applications– Offer a modular set of orthogonal services– Enable incremental development of grid-enabled

tools and applications – Implement standard Grid protocols and APIs– Available under liberal open source license– Large community of developers & users– Commercial support

Page 52: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

52Dheeraj Bhardwaj <[email protected]> December 2003

Application

Connectivity

Resource

Collective

Fabric

Core GridServices

Local OS

Grid Resource Information ServiceGrid Resource Access & ManagementGridFTP

Internet protocolGlobus Security Infrastructure

Resources to Share

Grid Information Index service Replica managementCertificate repository (My proxy)Co-allocation library

Building Grid

Grid Architecture & Globus ToolKit

Page 53: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

54Dheeraj Bhardwaj <[email protected]> December 2003

Key Protocols

• The Globus Toolkit™ centers around four key protocols– Connectivity layer:

• Security: Grid Security Infrastructure (GSI)– Resource layer:

• Resource Management: Grid Resource Allocation Management (GRAM)

• Information Services: Grid Resource Information Protocol (GRIP) and Index Information Protocol (GIIP)

• Data Transfer: Grid File Transfer Protocol (GridFTP)

• Also key collective layer protocols– Info Services, Replica Management, etc.

Page 54: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

55Dheeraj Bhardwaj <[email protected]> December 2003

Why Grid Security is Hard?

• Resources being used may be extremely valuable & the problems being solved extremely sensitive

• Resources are often located in distinct administrative domains– Each resource may have own policies & procedures

• The set of resources used by a single computation may be large, dynamic, and/or unpredictable– Not just client/server

• It must be broadly available & applicable– Standard, well-tested, well-understood protocols– Integration with wide variety of tools

Page 55: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

56Dheeraj Bhardwaj <[email protected]> December 2003

1) Easy to use

2) Single sign-on

3) Run applicationsftp,ssh,MPI,Condor,Web,…

4) User based trust model

5) Proxies/agents (delegation)

1) Specify local access control

2) Auditing, accounting, etc.

3) Integration w/ local systemKerberos, AFS, license mgr.

4) Protection from compromisedresources

API/SDK with authentication, flexible message protection,

flexible communication, delegation, ...Direct calls to various security functions (e.g. GSS-API)Or security integrated into higher-level SDKs:

E.g. GlobusIO, Condor-G, MPICH-G2, HDF5, etc.

User View Resource Owner View

Developer View

Grid Security Requirements

Page 56: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

58Dheeraj Bhardwaj <[email protected]> December 2003

Convergence on Service Oriented Architecture

• Development of service oriented grid middleware using different technologies (such as Java/Jini, web services) to instantiate the service architecture.

Service Requester

Service locator

Service provider

Lookup ServiceInte

ract

ion

with

Ser

vice

Register Service

Discover Service

Serv

ice

Mat

ches

A typical SOA

Page 57: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

59Dheeraj Bhardwaj <[email protected]> December 2003

The future.. Web Services

• Web services are self-describing applications that can find and interact with other web applications to complete complex tasks over the internet.

• Unlike the hard-wired applications of the client-server computing days, web services are loosely coupled software components that can find and interact with other components on the internet without manual human intervention

Page 58: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

60Dheeraj Bhardwaj <[email protected]> December 2003

The future… Web services

• Increasingly popular standards-based frameworks for accessing network applications – W3C standardization, Microsoft, IBM, SUN, others

• WSDL: Web Services Description Language– Interface definition Language for web services

• SOAP: Simple Object Access Protocol– XML based RPC protocol, common WSDL target

• WS-inspection – Conventions for locating service descriptions

• UDDI: Universal Description, Discovery, & Integration– Discovery for Web services.

Page 59: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

61Dheeraj Bhardwaj <[email protected]> December 2003

Open Grid Service Architecture (OGSA)

• Utilize standard Web services infrastructures• Building on current Globus toolkit:

– Grid service: semantics for service interactions– Management of transient instances (&state)– Factory, registry, Discovery, other services– Reliable and secure transport

• Multiple hosting targets J2EE, .NET, “C”,…..• Service Orientated architecture enable

resource virtualization• Delivery via open source Globus Toolkit 3.0

– Leverage GT Experience, code, mindshare

Page 60: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

62Dheeraj Bhardwaj <[email protected]> December 2003

BioGrid approach

• Standardize interfaces

• Provide global directory of objects

• Distribute computation transparently

• Distribute data transparently

• Provide security on all object storage, transfer and communications

• Provide accountability, credibility and identification

• Bundle everything in a plug-and-play package

Page 61: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

63Dheeraj Bhardwaj <[email protected]> December 2003

Typical Computing in Bioinformatics

Job

Task 1

Task 2

Task 999

Task 1000

.

.

.

Task 1-250

Task 251-500

Task 501-750

Task 751-1000

great many and similar tasks independent to each other

DBSoftware

DBSoftware

DBSoftware

DBSoftware

Page 62: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

64Dheeraj Bhardwaj <[email protected]> December 2003

Bioinformatics Environment

Unauthorized Local Users

Job Dispatcher (obidispatch)

NodeSearch

Set of Nodes

GlobusTool Kit

Globus Tool Kit

DB

Environment Scanner

(obiregist)SW HW

Temporal Work Area for Job Execution

Results

ReportingEnvironmental Information

OBIEnv User

Environment Information

Server

List of OBIEnv Users

LocalAuthentification

Divided Jobs

Job (List of Tasks)

transferred and updated by obiupdate command

Node

Node

PostgreSQL

P2PServer

Page 63: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

65Dheeraj Bhardwaj <[email protected]> December 2003

Parallel Job Execution

blast Q1 genbankblast Q2 genbank

:blast Q10 genbank

Job Dispatcher (obidispatch)

Job (Task List)

Nodes with TrEMBL and BLAST?

Environment Information

Server

Set of Nodes

TrEMBL

TrEMBL

TrEMBL

TrEMBLTrEMBL

Q1,Q2

Q3,Q4

Q5,Q6

Q7,Q8Q9,Q10

Tasks are independent to each other

Page 64: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

66Dheeraj Bhardwaj <[email protected]> December 2003

Typical Database Access in Bioinformatics

Web Services

App1 App2

Site A Site B

Mirroring

App1’ App2’

Site A Site B

Page 65: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

67Dheeraj Bhardwaj <[email protected]> December 2003

Database Federation and Computational Pipeline

Phenome

Metabolome

Proteome

Transcriptome

Genome

Computational Pipeline

Database Federation + Web Services

App1

App2

App3

App4

App5

Page 66: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

68Dheeraj Bhardwaj <[email protected]> December 2003

VO on Grid

Virtual Organization on Grid

VO provides the boundary of knowledge sharing overgeometrical and organizational limitation.

Project

Project

A B

CD

Page 67: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

69Dheeraj Bhardwaj <[email protected]> December 2003

BioGrid Schematic

• Grid-aware client software

• Data and software resource directories

• Grid of processing computers

Page 68: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

72Dheeraj Bhardwaj <[email protected]> December 2003

Open Grid Service Architecture

Page 69: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

73Dheeraj Bhardwaj <[email protected]> December 2003

Future Grid Challenges

• Need ‘power station’ on the Grid– Buy (obtain) resources as required

• Need to understand how applications behave – Balance out data transfer Vs. compute shipping

• Need to scalable wide-area service discovery– Peer to Peer or centralized servers– Meta-data to describe Grid Services

• Need to exploit distributed services– Grid Service Orchestration – Optimise service selection and recover from failure

Page 70: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

74Dheeraj Bhardwaj <[email protected]> December 2003

The GRID is all about• The Coordinated, Transparent, Secure and Effective

Utilization of Geographically distributed heterogeneous resources (both hardware & Software) for Applications

To be Successful• The Grid has to support applications in the same way

that the power utilities support the use of household appliances

The Metaphor• Computers to act as generators of computational

“power”, for applications to become computational appliances

• The software infrastructure to act as the utility responsible for managing the interaction between them

The GRID

Page 71: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

75Dheeraj Bhardwaj <[email protected]> December 2003

Whom Does Grid Computing Serve ?

• The users and Their Applications

• Large Complex Applications which need

resources beyond the traditional– Parallel/Distributed processing in a box– Put-it-yourself together Clusters

• Applications that describe multiple aspects of

a system

• Applications consisting of multiple modules

• Applications with multi-source data

• Applications interfacing with measurement

systems and visualization systemsApplication Programmers will be able to write applications that leverage TeraFlops computations amd PetaBytes storage

Page 72: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

76Dheeraj Bhardwaj <[email protected]> December 2003

Grid System – Three Point Checklist

• Coordinated resource sharing that are not subject to centralized control

• Using standard, open, general-purpose protocols and interfaces

• To deliver nontrivial quality of services

Page 73: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

77Dheeraj Bhardwaj <[email protected]> December 2003

Applications Development On Grid

What do Application Developers Need to Think About in Grid Environments ?

• This is very similar to the requirements for an application to be able to run on many different architectures

• Need now to also think that not all processes in an application are necessarily running on the same resource or even the same architecture

• Not all processes have access to the same environment, or may be able to reach the same set of remote resources

Page 74: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

78Dheeraj Bhardwaj <[email protected]> December 2003

Hook enough computers together and what do you get?

A new kind of utility that offers supercomputer processing on tap.

                                                                                                                           

             

Page 75: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

79Dheeraj Bhardwaj <[email protected]> December 2003

Access Grid

• High-end group work and collaboration technology

• Grid services being used for discovery, configuration, authentication

• O(50) systems deployed worldwide

• Basis for SC’2001 SC Global event in November 2001– www.scglobal.org

Ambient mic(tabletop)

Presentermic

Presentercamera

Audience camera

www.accessgrid.org

Page 76: Dheeraj Bhardwaj December 2003 1 Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India dheerajb

80Dheeraj Bhardwaj <[email protected]> December 2003

Building Bridges for

the Future of

Science

Grid Computing is a paradigm that will have considerable impact on how computing resources will be provisioned – and JavaTM technology is primary technology that will enable it