aaas data intensive science and grid

Ian Foster

Computation Institute

Argonne National Lab & University of Chicago

New computing platforms

for data-intensive science

4

Growth of Genbank

(1982-2005)

BroadInstitute

5

Proteomics Genomics Transcriptomics Protein sequence prediction Phenotypic studies Phylogeny Sequence analysis Protein structure prediction Protein-protein interaction Metabolomics Model organism collections Systems biology Health epidemiology Organisms Disease ….

1070 molecular bio databases Nucleic Acids Research Jan 2008

(96 in Jan 2001)

Slide: Carole Goble

6

New problem solving methodologies

<0 1700 1950 1990

Empirical

Data

Theory

Simulation“Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences”

– G. Djorgovski

8

More data does not always mean more knowledge

Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.

9

enormous

Data is

Infrastructure Storage & computingEconomics of scale

AggregationData & softwarePeople & disciplines

AlgorithmsScalable, probabilisticErrors & ambiguity

distributed

noisy

Cloud

Grid

10

Data

An incomplete list of process steps

Discover

Access

Integrate

Analyze

Mine

Publish

Annotate

Validate

CurateShare

Artisanal

Industrial

Data

Analyses

Models

Experiments

Literature

11

SOA as an integrating framework?

We expose data and software as services …

which others discover, decide to use, …

and compose to create new functions ...

which they publish as new services.

Technical …• Complexity• Semantics• Distribution• Scale

socio-technical challenges• Incentives• Policy, trust• Reproducibility• Life cycle

“Service-oriented science”, Science, 2005

and

12

Grid technology

13

NAE Grand Challenges

13

14

The future of multi-site data integration: An example

fMRI

Are positive symptom schizophrenics associated with more severe superior temporal gyrus dysfunction?

Receptor Density

ERP

Web

PubMed, Expasy,

Brain Map,Etc.

Structure

Clinical

PortalPortal

0.150.18

0.140.11

-0.14-0.10-0.06-0.020.020.060.100.140.180.220.260.30

ARIP - 20MG ARIP - 30MG RISP - 06MG PLACEBOTreatment Group

15

caBIG: sharing of infrastructure, applications, and data.

Aggregation in cancer biology

Globus

16

As of Feb16, 2009

123 participants104 services

65 data39 analytical

17

Microarray clustering in caBIG

1. Query and retrieve microarray data from a caArray data service:cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub

2. Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService

1. Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage

Workflow in/output

caGrid services

“Shim” servicesothers

Wei Tan(Taverna workflow)

18

Children’s Oncology Groupclinical imaging irials (Erberich)

19

Wide-area medical interface service

Converts local medical workflow actions into wide area operations Image workflow, EHR, …

Transparently manages federation of Security Data replication and recovery Data discovery

En

terp

rise/G

ridIn

terfa

ce S

erv

ice

DICOM Protocols

Grid Protocols(Web services)

DICOM

XDS

HL7

Vendor Specific

Wid

e A

rea

Serv

ice A

ctor

Plug-in Adapters

20

Main ESG PortalMain ESG Portal CMIP3 (IPCC AR4) ESG PortalCMIP3 (IPCC AR4) ESG Portal

198 TB of data at four locations 1,150 datasets 1,032,000 files Includes the past 6 years of joint

DOE/NSF climate modeling experiments

35 TB of data at one location 74,700 files Generated by a modeling campaign coordinated by the

Intergovernmental Panel on Climate Change Data from 13 countries, representing 25 models

8,000 registered users 1,900 registered projects

Downloads to date 49 TB 176,000 files

Downloads to date 387 TB 1,300,000 files 500 GB/day

(average)

400 scientific papers published to date based on analysis of CMIP3 (IPCC AR4) data

Earth System Grid

ESG usage: over 500 sites worldwide

ESG monthly download volumes

Globus

www.earthsystemgrid.org

21

Understanding interactions between human and natural systems

IPCC Emissions scenarios

Numerical Simulations

IPCC 4th Assessment

2007

IPCC process: Bill Collins, LBNL

Mitigation

Adaptation

22

A Community Integrated Model for Economic and Resource Trajectories for

Humankind (CIM-EARTH)

Dynamics,foresight,

uncertainty,resolution, …

Agriculture,transport,

taxation, …

Data (global,local, …)

(Super)computers

CIM-EARTHFramework

Communityprocess

Opencode, data

www.cim-earth.org

23

Alleviating Poverty

in Thailand:Modeling

Entrepreneurship

Consider only wealth,

access to capital

Consider alsodistance to

6 major cities

Rob Townsend, Tibi Stef-Praun, Victor Zhorin

Match

High

Low

24

enormous

Data is

Infrastructure Storage & computingEconomics of scale

AggregationData & softwarePeople & disciplines

AlgorithmsScalable, probabilisticErrors & ambiguity

distributed

noisy

Cloud

Grid

Computation Institutewww.ci.uchicago.edu

Thank you!

aaas data intensive science and grid

Technology

data centers

variety of data

intensive computing

emerging era of data

new collaboration

grid computing

cloud computing

new problem