aaas data intensive science and grid
DESCRIPTION
These slides were presented in a session that we organized at the American Association for Advancement of Science (AAAS) meeting in Chicago, February 2009.Abstract: New laboratory devices, sensor networks, high-throughput instruments, and numerical simulation systems are producing data at rates that are both without precedent and rapidly growing. The resulting increases in the size, number, and variety of data are revolutionizing scientific practice. These changes demand new computing infrastructures and tools. Until recently, most laboratories and collaborations managed their own data, operated their own computers, and used remote high-performance computers only when required. We are moving to a paradigm in which data will primarily be located and managed on remote clusters, grids, and data centers. In this symposium, we will examine the computing infrastructure designed to serve this emerging era of data-intensive computing from three perspectives: (1) that of grid computing, which enables the creation of virtual organizations that can share remote and distributed resources over the Internet; (2) that of data centers, which are transitioning to providers of integrated storage, data, compute, and collaboration services (the offering of one or more of these integrated services over the Internet is beginning to be called cloud computing); and (3) that of e-science, in which grids, Web 2.0 technologies, and new collaboration and analysis services are merging and changing the way science is conducted. Each speaker will focus on one perspective but also compare and contrast with the others.TRANSCRIPT
Ian Foster
Computation Institute
Argonne National Lab & University of Chicago
New computing platforms
for data-intensive science
3
4
Growth of Genbank
(1982-2005)
BroadInstitute
5
Proteomics Genomics Transcriptomics Protein sequence prediction Phenotypic studies Phylogeny Sequence analysis Protein structure prediction Protein-protein interaction Metabolomics Model organism collections Systems biology Health epidemiology Organisms Disease ….
1070 molecular bio databases Nucleic Acids Research Jan 2008
(96 in Jan 2001)
Slide: Carole Goble
6
New problem solving methodologies
<0 1700 1950 1990
Empirical
Data
Theory
Simulation“Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences”
– G. Djorgovski
7
8
More data does not always mean more knowledge
Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.
9
enormous
Data is
Infrastructure Storage & computingEconomics of scale
AggregationData & softwarePeople & disciplines
AlgorithmsScalable, probabilisticErrors & ambiguity
distributed
noisy
Cloud
Grid
10
Data
An incomplete list of process steps
Discover
Access
Integrate
Analyze
Mine
Publish
Annotate
Validate
CurateShare
Artisanal
Industrial
Data
Analyses
Models
Experiments
Literature
11
SOA as an integrating framework?
We expose data and software as services …
which others discover, decide to use, …
and compose to create new functions ...
which they publish as new services.
Technical …• Complexity• Semantics• Distribution• Scale
socio-technical challenges• Incentives• Policy, trust• Reproducibility• Life cycle
“Service-oriented science”, Science, 2005
and
12
Grid technology
13
NAE Grand Challenges
13
14
The future of multi-site data integration: An example
fMRI
Are positive symptom schizophrenics associated with more severe superior temporal gyrus dysfunction?
Receptor Density
ERP
Web
PubMed, Expasy,
Brain Map,Etc.
Structure
Clinical
PortalPortal
0.150.18
0.140.11
-0.14-0.10-0.06-0.020.020.060.100.140.180.220.260.30
ARIP - 20MG ARIP - 30MG RISP - 06MG PLACEBOTreatment Group
15
caBIG: sharing of infrastructure, applications, and data.
Aggregation in cancer biology
Globus
16
As of Feb16, 2009
123 participants104 services
65 data39 analytical
17
Microarray clustering in caBIG
1. Query and retrieve microarray data from a caArray data service:cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub
2. Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService
1. Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage
Workflow in/output
caGrid services
“Shim” servicesothers
Wei Tan(Taverna workflow)
18
Children’s Oncology Groupclinical imaging irials (Erberich)
19
Wide-area medical interface service
Converts local medical workflow actions into wide area operations Image workflow, EHR, …
Transparently manages federation of Security Data replication and recovery Data discovery
En
terp
rise/G
ridIn
terfa
ce S
erv
ice
DICOM Protocols
Grid Protocols(Web services)
DICOM
XDS
HL7
Vendor Specific
Wid
e A
rea
Serv
ice A
ctor
Plug-in Adapters
20
Main ESG PortalMain ESG Portal CMIP3 (IPCC AR4) ESG PortalCMIP3 (IPCC AR4) ESG Portal
198 TB of data at four locations 1,150 datasets 1,032,000 files Includes the past 6 years of joint
DOE/NSF climate modeling experiments
35 TB of data at one location 74,700 files Generated by a modeling campaign coordinated by the
Intergovernmental Panel on Climate Change Data from 13 countries, representing 25 models
8,000 registered users 1,900 registered projects
Downloads to date 49 TB 176,000 files
Downloads to date 387 TB 1,300,000 files 500 GB/day
(average)
400 scientific papers published to date based on analysis of CMIP3 (IPCC AR4) data
Earth System Grid
ESG usage: over 500 sites worldwide
ESG monthly download volumes
Globus
www.earthsystemgrid.org
21
Understanding interactions between human and natural systems
IPCC Emissions scenarios
Numerical Simulations
IPCC 4th Assessment
2007
IPCC process: Bill Collins, LBNL
Mitigation
Adaptation
22
A Community Integrated Model for Economic and Resource Trajectories for
Humankind (CIM-EARTH)
Dynamics,foresight,
uncertainty,resolution, …
Agriculture,transport,
taxation, …
Data (global,local, …)
(Super)computers
CIM-EARTHFramework
Communityprocess
Opencode, data
www.cim-earth.org
23
Alleviating Poverty
in Thailand:Modeling
Entrepreneurship
Consider only wealth,
access to capital
Consider alsodistance to
6 major cities
Rob Townsend, Tibi Stef-Praun, Victor Zhorin
Match
High
Low
24
enormous
Data is
Infrastructure Storage & computingEconomics of scale
AggregationData & softwarePeople & disciplines
AlgorithmsScalable, probabilisticErrors & ambiguity
distributed
noisy
Cloud
Grid
Computation Institutewww.ci.uchicago.edu
Thank you!