computation and knowledge
DESCRIPTION
Physics Colloquium, University of Chicago May 2008TRANSCRIPT
Ian Foster
Computation Institute
Argonne National Lab & University of Chicago
Computation and
Knowledge
3
4
Knowledge Generation in Astronomy ~1600
30 years? years
10 years6 years2 years
5
Automation10
-1 108 Hz
data capture
Community10
0 104
astronomers(106 amateur)
ComputationData10
6 1015
Baggregate 10
-1 1015
Hzpeak
Literature10
1 105
pages/year
Astronomy,from 1600 to 2000
6
Automation10
-1 108 Hz
data capture
Community10
0 104
astronomers(106 amateur)
ComputationData10
6 1015
Baggregate 10
-1 1015
Hzpeak
Literature10
1 105
pages/year
Astronomy,from 1600 to 2000
Textmining
Federation/collaboration
Dataanalysis
Complexsystem
modeling
Hypothesisgeneration
Experimentdesign
7
FLASH Turbulence Simulation (R.Fisher, D.Lamb, et al.)
74 million files154 TB
1 week, 65K CPUs11M CPU hours
Largest compressible homogeneous isotropic
turbulence simulation
LLNL BG/L23 TB
3 weeks @
20 MB/se
c
(Grid
FTP)
External usersaccess processed
dataset
Globus
9
Data Delivery as a Systems Problem
Data complexity Many components Parallelism (in many places) Network heterogeneities
(e.g., firewalls) Space (or the lack of it) Protocols Failures at many levels Deadlines Resource contention Policy and priorities 74 million files
154 TB
2 weeks @20 MB/sec
23 TB
3 hours@2000 MB/sec
2 mins?@200,000 MB/sec
1010
Send 1 GB partitioned into equi-sized files
over 60 ms RTT,1 Gbit/s WAN
Meg
ab
it/s
ec
File size (Kbyte) (16MB TCP buffer)
Number of files
John Bresnahan et al., ArgonneGlobus
11
Birmingham•
LIGO Gravitational WaveObservatory
>1 Terabyte/day to 8 sites770 TB replicated to date: >120 million replicasMTBF = 1 month
Cardiff
AEI/Golm
Ann Chervenak et al., ISI; Scott Koranda et al, LIGO
Globus
12
Lag Plot for Data Transfers to Caltech
Credit: Kevin Flasch, LIGO
13
14
15
Cancer Biology
Globus
16
Service-Oriented Science
People create services (data, code, instr.) …
which I discover (& decide whether to use) …
& compose to create a new function ...
& then publish as a new service.
I find “someone else” to host services, so I don’t have to become an expert in operating services & computers!
I hope that this “someone else” can manage security, reliability, scalability, …
!!“Service-Oriented Science”, Science, 2005
17
The ultimate arbiter?
Types, ontologies
Can I use it?
Billions of services
Discovering Services
Assume success
Syntax, semantics
Permissions
Reputation
A B
18
Discovery (1):Registries
Duke
OSU
NCI
Globus
NCI
19
Discovery (2):Standardized Vocabularies
Core Services
Grid Service
Uses TerminologyDescribed In
Cancer DataStandards
Repository
EnterpriseVocabularyServices
ReferencesObjects
Defined in
Service Metadata
Publishes
Subscribes toand Aggregates
Queries Service
Metadata Aggregated In
Registers To
Discovery Client API
IndexService
Globus
20
21
Text Mining
22
More Knowledge (?)
US papers/year in ApJ+AJ+PASP
23
GeneWays
Online Journals
Pathways
GeneWays
Andrey Rzhetsky et al.
Screening 250,000 journal articles
2.5M reasoning chains
4M statements
Image: Andrey Rzhetsky
Image: Andrey Rzhetsky
Image: Andrey Rzhetsky
Image: Andrey Rzhetsky
29
Identify Genes
Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4
Predictive Disease Susceptibility
Physiology
Metabolism Endocrine
Proteome
Immune Transcriptome
BiomarkerSignatures
Morphometrics
Pharmacokinetics
EthnicityEnvironment
AgeGender
Evidence Integration:Genetics & Disease Susceptibility
Source: Terry Magnuson
31
The Data Deluge
32Images courtesy Mark Ellisman, UCSD
33Images courtesy Mark Ellisman, UCSD
34Images courtesy Mark Ellisman, UCSD
35Images courtesy Mark Ellisman, UCSD
A human brain @ 1 micron voxel = 3.5 peta (1015)bytes
36
• Understanding increases far more slowly
Methodological bottlenecks?
Improved technology Human limitations?
Computer-assisted discovery
38
Towards an Open Analytics Environment
Resultsout
Datain
Programs& rules in
“No limits” Storage Computing Format Program
Allowing for Provenance Collaboration Annotation
39
Tagging & Social Networking
GLOSS: Generalized
Labels Over Scientific data Sources
(Foster, Nestorov)
40
High-PerformanceData Analytics
FunctionalMRI
Ben Clifford, Mihael Hatigan, Mike Wilde,Yong Zhao
Globus
42
start
report
DOCK6Receptor
(1 per protein:defines pocket
to bind to)
ZINC3-D
structures
ligands complexes
NAB scriptparameters
(defines flexibleresidues,
#MDsteps)
Amber Score:1. AmberizeLigand
3. AmberizeComplex5. RunNABScript
end
BuildNABScript
NABScript
NABScript
Template
Amber prep:2. AmberizeReceptor4. perl: gen nabscript
FREDReceptor
(1 per protein:defines pocket
to bind to)
Manually prepDOCK6 rec file
Manually prepFRED rec file
1 protein(1MB)
6 GB2M
structures(6 GB)
DOCK6FRED~4M x 60s x 1 cpu~60K cpu-hrs
Amber~10K x 20m x 1 cpu
~3K cpu-hrs
Select best ~500
~500 x 10hr x 100 cpu~500K cpu-hrsGCMC
PDBprotein
descriptions
4 million tasks500K cpu-hrs
Select best ~5KSelect best ~5K
(Mike Kubal, Benoit Roux, and others)
43
DOCK on SiCortex CPU cores: 5760 Power: 15,000 W Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years
(does not include ~800 sec to stage input data)
Ioan Raicu,Zhao
Zhang
44
An NSF MRI Proposal:PADS: Petascale Active Data Store
500 TB reliable storage (data &
metadata)
180 TB, 180 GB/s 17 Top/s
analysisData
ingest
Dynamic provisioning
Parallel analysis
Remote access
Offload to remote data centers
P A D S
Diverseusers
Diversedata
sources
1000 TBtape backup
45
Using Computation to Accelerate Science
Complexmodeling
Experimentautomation
Dataanalysis
Collaboration& federation
Hypothesisgeneration
46
Integrated View of Simulation, Experiment, & Informatics
*Simulation Information Management System+Laboratory Information Management System
DatabaseAnalysis
Tools
Experiment
SIMS*
ProblemSpecification
SimulationBrowsing &
Visualization
LIMS+
ExperimentalDesign
Browsing &Visualization
47
Robot Scientist(University of Wales)
48
A Concluding Thought
“Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework & exploratory apparatus for other sciences.” George
Djorgovski
49
The Computation Institute
A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods.
Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three).
www.ci.uchicago.edu
Faculty, fellows, staff, students, computers, projects.
50