computation and knowledge

46
Ian Foster Computation Institute Argonne National Lab & University of Chicago Computation and Knowledge

Upload: ian-foster

Post on 10-May-2015

2.317 views

Category:

Technology


0 download

DESCRIPTION

Physics Colloquium, University of Chicago May 2008

TRANSCRIPT

Page 1: Computation and Knowledge

Ian Foster

Computation Institute

Argonne National Lab & University of Chicago

Computation and

Knowledge

Page 2: Computation and Knowledge

3

Page 3: Computation and Knowledge

4

Knowledge Generation in Astronomy ~1600

30 years? years

10 years6 years2 years

Page 4: Computation and Knowledge

5

Automation10

-1 108 Hz

data capture

Community10

0 104

astronomers(106 amateur)

ComputationData10

6 1015

Baggregate 10

-1 1015

Hzpeak

Literature10

1 105

pages/year

Astronomy,from 1600 to 2000

Page 5: Computation and Knowledge

6

Automation10

-1 108 Hz

data capture

Community10

0 104

astronomers(106 amateur)

ComputationData10

6 1015

Baggregate 10

-1 1015

Hzpeak

Literature10

1 105

pages/year

Astronomy,from 1600 to 2000

Textmining

Federation/collaboration

Dataanalysis

Complexsystem

modeling

Hypothesisgeneration

Experimentdesign

Page 6: Computation and Knowledge

7

FLASH Turbulence Simulation (R.Fisher, D.Lamb, et al.)

74 million files154 TB

1 week, 65K CPUs11M CPU hours

Largest compressible homogeneous isotropic

turbulence simulation

LLNL BG/L23 TB

3 weeks @

20 MB/se

c

(Grid

FTP)

External usersaccess processed

dataset

Globus

Page 7: Computation and Knowledge

9

Data Delivery as a Systems Problem

Data complexity Many components Parallelism (in many places) Network heterogeneities

(e.g., firewalls) Space (or the lack of it) Protocols Failures at many levels Deadlines Resource contention Policy and priorities 74 million files

154 TB

2 weeks @20 MB/sec

23 TB

3 hours@2000 MB/sec

2 mins?@200,000 MB/sec

Page 8: Computation and Knowledge

1010

Send 1 GB partitioned into equi-sized files

over 60 ms RTT,1 Gbit/s WAN

Meg

ab

it/s

ec

File size (Kbyte) (16MB TCP buffer)

Number of files

John Bresnahan et al., ArgonneGlobus

Page 9: Computation and Knowledge

11

Birmingham•

LIGO Gravitational WaveObservatory

>1 Terabyte/day to 8 sites770 TB replicated to date: >120 million replicasMTBF = 1 month

Cardiff

AEI/Golm

Ann Chervenak et al., ISI; Scott Koranda et al, LIGO

Globus

Page 10: Computation and Knowledge

12

Lag Plot for Data Transfers to Caltech

Credit: Kevin Flasch, LIGO

Page 11: Computation and Knowledge

13

Page 12: Computation and Knowledge

14

Page 13: Computation and Knowledge

15

Cancer Biology

Globus

Page 14: Computation and Knowledge

16

Service-Oriented Science

People create services (data, code, instr.) …

which I discover (& decide whether to use) …

& compose to create a new function ...

& then publish as a new service.

I find “someone else” to host services, so I don’t have to become an expert in operating services & computers!

I hope that this “someone else” can manage security, reliability, scalability, …

!!“Service-Oriented Science”, Science, 2005

Page 15: Computation and Knowledge

17

The ultimate arbiter?

Types, ontologies

Can I use it?

Billions of services

Discovering Services

Assume success

Syntax, semantics

Permissions

Reputation

A B

Page 16: Computation and Knowledge

18

Discovery (1):Registries

Duke

OSU

NCI

Globus

NCI

Page 17: Computation and Knowledge

19

Discovery (2):Standardized Vocabularies

Core Services

Grid Service

Uses TerminologyDescribed In

Cancer DataStandards

Repository

EnterpriseVocabularyServices

ReferencesObjects

Defined in

Service Metadata

Publishes

Subscribes toand Aggregates

Queries Service

Metadata Aggregated In

Registers To

Discovery Client API

IndexService

Globus

Page 18: Computation and Knowledge

20

Page 19: Computation and Knowledge

21

Text Mining

Page 20: Computation and Knowledge

22

More Knowledge (?)

US papers/year in ApJ+AJ+PASP

Page 21: Computation and Knowledge

23

GeneWays

Online Journals

Pathways

GeneWays

Andrey Rzhetsky et al.

Screening 250,000 journal articles

2.5M reasoning chains

4M statements

Page 22: Computation and Knowledge

Image: Andrey Rzhetsky

Page 23: Computation and Knowledge
Page 24: Computation and Knowledge

Image: Andrey Rzhetsky

Page 25: Computation and Knowledge

Image: Andrey Rzhetsky

Page 26: Computation and Knowledge

Image: Andrey Rzhetsky

Page 27: Computation and Knowledge

29

Identify Genes

Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4

Predictive Disease Susceptibility

Physiology

Metabolism Endocrine

Proteome

Immune Transcriptome

BiomarkerSignatures

Morphometrics

Pharmacokinetics

EthnicityEnvironment

AgeGender

Evidence Integration:Genetics & Disease Susceptibility

Source: Terry Magnuson

Page 28: Computation and Knowledge
Page 29: Computation and Knowledge

31

The Data Deluge

Page 30: Computation and Knowledge

32Images courtesy Mark Ellisman, UCSD

Page 31: Computation and Knowledge

33Images courtesy Mark Ellisman, UCSD

Page 32: Computation and Knowledge

34Images courtesy Mark Ellisman, UCSD

Page 33: Computation and Knowledge

35Images courtesy Mark Ellisman, UCSD

A human brain @ 1 micron voxel = 3.5 peta (1015)bytes

Page 34: Computation and Knowledge

36

• Understanding increases far more slowly

Methodological bottlenecks?

Improved technology Human limitations?

Computer-assisted discovery

Page 35: Computation and Knowledge

38

Towards an Open Analytics Environment

Resultsout

Datain

Programs& rules in

“No limits” Storage Computing Format Program

Allowing for Provenance Collaboration Annotation

Page 36: Computation and Knowledge

39

Tagging & Social Networking

GLOSS: Generalized

Labels Over Scientific data Sources

(Foster, Nestorov)

Page 37: Computation and Knowledge

40

High-PerformanceData Analytics

FunctionalMRI

Ben Clifford, Mihael Hatigan, Mike Wilde,Yong Zhao

Globus

Page 38: Computation and Knowledge

42

start

report

DOCK6Receptor

(1 per protein:defines pocket

to bind to)

ZINC3-D

structures

ligands complexes

NAB scriptparameters

(defines flexibleresidues,

#MDsteps)

Amber Score:1. AmberizeLigand

3. AmberizeComplex5. RunNABScript

end

BuildNABScript

NABScript

NABScript

Template

Amber prep:2. AmberizeReceptor4. perl: gen nabscript

FREDReceptor

(1 per protein:defines pocket

to bind to)

Manually prepDOCK6 rec file

Manually prepFRED rec file

1 protein(1MB)

6 GB2M

structures(6 GB)

DOCK6FRED~4M x 60s x 1 cpu~60K cpu-hrs

Amber~10K x 20m x 1 cpu

~3K cpu-hrs

Select best ~500

~500 x 10hr x 100 cpu~500K cpu-hrsGCMC

PDBprotein

descriptions

4 million tasks500K cpu-hrs

Select best ~5KSelect best ~5K

(Mike Kubal, Benoit Roux, and others)

Page 39: Computation and Knowledge

43

DOCK on SiCortex CPU cores: 5760 Power: 15,000 W Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years

(does not include ~800 sec to stage input data)

Ioan Raicu,Zhao

Zhang

Page 40: Computation and Knowledge

44

An NSF MRI Proposal:PADS: Petascale Active Data Store

500 TB reliable storage (data &

metadata)

180 TB, 180 GB/s 17 Top/s

analysisData

ingest

Dynamic provisioning

Parallel analysis

Remote access

Offload to remote data centers

P A D S

Diverseusers

Diversedata

sources

1000 TBtape backup

Page 41: Computation and Knowledge

45

Using Computation to Accelerate Science

Complexmodeling

Experimentautomation

Dataanalysis

Collaboration& federation

Hypothesisgeneration

Page 42: Computation and Knowledge

46

Integrated View of Simulation, Experiment, & Informatics

*Simulation Information Management System+Laboratory Information Management System

DatabaseAnalysis

Tools

Experiment

SIMS*

ProblemSpecification

SimulationBrowsing &

Visualization

LIMS+

ExperimentalDesign

Browsing &Visualization

Page 43: Computation and Knowledge

47

Robot Scientist(University of Wales)

Page 44: Computation and Knowledge

48

A Concluding Thought

“Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework & exploratory apparatus for other sciences.” George

Djorgovski

Page 45: Computation and Knowledge

49

The Computation Institute

A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods.

Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three).

www.ci.uchicago.edu

Faculty, fellows, staff, students, computers, projects.

Page 46: Computation and Knowledge

50