big process for big data @ nasa

45
computationinstitute.or Big process for big data Ian Foster [email protected] NASA Goddard, February 27, 2013

Upload: ian-foster

Post on 10-May-2015

1.317 views

Category:

Technology


0 download

DESCRIPTION

A talk at NASA Goddard, February 27, 2013 Large and diverse data result in challenging data management problems that researchers and facilities are often ill-equipped to handle. I propose a new approach to these problems based on the outsourcing of research data management tasks to software-as-a-service providers. I argue that this approach can both achieve significant economies of scale and accelerate discovery by allowing researchers to focus on research rather than mundane information technology tasks. I present early results with the approach in the context of Globus Online

TRANSCRIPT

Page 1: Big Process for Big Data @ NASA

computationinstitute.org

Big process for big data

Ian Foster

[email protected] Goddard, February 27, 2013

Page 2: Big Process for Big Data @ NASA

computationinstitute.org

The Computation Institute

= UChicago + Argonne

= Cross-disciplinary nexus

= Home of the Research Cloud

Page 3: Big Process for Big Data @ NASA

computationinstitute.org

High energy physics

Molecular biology

Cosmology

Genetics

MetagenomicsLinguistics

Economics

Climate change

Visual arts

Page 4: Big Process for Big Data @ NASA

computationinstitute.org

x10 in 6 years

x105 in 6 years

Will data kill genomics?

Kahn, Science, 331 (6018): 728-729

Page 5: Big Process for Big Data @ NASA

computationinstitute.org

18 ordersof magnitudein 5 decades!

12 ordersof magnitudeIn 6 decades!

Moore’s Law for X-Ray Sources

Page 6: Big Process for Big Data @ NASA

computationinstitute.org

1.2 PB of climate dataDelivered to 23,000 users

Page 7: Big Process for Big Data @ NASA

computationinstitute.org

We have exceptional infrastructure for the 1%

Page 8: Big Process for Big Data @ NASA

computationinstitute.org

What about the 99%?

We have exceptional infrastructure for the 1%

Page 9: Big Process for Big Data @ NASA

computationinstitute.org

What about the 99%?

We have exceptional infrastructure for the 1%

Big science. Small labs.

Page 10: Big Process for Big Data @ NASA

computationinstitute.org

Need: A new way to deliver research cyberinfrastructure

FrictionlessAffordable

Sustainable

Page 11: Big Process for Big Data @ NASA

computationinstitute.org

We asked ourselves:

What if the research work flow could be managed as

easily as……our pictures

…home entertainment…our e-mail

Page 12: Big Process for Big Data @ NASA

computationinstitute.org

What makes these services great?

Great User Experience+

High performance (but invisible) infrastructure

Page 13: Big Process for Big Data @ NASA

computationinstitute.org

We aspire (initially) to create a great user

experience forresearch data managementWhat would a “dropbox

for science” look like?

Page 14: Big Process for Big Data @ NASA

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

Page 15: Big Process for Big Data @ NASA

computationinstitute.org

RegistryStaging Store

IngestStore

AnalysisStore

Community Store

Archive Mirror

IngestStore

AnalysisStore

Community Store

Archive Mirror

Registry

A common work flow…

Page 16: Big Process for Big Data @ NASA

computationinstitute.org

RegistryStaging Store

IngestStore

AnalysisStore

Community Store

Archive Mirror

IngestStore

AnalysisStore

Community Store

Archive Mirror

Registry

… with common challenges

Data movement, sync, and sharing

• Between facilities, archives, researchers

• Many files, large data volumes• With security, reliability,

performance

Page 17: Big Process for Big Data @ NASA

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

• Collect•Move• Sync• Share

Capabilities delivered using Software-as-Service (SaaS) model

Page 18: Big Process for Big Data @ NASA

computationinstitute.org

DataSource

DataDestinatio

n

User initiates transfer request

1

Globus Online moves/syncs files

2

Globus Online notifies user

3

Page 19: Big Process for Big Data @ NASA

computationinstitute.org

DataSource

User A selects file(s) to share; selects user/group, sets share permissions

1

Globus Online tracks shared files; no need to move files to cloud storage!

2

User B logs in to Globus Online

and accesses shared file

3

Page 20: Big Process for Big Data @ NASA

computationinstitute.org

Extreme ease of use

• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and

optimization• Reliability via transfer retries• Web interface, REST API, command

line• One-click “Globus Connect” install • 5-minute Globus Connect Multi User

install

Page 21: Big Process for Big Data @ NASA

computationinstitute.org

Early adoption is encouraging

Page 22: Big Process for Big Data @ NASA

computationinstitute.org

Early adoption is encouraging

8,000 registered users; ~100 daily~10 PB moved; ~1B files

10x (or better) performance vs. scp99.9% availability

Entirely hosted on AWS

Page 23: Big Process for Big Data @ NASA

computationinstitute.org

Delivering a great user experience relies onhigh performance

network infrastructure

Page 24: Big Process for Big Data @ NASA

computationinstitute.org

Science DMZoptimizes performance

+

Page 25: Big Process for Big Data @ NASA

computationinstitute.org

What is a Science DMZ?

Three key components, all required:• “Friction free” network path

– Highly capable network devices (wire-speed, deep queues)– Virtual circuit connectivity option– Security policy and enforcement specific to science

workflows– Located at or near site perimeter if possible

• Dedicated, high-performance Data Transfer Nodes (DTNs)– Hardware, operating system, libraries optimized for

transfer– Optimized data transfer tools: Globus Online, GridFTP

• Performance measurement/test node– perfSONAR

Details at http://fasterdata.es.net/science-dmz/

Page 26: Big Process for Big Data @ NASA

computationinstitute.org

Globus GridFTP architecture

28

GridFTP Dedicated

LFN

Shared

Glo

bu

s X

IO

Parallel TCP

UDP or RDMA

TCP

Internal layered XIO architecture allows alternative network and filesystem interfaces to be plugged in to the stack

Page 27: Big Process for Big Data @ NASA

computationinstitute.org

GridFTP performance options

• TCP configuration• Concurrency: Multiple flows per node• Parallelism: Multiple nodes• Pipelining of requests to support small files• Multiple cores for integrity, encryption• Alternative protocol selection*• Use of circuits and multiple paths*

Globus Online can configure these options based on what it knows about a transfer

* Experimental

Page 28: Big Process for Big Data @ NASA

computationinstitute.org

Exploiting multiple paths

• Take advantage of multiple interfaces in multi-homed data transfer nodes

• Use circuit as well as production IP link• Data will flow even while the circuit is being set up• Once circuit is set up, use both paths to improve

throughputRaj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter

Page 29: Big Process for Big Data @ NASA

computationinstitute.org

Exploiting multiple paths

multipath

multipath

Default, commodity IP routes+ Dedicated circuits

= Significant performance gains

Transfer between NERSC and ANL Transfer between UMich and Caltech

Raj Kettimuthu, Ezra Kissel, Martin Swany, Jason Zurawski, Dan Gunter

Page 30: Big Process for Big Data @ NASA
Page 31: Big Process for Big Data @ NASA

computationinstitute.org

K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s

Page 32: Big Process for Big Data @ NASA

computationinstitute.org

B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC

Page 33: Big Process for Big Data @ NASA

computationinstitute.org

Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience

Page 34: Big Process for Big Data @ NASA

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

Page 35: Big Process for Big Data @ NASA

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

Page 36: Big Process for Big Data @ NASA

computationinstitute.org

Globus Online Research Data Management-as-a-Service

Globus Integrate (Globus Nexus, Globus Connect)

Sharing, Collaboration,

Annotation… SaaS

PaaS

Many more capabilities planned …

Backup, Archival, Retrieval

Ingest, Cataloging, Integration

Page 37: Big Process for Big Data @ NASA

computationinstitute.org

A platform for integration

Page 38: Big Process for Big Data @ NASA

computationinstitute.org

Catalog as a service

Approach• Hosted user-

defined catalogs• Based on tag

model<subject, name, value>

• Optional schema constraints

• Integrated with other Globus services

Three REST APIs/query/• Retrieve subjects/tags/• Create, delete,

retrieve tags/tagdef/• Create, delete,

retrieve tag definitions

Builds on USC Tagfiler project (C. Kesselman et al.)

Page 39: Big Process for Big Data @ NASA

computationinstitute.org

Other early successes in services for science…

Page 40: Big Process for Big Data @ NASA

computationinstitute.org

Page 41: Big Process for Big Data @ NASA

computationinstitute.org

Page 42: Big Process for Big Data @ NASA

computationinstitute.org

Other innovative science SaaS projects

Page 43: Big Process for Big Data @ NASA

computationinstitute.org

Other innovative science SaaS projects

Page 44: Big Process for Big Data @ NASA

computationinstitute.org

To provide more capability formore people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources

“Science as a service”

Our vision for a 21st century cyberinfrastructure

Page 45: Big Process for Big Data @ NASA

computationinstitute.org

Thank you to our sponsors!