urgent computing, sharing grid resources, and elastic

24
Urgent Computing, Sharing Grid Resources, and Elastic Computing Pete Beckman Argonne National Laboratory University of Chicago http://www.mcs.anl.gov/~beckman

Upload: others

Post on 05-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Urgent Computing, Sharing Grid Resources, and Elastic

Urgent Computing,Sharing Grid Resources,and Elastic Computing

Pete BeckmanArgonne National Laboratory

University of Chicago

http://www.mcs.anl.gov/~beckman

Page 2: Urgent Computing, Sharing Grid Resources, and Elastic
Page 3: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

3Argonne Nat’l Lab/U Chicago

Page 4: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

4Argonne Nat’l Lab/U Chicago

Page 5: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

5Argonne Nat’l Lab/U Chicago

Page 6: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

6Argonne Nat’l Lab/U Chicago

Urgent Computing: I Need it Now!

• Applications with dynamic data andresult deadlines are being deployed

• Late results are useless Wildfire path prediction Storm/Flood prediction Influenza modeling

• Some jobs need priority access“Right-of-Way Token”

Page 7: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

7Argonne Nat’l Lab/U Chicago

How can we get cycles?

• Build supercomputers for the app Pros: Resource is ALWAYS available Cons: Incredibly costly (99% idle) Example: Coast Guard rescue boats

• Share public infrastructure Pros: low cost Cons: Requires complex system for

authorization, resource mgmt, andcontrol

Examples: school buses forevacuation, cruise ships fortemporary housing

Page 8: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

8Argonne Nat’l Lab/U Chicago

Introducing SPRUCE

• The Vision: Build cohesive infrastructure that can provide

urgent computing cycles• Technical Challenges:

Provide high degree of reliability Elevated priority mechanisms Resource selection, data movement

• Social Challenges: Who? When? What? How will emergency use impact regular use? Decision-making, workflow, and interpretation

Page 9: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

9Argonne Nat’l Lab/U Chicago

GETS USER

GETS USERORGANIZATION

GETS priority is invokedGETS priority is invoked““call-by-callcall-by-call””

Calling cards are in widespreaduse and easily understood by theNS/EP User, simplifying GETSusage

GETS is a "ubiquitous"service in the PublicSwitched TelephoneNetwork…if you can get aDIAL TONE, you can makea GETS call

Existing “Digital Right-of-Way”Emergency Phone System

Page 10: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

10Argonne Nat’l Lab/U Chicago

AutomatedTrigger

HumanTrigger

Right-of-WayToken

21

SPRUCEScience Gateway

Event

SPRUCE Architecture Overview (1/3)Right-of-Way Tokens

FirstResponder

Page 11: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

11Argonne Nat’l Lab/U Chicago

Conventional Job Submission

Parameters

!Urgent Computing

Parameters

Urgent ComputingJob Submission

SPRUCE JobManager

SupercomputerResource

Local Site Policies

Priority JobQueue

User Team Authentication

3

Choose aResource

4

5

SPRUCE Architecture Overview (2/3)Submitting Urgent Jobs

Page 12: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

12Argonne Nat’l Lab/U Chicago

SupercomputerResource Domain

SpecialistInterpreter

6Results

Decision Maker

7

SPRUCE Architecture Overview (3/3)Analyzing Urgent Jobs

Page 13: Urgent Computing, Sharing Grid Resources, and Elastic

Student fun with AJAX…

Page 14: Urgent Computing, Sharing Grid Resources, and Elastic
Page 15: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

15Argonne Nat’l Lab/U Chicago

Site-Local Response Policies: How will Urgent Computing be treated?

• “Next-to-run” status for priority queue; wait forrunning jobs to complete

• Force checkpoint of existing jobs; run urgent job• Suspend current job in memory (kill -STOP); run

urgent job• Kill all jobs immediately; run urgent job

• Provide differentiated CPU accounting “jobs that can be killed because they maintain their

own checkpoints will be charged 20% less”• Other incentives

Page 16: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

16Argonne Nat’l Lab/U Chicago

Emergency Preparedness Testing:“Warm Standby”

• In urgent computing situation, there is no time toport applications Applications must be in “warm standby” Verification and validation runs test readiness

periodically (Inca) Only verified apps participate in urgent computing

• Grid-wide Information Catalog Application was last tested & validated on <date> Also provides key success/failure history logs

Page 17: Urgent Computing, Sharing Grid Resources, and Elastic

DeadlineUrgency Level

59%

78%

98%

95%

Reliability

45 days agoCity AirflowSDSC::Elimidata

SDSC::Elimidata

NCSA::Cobalt

NCSA::Cobalt

Platform

8 days agoTornado

30 days agoInfluenza

14 days agoCity Airflow

ValidatedApp.

Normal priority, no SPRUCE supportSDSC::Datastar

Automated, immediate access, killexisting jobs, 10 min turnaround

Automated, next job

Human-in-the-loop, immediate access,kill existing jobs, 15 min. turnaround

Policy

SDSC::Elimidata

PSC::Rachel

NCSA::Cobalt

Platform

Warm Standby Validation History

Site Policies

(5.3 hrs, 1024 nodes)SDSC::Datastar

Immediate

Immediate

Next Available Job (Policy Based)

PSC::Rachel

NCSA::Cobalt

Platform

Live Job/Queue DataUser Team

MDS4 Service

SPRUCE Data

Advisor

Best HPCResource

Urgent Computation Request

Choosing a ResourceAn Advisor

Page 18: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

18Argonne Nat’l Lab/U Chicago

Deployment Status• Deployed and available:

UC/ANL Purdue TACC SDSC

• Very close: Indiana LSU

• Ready to integrate LEAD into SPRUCE First user-customer Warm standby apps

Page 19: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

19Argonne Nat’l Lab/U Chicago

What About “Capacity” Computing?

• SPRUCE works well with “capability”computing: Interface to small set of large resources

• Imagine a larger set of smallerresources? Condor management? Real on-demand servers?

• Amazon S3 & EC2

Page 20: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

20Argonne Nat’l Lab/U Chicago

Amazon S3 & EC2It’s a Web Services World

• S3: Simple Storage Service Cost: $0.20/GB transfer, $.15/GB-month

• EC2: Elastic Compute Cloud Cost: $0.10/cpu-hr, $0.20/GB transfer No cost for internal bandwidth

• Cost is extraordinarily good• Commoditization is good!!• The the real keys are reliability and

dynamic behavior

Page 21: Urgent Computing, Sharing Grid Resources, and Elastic
Page 22: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

22Argonne Nat’l Lab/U Chicago

Imagine…

• Other companies catching up…• Commoditization (like web email)• A standardized interface to web-service

“request vm”• Dynamic capacity provides availability of

250K “node instances”• urgent computing resources available

immediately• Missing bisection bandwidth, but great for

capacity computing

Page 23: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

23Argonne Nat’l Lab/U Chicago

The Future

• Web services interfaces to all the portal functions• Extended submission schema• Flexible tokens - aggregation, extension• Encode local site policies• Warm standby integration• Automated ‘advisor’• Data movement• Redundancy to avoid downtime of portal

Page 24: Urgent Computing, Sharing Grid Resources, and Elastic

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006http://www.mcs.anl.gov/~beckman

24Argonne Nat’l Lab/U Chicago

Questions?Ready to Join?

[email protected]@mcs.anl.gov

http://spruce.teragrid.org