teragrid-wide operations draft #2 mar 31 von welch

17
TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Upload: claude-fowler

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Big Picture Resource Changes Sun Constellation Cluster (Ranger) at TACC, Feb ’08 –Initially 504 Tflops; upgraded in July 2008 to ~63,000 compute cores and 580 Tflops Cray XT4 (Kraken) at NICS, Aug ’08 –166 Tflops and 18,000 computing core cores Additional resources that entered production in 2008: –Two Dell PowerEdge 1950 clusters: 668-node system at LONI (QueenBee) and the 893-node system at Purdue (Steele) –PSC’s SGI Altix 4700 shared-memory NUMA system (Pople) –FPGA-based resource at Purdue (Brutus) –Remote visualization system at TACC (Spur) Other improvements: –Condor Pool at Purdue also grew from 7,700 to more than 22,800 processor cores. –Indiana integrated its Condor resources with the Purdue flock, simplifying use. Decommissioned systems: –NCSA’s Tungsten, PSC’s Rachel, Purdue’s Lear, SDSC’s DataStar and Blue Gene, and TACC’s Maverick.

TRANSCRIPT

Page 1: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

TeraGrid-Wide Operations

DRAFT #2 Mar 31Von Welch

Page 2: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Highlights•TeraGrid surpassed 1 petaflops of aggregate computing.

– Aggregate compute power available is 3.5x times from 2007 to 2008.– Primarily result of Track 2 systems at TACC and NICS coming online.– NUs used and allocated is ~4x times from 2007 to 2008.

•Significant improvement in the instrumentation, including tracking of grid usage and data transfers.

• Inca providing historical tracking of software and service reliability along with a new interface for both users and administrators.

•An international security incident touched TeraGrid, resulting in a very strong incident response as well as improved procedures for a new attack vector.

• Improvements in authentication procedures and cross-resource single-sign-on.

Page 3: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Big Picture Resource Changes• Sun Constellation Cluster (Ranger) at TACC, Feb ’08

– Initially 504 Tflops; upgraded in July 2008 to ~63,000 compute cores and 580 Tflops

• Cray XT4 (Kraken) at NICS, Aug ’08– 166 Tflops and 18,000 computing core cores

• Additional resources that entered production in 2008:– Two Dell PowerEdge 1950 clusters: 668-node system at LONI (QueenBee) and

the 893-node system at Purdue (Steele)– PSC’s SGI Altix 4700 shared-memory NUMA system (Pople)– FPGA-based resource at Purdue (Brutus)– Remote visualization system at TACC (Spur)

• Other improvements:– Condor Pool at Purdue also grew from 7,700 to more than 22,800 processor

cores.– Indiana integrated its Condor resources with the Purdue flock, simplifying use.

• Decommissioned systems:– NCSA’s Tungsten, PSC’s Rachel, Purdue’s Lear, SDSC’s DataStar and Blue

Gene, and TACC’s Maverick.

Page 4: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

TeraGrid HPC Usage, 20083.8B NUs in Q4

2008

Kraken, Aug. 2008

Ranger, Feb. 2008

3.9B NUs in 2007

In 2008,•Aggregate HPC

power increased by 3.5x

•NUs requested and awarded quadrupled

•NUs delivered increased by 2.5x

Page 5: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

TeraGrid Operations Center

• Created 7,762 tickets• Resolved 2,652 tickets (34%)• Took 675 phone calls• Resolved 454 phone calls

(67%)

•Manage TG ticket system and 24x7 toll-free call center

•Respond to all users and provide front-line resolution if possible

- 34% resolution rate•Route remaining tickets

to RP sites and other second-tier resolution centers

•Maintain situational awareness across the TG project (upgrades, maintenance, etc.)

Page 6: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Instrumentation and Monitoring•Monitoring and statistics gathering for TG

services– E.g. Backbone, Grid Services (GRAM, GridFTP)

•Used for measuring adoption, detecting problems, resource provisioning.

Page 7: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Inca Grid Monitoring System• Automated, user-level testing to improves reliability by detecting

Grid infrastructure problems.– Provides detailed information about tests and their execution to aid in

debugging problems. • Originally designed for TeraGrid, used in other large-scale projects

including ARCS, DEISA, and NGS.• Improvements in 2008 include new version of the Inca Web server,

which provides for custom views of latest results.– The TeraGrid User Portal uses custom view of SSH and batch job tests in

resources viewer. • Added email notification upon test failures.• New historical views were created to summarize overall data trends. • Developed a plug-in that allows Inca to recognize scheduled

downtimes• 20 new tests written and 77 TeraGrid tests were modified.• 2,538 pieces of test data are being collected.

Page 8: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

TeraGrid Backbone Network

Provides dedicated high-speed interconnect between TG high-end resources.

TeraGrid 10 Gb/s backbone runs

from Chicago to to Denver to Los

Angeles. Contracted from

NLR.

Dedicated 10 Gb/s link(s) from each RP to one of the

three core routers.

Map image from Indiana University

Page 9: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Security•Gateway Summit to develop understanding of

security needs of RPs and Gateways.– Co-organized with Science Gateways team – 30 attendees for RP sites and Gateways

•User Portal Password Reset Procedure•Risk Assessments for Science Gateways and

User Portal•TAGPMA participation and leadership•Uncovered large-scale attack in collaboration

with EU Grid partners.– Established secure communications: Secure Wiki,

SELS

Page 10: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Single Sign-on• Java-based GSI-SSHTERM application added to User Portal

– Consistently in top 5 apps.– Augments command-line functionality already in place.

•Replicating MyProxy CA at PSC to provide catastrophic failover for server at NCSA.– Implemented client changes on RPs and User Portal for failover.

•Developed a set of guidelines for management of grid identities (X.509 distinguished names) in the TeraGrid Central Database (TGCDB) and RP sites.– Tests written for TGCDB; Inca tests for RPs will follow.

•Started technical implementation of Shibboleth support for User Portal– TeraGrid now member of InCommon (as a service provider)– Will transfer to new Internet Framework.

Page 11: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

END OF PRESENTATION

Reference material and future plans slides for Towns follow.

Page 12: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Allocation Statistics

Page 13: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Tickets per Quarter, TeraGrid Operations Center

0

100

200

300

400

500

600

700

800

1st Qtr2003

3rd Qtr 1st Qtr2004

3rd Qtr 1st Qtr2005

3rd Qtr 1st Qtr2006

3rd Qtr 1st Qtr2007

3rd Qtr 1st Qtr2008

3rd Qtr

Page 14: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

New Resources for 2009•NICS Kraken system was upgraded in

February 2009 to a 66,048-core, 600-Tflops Cray XT5 system.

•NCSA placed the 192-node GPU-accelerated Dell PowerEdge 1950 cluster, Lincoln, into production.

•Further planned additions for 2009 include NCAR’s a Sun Ultra 40 system dedicated to data analysis and visualization.

Page 15: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Inca Plans for 2009•Integration of Inca into Internet Framework.•Create interface for RP administratorss to

execute tests on-demand.•Integrate with ticket systems to connect

tickets to tests.•Start work on a Knowledge Base for errors,

causes and solutions.•Development and maintain views based on

needs and output of QA and CUE groups.

Page 16: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

SSO Plans for 2009•Complete PSC deployment of backup

MyProxy service.•Complete integration of Shibboleth support

into Internet Framework– Develop full trust model for TeraGrid/Campuses– Start recruiting campuses and growing usage

•Work on bridging authorization with OSG and EGEE to support other activities.

Page 17: TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch

Other Continuing Tasks•TOC

– 24x7x365 point of contact – Trouble ticket creation and management

•Helpdesk– First tier support at RP sites integrated with TOC

•Instrumentation services•Backbone network and network coordination•Security Coordination, TAGPMA, etc.