1 2010 asq world conference on quality and improvement may 24-26, 2010, st. louis, mo quality in...

29
1 2010 ASQ World Conference on Quality and Improvement May 24-26, 2010, St. Louis, MO Quality in Chaos: a view from the TeraGrid environment John Towns TeraGrid Forum Chair Director of Persistent Infrastructure National Center for Supercomputing Applications University of Illinois [email protected] with the assistance of many TeraGrid colleagues!!

Upload: barnard-turner

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Quality in Chaos: a view from the TeraGrid

environment

John TownsTeraGrid Forum Chair

Director of Persistent InfrastructureNational Center for Supercomputing Applications

University of [email protected]

with the assistance of many TeraGrid colleagues!!

2 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

What is Cyberinfrastructure?

• Computing systems,

• data storage systems, and data repositories,

• visualization environments,

• and people,

• all linked together by high performance networks.

3 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

The Vision of TeraGrid

• Three part mission:– support the most advanced computational science in multiple

domains– empower new communities of users– provide resources and services that can be extended to a broader

cyberinfrastructure• TeraGrid is…

– an advanced, nationally distributed, open cyberinfrastructure comprised of supercomputing, storage, and visualization systems, data collections, and science gateways, integrated by software services and high bandwidth networks, coordinated through common policies and operations, and supported by computing and technology experts, that enables and supports leading edge scientific discovery and promotes science and technology education

– a complex collaboration of over a dozen organizations and NSF awards working together to provide collective services that go beyond what can be provided by individual institutions

4 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

What is TeraGrid?(simple definition)

A complex collaboration of over a dozen organizations working together to provide cyberinfrastructure

that goes beyond what can be provided by individual

institutions,

to improve research productivity and enable breakthroughs not otherwise possible.

5 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

TeraGrid Objectives

• DEEP Science: Enabling Petascale Science– make science more productive through an integrated

set of very-high capability resources• address key challenges prioritized by users

• WIDE Impact: Empowering Communities– bring TeraGrid capabilities to the broad science

community• partner with science community leaders - “Science Gateways”

• OPEN Infrastructure, OPEN Partnership– provide a coordinated, general purpose, reliable set of

services and resources• partner with campuses and facilities

6 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

What you can do with the TeraGrid:Simulation of cell membrane processes

Work by Emad Tajkhorshid and James Gumbart, of University of Illinois Urbana-Champaign. – Mechanics of Force Propagation in

TonB-Dependent Outer Membrane Transport. Biophysical Journal 93:496-504 (2007).

– Results of the simulation may be seen at www.life.uiuc.edu/emad/TonB-BtuB/btub-2.5Ans.mpg

• Modeled mechanisms for transport of molecules through cell membrane.

• Used 400,000 CPU hours [45 processor-years] on systems at National Center for Supercomputing Applications, IU, Pittsburgh Supercomputing Center

Image courtesy of Emad Tajkhorshid, UIUC

7 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

TG App: SCEC-PSHA

• Part of Southern California Earthquake Center (Tom Jordan, USC)

• Using large scale simulation data, estimate probablistic seismic hazard (PSHA) curves for sites in southern California (probability that ground motion will exceed some threshold over a given time period)

• Used by hospitals, power plants, schools, etc. as part of their risk assessment

• For each location, need a Cybershake run followed by roughly 840,000 parallel short jobs– parallelize across locations, not individual

workflows

• Completed over 300 locations to date, targeting 2000 sites in 2010

Managing these requires effective grid workflow tools for job

submission, data management and error recovery, using Pegasus (ISI)

and DAGman (Wisconsin)

7

Information/image courtesy of Phil Maechling

8 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

What is the TeraGrid?

• An instrument that delivers high-end IT resources/services: computation, storage, visualization, and data/services– a computational facility – over a PetaFLOP in parallel computing

capability– a data storage and management facility - over 20 PetaBytes of storage

(disk and tape), over 100 scientific data collections– a high-bandwidth national data network

• A service: help desk and consulting, Advanced Support for TeraGrid Applications (ASTA), education and training events and resources

• Something you can use without financial cost – research accounts allocated via peer review– Startup and Education accounts automatic

• World’s largest distributed cyberinfrastructure for scientific research– supported by National Science Foundation

9 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

SDSC

TACC

UC/ANL

NCSA

ORNL

PU

IU

PSCNCAR

Caltech

USC/ISI

UNC/RENCI

UW

Resource Provider (RP)

Software Integration Partner

Grid Infrastructure Group (UChicago)

11 Resource Providers, One Facility

NICS

LONI

Network Hub

10 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

TeraGrid Resources and Services• Computing

– more than one petaflop of computing power today and growing•500 Tflop Ranger (Sun Constellation) at Texas Advanced Computing Center (TACC)

•1.03 PFlop Kraken (Cray XT5) at National Institute for Computational Sciences (NICS), University of Tennessee

• Remote visualization servers and software– 60 TFlop condor-based viz resource at Purdue University

• Data – allocation of data storage facilities – over 100 Scientific Data Collections

• Central allocations process • Technical Support

– central point of contact for support of all systems– Advanced Support for TeraGrid Applications (ASTA)– education and training events and resources– over 30 Science Gateways

11 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

How is TeraGrid Organized?• TG is set up like a large cooperative research group

– evolved from many years of collaborative arrangements between the centers

– still evolving!• Federation of 12 awards

– Resource Providers (RPs)• provide the computing, storage, and visualization resources

– Grid Infrastructure Group (GIG)• central planning, reporting, coordination, facilitation, and management group

• Strategically lead by the TeraGrid Forum– made up of the PI’s from each RP and the GIG– led by the TG Forum Chair, who is responsible for coordinating the group

(elected position)• John Towns – TG Forum Chair

– responsible for the strategic decision making that affects the collaboration

• Day-to-Day Functioning via Working Groups (WGs):– each WG under a GIG Area Director (AD), includes RP representatives

and/or users, and focuses on a targeted area of TeraGrid

12 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Impacting Many Agencies

NSF

DOE

NIH

NASA

DOD

International

University

Other

Industry

NSF52%

DOE13%

NIH19%

NASA 10%

DOD1%

International0%

University2% Other

2%

Industry1%

NSF49%

DOE11%

NIH15%

NASA 9%

DOD5%

International3%

University1%

Other6%

Industry1%

Supported Research Funding by Agency

Resource Usage by Agency

$91.5M Direct Support of Funded Research

10B NUs Delivered

13 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

So why are you here anyhow??

• For moderate scale research projects funded by federal agencies, quality is an afterthought– $10s of millions/year– well.. perhaps just assumed as implicitly needed– no explicit treatment of quality in many programs

• Of course, large scale projects have quality as a first class concern– $100s of millions/year– DOD has recognizes importance of quality in modeling and

simulation efforts• specifically designed verification, validation, and accreditation (VV&A) processes

• understand the simulation’s capabilities, limitations, and performance relative to the real-world objects it simulates

• http://vva.msco.mil/– NSF MREFC planning processes have quality concerns stated in

solicitations for these projects

14 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

TeraGrid is no exception

• Initially defined largely as a technology research activity with intent to support production (academic definition) resources

• Behaved organizationally much like an individual investigator research team– lack of clear structure and processes

• In the end, TeraGrid is both operations and research– operations:

• facilities/services on which researchers rely• infrastructure on which other providers build

– research:• learning how to do distributed, collaborative science on a global, federated infrastructure

• learning how to run multi-institution shared infrastructure

• Further, lack of recognition of what TeraGrid really is– an emerging and evolving infrastructure for enabling science and

engineering– (initially) treated as a research project

• Thus, something of a “perfect storm”

15 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

but….

• TeraGrid has become quite successful anyhow– the picture was perhaps not so bleak

• participant centers embodied a great deal of experience and expertise

• no lack of vision (perhaps too much) or passion amongst participants

• we came to some basic realizations

• Fundamentally, we had to mature as a distributed infrastructure organization– while we provided many technically interesting

things, we had lost sight of the “quality” of what we provided• we had to understand what that meant!

16 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Quality in a TeraGrid Context

• What did this mean for us?– TeraGrid must deliver important services reliably

and without barriers to entry to a community of scientists and engineers not interested in the nerdy details we TeraGrid geeks loved to wallow in…

17 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

TeraGrid faced many challenges on this front …

• Relied on a software we obtained elsewhere and had little control over– “academic grade” quality was a generous description for much of it

• We integrated this software along with many services into a distributed environment– resources at various site governed by conflicting policies– software often not based on standards or did not comply with them

• The distributed organization presented many faces to the user community for many of the services provided– participants desire to maintain their own identity while playing nice

in the larger environment• TeraGrid had (has) many organizational challenges

– no strong central management/authority– participants frequently pitted against one another in life/death

funding competitions• And the list goes on…

18 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

What TeraGrid needed to do…

• Create a more stable distributed environment and facilitate use by the user community– institute basic quality assurance mechanisms

• Quality Assurance working group– increased stability/reliability of software infrastructure

• Inca system, – new interfaces to environment

• Science Gateways, workflow support

• Reduce the number of faces presented to the user community– reduce electronic interfaces

• User Portal, POPS, trouble ticket submission• create common user environment across multiple heterogeneous systems

– reduce “human faces”• centralized helpdesk, integrated/coordinated advanced support functions

• Focus on facilitating use and not new technology development– support for new and advanced users – understand the challenges users face in our environment

19 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Something Important Going for Us

• TeraGrid was not a revolutionary idea suddenly instituted as a project– built on a long history of NSF-funded supercomputing centers

• initially funded in 1985– a progression of NSF programs

• a handful of major centers funded early on• some loose collaboration of those centers through 1980’s and 1990s• first NSF program to fund collections of centers in 1997• evolution of that program to TeraGrid

• This provided an important resource– staff with a passion for delivering resources and services to support

science an engineering– a culture of striving to do our best in this developed

• But…– most staff were subject matter experts and not process driven– we regularly work with cutting edge technologies

• no luxury of spending 2 years developing software using traditional software engineering processes

20 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Creating a stable and reliable environment: QA Working Group

• Goal: improve reliability of production TeraGrid software components/services

• Increase reliability of services:– prioritize testing/debugging of services most relevant

to users– identify existing tests to be used and/or develop new

tests• improving the use of the Inca monitoring framework

• increase availability of CTSS services:– improve time from detected failure to notification– map errors to potential problem resolution procedures

• Develop/propose a more formal process for CTSS software deployment

21 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Creating a stable and reliable environment: Build & Test Facility

• Lower software build and support costs across providers

• Improve software quality

• Make software builds reproducible

• Faster software turnaround time

• Provide public access to software manufacture process

TG Submit Host

build.teragrid.org

TeraGrid Software

SourcesBinaries

NMI Build/Test Framework

Condor / Condor-G / DAGMan

TG Build Pool

NMI Framework

Wrapper

Condor Startd

TG Software Build Scripts

TG Build Tools

TeraGrid Recipes

Build ScriptsBuild Specs

Build Results DB

Build Job InfoCentral framework

Evolved TG component

New TG component

22 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Reducing the number of electronic faces• TeraGrid User Portal

– access RP resources and special multi-site services

– current, up-to-date information about TG environment

– manage and monitor allocations via common tools

– first line of support for users• documentation, information about

hardware and software resources • education, outreach and training events

and resources

• Common User Environment Working Group– remove barriers to user movement between

TeraGrid resources– coordinate with RP staff and TG WG

• CUE Management System, CUE Build Environment, CUE Testing Platform, CUE Variable Collection

• Science Gateways– user access without allocation

request• simplifies access to resources• immediate reach to communities of researchers

23 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

What is a Science Gateway?

• A Science Gateway– enables scientific communities of

users with a common scientific goal– uses high performance computing– has a common interface – leverages community investment

• Three common forms:– web-based portals – application programs running on

users' machines but accessing services in TeraGrid

– coordinated access points enabling users to move seamlessly between TeraGrid and other grids

24 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

How can a Gateway help?

• Make science more productive– researchers use same tools– complex workflows– common data formats– data sharing

• Bring TeraGrid capabilities to the broad science community– lots of disk space– lots of compute resources– powerful analysis

capabilities– nice interface to information

25 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Support for New and Established Users

• TeraGrid Advanced Support for Applications (ASTA)– request help with code optimization, workflow improvement and

gateways through• TeraGrid Pathways

– new user support, mentoring, fellowships• Campus Champions

– individuals at your institution to offer support• HPC University

– online public resources• TeraGrid Annual Conference

– showcases capabilities, achievements and impact of TeraGrid in research

– presentations, demos, posters, visualizations

– tutorials, training and peer support– student competitions and volunteer opportunities

26 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Understanding the Challenges User Face

• Established User Interaction Council– key group of project leaders chaired by Director of

Science• Regular analysis of trouble tickets to identify

problem areas– leverage expertise and experience of other staff in

resolving• often results in an agreement among support teams at the 11 RPs how to (better, faster) resolve problems in future

– relevant insights are promptly reflected in the online materials • documentation, User Portal, Knowledge Base

– cross-cutting operational issues identified and reported to the User Interaction

27 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

But what did we learn?

• We did many things to improve the quality of the product we delivered to our customers– established many practices and procedures

• adopted many formal software engineering practices– improved the user experience in using our

resources and services• paid attention to the experiences our users had in making use of the environment

• But these were not the heart of what has made us successful

28 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Its all about the people!

• Staff with a passion to produce a quality product in the form of integrated software, services and resources– who were willing to go beyond tradition research

activities to attain the goal• Staff with a passion to enable the work of scientists

and engineers– with the expertise in the use of advanced technologies

• Staff with a vision for excellence– who connected with our user community on many levels

Never underestimate the value of the staff working on your projects!

29 2010 ASQ World Conference on Quality and Improvement

May 24-26, 2010, St. Louis, MO

Questions?