teragrid review summary charlie catlett, teragrid director university of chicago and argonne...

29
TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Upload: cecil-mccormick

Post on 26-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

TeraGrid Review Summary

Charlie Catlett, TeraGrid Director

University of Chicago and Argonne National Laboratory

March 2006

Page 2: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Panel Comments

The project has made a very good start.

The functionality provided in the Coordinated TeraGrid Software and Services (CTSS – software stack) provides the basic elements of a uniform computing job management system, a common and strong user authentication system, and basic global data handling tools across all of the TeraGrid resources.

A very good start has also been made on a uniform global file system that can manage both on-line and near-line data (almost every user’s first priority after the basic functions).

Successful federated operations across the eight sites, security, and user services groups are in place and functioning well.

All of this is running on a very impressive set of resources comprised of some 50 teraflops of computing resources and 1.5 petabytes of online storage, interconnected with a minimum of 10 Gb/s network bandwidth.

This current version of TeraGrid is being used by an impressive collection of early adopters, some of whom are doing things that they could not do without the TeraGrid.

Page 3: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Panel Comments: Impact and Demand

It is clear that TeraGrid is supporting a large amount of research activity and demand for its services is high.

TeraGrid currently has 1,600 users, and plans to expand this to 2,500 in 2007 and to 6,000 by 2010.

There is encouraging evidence of innovation, both in terms of scientific results and methodology.

Page 4: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Get Users Involved in Planning

(R24) Major facilities like TeraGrid must have early and substantial advice from the broad user community, not just a collection of experienced and highly motivated early adopters. All such large-scale projects need such a group that not only meets in “all hands” style meetings, but also has an executive group that should participate in the GIG meetings. The GIG should move as quickly as possible to form such a body.

– Cyberinfrastructure User Advisory Committee (CUAC) is currently being selected and will meet for the first time at the TeraGrid’06 conference in June 2006.

– While the original CUAC design was to advise TeraGrid, NSF OCI has suggested that CUAC advise all OCI projects. This may be less useful to TeraGrid than a CUAC that is more targeted.

– The Michigan project (or whomever is selected) is also a critical element in obtaining user feedback.

– Science Gateway projects bring requirements from entire communities, and indeed this input has already driven CTSS v3 (Globus web services), a new information services design, and plans for enhanced authorization frameworks

Page 5: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Assessing Impact

(R3) Research supported by TeraGrid should be documented more thoroughly and systematically, with more background to make clearer the contribution each project makes to its specific field(s), and more details of how TeraGrid has contributed. The TeraGrid team should engage users more closely in this activity, perhaps by making it a condition of support. The use made of data currently collected to quantify research impact should be improved and a program of work initiated to define a richer set of research impacts, identify metrics for them, and collect and analyze the data. This should be done independently of any future external evaluation process.

– We initiated a RAT specifically for these purposes in November 2005 (it is in process).•This is a significant challenge that cannot be readily addressed with incremental changes to the measurement techniques our community has developed over 20 years delivering HPC cycles at supercomputer centers.

There is a great need for systematic and thorough assessment of the entire project at every level. For example:

Page 6: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Assessing Impact

(R7) The plan to introduce better monitoring of how the TeraGrid is used and how various services are used over the remainder of the grant period is a good one, and should be useful in guiding software selection and other administrative decisions for the future. This fits well with the panel’s overarching recommendation of incorporating better assessment and response capabilities into all levels of the TeraGrid, Core, and RP projects. (R2) It might help to have users categorize their papers so as distinguish between, for example, results and methodological outputs.

– We are in the process of developing new guidelines for TeraGrid users with respect to reporting papers and acknowledging use of TeraGrid. We will incorporate this recommendation into our recommendations and requirements for users.

Page 7: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Process

(R4) The methodology for the selection and prioritization of projects should be documented in detail and performed on a regular basis several times a year. The process should be open, formal, and inclusive. Addressing such issues is important to wide community acceptance of the TeraGrid approach. Furthermore, reflecting the science priorities of the NSF and the nation is important in the process of clarifying selection prioritization.

– We are finalizing the processes for selecting ASTA projects, and intend to work with the CUAC to optimize the selection criteria and processes.

– The Science Gateway program is less exclusive, however, and we have welcomed many gateways into the project since its inception.

Page 8: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Panel Comments: Architecture

Implementation of the project over the past fifteen months has been quite satisfactory.—the review committee made a number of positive observations.

– The most important positive observation is that architectural choices made by the project early on have been a key to successes thus far demonstrated by the TeraGrid. The project participants have done an excellent job in deploying a grid environment with a diverse set of resources that range from large distributed memory clusters to Condor flocks, and include data collections, storage, visualization and instruments. They are also commended for the innovative approach to deal with user communities through “gateways”—an approach that provides improved scalability as opposed to dealing with single users.

Page 9: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Adding New Services

(R9-a) In addition to grid-appropriate scheduling software, the panel identified other missing capabilities. If the TeraGrid is to promote collaborative science, it needs software components such as persistent CVS servers, Bugzilla servers, and other software to support collaboration. – While we do have internal CVS, bugzilla, forum, and other collaborative

services we are considering whether there is user demand such that they could be incorporated into the user portal. However, we find that science gateways are likely the best place for this functionality.

(R9-b) Another need identified by the panel regards workflow software tools: many of the RPs are developing workflow applications, but there’s a more important need for such software to be incorporated into the TeraGrid itself. A mechanism for combining the best solutions or electing an appropriate solution is needed, and deployment of the chosen solution is important to usability for the entire TeraGrid. – Indeed this is precisely our approach.

•We provide DAG-MAN as part of CTSS due to its popularity, and have deployed other tools such as VDS and GridShell. Workflow is indeed an important capability, and tends to be closely tied to the applications - each community has its own favorite workflow approach and there is no widely accepted “standard.”

– We have thus ensured that TeraGrid resources provide services that these various workflow approaches require, allowing users to use their package of choice or a selection of useful alternatives.

•TeraGrid’s information services, job submission, and other services are based on standards (defacto or increasingly web services) and thus any workflow system can readily incorporate TeraGrid resources.

Page 10: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: End-to-End Tools

(R12) There should be monitoring tools integrated with TeraGrid for end-to-end application+network performance so that when a user has a problem, the exact path at issue can be tested.

– Our user support and networking teams include experts to work with users who require this level of analysis. The tools and expertise to diagnose end-to-end performance are generally not packaged, or designed, for non-expert use. •To date this approach appears to be working for our user community.

– Developing or packaging such tools would be a significant endeavor, and one that even large-scale network provider partners (Abilene, ESnet, NLR, others) have not undertaken.

Page 11: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Beyond Computation

(R16) The Grid vision is that it can be much greater than infrastructure designed to access and harness distributed computational resources. There are also opportunities for data grids and for collaboration grids. The TeraGrid recognizes that data management and collaboration services are also important elements to include in the infrastructure although progress towards realizing those capabilities has not kept pace with progress on computational issues. The review committee recommends that the project move as quickly as possible to incorporate capabilities that will extend the utility of the TeraGrid in these directions to appeal to an even broader community.

– This is largely paced by maturity of solutions. We have asked Kelly Gaither (TACC) to join the GIG as an Area Director to oversee planning for a variety of new services such as better integrating our data collections. Many of the lessons here are coming from science gateway partners.

(R23) A set of policies and procedures should be developed to ensure the sustainability and persistence of enterprise-critical programs throughout their useful life cycle, with attention to issues including the possible adoption of standards, licensing vs. open source, and funding for maintenance and support. The GIG must figure out how to replicate the sort of trust that is enjoyed by the Core centers in the federated TeraGrid.

– (R23) This is incorporated into the selection criteria for TeraGrid capabilities, and where possible we have selected standard protocols and interfaces. Where standards are not yet available we have worked with software providers who are either creating standards or who are committed to adoption of standards. We are also looking at commercial solutions as they become feasible.

Page 12: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Science Gateways Program

(R20) The concept of science gateways is very good, but the implementation of gateways is immature and ways to improve this should be better articulated in the plan for 2006.

– The concept of science gateways is relatively new, but in the six months since initiated science gateways work the progress has been significant and the impact has been global. Because the concept resonates so deeply with the community, it is easy to forget that it is a very new activity.

Page 13: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Panel Comments: Security

User identity management and authentication seems well implemented in the current instantiation of the TeraGrid

The security stance of TeraGrid and the RPs is generally excellent. Their stance is informed by dramatic real-world experience, especially dealing with the successful, large-scale hacker attack that started in early 2004.

As a result of this there is a well structured organization for dealing with security issues, there are documented best practices that all RPs are required to adhere to, and there are well developed prophylactic, response, and recovery procedures.

Page 14: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Panel Comments: Technical Approach

The approach for handling the software environment has worked well—there are several important developments. One outcome is that the project has aligned the build process with that which has come out of the NMI effort.

The whole data management environment of the TeraGrid is well focused. It has the capability of enhancing the outreach of scientific applications, and it perfectly adapted to the vision and goals of the TeraGrid infrastructure.

These two services [GPFS, GridFTP] are perfectly adapted to the TeraGrid high capability environment

The plan to introduce better monitoring of how the TeraGrid is used, what services are used, etc. over the next year is a good one, and should be useful in guiding software selection in the future. 

Page 15: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Data Management

(R17) The rapid adoption by users of GPFS-WAN is a very positive experience that underlines the relevance of the TeraGrid vision. TeraGrid should be prepared to rapidly expand the disk storage available to this very useful and visible service.

– The engineering of the network between the LA and Chicago hubs took into account the growing use of GPFS, and this is why we anticipate moving to 20 Gb/s by mid 2006.

– We are also working with other RPs to investigate the feasibility of bringing additional GPFS resources into service at their sites, providing both additional storage capacity and robustness by avoiding single points of failure.

– This recommendation suggests that when NSF solicitations for new resource solicitations/investments are designed there is a view toward data requirements in addition to computational power.

Page 16: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Coordinated Visualization

(R25) Closer coordination of visualization activities across the TeraGrid is needed. In particular, the visualization activities at UC/ANL, TACC and NCSA (e.g. the Visualization Pipeline) should be reviewed at the GIG level and a coordination plan put in place. – The majority of visualization user support is done with non-TeraGrid funding,

however there is good internal collaboration via a working group led by ANL (Papka) and TACC (Gaither).

– Kelly Gaither’s responsibilities as a new GIG AD demonstrates an increased priority on this coordination and she will work with Mike Papka at ANL to further address this area.

– ANL, TACC and Purdue are developing a joint tutorial for the TeraGrid ’06 conference, and our EOT team is discussing with them a TeraGrid Institute session focused on visualization.

(R26) The allocation of resources on the TeraGrid needs to be made cognizant of specialized resources, such as visualization hardware and software, so that those resources can be specified in allocation requests and allocated as a scarce resource. – We initiated a RAT in 2005 to look at non-compute allocations and have begun

to tie this work into our accounting planning. We anticipate in 2006 that we will begin to address this.

– We are also examining the allocations process to this end, given that the process was developed during an earlier era when the only service was high-end compute resources for individual PIs.

•We will be working with the CUAC to evaluate the allocations process and help us evolve it.

Page 17: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Deep and Wide Balance

(R25) The tension between the stated “Deep” and “Wide” goals of the TeraGrid exists in the software realm as well as in other areas of TeraGrid. With “width” comes many software scaling problems, allocation management problems, and user policy decisions. These must be addressed carefully over the next several years, since choices made now will affect scalability in the future. The committee has reiterated the need to recognize coming petascale computing technology: this technology will need to be integrated into the TeraGrid, and will have significant software implications.

– We recognize these challenges and they factor into every level of our plans. We do not view these initiatives as being in tension, but rather complementary.

– The “Wide” focus has in fact initiated improvements of benefit to the “Deep” community as we find increasing synergy between the needs of the two types of users.

Page 18: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Panel Comments: Management, Scope

The design, management approach, and scope of the projects appears to be poised to provide a powerful next generation cyberinfrastructure to the U. S. research and education community

An excellent start has been made on the most basic functionality of a uniform shell environment and all aspects of its operational support across all eight Resource Provider sites.

The structure of the federation of Resource Providers to ensure software consistency, operational reliability, and security, appears to be headed for success, at least at the current scale of the system.

Page 19: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Resources and Resource Providers

Metrics are needed for the GIG to assess the RPs. For example, ORNL had the goal of bringing an instrument onto TG, and they, in particular, need a metric for evaluating the success of this effort. The panel feels that the choice of this particularly complex instrument as the first (and currently only) example of integrating instrumentation into the TeraGrid will prove challenging. Consideration should be given of the integration of additional instrumentation efforts, possibly linking with other related NSF supported activities.– The GIG is working toward more precise service descriptions and

these will provide metrics that will support continuous improvement overall, by the GIG, and by RPs. We are also working to define non-computational services such as data and instrument services.

(R27) As a TeraGrid RP, Purdue provides an important set of capabilities to the TeraGrid. In addition to making the technical and science gateway advances planned for 2006, it is critical that the Purdue condor clusters be fully integrated into the TeraGrid and used to help offload smaller jobs from larger TeraGrid supercomputers that are becoming overloaded.– Purdue’s Condor clusters are fully integrated today.– Fully harnessing these types of resources at all of the sites is a

priority for 2006. The roaming allocation capability sets the stage to provide incentives for users to self-select as well.

Page 20: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Robustness and Scale

(R15) One of the issues that the project has not addressed adequately is robustness for scalability—applied to systems, resource providers, and number of users. The project should explicitly address this issue in future plans.

– Verification and validation have been an emphasis of TeraGrid since early production with the design, development, and incorporation of Inca in 2003.

– CTSS v3 represents significant streamlining and improving robustness of the TeraGrid software and services.

– The next step will be to more clearly define service levels including metrics can be used to drive improvements in stability of components.

Page 21: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: User Documentation

(R6) The quality, structure and content of available documentation are uneven across the various components of the TeraGrid. The TeraGrid, including the GIG and all RPs, should re-program sufficient resources to provide centralized, cohesive documentation with centralization of both version and URL control, which is essential to ensure the usability of the documentation.

– We recognized this as a significant issue in early 2005 and worked with a RAT to begin to address the largest holes, but there is much work yet to be done.

– One result of the RAT was adjust GIG budgets to support an FTE at Indiana to convert existing and new training and education material to leverage their knowledgebase system. This work began in late 2005.

– The RAT, consisting of external relations staff from many RP sites, continues to work with Scott Lathrop as a working group to coordinate this activity.

– We have also seen confusion with much of the same material being hosted at both TeraGrid and CIP websites, and are working with the NCSA and SDSC leaders to “combine forces” to create a deeper, more coherent set of materials for the users, and a central point from which they can access all of these resources available from all of the RPs.

The user documentation that is currently available for the TeraGrid is a good first start, but substantial enhancements and improvements are required.

Page 22: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: User Documentation

(R11) The panel acknowledges that a “primer” document exists, but the initial release is inadequate with respect to providing necessary information for beginning to use the TeraGrid. A substantive and useful introductory document that links to the more comprehensive documentation products that already exist or are contemplated and a plan to maintain and update it as the facility scales up are needed.

– The primer is not aimed at end-users, but was written to outline the processes and steps necessary for integrating a new computational resource into TeraGrid. As such, it is certainly not appropriate for new TeraGrid users.

– We have structured our online documentation and training materials to help new users, adding in the past year new features such as process flowcharts for some of the more complex operations. As part of this effort we are developing an increasingly rich new user primer.

Page 23: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Training Materials

(R14) Much more online (web-based) training material and educational information is needed. This may be more straightforward than live user training. Sufficient human resources should be dedicated to the preparation, maintenance and updating of user manuals, copious examples and other kinds of documentation.

– In practice, good online training is much more difficult to design than live user training, where immediate feedback allows the instructor to adjust the material and pace.

– We have been working with the science gateway and RP partners to harvest the “best of” their growing collection of TeraGrid training and educational materials. •Evaluation of these materials will be essential as the goal is not quantity but quality.

Page 24: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: User portal

(R18) The review panel recommends that, given the potential value of the user portal to contribute to realizing the “TeraGrid Wide” vision, the GIG team should work with a representative sample of the user community to evaluate and enhance the functionalities of the portal.

– The user portal design to date has come from extensive prior experience working with NPACI users as well as current work with TACC and TeraGrid users.

– We intend to work with the CUAC in this regard as well.(R6)The existence and utility of the user portal should be more widely advertised, documented, and described in the training material.

– As the user portal moves from prototype to production this year we intend to invest heavily in advertising, documenting, and training for this important new capability. •The User Portal will be one of the topics addressed in the TeraGrid Institute training series to expose more users to the capabilities and to solicit their feedback. The Michigan project (or related endeavor) will be a tremendous asset in understanding the user experience and feedback on these and other tools and resources.

The user portal provides a simplified way to submit and manage jobs (as opposed to traditional submission of job scripts).

Page 25: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Panel Comments: EOT, Broad Impact

The GIG and RP’s have defined a strong program to ensure broad impacts of TeraGrid computing on the research and education infrastructure of the United States through the engagement of a number of specific scientific and engineering communities.

– A continued emphasis on the development of the technologies these communities need to effectively apply TeraGrid computing and working closely with them to ensure their needs are met is critical to the success of the TeraGrid.

The ETF team has recognized the importance of [EOT] by bringing on board a Director for EOT to co-ordinate the activity. The group has identified a set of laudable EOT goals.

Page 26: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Outreach and Communications

(R5) A clearer representation of the overall TeraGrid vision and strategy is needed in order to inform the broader community.

– In July 2005 we developed some of this material as part of an outreach to the US National Congress on Computational Mechanics in Austin, and adapted those materials for distribution at SC05. •We continue to refine and expand these materials, for example to tailor them toward different audiences including K-12 and specific science disciplines. The goal is to raise their awareness of the opportunities for them to benefit from and engage in the use of TeraGrid resources.

– The External Relations team is also developing a joint Science Highlights document to highlight the broad scientific impact resulting from the use of TeraGrid.

Page 27: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: Education and Training

There is a need for continuing education of the community to introduce new technologies or transition to new software systems/services; new communities may need more tailored training; the next generation of computational science researchers needs to be educated. Some of these needs can be met by leveraging existing activities but a more focused TeraGrid approach is needed which in turns means a larger dedicated resource. The TeraGrid project should have a larger amount of resource dedicated to training – whether this comes about by a reallocation of existing resource or additional resource from NSF is for NSF to decide, however the panel could not identify any area of the TeraGrid project that seemed over resourced. – EOT often appears less resourced because of broad staff involvement with EOT

staff primarily serving to coordinate the efforts. However, we agree that this is an area that requires more resource investment.

– Scott Lathrop and the EOT team are developing an aggressive, comprehensive plan including:

•Designed an SC07-09 education program, to be funded by the SCxx organization, which precisely addresses the panel’s suggestion for preparing faculty across the country to directly train students.

•A TeraGrid Institute training series is being launched with the TeraGrid ’06 conference and through the summer workshops offered by RPs to raise the level of TeraGrid training

•A schedule of EOT events has been added and will be continually updated on the TeraGrid web site to provide a centralized list of EOT events and resources across the GIG and RPs.

•Scott is engaging the EOT-PACI / EPIC community to develop TeraGrid-specific proposals, providing a method for engaging the community with TeraGrid.

•TeraGrid EOT is working with EPIC to publish a monthly “Cyberinfo Beat” newsletter with resources for teaching and learning.

Page 28: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

REC: User Outreach and Education

(R13) Given that current users require regular updates on new additions to TeraGrid hardware and software, and there is a strong incentive to reach out to researchers who may become new users, education, outreach and training are critical to success. The GIG and all RPs should explore the degree to which education about the TeraGrid could be carried out at the universities as part of student training.

– This is a goal that we have for the EOT program. Our focus to date has been to create materials for both the traditional users and for science gateway users, however we have had several university courses use TeraGrid as a means for teaching grid computing.•For example, the SC07-09 Education Programs are focused on preparing college/university faculty to integrate these tools and resources into classes for students across all disciplines. The GIG and RPs are involved in a number of education efforts across the K-12, undergraduate and graduate spectrum to affect student training at all levels.

Page 29: TeraGrid Review Summary Charlie Catlett, TeraGrid Director University of Chicago and Argonne National Laboratory March 2006

Next Steps

Dane Skow and the GIG Area Directors are mid-way through a comprehensive update of our WBS, taking these recommendations into account.

•This includes adjustments in focus and organization, moving from 4 to 5 area directors with the addition of Kelly Gaither.

The formation of the CUAC will provide us with a user-driven resource that we can leverage, with many of these recommendations at the heart of a strong agenda for the CUAC.We look forward to working with OCI as we update our plans and pursue these recommendations to improve the effectiveness and science impact of the TeraGrid facility.