A survey of task mapping on production grids

Download A survey of task mapping on production grids

Post on 09-Mar-2017




0 download


<ul><li><p>37</p><p>A Survey of Task Mapping on Production Grids</p><p>XAVIER GREHANT and ISABELLE DEMEURE, Institut Telecom, Telecom ParisTech,CNRS, LTCISVERRE JARP, CERN openlab</p><p>Grids designed for computationally demanding scientific applications started experimental phases ten yearsago and have been continuously delivering computing power to a wide range of applications for more thanhalf of this time. The observation of their emergence and evolution reveals actual constraints and successfulapproaches to task mapping across administrative boundaries. Beyond differences in distributions, services,protocols, and standards, a common architecture is outlined. Application-agnostic infrastructures built forresource registration, identification, and access control dispatch delegation to grid sites. Efficient task map-ping is managed by large, autonomous applications or collaborations that temporarily infiltrate resourcesfor their own benefits.</p><p>Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems;C.4 [Computer Systems Organization]: Performance of SystemsDesign studies; K.6.0 [Managementof Computing and Information System]: General</p><p>General Terms: Design, Management, Performance, Reliability</p><p>Additional Key Words and Phrases: Computing Grids, Resource Allocation, Task Scheduling, Task Mapping,Grid Architecture, Resource Utilization, Meta-Scheduling, Pull Model, Late Binding</p><p>ACM Reference Format:Grehant, X., Demeure, I., and Jarp, S. 2013. A survey of task mapping on production grids. ACM Comput.Surv. 45, 3, Article 37 (June 2013), 25 pages.DOI: http://dx.doi.org/10.1145/2480741.2480754</p><p>1. INTRODUCTION</p><p>1.1. Scope</p><p>The focus of this article is on task mapping in production grids. The historical perspec-tive aims to help identify fundamental constraints behind grid resource allocation andthe logic of the evolution. This survey is written as a bibliographical basis for furtherresearch towards efficiency in grids.</p><p>In this section, we define important concepts and specify the limits of our study.</p><p>Definition 1 (Grid [Foster et al. 2001]). Grids coordinate resource sharing and prob-lem solving in dynamic, multi-institutional virtual organizations.</p><p>Authors addresses: X. Grehant and I. Demeure, Telecom ParisTech, 46 rue Barrault, 75013 Paris, France;S. Jarp, CERN openlab, CH-1211 Genve 23, Switzerland; corresponding authors email: xavier.grehant@gmail.com.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or permissions@acm.org.c 2013 ACM 0360-0300/2013/06-ART37 $15.00</p><p>DOI: http://dx.doi.org/10.1145/2480741.2480754</p><p>ACM Computing Surveys, Vol. 45, No. 3, Article 37, Publication date: June 2013.</p></li><li><p>37:2 X. Grehant et al.</p><p>Originally, virtual organizations federate individuals to participate in grids.</p><p>Virtual Organizations enable disparate groups of organizations and/or in-dividuals to share resources in a controlled fashion, so that members maycollaborate to achieve a shared goal [Foster et al. 2001].</p><p>In practice, resource users are distinguished from resource providers.</p><p>Definition 2 (Virtual Organization). A virtual Organization (VO) is a collaborationof individual users, perceived from the outside as a single user because they identifyas such, who run the same applications and have common objectives.</p><p>Definition 3 (Grid Site). A grid site is a set of grid resources under a single admin-istrative domain, including computing and storage resources.</p><p>An institution or organization controls the management of its own sites, includingconfiguration, access control, resource allocation mechanisms, and policies.</p><p>Definition 4 (Production Grid). A production grid is a solution implemented andused for the execution of large-scale applications from independent collaborations (VOs)on resources that span independent institutions (grid sites).</p><p>Definition 5 (Task Mapping [Kim et al. 2007]). An important research problem ishow to assign resources to tasks (match) and order the execution of tasks on theresources (schedule) to maximize some performance criterion of an heterogeneouscomputing system. This procedure of matching and scheduling is called mapping orresource allocation.</p><p>The mechanisms developed for resource allocation include dynamic schedulingand configuration, market simulations, feedback control, and a variety of heuristics[Milojicic et al. 2000; Ricci et al. 2006; Chen and Zhang 2009]. This survey identifiesthe mechanisms used in production grids. The analysis of production grid designs, theiremergence, and evolution helps apprehend the fundamental constraints that shouldbe taken into account when pursuing research on grid resource allocation for compu-tationally demanding applications.</p><p>The article is of limited relevance to connected areas.</p><p>Clusters. The term grid is often misused to designate clusters inside a single orga-nization. A cluster is a set of connected computers inside the same administrativedomains. Definition 1 eliminates the confusion.</p><p>Research Grids. Research grids aggregate resources from research teams in dis-tributed systems. Research teams lend slices of their own local resources to thecommunity. In return, they access distributed slices from other members to testtheir systems on large-scale testbeds. PlanetLab, Grid5000, and FutureGrid belongin this category [Robert 2003; Fiuczynski 2006; Crowcroft et al. 2008]. The distinc-tion between research and production grids is not a judgement on the readiness oftheir resource allocation systems. The allocation systems of PlanetLab and Grid5000(OAR) are both production quality and under use 24/7 [Peterson et al. 2005; Georgiouand Richard 2009]. Users of a research grid experiment with the deployment of net-worked services, while users of a production grid run computations. The allocationproblem in research grids is not covered in this survey.</p><p>Prototypes. The goal is to capture constraints that govern grids in order for the readerto build insight on the relevance of research hypotheses in this area. Therefore, theanalysis does not cover grid prototypes, that is, solutions proposed as part of researchon grids, and not used in production.</p><p>ACM Computing Surveys, Vol. 45, No. 3, Article 37, Publication date: June 2013.</p></li><li><p>A Survey of Task Mapping on Production Grids 37:3</p><p>Desktop grids. A desktop grid (such as SETI@home or any system based on BOINC)is the distribution of computations from a single application to personal computers[Anderson 2003]. Task mapping faces less constraints on a desktop grid than ona grid with many users and applications, and where resource providers are entireinstitutions. However, the reader familiar with desktop grids may understand thatsome elements of this analysis also apply to their context.</p><p>Enterprise clouds. Companies such as Amazon Web Services sell the access to com-puting resources. Every paying user obtains dedicated and isolated virtual storagedevices and processors, which she manages just like her own cluster. In grids, theresponsibilities of resource users and providers are not so well distinguished. At theend of the article, we see how grids are evolving towards a closer relationship tocomputing clouds.</p><p>1.2. Outline</p><p>The remainder of this article is organized as follows. Section 2 describes the aggregationof multiple sites into a grid. Distribution of control introduces a number of challenges,such as security, user identification, information flow, seamless resource integration,and central monitoring. Initially, grid designs were focused on solving these issuesrather than on optimizing computing performance. Section 3 identifies constraintson allocation and their impact on performance. To improve performance, major userscentralize control on temporarily accessed resources. Section 4 presents this trend andcharacterizes the resulting environments on which efficient allocation can take place.</p><p>2. FEDERATING RESOURCES</p><p>Grid infrastructures take on a distinctive challenge. They aggregate resources fromdifferent institutions to run multiple applications from multiple users. Section 2.1presents major grids, their applications, and their participants, and Section 2.2presents the motivation of their federation in grids. They share workload constraints(Section 2.3) and parallelization methods (Section 2.4). The initial idea was to replicatethe structure of cluster management systems (Section 2.5) to face the same problemsin the wide area (Section 2.6).</p><p>2.1. Infrastructures</p><p>Grids started an initial exploratory phase in 2000 [Ruda 2001]. Production grids weredesigned to serve as large-scale computing resources for computationally demandingscientific workloads. Around 2004, they started to be used continuously for a widerange of applications. In production grids, academia or other public-funded institutionsvoluntarily offer a certain amount of their computing resources to external scientificprojects.</p><p>A few grid infrastructures scale to tens of thousands of nodes. They aggregate com-puting, data warehousing, and networking facilities from several institutions. Theydistribute middleware to operate resources and manage users and applications. Grid-specific components are often open-source academic projects.</p><p>The following is a non-exhaustive list of grid infrastructures.</p><p>LCG. LCG is the computing grid for the Large Hadron Collider, CERN particle ac-celerator in Switzerland [Andreetto 2004; Codispoti et al. 2009]. It aggregates sitesmainly in Europe, but also in Taiwan and Korea. It groups together over 41,000 CPUsfrom 240 sites. The LCG is supported by the European Commission and more than90 organizations from over 30 countries. Participating regions identify their contri-butions separately like GridPP in the U.K. [the GridPP Collaboration 2006]. LCGresources also support applications from different scientific domains. It forms a</p><p>ACM Computing Surveys, Vol. 45, No. 3, Article 37, Publication date: June 2013.</p></li><li><p>37:4 X. Grehant et al.</p><p>general-purpose European grid. EGEE (Enabling Grids for E-sciences) carried outits coordination at CERN from 2003 to 2009, and EGI (European Grid Initiatives), afederation of national projects, from 2010.</p><p>OSG. The Open Science Grid started in 2005. It is funded by U.S. LHC software andcomputing programs, the National Science Foundation (NSF), and the U.S. Depart-ment of Energy. It continued Grid3, started in 2003 [Avery 2007].</p><p>TeraGrid. TeraGrid started in 2001 with funds from the NSF to establish a Dis-tributed Terascale Facility (DTF) [Pennington 2002; Hart 2011]. It brings togethernine major national computer centers in the U.S. It is deployed on several teraFLOPSsystems, including a petaFLOPS system. TeraGrid is continually increasing in ca-pacity thanks to NSF grants.</p><p>NorduGrid. NorduGrid was funded in 2001 by the NORDUNet2 program to builda grid for countries in northern Europe based on the ARC middleware [Ellert et al.2007]. The NORDUNet2 program aimed to respond to the American challenge ofthe Next Generation Initiative (NGI) and Internet2 (I2). NorduGrid provides around5,000 CPUs over 50 sites.</p><p>Each of these grids is distinctive by the hardware resources integrated, by the orga-nizations supplying these resources, by the projects supported, and by the middleware.</p><p>These infrastructures must not be confused with software development projects likeGridbus1, Globus2, VDT3, Unicore [Streit et al. 2010], and Naregi Grid Middleware[Miura 2006; Sakane et al. 2009], which distribute consistent sets of grid middlewarecomponents and are active in the standardization effort [Foster et al. 2002]. Gridinfrastructures use and redistribute some of these components [Asadzadeh et al. 2004].</p><p>2.2. Objectives</p><p>Participants in grids usually have one of the following intents. They either want toconsolidate CPU cycles or avoid data movement.</p><p>Consolidated resources from multiple institutions make it possible to solve problemsof common interest, where the efforts of a single institution would require unreasonableexecution time. The resolution of NUG30, a quadratic assignment problem, illustratesthe use of a grid for high performance computing (HPC) [Anstreicher et al. 2000; Gouxet al. 2000]. The intent in HPC is to maximize the execution speed of each application.However, supercomputers are better suited for HPC in general. They offer a lowerlatency between computing units and schedule processes at a lower level.</p><p>By contrast with HPC, high throughput computing (HTC) systems intend to maxi-mize the sustained, cumulative amount of computation executed. A grid consolidatesresource allocation from multiple institutions and accepts applications from exter-nal collaborations. Its overall throughput is potentially higher than the sum of thethroughputs of participating institutions acting separately. However, an improvementis realized only if allocation is sufficiently accurate and responsive.</p><p>Grids have been prominently driven by the will to analyze unprecedented amountsof data. High energy physics (HEP) experiments at CERN, the European Center forNuclear Research, and Fermilab, an American proton-antiproton collider, attract col-laborations of particle physicists who account for most users of LCG, OSG, TeraGrid,and NorduGrid [Terekhov 2002; Graham et al. 2004]. They search for interesting eventsin the vast amount of data generated by detectors. The dozens of petabytes of data gen-erated by the Large Hadron Collider (LHC) are entirely replicated on a few primary</p><p>1gridbus.org.2globus.org.3Virtual Data Toolkit: vdt.cs.wisc.edu.</p><p>ACM Computing Surveys, Vol. 45, No. 3, Article 37, Publication date: June 2013.</p></li><li><p>A Survey of Task Mapping on Production Grids 37:5</p><p>sites and partially on secondary sites to avoid further data movement. Grids are pri-marily interesting for distributed data analysis. Computations run on the computercenters that store the data to avoid transferring data to every individual user. The gaindepends on the capability to appropriately distribute data in the first place and to maptasks according to data location.</p><p>An increasing number of non-HEP applications are now using production grids [Linand Yen 2010]. More data movement is expected in their case, as most of the CPUsaccessible on existing infrastructures still coincide with HEP data centers. However,grids provide affordable and scalable resources to computationally intensive projects,because...</p></li></ul>


View more >