grid computing
TRANSCRIPT
1
GRID COMPUTING
Faisal N. Abu-Khzam
&
Michael A. Langston
University of Tennessee
2
Outline
Hour 1: Introduction
Break Hour 2: Using the Grid
Break Hour 3: Ongoing Research
Q&A Session
3
Hour 1: IntroductionWhat is Grid Computing?Who Needs It?An Illustrative ExampleGrid UsersCurrent Grids
4
What is Grid Computing?Computational Grids
– Homogeneous (e.g., Clusters)
– Heterogeneous (e.g., with one-of-a-kind instruments)
Cousins of Grid ComputingMethods of Grid Computing
5
Computational GridsA network of geographically distributed
resources including computers, peripherals, switches, instruments, and data.
Each user should have a single login account to access all resources.
Resources may be owned by diverse organizations.
6
Computational GridsGrids are typically managed by
gridware. Gridware can be viewed as a special
type of middleware that enable sharing and manage grid components based on user requirements and resource attributes (e.g., capacity, performance, availability…)
7
Cousins of Grid ComputingParallel ComputingDistributed ComputingPeer-to-Peer ComputingMany others: Cluster Computing,
Network Computing, Client/Server Computing, Internet Computing, etc...
8
Distributed ComputingPeople often ask: Is Grid Computing a
fancy new name for the concept of distributed computing?
In general, the answer is “no.” Distributed Computing is most often concerned with distributing the load of a program across two or more processes.
9
PEER2PEER ComputingSharing of computer resources and
services by direct exchange between systems.
Computers can act as clients or servers depending on what role is most efficient for the network.
10
Methods of Grid ComputingDistributed SupercomputingHigh-Throughput ComputingOn-Demand ComputingData-Intensive ComputingCollaborative ComputingLogistical Networking
11
Distributed SupercomputingCombining multiple high-capacity
resources on a computational grid into a single, virtual distributed supercomputer.
Tackle problems that cannot be solved on a single system.
12
High-Throughput ComputingUses the grid to schedule large numbers
of loosely coupled or independent tasks, with the goal of putting unused processor cycles to work.
13
On-Demand ComputingUses grid capabilities to meet short-term
requirements for resources that are not locally accessible.
Models real-time computing demands.
14
Data-Intensive ComputingThe focus is on synthesizing new
information from data that is maintained in geographically distributed repositories, digital libraries, and databases.
Particularly useful for distributed data mining.
15
Collaborative ComputingConcerned primarily with enabling and
enhancing human-to-human interactions.
Applications are often structured in terms of a virtual shared space.
16
Logistical NetworkingGlobal scheduling and optimization of data
movement.Contrasts with traditional networking, which
does not explicitly model storage resources in the network.
Called "logistical" because of the analogy it bears with the systems of warehouses, depots, and distribution channels.
17
Who Needs Grid Computing?A chemist may utilize hundreds of
processors to screen thousands of compounds per hour.
Teams of engineers worldwide pool resources to analyze terabytes of structural data.
Meteorologists seek to visualize and analyze petabytes of climate data with enormous computational demands.
18
An Illustrative ExampleTiffany Moisan, a NASA research
scientist, collected microbiological samples in the tidewaters around Wallops Island, Virginia.
She needed the high-performance microscope located at the National Center for Microscopy and Imaging Research (NCMIR), University of California, San Diego.
19
Example (continued)She sent the samples to San Diego and
used NPACI’s Telescience Grid and NASA’s Information Power Grid (IPG) to view and control the output of the microscope from her desk on Wallops Island. Thus, in addition to viewing the samples, she could move the platform holding them and make adjustments to the microscope.
20
Example (continued)The microscope produced a huge
dataset of images.This dataset was stored using a storage
resource broker on NASA’s IPG.Moisan was able to run algorithms on
this very dataset while watching the results in real time.
21
Grid UsersGrid developersTool developersApplication developersEnd UsersSystem Administrators
22
Grid DevelopersVery small group.Implementers of a grid “protocol” who
provides the basic services required to construct a grid.
23
Tool DevelopersImplement the programming models
used by application developers.Implement basic services similar to
conventional computing services:– User authentication/authorization– Process management– Data access and communication
24
Tool DevelopersAlso implement new (grid) services
such as:– Resource locations– Fault detection– Security– Electronic payment
25
Application DevelopersConstruct grid-enabled applications for
end-users who should be able to use these applications without concern for the underlying grid.
Provide programming models that are appropriate for grid environments and services that programmers can rely on when developing (higher-level) applications.
26
System AdministratorsBalance local and global concerns.Manage grid components and
infrastructure.Some tasks still not well delineated due
to the high degree of sharing required.
27
Some Highly-Visible GridsThe NSF PACI/NCSA Alliance Grid.The NSF PACI/SDSC NPACI Grid.The NASA Information Power Grid
(IPG).The Distributed Terascale Facility
(DTF) Project.
28
DTFCurrently being built by NSF’s
Partnerships for Advanced Computational Infrastructure (PACI)
A collaboration: NCSA, SDSC, Argonne, and Caltech will work in conjunction with IBM, Intel, Quest Communications, Myricom, Sun Microsystems, and Oracle.
29
DTF ExpectationsA 40-billion-bits-per-second optical
network (Called TeraGrid) is to link computers, visualization systems, and data at four sites.
Performs 11.6 trillion calculations per second.
Stores more than 450 trillion bytes of data.
30
GRID COMPUTING
BREAK
31
Hour 2: Using the GridGlobus CondorHarnessLegionIBPNetSolveOthers
32
GlobusA collaboration of Argonne National
Laboratory’s Mathematics and Computer Science Division, the University of Southern California’s Information Sciences Institute, and the University of Chicago's Distributed Systems Laboratory.
Started in 1996 and is gaining popularity year after year.
33
GlobusA project to develop the underlying
technologies needed for the construction of computational grids.
Focuses on execution environments for integrating widely-distributed computational platforms, data resources, displays, special instruments and so forth.
34
The Globus ToolkitThe Globus Resource Allocation
Manager (GRAM) – Creates, monitors, and manages services.– Maps requests to local schedulers and
computers.
The Grid Security Infrastructure (GSI)– Provides authentication services.
35
The Globus ToolkitThe Monitoring and Discovery Service
(MDS)– Provides information about system status,
including server configurations, network status, and locations of replicated datasets, etc.
Nexus and globus_io – provides communication services for
heterogeneous environments.
36
The Globus ToolkitGlobal Access to Secondary Storage
(GASS)– Provides data movement and access
mechanisms that enable remote programs to manipulate local data.
Heartbeat Monitor (HBM) – Used by both system administrators and
ordinary users to detect failure of system components or processes.
37
CondorThe Condor project started in 1988 at
the University of Wisconsin-Madison.The main goal is to develop tools to
support High Throughput Computing on large collections of distributively owned computing resources.
38
CondorRuns on a cluster of workstations to glean
wasted CPU cycles. A “Condor pool” consists of any number of
machines, of possibly different architectures and operating systems, that are connected by a network.
Condor pools can share resources by a feature of Condor called flocking.
39
The Condor Pool SoftwareJob management services:
– Supports requests about the job queue .– Puts a job on hold.– Enables the submission of new jobs.– Provides information about jobs that are
already finished.A machine with job management
installed is called a submit machine.
40
The Condor Pool SoftwareResource management:
– Keeps track of available machines.– Performs resource allocation and
scheduling.Machines with resource management
installed are called execute machines.A machine could be a “submit” and an
“execute” machine simultaneously.
41
Condor-GA version of Condor that uses Globus to
submit jobs to remote resources.Allows users to monitor jobs submitted
through the Globus toolkit.Can be installed on a single machine.
Thus no need to have a Condor pool installed.
42
LegionAn object-based metasystems software
project designed at the University of Virginia to support millions of hosts and trillions of objects linked together with high-speed links.
Allows groups of users to construct shared virtual work spaces, to collaborate research and exchange information.
43
LegionAn open system designed to
encourage third party development of new or updated applications, run-time library implementations, and core components.
The key feature of Legion is its object-oriented approach.
44
HarnessA Heterogeneous Adaptable
Reconfigurable Networked SystemA collaboration between Oak Ridge
National Lab, the University of Tennessee, and Emory University.
Conceived as a natural successor of the PVM project.
45
Harness
An experimental system based on a highly customizable, distributed virtual machine (DVM) that can run on anything from a Supercomputer to a PDA.
Built on three key areas of research: Parallel Plug-in Interface, Distributed Peer-to-Peer Control, and Multiple DVM Collaboration.
46
IBPThe Internet Backplane Protocol (IBP)
is a middleware for managing and using remote storage.
It was devised at the University of Tennessee to support Logistical Networking in large scale, distributed systems and applications.
47
IBPNamed because it was designed to
enable applications to treat the Internet as if it were a processor backplane.
On a processor backplane, the user has access to memory and peripherals, and can direct communication between them with DMA.
48
IBP
IBP gives the user access to remote storage and standard Internet resources (e.g. content servers implemented with standard sockets) and can direct communication between them with the IBP API.
49
IBPBy providing a uniform, application-
independent interface to storage in the network, IBP makes it possible for applications of all kinds to use logistical networking to exploit data locality and more effectively manage buffer resources.
50
NetSolveA client-server-agent model.Designed for solving complex scientific
problems in a loosely-coupled heterogeneous environment.
51
The NetSolve Agent
A “resource broker” that represents the gateway to the NetSolve system
Maintains an index of the available computational resources and their characteristics, in addition to usage statistics.
52
The NetSolve Agent
Accepts requests for computational services from the client API and dispatches them to the best-suited sever.
Runs on Linux and UNIX.
53
The NetSolve ClientProvides access to remote resources
through simple and intuitive APIs.Runs on a user’s local system.Contacts the NetSolve system through
the agent, which in turn returns the server that can best service the request.
Runs on Linux, UNIX, and Windows.
54
The NetSolve ServerThe computational backbone of the
system.A daemon process that awaits client
requests.Runs on different platforms: a single
workstation, cluster of workstations, symmetric multiprocessors (SMPs), or massively parallel processors (MPPs).
55
The NetSolve ServerA key component of the server is the
Problem Description File (PDF). With the PDF, routines local to a given
server are made available to clients throughout the NetSolve system.
56
The PDF Template
PROBLEM Program Name …
LIB Supporting Library Information …
INPUT specifications …
OUTPUT specifications …
CODE
57
Network Weather ServiceSupports grid technologies.Uses sensor processes to monitor cpu
loads and network traffic. Uses statistical models on the collected
data to generate a forecast of future behavior.
NetSolve is currently integrating NWS into its agent.
58
Gridware CollaboarationsNetSolve is using Globus' "Heartbeat
Monitor" to detect failed servers. A NetSolve client is now in testing that
allows access to Globus.Legion has adopted NetSolve’s client-user
interface to leverage its metacomputing resources.
The NetSolve client uses Legion’s data-flow graphs to keep track of data dependencies.
59
Gridware Collaboarations
NetSolve can access Condor pools among its computational resources.
IBP-enabled clients and servers allow NetSolve to allocate and schedule storage resources as part of its resource brokering. This improves fault tolerance.
60
GRID COMPUTING
BREAK
61
Hour 3: Ongoing Research
Motivation.Special Projects.
– Ongoing work at Tennessee
General Issues.– Open questions of interest to the entire
research community
62
MotivationComputer speed
doubles every 18 months
Network speed doubles every 9 months
Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins
63
Special ProjectsThe SInRG Project.
– Grid Service Clusters (GSCs)– Data Switches
Incorporating Hardware Acceleration.Unbridled Parallelism
– SETI@home and Folding@home– The Vertex Cover Solver
Security.
64
The SInRG Project
65
The Grid Service ClusterThe basic grid building block.Each GSC will use the same software
infrastructure as is now being deployed on the national Grid, but tuned to take advantage of the highly structured and controlled design of the cluster.
Some GSCs are general-purpose and some are special-purpose.
66
The Grid Service Cluster
67
An advanced data switchThe components that make up a
GSC must be able to access each other at very high speeds and with guaranteed Quality of Service (QoS).
Links of at least1Gbps assure QoS in many circumstances simply by over provisioning.
68
Computational Ecology GSC Collaboration between computer
science and mathematical ecology. 8-processor Symmetric Multi-
Processor (SMP). Initial in-core memory (RAM) is
approximately 4 gigabytes. Out-of-core data storage unit provides
a minimum of 450 gigabytes.
69
Medical Imaging GSC Collaboration between computer
science and the medical school. High-end graphics workstations. Distinguished by the need to have
these workstations attached as directly as possible to the switch to facilitate interactive manipulation of the reconstructed images.
70
Molecular Design GSC Collaboration between computer
science and chemical engineering.Data visualization laboratory32 dual processorsHigh performance switch
71
Machine Design GSC
Collaboration between computer science and electrical engineering.
12 Unix-based CAD workstations.8 Linux boxes with Pilchard boards.Investigating the potential of
reconfigurable computing in grid environments.
72
Machine Design GSC
73
Types of HardwareGeneral purpose hardware – can
implement any functionASICs – hardware that can implement
only a specific applicationFPGAs – reconfigurable hardware that
can implement any function
74
The FPGAFPGAs offer reprogrammabilityAllows optimal logic design of each
function to be implementedHardware implementations offer
acceleration over software implementations which are run on general purpose processors
75
The Pilchard EnvironmentDeveloped at Chinese University in
Hong Kong.Plugs into 133MHz RAM DIMM slot
and is an example of “programmable active memory.”
Pilchard is accessed through memory read/write operations.
Higher bandwidth and lower latency than other environments.
76
ObjectivesEvaluate utility of NetSolve gridware.Determine effectiveness of hardware
acceleration in this environment.Provide an interface for the remote use
of FPGAs.Allow users to experiment and gauge
whether a given problem would benefit from hardware acceleration.
77
Sample ImplementationsFast Fourier Transform (FFT)Data Encryption Standard algorithm
(DES)Image backprojection algorithmA variety of combinatorial algorithms
78
Implementation TechniquesTwo types of functions are implemented Software version - runs on the PC’s
processorHardware version - runs in the FPGATo implement the hardware version of
the function, VHDL code is needed
79
The Hardware FunctionImplemented in VHDL or some other
hardware description language.The VHDL code is then mapped onto
the FPGA (synthesis). CAD tools help make mapping
decisions based on constraints such as: chip area, I/O pin counts, routing resources and topologies, partitioning, resource usage minimization.
80
The Hardware FunctionResult of synthesis is a configuration
file (bit stream).This file defines how the FPGA is to be
reprogrammed in order to implement the new desired functionality.
To run, a copy of the configuration file must be loaded on the FPGA.
81
Behind the Scenes
VHDLprogrammer
Serveradministrator
SoftwareprogrammerSynthesis
Software and Hardware functions
PDFs,Libraries
Client NetSolveserver
Request
Requestresults
VHDLcode
Configurationfile
82
ConclusionsHardware acceleration is offered to both
local and remote users.Resources are available through an
efficient and easy-to-use interface.A development environment is provided
for devising and testing a wide variety of software, hardware and hybrid solutions.
83
Unbridled ParallelismSometimes the overhead of gridware is
unneeded. Well known examples include SETI@home and Folding@home.
We’re currently building a Vertex Cover solver with multiple levels of acceleration.
84
A Naked SSH ApproachA bit of blasphemy: the anti-gridware
paradigm Our work begs several questions.
– When does it make sense?
– How much efficiency are we gaining?
– What are its limitations?
85
Grid SecurityAlgorithm complexity theory
– Verifiability– Concealment
Cryptography and checkpointing– Corroboration – Scalability
Voting and spot-checking– Fault tolerance– Reliability
86
Some General IssuesGrid architecture.Resource management.QoS mechanisms.Performance monitoring.Fault tolerance.
87
ReferencesURL to these slides:
http://www.cs.utk.edu/~abukhzam/grid-tutorial.htm
Condor:http://www.cs.wisc.edu/condor
Globus:http://www.globus.org
88
ReferencesNetSolve:
http://icl.cs.utk.edu/netsolve
Harness:
NWS:
http://nws.cs.ucsb.edu/SInRG:
http://icl.cs.utk.edu/sinrg
89
GRID COMPUTING
END