cluster computing with linux
DESCRIPTION
Cluster Computing with Linux. Prabhaker Mateti Wright State University. - PowerPoint PPT PresentationTRANSCRIPT
Cluster Computing with LinCluster Computing with Linuxux
Prabhaker MatetiPrabhaker Mateti
Wright State UniversityWright State University
Mateti, Linux ClustersMateti, Linux Clusters 22
AbstractAbstract
Cluster computing distributes the Cluster computing distributes the computational load to collections of similar computational load to collections of similar machines. This talk describes what machines. This talk describes what cluster computing is, the typical Linux cluster computing is, the typical Linux packages used, and examples of large packages used, and examples of large clusters in use today. This talk also clusters in use today. This talk also reviews cluster computing modifications of reviews cluster computing modifications of the Linux kernel. the Linux kernel.
Mateti, Linux ClustersMateti, Linux Clusters 33
What What KindKind of Computing, did you of Computing, did you say?say?
SequentialSequential ConcurrentConcurrent ParallelParallel DistributedDistributed NetworkedNetworked MigratoryMigratory
ClusterCluster GridGrid PervasivePervasive QuantumQuantum OpticalOptical MolecularMolecular
Mateti, Linux ClustersMateti, Linux Clusters 55
Fundamentals OverviewFundamentals Overview
Granularity of Parallelism Granularity of Parallelism SynchronizationSynchronizationMessage PassingMessage PassingShared MemoryShared Memory
Mateti, Linux ClustersMateti, Linux Clusters 66
Granularity of ParallelismGranularity of Parallelism
Fine-Grained Parallelism Fine-Grained Parallelism Medium-Grained ParallelismMedium-Grained ParallelismCoarse-Grained Parallelism Coarse-Grained Parallelism NOWs (Networks of Workstations) NOWs (Networks of Workstations)
Mateti, Linux ClustersMateti, Linux Clusters 77
Fine-Grained MachinesFine-Grained Machines
Tens of thousands of Processor ElementsTens of thousands of Processor Elements Processor Elements Processor Elements
Slow (bit serial) Slow (bit serial) Small Fast Private RAMSmall Fast Private RAM Shared Memory Shared Memory
Interconnection Networks Interconnection Networks Message Passing Message Passing
Single Instruction Multiple Data (SIMD)Single Instruction Multiple Data (SIMD)
Mateti, Linux ClustersMateti, Linux Clusters 88
Medium-Grained MachinesMedium-Grained Machines
Typical Configurations Typical Configurations Thousands of processors Thousands of processors Processors have power between coarse- and Processors have power between coarse- and
fine-grained fine-grained Either shared or distributed memoryEither shared or distributed memoryTraditionally: Research Machines Traditionally: Research Machines Single Code Multiple Data (SCMD)Single Code Multiple Data (SCMD)
Mateti, Linux ClustersMateti, Linux Clusters 99
Coarse-Grained MachinesCoarse-Grained Machines
Typical Configurations Typical Configurations Hundreds/Thousands of Processors Hundreds/Thousands of Processors
Processors Processors Powerful (fast CPUs) Powerful (fast CPUs) Large (cache, vectors, multiple fast buses)Large (cache, vectors, multiple fast buses)
Memory: Shared or Distributed-Shared Memory: Shared or Distributed-Shared Multiple Instruction Multiple Data (MIMD)Multiple Instruction Multiple Data (MIMD)
Mateti, Linux ClustersMateti, Linux Clusters 1010
Networks of WorkstationsNetworks of Workstations
Exploit inexpensive Workstations/PCs Exploit inexpensive Workstations/PCs Commodity network Commodity network The NOW becomes a “distributed memory The NOW becomes a “distributed memory
multiprocessor”multiprocessor” Workstations send+receive messages Workstations send+receive messages C and Fortran programs with PVM, MPI, etc. librariesC and Fortran programs with PVM, MPI, etc. libraries Programs developed on NOWs are portable to Programs developed on NOWs are portable to
supercomputers for production runssupercomputers for production runs
Mateti, Linux ClustersMateti, Linux Clusters 1111
Definition of “Parallel”Definition of “Parallel”
S1 begins at time b1, ends at e1S1 begins at time b1, ends at e1S2 begins at time b2, ends at e2S2 begins at time b2, ends at e2S1 || S2S1 || S2
Begins at min(b1, b2)Begins at min(b1, b2)Ends at max(e1, e2)Ends at max(e1, e2)Commutative (Equiv to S2 || S1)Commutative (Equiv to S2 || S1)
Mateti, Linux ClustersMateti, Linux Clusters 1212
Data DependencyData Dependency
x := a + b; y := c + d;x := a + b; y := c + d;x := a + b || y := c + d;x := a + b || y := c + d;y := c + d; x := a + b;y := c + d; x := a + b;X depends on a and b, y depends on c X depends on a and b, y depends on c
and dand dAssumed a, b, c, d were independentAssumed a, b, c, d were independent
Mateti, Linux ClustersMateti, Linux Clusters 1313
Types of ParallelismTypes of Parallelism
Result: Data structure can be split into Result: Data structure can be split into parts of same structure.parts of same structure.
Specialist: Each node specializes. Specialist: Each node specializes. Pipelines.Pipelines.
Agenda: Have list of things to do. Each Agenda: Have list of things to do. Each node can generalize.node can generalize.
Mateti, Linux ClustersMateti, Linux Clusters 1414
Result ParallelismResult Parallelism
Also called Also called Embarrassingly ParallelEmbarrassingly Parallel Perfect ParallelPerfect Parallel
Computations that can be subdivided into sets of Computations that can be subdivided into sets of independent tasks that require little or no independent tasks that require little or no communication communication Monte Carlo simulationsMonte Carlo simulations F(x, y, z)F(x, y, z)
Mateti, Linux ClustersMateti, Linux Clusters 1515
Specialist ParallelismSpecialist Parallelism
Different operations performed simultaneously Different operations performed simultaneously on different processors on different processors
E.g., Simulating a chemical plant; one processor E.g., Simulating a chemical plant; one processor simulates the preprocessing of chemicals, one simulates the preprocessing of chemicals, one simulates reactions in first batch, another simulates reactions in first batch, another simulates refining the products, etc.simulates refining the products, etc.
Mateti, Linux ClustersMateti, Linux Clusters 1616
Agenda Parallelism: MW ModelAgenda Parallelism: MW Model
Manager Manager Initiates computation Initiates computation Tracks progress Tracks progress Handles worker’s requests Handles worker’s requests Interfaces with userInterfaces with user
Workers Workers Spawned and terminated by manager Spawned and terminated by manager Make requests to manager Make requests to manager Send results to managerSend results to manager
Mateti, Linux ClustersMateti, Linux Clusters 1717
Embarrassingly ParallelEmbarrassingly Parallel
Result Parallelism is obviousResult Parallelism is obviousEx1: Compute the square root of each of Ex1: Compute the square root of each of
the million numbers given.the million numbers given.Ex2: Search for a given set of words Ex2: Search for a given set of words
among a billion web pages.among a billion web pages.
Mateti, Linux ClustersMateti, Linux Clusters 1818
ReductionReduction
Combine several sub-results into oneCombine several sub-results into oneReduce r1 r2 … rn with opReduce r1 r2 … rn with opBecomes r1 op r2 op … op rnBecomes r1 op r2 op … op rnHadoop is based on this ideaHadoop is based on this idea
Mateti, Linux ClustersMateti, Linux Clusters 1919
Shared MemoryShared Memory
Process A writes to a memory locationProcess A writes to a memory locationProcess B reads from that memory Process B reads from that memory
locationlocationSynchronization is crucialSynchronization is crucialExcellent speedExcellent speedSemantics … ?Semantics … ?
Mateti, Linux ClustersMateti, Linux Clusters 2020
Shared MemoryShared Memory
Needs hardware support: Needs hardware support: multi-ported memorymulti-ported memory
Atomic operations: Atomic operations: Test-and-SetTest-and-SetSemaphoresSemaphores
Mateti, Linux ClustersMateti, Linux Clusters 2121
Shared Memory Semantics: Shared Memory Semantics: AssumptionsAssumptions
Global time is available. Discrete increments.Global time is available. Discrete increments. Shared variable, s = vi at ti, i=0,…Shared variable, s = vi at ti, i=0,… Process A: s := v1 at time t1Process A: s := v1 at time t1 Assume no other assignment occurred after t1.Assume no other assignment occurred after t1. Process B reads s at time t and gets value v.Process B reads s at time t and gets value v.
Mateti, Linux ClustersMateti, Linux Clusters 2222
Shared Memory: SemanticsShared Memory: Semantics
Value of Shared VariableValue of Shared Variable v = v1, if t > t1v = v1, if t > t1 v = v0, if t < t1v = v0, if t < t1 v = ??, if t = t1v = ??, if t = t1
t = t1 +- discrete quantumt = t1 +- discrete quantum Next Update of Shared VariableNext Update of Shared Variable
Occurs at t2Occurs at t2 t2 = t1 + ?t2 = t1 + ?
Mateti, Linux ClustersMateti, Linux Clusters 2323
Distributed Shared MemoryDistributed Shared Memory
““Simultaneous” read/write access by Simultaneous” read/write access by spatially distributed processorsspatially distributed processors
Abstraction layer of an implementation Abstraction layer of an implementation built from message passing primitivesbuilt from message passing primitives
Semantics not so cleanSemantics not so clean
Mateti, Linux ClustersMateti, Linux Clusters 2424
SemaphoresSemaphores
Semaphore s;Semaphore s;V(s) ::= V(s) ::= s := s + 1 s := s + 1 P(s) ::= P(s) ::= when s > 0 do s := s – 1 when s > 0 do s := s – 1
Deeply studied theory.Deeply studied theory.
Mateti, Linux ClustersMateti, Linux Clusters 2525
Condition VariablesCondition Variables
Condition C;Condition C;C.wait()C.wait()C.signal()C.signal()
Mateti, Linux ClustersMateti, Linux Clusters 2626
Distributed Shared MemoryDistributed Shared Memory
A common address space that all the A common address space that all the computers in the cluster share.computers in the cluster share.
Difficult to describe semantics. Difficult to describe semantics.
Mateti, Linux ClustersMateti, Linux Clusters 2727
Distributed Shared Memory: Distributed Shared Memory: IssuesIssues
DistributedDistributedSpatiallySpatiallyLANLANWANWAN
No global time availableNo global time available
Mateti, Linux ClustersMateti, Linux Clusters 2828
Distributed ComputingDistributed Computing
No shared memoryNo shared memoryCommunication among processesCommunication among processes
Send a messageSend a messageReceive a messageReceive a message
AsynchronousAsynchronousSynchronousSynchronousSynergy among processesSynergy among processes
Mateti, Linux ClustersMateti, Linux Clusters 2929
MessagesMessages
Messages are sequences of bytes moving Messages are sequences of bytes moving between processesbetween processes
The sender and receiver must agree on The sender and receiver must agree on the type structure of values in the the type structure of values in the messagemessage
““Marshalling”: data layout so that there is Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one no ambiguity such as “four chars” v. “one integer”.integer”.
Mateti, Linux ClustersMateti, Linux Clusters 3030
Message PassingMessage Passing
Process A sends a data buffer as a Process A sends a data buffer as a message to process B.message to process B.
Process B waits for a message from A, Process B waits for a message from A, and when it arrives copies it into its own and when it arrives copies it into its own local memory.local memory.
No memory shared between A and B.No memory shared between A and B.
Mateti, Linux ClustersMateti, Linux Clusters 3131
Message PassingMessage Passing
Obviously, Obviously, Messages cannot be received before they are sent.Messages cannot be received before they are sent. A receiver waits until there is a message.A receiver waits until there is a message.
AsynchronousAsynchronous Sender never blocks, even if infinitely many Sender never blocks, even if infinitely many
messages are waiting to be receivedmessages are waiting to be received Semi-asynchronous is a practical version of above Semi-asynchronous is a practical version of above
with large but finite amount of bufferingwith large but finite amount of buffering
Mateti, Linux ClustersMateti, Linux Clusters 3232
Message Passing: Point to Message Passing: Point to PointPoint
Q: send(m, P) Q: send(m, P) Send message M to process PSend message M to process P
P: recv(x, Q)P: recv(x, Q)Receive message from process Q, and place Receive message from process Q, and place
it in variable xit in variable xThe message dataThe message data
Type of x must match that of mType of x must match that of m As if x := mAs if x := m
Mateti, Linux ClustersMateti, Linux Clusters 3333
BroadcastBroadcast
One sender Q, multiple receivers POne sender Q, multiple receivers PNot all receivers may receive at the same Not all receivers may receive at the same
timetimeQ: broadcast (m) Q: broadcast (m)
Send message M to processesSend message M to processesP: recv(x, Q)P: recv(x, Q)
Receive message from process Q, and place Receive message from process Q, and place it in variable xit in variable x
Mateti, Linux ClustersMateti, Linux Clusters 3434
Synchronous Message PassingSynchronous Message Passing
Sender blocks until receiver is ready to Sender blocks until receiver is ready to receive.receive.
Cannot send messages to self.Cannot send messages to self.No buffering.No buffering.
Mateti, Linux ClustersMateti, Linux Clusters 3535
Asynchronous Message PassingAsynchronous Message Passing
Sender never blocks.Sender never blocks.Receiver receives when ready.Receiver receives when ready.Can send messages to self.Can send messages to self. Infinite buffering.Infinite buffering.
Mateti, Linux ClustersMateti, Linux Clusters 3636
Message PassingMessage Passing
Speed not so good Speed not so good Sender copies message into system buffers.Sender copies message into system buffers.Message travels the network.Message travels the network.Receiver copies message from system buffers Receiver copies message from system buffers
into local memory.into local memory.Special virtual memory techniques help.Special virtual memory techniques help.
Programming QualityProgramming Quality less error-prone cf. shared memoryless error-prone cf. shared memory
Mateti, Linux ClustersMateti, Linux Clusters 3838
Architectures of Top 500 SysArchitectures of Top 500 Sys
Mateti, Linux ClustersMateti, Linux Clusters 3939
Architectures of Top 500 SysArchitectures of Top 500 Sys
Mateti, Linux ClustersMateti, Linux Clusters 4141
““Parallel” ComputersParallel” Computers
Traditional supercomputersTraditional supercomputersSIMD, MIMD, pipelinesSIMD, MIMD, pipelinesTightly coupled shared memoryTightly coupled shared memoryBus level connectionsBus level connectionsExpensive to buy and to maintainExpensive to buy and to maintain
Cooperating networks of computersCooperating networks of computers
Mateti, Linux ClustersMateti, Linux Clusters 4242
Traditional SupercomputerTraditional Supercomputerss
Very high starting costVery high starting costExpensive Expensive hardwarehardwareExpensive Expensive softwaresoftware
High maintenanceHigh maintenanceExpensive to Expensive to upgradeupgrade
Mateti, Linux ClustersMateti, Linux Clusters 4343
Computational GridsComputational Grids
“ “Grids are persistent environments that Grids are persistent environments that enable software applications to integrate enable software applications to integrate instruments, displays, computational and instruments, displays, computational and information resources that are managed information resources that are managed by diverse organizations in widespread by diverse organizations in widespread locations.”locations.”
Mateti, Linux ClustersMateti, Linux Clusters 4444
Computational GridsComputational Grids
Individual nodes can be supercomputers, Individual nodes can be supercomputers, or NOWor NOW
High availabilityHigh availabilityAccommodate peak usageAccommodate peak usageLAN : Internet :: NOW : GridLAN : Internet :: NOW : Grid
Mateti, Linux ClustersMateti, Linux Clusters 4545
Buildings-Full of Workstations Buildings-Full of Workstations
1.1. Distributed OS have not taken a foot hold. Distributed OS have not taken a foot hold.
2.2. Powerful personal computers are ubiquitous. Powerful personal computers are ubiquitous.
3.3. Mostly idle: more than 90% of the up-time?Mostly idle: more than 90% of the up-time?
4.4. 100 Mb/s LANs are common. 100 Mb/s LANs are common.
5.5. Windows and Linux are the top two OS in Windows and Linux are the top two OS in terms of installed base. terms of installed base.
Mateti, Linux ClustersMateti, Linux Clusters 4646
Networks of Workstations (NOW)Networks of Workstations (NOW)
WorkstationWorkstationNetworkNetworkOperating SystemOperating SystemCooperationCooperationDistributed+Parallel ProgramsDistributed+Parallel Programs
Mateti, Linux ClustersMateti, Linux Clusters 4747
What is a Workstation? What is a Workstation?
PC? Mac? Sun …?PC? Mac? Sun …? ““Workstation OS”Workstation OS”
Mateti, Linux ClustersMateti, Linux Clusters 4848
““Workstation OS”Workstation OS”
Authenticated usersAuthenticated usersProtection of resourcesProtection of resourcesMultiple processesMultiple processesPreemptive schedulingPreemptive schedulingVirtual MemoryVirtual MemoryHierarchical file systemsHierarchical file systemsNetwork centricNetwork centric
Mateti, Linux ClustersMateti, Linux Clusters 4949
Clusters of WorkstationsClusters of Workstations
Inexpensive alternative to traditional Inexpensive alternative to traditional supercomputerssupercomputers
High availabilityHigh availabilityLower down timeLower down timeEasier accessEasier access
Development platform with production Development platform with production runs on traditional supercomputersruns on traditional supercomputers
Mateti, Linux ClustersMateti, Linux Clusters 5050
Clusters of WorkstationsClusters of Workstations
Dedicated NodesDedicated NodesCome-and-Go NodesCome-and-Go Nodes
Mateti, Linux ClustersMateti, Linux Clusters 5151
Clusters with Part Time NodesClusters with Part Time Nodes
Cycle Stealing: Running of jobs on a workstation Cycle Stealing: Running of jobs on a workstation that don't belong to the owner. that don't belong to the owner.
Definition of Idleness: E.g., No keyboard and no Definition of Idleness: E.g., No keyboard and no mouse activitymouse activity
Tools/LibrariesTools/Libraries CondorCondor PVMPVM MPIMPI
Mateti, Linux ClustersMateti, Linux Clusters 5252
CooperationCooperation
Workstations are “personal”Workstations are “personal”Others use slows you downOthers use slows you down……Willing to shareWilling to shareWilling to trustWilling to trust
Mateti, Linux ClustersMateti, Linux Clusters 5353
Cluster Cluster CharacteristicsCharacteristics
Commodity ofCommodity offf the shelf hardware the shelf hardware NNetworketworkededCommon Home DirectoriesCommon Home DirectoriesOpen source software and OSOpen source software and OSSupport Support message passingmessage passing programming programmingBatch scheduling of jobsBatch scheduling of jobsProcess migrationProcess migration
Mateti, Linux ClustersMateti, Linux Clusters 5454
Beowulf Cluster Beowulf Cluster
Dedicated nodesDedicated nodes Single System ViewSingle System View Commodity of the shelf hardwareCommodity of the shelf hardware Internal high speed networkInternal high speed network Open source software and OSOpen source software and OS Support parallel programming such as MPI, PVMSupport parallel programming such as MPI, PVM Full trust in each otherFull trust in each other
Login from one node into another without Login from one node into another without authenticationauthentication
Shared file system subtreeShared file system subtree
Mateti, Linux ClustersMateti, Linux Clusters 5555
Example ClustersExample Clusters
July 1999July 199911000 nodes 000 nodes Used for genetic algoritUsed for genetic algorit
hm research by John hm research by John KKoza, Stanford Universityoza, Stanford University
www.genetic-programmwww.genetic-programming.com/ing.com/
Mateti, Linux ClustersMateti, Linux Clusters 5656
Typical Typical Big BeowulfBig Beowulf
11000 nodes Beowulf Cl000 nodes Beowulf Cluster Systemuster System
Used for genetic algoritUsed for genetic algorithm research by John Chm research by John Coza, Stanford Universityoza, Stanford University
http://www.genetic-proghttp://www.genetic-programming.com/ramming.com/
Mateti, Linux ClustersMateti, Linux Clusters 5757
Largest Cluster SystemLargest Cluster System
IBM BlueGene, 2007IBM BlueGene, 2007 DOE/NNSA/LLNLDOE/NNSA/LLNL Memory: 73728 GBMemory: 73728 GB OS: CNK/SLES 9OS: CNK/SLES 9 Interconnect: ProprietaryInterconnect: Proprietary PowerPC 440PowerPC 440 106,496 nodes106,496 nodes 478.2 Tera FLOPS on 478.2 Tera FLOPS on
LINPACKLINPACK
Mateti, Linux ClustersMateti, Linux Clusters 5858
2008 World’s Fastest: Roadrunner2008 World’s Fastest: Roadrunner
Operating System: LinuxOperating System: Linux Interconnect InfinibandInterconnect Infiniband129600 cores: PowerXCell 8i 3200 MHz129600 cores: PowerXCell 8i 3200 MHz1105 TFlops1105 TFlopsat DOE/NNSA/LANLat DOE/NNSA/LANL
Mateti, Linux ClustersMateti, Linux Clusters 5959
Cluster Computers for RentCluster Computers for Rent Transfer executable files, source Transfer executable files, source
code or data to your secure code or data to your secure personal account on TTI servers personal account on TTI servers (1). Do this securely using winscp (1). Do this securely using winscp for Windows or "secure copy" scp for Windows or "secure copy" scp for Linux. for Linux.
To execute your program, simply To execute your program, simply submit a job (2) to the scheduler submit a job (2) to the scheduler using the "menusub" command or using the "menusub" command or do it manually using "qsub" (we do it manually using "qsub" (we use the popular PBS batch use the popular PBS batch system). There are working system). There are working examples on how to submit your examples on how to submit your executable. Your executable is executable. Your executable is securely placed on one of our in-securely placed on one of our in-house clusters for execution (3). house clusters for execution (3).
Your results and data are written Your results and data are written to your personal account in real to your personal account in real time. Download your results (4).time. Download your results (4).
Mateti, Linux ClustersMateti, Linux Clusters 6060
Turnkey Cluster VendorsTurnkey Cluster Vendors Fully integrated Beowulf clusters with commercially Fully integrated Beowulf clusters with commercially
supported Beowulf software systems are available from :supported Beowulf software systems are available from : HP www.hp.com/solutions/enterprise/highavailability/HP www.hp.com/solutions/enterprise/highavailability/ IBM www.ibm.com/servers/eserver/clusters/ IBM www.ibm.com/servers/eserver/clusters/ Northrop Grumman.comNorthrop Grumman.com Accelerated Servers.comAccelerated Servers.com Penguin Computing.comPenguin Computing.com www.aspsys.com/clusterswww.aspsys.com/clusters www.pssclabs.comwww.pssclabs.com
Mateti, Linux ClustersMateti, Linux Clusters 6161
Why Why are Linux Clustersare Linux Clusters GGood?ood?
Low initial implementation costLow initial implementation cost Inexpensive Inexpensive PCsPCsStandard componentsStandard components and Networks and NetworksFree SoftwareFree Software:: L Linux,inux, GNU, MPI, PVM GNU, MPI, PVM
Scalability: cScalability: can an grow and shrinkgrow and shrinkFamiliar Familiar ttechnology, easy for user to adechnology, easy for user to ad
opt the approach, use and maintain systopt the approach, use and maintain system.em.
Mateti, Linux ClustersMateti, Linux Clusters 6262
2007 OS Share of Top 5002007 OS Share of Top 500
OS Count OS Count Share Share Rmax (GF) Rmax (GF) Rpeak (GF) ProcessorRpeak (GF) ProcessorLinux Linux 426 426 85.20% 4897046 85.20% 4897046 7956758 7956758 970790970790Windows 6 Windows 6 1.20% 47495 1.20% 47495 86797 86797 12112 12112Unix Unix 30 30 6.00% 408378 6.00% 408378 519178 519178 73532 73532BSD BSD 2 2 0.40% 44783 0.40% 44783 50176 50176 5696 5696Mixed 34 Mixed 34 6.80% 1540037 6.80% 1540037 1900361 1900361 580693580693MacOS 2 MacOS 2 0.40% 28430 0.40% 28430 44816 44816 5272 5272Totals Totals 500 500 100% 6966169 10558086 1648095 100% 6966169 10558086 1648095
http://www.top500.org/stats/list/30/osfam Nov 2007http://www.top500.org/stats/list/30/osfam Nov 2007
Mateti, Linux ClustersMateti, Linux Clusters 6363
2007 OS Share of Top 5002007 OS Share of Top 500OS Family Count Share % Rmax (GF) Rpeak (GF) Procs
Linux 439 87.80 % 13309834 20775171 2099535
Windows 5 1.00 % 328114 429555 54144
Unix 23 4.60 % 881289 1198012 85376
BSD Based 1 0.20 % 35860 40960 5120
Mixed 31 6.20 % 2356048 2933610 869676
Mac OS 1 0.20 % 16180 24576 3072
Totals 500 100% 16927325 25401883 3116923
Mateti, Linux ClustersMateti, Linux Clusters 6464
Many Books on Linux ClustersMany Books on Linux Clusters
Search: Search: google.comgoogle.com amazon.comamazon.com
Example book:Example book:William Gropp, Ewing Lusk, William Gropp, Ewing Lusk,
Thomas Sterling, MIT Press, Thomas Sterling, MIT Press, 2003, ISBN: 0-262-69292-9 2003, ISBN: 0-262-69292-9
Mateti, Linux ClustersMateti, Linux Clusters 6565
Why Why Is Is Beowulf Beowulf GGood?ood?
Low initial implementation costLow initial implementation cost Inexpensive Inexpensive PCsPCsStandard componentsStandard components and Networks and NetworksFree SoftwareFree Software:: L Linux,inux, GNU, MPI, PVM GNU, MPI, PVM
Scalability: cScalability: can an grow and shrinkgrow and shrinkFamiliar Familiar ttechnology, easy for user to adechnology, easy for user to ad
opt the approach, use and maintain systopt the approach, use and maintain system.em.
Mateti, Linux ClustersMateti, Linux Clusters 6666
Single System Single System ImageImage
Common filesystem view from any nodeCommon filesystem view from any nodeCommon accountCommon accountss on all nodes on all nodesSingle software installation point Single software installation point Easy to install and maintain system Easy to install and maintain system Easy to use for Easy to use for end-end-usersusers
Mateti, Linux ClustersMateti, Linux Clusters 6767
CloseClosedd Cluster Configuration Cluster Configuration
computenode
computenode
computenode
computenode
High Speed Network
Service Network
gatewaynode
External Network
computenode
computenode
computenode
computenode
High Speed Network
gatewaynode
External Network
File Servernode
-Front end
Mateti, Linux ClustersMateti, Linux Clusters 6868
Open Cluster ConfigurationOpen Cluster Configuration
computenode
computenode
computenode
computenode
computenode
computenode
computenode
computenode
External Network
File Servernode
High Speed Network
Front-end
Mateti, Linux ClustersMateti, Linux Clusters 6969
DIY DIY Interconnection NetworkInterconnection Network
Most popularMost popular:: Fast Ethernet Fast EthernetNetworkNetwork topologies topologies
MMesheshTorusTorus
Switch Switch v. v. Hub Hub
Mateti, Linux ClustersMateti, Linux Clusters 7070
Software Software CComponentsomponents
Operating SystemOperating SystemLLinuxinux,, FreeBSDFreeBSD, …, …
““ParallelParallel”” PProgramrogramssPVM, MPIPVM, MPI, …, …
UtilitiesUtilitiesOpen sourceOpen source
Mateti, Linux ClustersMateti, Linux Clusters 7171
Cluster ComputCluster Computinging
Ordinary programs run as-is on clusters is Ordinary programs run as-is on clusters is notnot cluster computing cluster computing
Cluster computing takes advantage of :Cluster computing takes advantage of :Result parallelismResult parallelismAgenda parallelismAgenda parallelismReduction operationsReduction operationsProcess-grain parallelismProcess-grain parallelism
Mateti, Linux ClustersMateti, Linux Clusters 7272
Google Linux ClustersGoogle Linux Clusters
GFS: The Google File System GFS: The Google File System thousands of terabytes of storage across thousands of terabytes of storage across
thousands of disks on over a thousand thousands of disks on over a thousand machinesmachines
150 million queries per day150 million queries per dayAverage response time of 0.25 secAverage response time of 0.25 secNear-100% uptimeNear-100% uptime
Mateti, Linux ClustersMateti, Linux Clusters 7373
Cluster Computing Applications Cluster Computing Applications
MathematicalMathematical fftw (fast Fourier transform)fftw (fast Fourier transform) pblas (parallel basic linear algebra software)pblas (parallel basic linear algebra software) atlas (a collections of mathematical library)atlas (a collections of mathematical library) sprngsprng (scalable parallel random number generator) (scalable parallel random number generator) MPITBMPITB -- MPI toolbox for MATLAB -- MPI toolbox for MATLAB
Quantum Chemistry softwareQuantum Chemistry software Gaussian, Gaussian, qchemqchem
Molecular Dynamic solverMolecular Dynamic solver NAMDNAMD, , gromacsgromacs, , gamessgamess
Weather modelingWeather modeling MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)
Mateti, Linux ClustersMateti, Linux Clusters 7474
Development of Cluster ProgramsDevelopment of Cluster Programs
New algorithms + codeNew algorithms + codeOld programs re-done:Old programs re-done:
Reverse engineer design, and re-codeReverse engineer design, and re-codeUse new languages that have distributed and Use new languages that have distributed and
parallel primitivesparallel primitivesWith new librariesWith new libraries
Parallelize legacy codeParallelize legacy codeMechanical conversion by software toolsMechanical conversion by software tools
Mateti, Linux ClustersMateti, Linux Clusters 7575
Distributed ProgramsDistributed Programs
Spatially distributed programsSpatially distributed programs A part here, a part there, …A part here, a part there, … ParallelParallel SynergySynergy
Temporally distributed programsTemporally distributed programs Compute half today, half tomorrowCompute half today, half tomorrow Combine the results at the endCombine the results at the end
Migratory programsMigratory programs Have computation, will travelHave computation, will travel
Mateti, Linux ClustersMateti, Linux Clusters 7676
Technological Bases of Technological Bases of Distributed+Parallel ProgramsDistributed+Parallel Programs
Spatially distributed programsSpatially distributed programsMessage passingMessage passing
Temporally distributed programsTemporally distributed programsShared memoryShared memory
Migratory programsMigratory programsSerialization of data and programsSerialization of data and programs
Mateti, Linux ClustersMateti, Linux Clusters 7777
Technological Bases for Technological Bases for Migratory programsMigratory programs
Same CPU architectureSame CPU architectureX86, PowerPC, MIPS, SPARC, …, JVMX86, PowerPC, MIPS, SPARC, …, JVM
Same OS + environmentSame OS + environmentBe able to “checkpoint”Be able to “checkpoint”
suspend, and suspend, and then resume computation then resume computation without loss of progresswithout loss of progress
Mateti, Linux ClustersMateti, Linux Clusters 7878
Parallel Programming Parallel Programming LanguagesLanguages
Shared-memory languagesShared-memory languagesDistributed-memory languagesDistributed-memory languagesObject-oriented languages Object-oriented languages Functional programming languagesFunctional programming languagesConcurrent logic languages Concurrent logic languages Data flow languagesData flow languages
Mateti, Linux ClustersMateti, Linux Clusters 7979
Linda: Tuple Spaces, shared memLinda: Tuple Spaces, shared mem
<v1, v2, …, vk><v1, v2, …, vk>Atomic PrimitivesAtomic Primitives
In (t)In (t)Read (t)Read (t)Out (t)Out (t)Eval (t)Eval (t)
Host language: e.g., C/Linda, JavaSpacesHost language: e.g., C/Linda, JavaSpaces
Mateti, Linux ClustersMateti, Linux Clusters 8080
Data Parallel LanguagesData Parallel Languages
Data is distributed over the processors as Data is distributed over the processors as a arrays a arrays
Entire arrays are manipulated:Entire arrays are manipulated:A(1:100) = B(1:100) + C(1:100) A(1:100) = B(1:100) + C(1:100)
Compiler generates parallel codeCompiler generates parallel codeFortran 90Fortran 90High Performance Fortran (HPF) High Performance Fortran (HPF)
Mateti, Linux ClustersMateti, Linux Clusters 8181
Parallel Functional LanguagesParallel Functional Languages
Erlang http://www.erlang.org/ Erlang http://www.erlang.org/ SISAL http://www.llnl.gov/sisal/SISAL http://www.llnl.gov/sisal/PCN ArgonnePCN ArgonneHaskell-Eden http://www.mathematik.uni-Haskell-Eden http://www.mathematik.uni-
marburg.de/~eden marburg.de/~eden Objective Caml with BSPObjective Caml with BSPSAC Functional Array LanguageSAC Functional Array Language
Mateti, Linux ClustersMateti, Linux Clusters 8282
Message Passing LibrariesMessage Passing Libraries
Programmer is responsible for initial data Programmer is responsible for initial data distribution, synchronization, and sending distribution, synchronization, and sending and receiving informationand receiving information
Parallel Virtual Machine (PVM)Parallel Virtual Machine (PVM)Message Passing Interface (MPI)Message Passing Interface (MPI)Bulk Synchronous Parallel model (BSP)Bulk Synchronous Parallel model (BSP)
Mateti, Linux ClustersMateti, Linux Clusters 8383
BSP: Bulk Synchronous Parallel BSP: Bulk Synchronous Parallel modelmodel
Divides computation into Divides computation into superstepssupersteps In each superstep a processor can work on local In each superstep a processor can work on local
data and send messages. data and send messages. At the end of the superstep, a barrier At the end of the superstep, a barrier
synchronization takes place and all processors synchronization takes place and all processors receive the messages which were sent in the receive the messages which were sent in the previous superstepprevious superstep
Mateti, Linux ClustersMateti, Linux Clusters 8484
BSP: Bulk Synchronous Parallel BSP: Bulk Synchronous Parallel modelmodel
http://www.bsp-worldwide.org/http://www.bsp-worldwide.org/ Book: Book:
Rob H. Bisseling,Rob H. Bisseling,“Parallel Scientific Computation: A Structured “Parallel Scientific Computation: A Structured Approach using BSP and MPI,”Approach using BSP and MPI,”Oxford University Press, 2004,Oxford University Press, 2004,324 pages,324 pages,ISBN 0-19-852939-2. ISBN 0-19-852939-2.
Mateti, Linux ClustersMateti, Linux Clusters 8585
BSP LibraryBSP Library
Small number of subroutines to implement Small number of subroutines to implement process creation, process creation, remote data access, and remote data access, and bulk synchronization.bulk synchronization.
Linked to C, Fortran, … programs Linked to C, Fortran, … programs
Mateti, Linux ClustersMateti, Linux Clusters 8686
Portable Batch System (PBS)Portable Batch System (PBS) Prepare a .cmd file Prepare a .cmd file
naming the program and its argumentsnaming the program and its arguments properties of the jobproperties of the job the needed resources the needed resources
Submit .cmd to the PBS Job Server: Submit .cmd to the PBS Job Server: qsubqsub command command Routing and Scheduling: The Job Server Routing and Scheduling: The Job Server
examines .cmd details to route the job to an execution queue. examines .cmd details to route the job to an execution queue. allocates one or more cluster nodes to the joballocates one or more cluster nodes to the job communicates with the Execution Servers (mom's) on the cluster to determine the current communicates with the Execution Servers (mom's) on the cluster to determine the current
state of the nodes. state of the nodes. When all of the needed are allocated, passes the .cmd on to the Execution Server on the When all of the needed are allocated, passes the .cmd on to the Execution Server on the
first node allocated (the "mother superior"). first node allocated (the "mother superior"). Execution Server Execution Server
will will loginlogin on the first node as the submitting user and run the .cmd file in the user's home on the first node as the submitting user and run the .cmd file in the user's home directory. directory.
Run an installation defined prologue script.Run an installation defined prologue script. Gathers the job's output to the standard output and standard error Gathers the job's output to the standard output and standard error It will execute installation defined epilogue script.It will execute installation defined epilogue script. Delivers stdout and stdout to the user.Delivers stdout and stdout to the user.
Mateti, Linux ClustersMateti, Linux Clusters 8787
TORQUE, an open source PBSTORQUE, an open source PBS Tera-scale Open-source Resource and QUEue manager Tera-scale Open-source Resource and QUEue manager
(TORQUE) enhances OpenPBS (TORQUE) enhances OpenPBS Fault Tolerance Fault Tolerance
Additional failure conditions checked/handled Additional failure conditions checked/handled Node health check script support Node health check script support
Scheduling Interface Scheduling Interface Scalability Scalability
Significantly improved server to MOM communication model Significantly improved server to MOM communication model Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger jobs (over 2000 processors) Ability to handle larger jobs (over 2000 processors) Ability to support larger server messages Ability to support larger server messages
LoggingLogging http://www.supercluster.org/projects/torque/http://www.supercluster.org/projects/torque/
Mateti, Linux ClustersMateti, Linux Clusters 8888
PVM, and MPIPVM, and MPI
Message passing primitivesMessage passing primitivesCan be embedded in many existing Can be embedded in many existing
programming languagesprogramming languagesArchitecturally portableArchitecturally portableOpen-sourced implementationsOpen-sourced implementations
Mateti, Linux ClustersMateti, Linux Clusters 8989
Parallel Virtual Machine (Parallel Virtual Machine (PVMPVM) )
PVM enables a heterogeneous collection PVM enables a heterogeneous collection of networked computers to be used as a of networked computers to be used as a single large parallel computer. single large parallel computer.
Older than MPIOlder than MPILarge scientific/engineering user Large scientific/engineering user
communitycommunityhttp://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/
Mateti, Linux ClustersMateti, Linux Clusters 9090
Message Passing Interface (MPI)Message Passing Interface (MPI)
http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/MPI-2.0 http://www.mpi-forum.org/docs/ MPI-2.0 http://www.mpi-forum.org/docs/ MPIMPICH: www.mcs.anl.gov/mpi/mpich/ CH: www.mcs.anl.gov/mpi/mpich/ by Aby A
rgonne National Laboratory and Missisippy rgonne National Laboratory and Missisippy State UniversityState University
LAM: http://www.lam-mpi.org/LAM: http://www.lam-mpi.org/http://www.open-mpi.org/ http://www.open-mpi.org/
Mateti, Linux ClustersMateti, Linux Clusters 9191
OpenMP for shared memory OpenMP for shared memory
Distributed shared memory APIDistributed shared memory APIUser-gives hints as directives to the User-gives hints as directives to the
compilercompilerhttp://www.openmp.orghttp://www.openmp.org
Mateti, Linux ClustersMateti, Linux Clusters 9292
SPMDSPMD
Single program, multiple dataSingle program, multiple dataContrast with SIMDContrast with SIMDSame program runs on multiple nodesSame program runs on multiple nodesMay or may not be lock-stepMay or may not be lock-stepNodes may be of different speedsNodes may be of different speedsBarrier synchronizationBarrier synchronization
Mateti, Linux ClustersMateti, Linux Clusters 9393
CondorCondor
Cooperating workstations: come and go.Cooperating workstations: come and go.Migratory programsMigratory programs
CheckpointingCheckpointingRemote IORemote IO
Resource matchingResource matchinghttp://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/
Mateti, Linux ClustersMateti, Linux Clusters 9494
Migration of JobsMigration of Jobs
Policies Policies Immediate-EvictionImmediate-EvictionPause-and-MigratePause-and-Migrate
Technical IssuesTechnical IssuesCheck-pointing: Preserving the state of the Check-pointing: Preserving the state of the
process so it can be resumed.process so it can be resumed.Migrating from one architecture to anotherMigrating from one architecture to another
Mateti, Linux ClustersMateti, Linux Clusters 9595
Kernels Etc Mods for ClustersKernels Etc Mods for Clusters Dynamic load balancingDynamic load balancing Transparent process-migrationTransparent process-migration Kernel ModsKernel Mods
http://openmosix.sourceforge.net/ http://openmosix.sourceforge.net/ http://kerrighed.org/http://kerrighed.org/
http://openssi.org/http://openssi.org/ http://ci-linux.sourceforge.net/ http://ci-linux.sourceforge.net/
CLuster Membership Subsystem ("CLMS") and CLuster Membership Subsystem ("CLMS") and Internode Communication SubsystemInternode Communication Subsystem
http://www.gluster.org/ http://www.gluster.org/ GlusterFS: Clustered File Storage of peta bytes.GlusterFS: Clustered File Storage of peta bytes. GlusterHPC: High Performance Compute ClustersGlusterHPC: High Performance Compute Clusters
http://boinc.berkeley.edu/http://boinc.berkeley.edu/ Open-source software for volunteer computing and grid computingOpen-source software for volunteer computing and grid computing
Mateti, Linux ClustersMateti, Linux Clusters 9696
OpenMosix DistroOpenMosix Distro Quantian LinuxQuantian Linux
Boot from DVD-ROMBoot from DVD-ROM Compressed file system on DVDCompressed file system on DVD Several GB of cluster softwareSeveral GB of cluster software http://dirk.eddelbuettel.com/quantian.htmlhttp://dirk.eddelbuettel.com/quantian.html
Live CD/DVD or Single Floppy BootablesLive CD/DVD or Single Floppy Bootables http://bofh.be/clusterknoppix/http://bofh.be/clusterknoppix/ http://sentinix.org/http://sentinix.org/ http://itsecurity.mq.edu.au/chaos/http://itsecurity.mq.edu.au/chaos/ http://openmosixloaf.sourceforge.net/http://openmosixloaf.sourceforge.net/ http://plumpos.sourceforge.net/http://plumpos.sourceforge.net/ http://www.dynebolic.org/http://www.dynebolic.org/ http://bccd.cs.uni.edu/http://bccd.cs.uni.edu/ http://eucaristos.sourceforge.net/http://eucaristos.sourceforge.net/ http://gomf.sourceforge.net/ http://gomf.sourceforge.net/
Can be installed on HDDCan be installed on HDD
Mateti, Linux ClustersMateti, Linux Clusters 9797
What is openMOSIX?What is openMOSIX? An open source enhancement to the Linux An open source enhancement to the Linux
kernelkernel Cluster with come-and-go nodesCluster with come-and-go nodes System image model: Virtual machine with lots System image model: Virtual machine with lots
of memory and CPUof memory and CPU Granularity: ProcessGranularity: Process Improves the overall (cluster-wide) performance.Improves the overall (cluster-wide) performance. Multi-user, time-sharing environment for the Multi-user, time-sharing environment for the
execution of both sequential and parallel execution of both sequential and parallel applicationsapplications
Applications unmodified (no need to link with Applications unmodified (no need to link with special library)special library)
Mateti, Linux ClustersMateti, Linux Clusters 9898
What is openMOSIX?What is openMOSIX?
Execution environment: Execution environment: farm of diskless x86 based nodesfarm of diskless x86 based nodes UP (uniprocessor), or UP (uniprocessor), or SMP (symmetric multi processor)SMP (symmetric multi processor) connected by standard LAN (e.g., Fast Ethernet)connected by standard LAN (e.g., Fast Ethernet)
Adaptive resource management to dynamicAdaptive resource management to dynamic load load characteristicscharacteristics CPU, RAM, I/O, etc.CPU, RAM, I/O, etc.
Linear scalabilityLinear scalability
Mateti, Linux ClustersMateti, Linux Clusters 9999
Users’ View of the ClusterUsers’ View of the Cluster
Users can start from any node in the Users can start from any node in the cluster, or sysadmin sets-up a few nodes cluster, or sysadmin sets-up a few nodes as login nodesas login nodes
RoundRound-robin DNS: “hpc.clusters” with -robin DNS: “hpc.clusters” with many IPs assigned to same name many IPs assigned to same name
Each process has a Home-NodeEach process has a Home-NodeMigrated processes always appear to run at Migrated processes always appear to run at
the home node,the home node, e.g., “ps” show all your e.g., “ps” show all your processes, even if they run elsewhereprocesses, even if they run elsewhere
Mateti, Linux ClustersMateti, Linux Clusters 100100
MOSIX architectureMOSIX architecture
network transparencynetwork transparencypreemptive process migrationpreemptive process migrationdynamic load balancingdynamic load balancingmemory sharingmemory sharingefficient kernel communicationefficient kernel communicationprobabilistic information dissemination probabilistic information dissemination
algorithmsalgorithmsdecentralized control and autonomydecentralized control and autonomy
Mateti, Linux ClustersMateti, Linux Clusters 101101
A two tier technologyA two tier technology Information gathering and disseminationInformation gathering and dissemination
Support scalable configurations by probabilistic Support scalable configurations by probabilistic dissemination algorithmsdissemination algorithms
Same overhead for 16 nodes or 2056 nodesSame overhead for 16 nodes or 2056 nodes Pre-emptive process migration that can migrate Pre-emptive process migration that can migrate
any process, anywhere, anytime - transparentlyany process, anywhere, anytime - transparently Supervised by adaptive algorithms that respond to Supervised by adaptive algorithms that respond to
global resource availabilityglobal resource availability Transparent to applications, no change to user Transparent to applications, no change to user
interfaceinterface
Mateti, Linux ClustersMateti, Linux Clusters 102102
Tier 1: Information gathering Tier 1: Information gathering and disseminationand dissemination
In each unit of time (e.g., 1 second) each In each unit of time (e.g., 1 second) each node gathers information about:node gathers information about:CPU(s) speed, load and utilizationCPU(s) speed, load and utilizationFree memoryFree memoryFree proc-table/file-table slotsFree proc-table/file-table slots
Info sent to a randomly selected node Info sent to a randomly selected node Scalable - more nodes better scattering Scalable - more nodes better scattering
Mateti, Linux ClustersMateti, Linux Clusters 103103
Tier 2: Process migrationTier 2: Process migration
Load balancing: reduce variance between Load balancing: reduce variance between pairs of nodes to improve the overall pairs of nodes to improve the overall performanceperformance
Memory ushering: migrate processes from Memory ushering: migrate processes from a node that nearly exhausted its free a node that nearly exhausted its free memory, to prevent pagingmemory, to prevent paging
Parallel File I/O: bring the process to the Parallel File I/O: bring the process to the file-server, direct file I/O from migrated file-server, direct file I/O from migrated processesprocesses
Mateti, Linux ClustersMateti, Linux Clusters 104104
Network transparencyNetwork transparency
The user and applications are provided a The user and applications are provided a virtual machine that looks like a single virtual machine that looks like a single machine.machine.
Example: Disk access from diskless nodes Example: Disk access from diskless nodes on fileserver is completely transparent to on fileserver is completely transparent to programsprograms
Mateti, Linux ClustersMateti, Linux Clusters 105105
Preemptive process migrationPreemptive process migration
Any user’s process, trasparently and at Any user’s process, trasparently and at any time, can/may migrate to any other any time, can/may migrate to any other node.node.
The migrating process is divided into:The migrating process is divided into:system context (system context (deputydeputy) that may not be ) that may not be
migrated from home workstation (UHN);migrated from home workstation (UHN);user context (user context (remoteremote) that can be migrated on ) that can be migrated on
a diskless node;a diskless node;
Mateti, Linux ClustersMateti, Linux Clusters 106106
Splitting the Linux processSplitting the Linux process
System context (environment) - site dependent- “home” confinedSystem context (environment) - site dependent- “home” confined Connected by an exclusive link for both synchronousConnected by an exclusive link for both synchronous
(system calls) and asynchronous (signals, MOSIX events) (system calls) and asynchronous (signals, MOSIX events) Process context (code, stack, data) - site independent - may migrateProcess context (code, stack, data) - site independent - may migrate
Deputy
Rem
ote
Kernel Kernel
Userland Userland
openMOSIX LinkLocal
master node diskless node
Mateti, Linux ClustersMateti, Linux Clusters 107107
Dynamic load balancingDynamic load balancing Initiates process migrations in order to balance Initiates process migrations in order to balance
the load of farmthe load of farm responds to variations in the load of the nodes, responds to variations in the load of the nodes,
runtime characteristics of the processes, number runtime characteristics of the processes, number of nodes and their speedsof nodes and their speeds
makes continuous attempts to reduce the load makes continuous attempts to reduce the load differences among nodesdifferences among nodes
the policy is symmetrical and decentralizedthe policy is symmetrical and decentralized all of the nodes execute the same algorithmall of the nodes execute the same algorithm the reduction of the load differences is performed the reduction of the load differences is performed
indipendently by any pair of nodesindipendently by any pair of nodes
Mateti, Linux ClustersMateti, Linux Clusters 108108
Memory sharingMemory sharing
places the maximal number of processes in the places the maximal number of processes in the farm main memory, even if it implies an uneven farm main memory, even if it implies an uneven load distribution among the nodesload distribution among the nodes
delays as much as possible swapping out of delays as much as possible swapping out of pagespages
makes the decision of which process to migrate makes the decision of which process to migrate and where to migrate it is based on the and where to migrate it is based on the knoweldge of the amount of free memory in knoweldge of the amount of free memory in other nodesother nodes
Mateti, Linux ClustersMateti, Linux Clusters 109109
Efficient kernel communicationEfficient kernel communication
Reduces overhead of the internal kernel Reduces overhead of the internal kernel communications (e.g. between the communications (e.g. between the process and its home site, when it is process and its home site, when it is executing in a remote site)executing in a remote site)
Fast and reliable protocol with low startup Fast and reliable protocol with low startup latency and high throughputlatency and high throughput
Mateti, Linux ClustersMateti, Linux Clusters 110110
Probabilistic information Probabilistic information dissemination algorithmsdissemination algorithms
Each node has sufficient knowledge about Each node has sufficient knowledge about available resources in other nodes, without available resources in other nodes, without pollingpolling
measure the amount of available resources on measure the amount of available resources on each nodeeach node
receive resources indices that each node sends receive resources indices that each node sends at regular intervals to a randomly chosen subset at regular intervals to a randomly chosen subset of nodes of nodes
the use of randomly chosen subset of nodes the use of randomly chosen subset of nodes facilitates dynamic configuration and overcomes facilitates dynamic configuration and overcomes node failuresnode failures
Mateti, Linux ClustersMateti, Linux Clusters 111111
Decentralized control and Decentralized control and autonomyautonomy
Each node makes its own control Each node makes its own control decisions independently.decisions independently.
No master-slave relationshipsNo master-slave relationshipsEach node is capable of operating as an Each node is capable of operating as an
independent systemindependent systemNodes may join or leave the farm with Nodes may join or leave the farm with
minimal disruptionminimal disruption
Mateti, Linux ClustersMateti, Linux Clusters 112112
File System AccessFile System Access MOSIX is particularly efficient for distributing and MOSIX is particularly efficient for distributing and
executing CPU-bound processesexecuting CPU-bound processes However, the processes are inefficient with However, the processes are inefficient with
significant file operationssignificant file operations I/O accesses through the home node incur high I/O accesses through the home node incur high
overheadoverhead ““Direct FSA” is for better handling of I/O:Direct FSA” is for better handling of I/O:
Reduce the overhead of executing I/O oriented Reduce the overhead of executing I/O oriented system-calls of a migrated processsystem-calls of a migrated process
a migrated process performs I/O operations locally, in a migrated process performs I/O operations locally, in the current node, the current node, not via the home nodenot via the home node
processes migrate more freelyprocesses migrate more freely
Mateti, Linux ClustersMateti, Linux Clusters 113113
DFSA RequirementsDFSA Requirements
DFSA can work with any file system that satisfies some DFSA can work with any file system that satisfies some properties. properties.
Unique mount point:Unique mount point: The FS are identically mounted on The FS are identically mounted on all.all.
File consistency: when an operation is completed in one File consistency: when an operation is completed in one node, any subsequent operation on any other node will node, any subsequent operation on any other node will see the results of that operationsee the results of that operation
Required because an openMOSIX process may perform Required because an openMOSIX process may perform consecutive syscalls from different nodes consecutive syscalls from different nodes
Time-stamp consistency: if file A is modified after B, A Time-stamp consistency: if file A is modified after B, A must have a timestamp > B's timestampmust have a timestamp > B's timestamp
Mateti, Linux ClustersMateti, Linux Clusters 114114
DFSA Conforming FSDFSA Conforming FS
Global File System (GFS)Global File System (GFS)openMOSIX File System (MFS)openMOSIX File System (MFS)LustreLustre global file system global file systemGeneral Parallel File System (General Parallel File System (GPFS)GPFS)Parallel Virtual File System (Parallel Virtual File System (PVFS)PVFS)Available operations: all common file-Available operations: all common file-
system and I/O system-callssystem and I/O system-calls
Mateti, Linux ClustersMateti, Linux Clusters 115115
Global File System (GFS)Global File System (GFS)
Provides local caching and cache consistency Provides local caching and cache consistency over the cluster using a unique locking over the cluster using a unique locking mechanismmechanism
Provides direct access from any node to any Provides direct access from any node to any storage entity storage entity
GFS + process migration combine the GFS + process migration combine the advantages of load-balancing with direct disk advantages of load-balancing with direct disk access from any node - for parallel file access from any node - for parallel file operationsoperations
Non-GNU License (SPL)Non-GNU License (SPL)
Mateti, Linux ClustersMateti, Linux Clusters 116116
The MOSIX File System (MFS)The MOSIX File System (MFS) Provides a unified view of all files and all Provides a unified view of all files and all
mounted FSs on all the nodes of a MOSIX mounted FSs on all the nodes of a MOSIX cluster as if they were within a single file system.cluster as if they were within a single file system.
Makes all directories and regular files throughout Makes all directories and regular files throughout
an openMOSIX cluster available from all the an openMOSIX cluster available from all the nodesnodes
Provides cache consistencyProvides cache consistency Allows parallel file access by proper distribution Allows parallel file access by proper distribution
of files (a process migrates to the node with the of files (a process migrates to the node with the needed files) needed files)
Mateti, Linux ClustersMateti, Linux Clusters 117117
MFS NamespaceMFS Namespace
/
etc usr varbin mfs
/
etc usr var bin mfs
Mateti, Linux ClustersMateti, Linux Clusters 118118
Lustre: A scalable File System Lustre: A scalable File System http://www.lustre.org/ http://www.lustre.org/ Scalable data serving through parallel data Scalable data serving through parallel data
striping striping Scalable meta data Scalable meta data Separation of file meta data and storage Separation of file meta data and storage
allocation meta data to further increase allocation meta data to further increase scalability scalability
Object technology - allowing stackable, value-Object technology - allowing stackable, value-add functionality add functionality
Distributed operation Distributed operation
Mateti, Linux ClustersMateti, Linux Clusters 119119
Parallel Virtual File System (PVFS)Parallel Virtual File System (PVFS)
http://www.parl.clemson.edu/pvfs/http://www.parl.clemson.edu/pvfs/User-controlled striping of files across User-controlled striping of files across
nodes nodes Commodity network and storage hardware Commodity network and storage hardware MPI-IO support through ROMIOMPI-IO support through ROMIOTraditional Linux file system access Traditional Linux file system access
through the pvfs-kernel package through the pvfs-kernel package The native PVFS library interface The native PVFS library interface
Mateti, Linux ClustersMateti, Linux Clusters 120120
General Parallel File Sys (GPFS)General Parallel File Sys (GPFS)
www.ibm.com/servers/eserver/clusters/www.ibm.com/servers/eserver/clusters/software/gpfs.htmlsoftware/gpfs.html
““GPFS for Linux provides world class GPFS for Linux provides world class performance, scalability, and availability for file performance, scalability, and availability for file systems. It offers compliance to most UNIX file systems. It offers compliance to most UNIX file standards for end user applications and standards for end user applications and administrative extensions for ongoing administrative extensions for ongoing management and tuning. It scales with the size management and tuning. It scales with the size of the Linux cluster and provides NFS Export of the Linux cluster and provides NFS Export capabilities outside the cluster.”capabilities outside the cluster.”
Mateti, Linux ClustersMateti, Linux Clusters 121121
Mosix Ancillary ToolsMosix Ancillary Tools
Kernel debuggerKernel debugger Kernel profiler Kernel profiler Parallel make (all exec() become mexec())Parallel make (all exec() become mexec()) openMosix pvmopenMosix pvm openMosix mm5openMosix mm5 openMosix HMMERopenMosix HMMER openMosix MathematicaopenMosix Mathematica
Mateti, Linux ClustersMateti, Linux Clusters 122122
Cluster AdministrationCluster Administration
LTSP (www.ltsp.org)LTSP (www.ltsp.org)CClulumpOmpOs (s (www.clumpos.orgwww.clumpos.org))MpsMpsMtopMtopMosctlMosctl
Mateti, Linux ClustersMateti, Linux Clusters 123123
Mosix commands & filesMosix commands & files setpe – starts and stops Mosix on the current nodesetpe – starts and stops Mosix on the current node tune – calibrates the node speed parameterstune – calibrates the node speed parameters mtune – calibrates the node MFS parametersmtune – calibrates the node MFS parameters migrate – forces a process to migratemigrate – forces a process to migrate mosctl – comprehensive Mosix administration toolmosctl – comprehensive Mosix administration tool mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay, mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay,
slowdecay – various way to start a program in a specific wayslowdecay – various way to start a program in a specific way mon & mosixview – CLI and graphic interface to monitor the cluster statusmon & mosixview – CLI and graphic interface to monitor the cluster status
/etc/mosix.map – contains the IP numbers of the cluster nodes/etc/mosix.map – contains the IP numbers of the cluster nodes /etc/mosgates – contains the number of gateway nodes present in the /etc/mosgates – contains the number of gateway nodes present in the
clustercluster /etc/overheads – contains the output of the ‘tune’ command to be loaded at /etc/overheads – contains the output of the ‘tune’ command to be loaded at
startupstartup /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at
startupstartup /proc/mosix/admin/* - various files, sometimes binary, to check and control /proc/mosix/admin/* - various files, sometimes binary, to check and control
MosixMosix
Mateti, Linux ClustersMateti, Linux Clusters 124124
MonitoringMonitoring Cluster monitor - ‘Cluster monitor - ‘mosmon’(or ‘qtop’)mosmon’(or ‘qtop’)
Displays load, speed, utilization and memory Displays load, speed, utilization and memory information across the cluster.information across the cluster.
Uses the /proc/hpc/info interface for the retrieving Uses the /proc/hpc/info interface for the retrieving informationinformation
Applet/CGI based monitoring tools - display Applet/CGI based monitoring tools - display cluster properties cluster properties Access via the InternetAccess via the Internet Multiple resourcesMultiple resources
openMosixview with X GUIopenMosixview with X GUI
Mateti, Linux ClustersMateti, Linux Clusters 125125
openMosixviewopenMosixview
by Mathias Rechemburgby Mathias Rechemburg www.mosixview.comwww.mosixview.com
Mateti, Linux ClustersMateti, Linux Clusters 126126
Qlusters OSQlusters OS
http://www.qlusters.com/http://www.qlusters.com/ Based in part on openMosix technologyBased in part on openMosix technology Migrating socketsMigrating sockets Network RAM already implementedNetwork RAM already implemented Cluster Installer, Configurator, Monitor, Cluster Installer, Configurator, Monitor,
Queue Manager, Launcher, SchedulerQueue Manager, Launcher, Scheduler Partnership with IBM, Compaq, Red Hat Partnership with IBM, Compaq, Red Hat
and Inteland Intel
Mateti, Linux ClustersMateti, Linux Clusters 128128
MMore ore IInformationnformation on Clusters on Clusters www.ieeetfcc.org/www.ieeetfcc.org/ IEEE Task Force on Cluster ComputingIEEE Task Force on Cluster Computing. (now . (now
Technical Committee on Scalable Computing TCSC). Technical Committee on Scalable Computing TCSC). lcic.org/ “a central repository of links and information regarding lcic.org/ “a central repository of links and information regarding
Linux clustering, in all its forms.”Linux clustering, in all its forms.” www.beowulf.org resources for of clusters built on commodity www.beowulf.org resources for of clusters built on commodity
hardware deploying Linux OS and open source software. hardware deploying Linux OS and open source software. linuxclusters.com/ “Authoritative resource for information on Linux linuxclusters.com/ “Authoritative resource for information on Linux
Compute Clusters and Linux High Availability Clusters.”Compute Clusters and Linux High Availability Clusters.” www.linuxclustersinstitute.org/ “To provide education and advanced www.linuxclustersinstitute.org/ “To provide education and advanced
technical training for the deployment and use of Linux-based technical training for the deployment and use of Linux-based computing clusters to the high-performance computing community computing clusters to the high-performance computing community worldwide.”worldwide.”
Mateti, Linux ClustersMateti, Linux Clusters 129129
Code-GranularityCode ItemLarge grain(task level)Program
Medium grain(control level)Function (thread)
Fine grain(data level)Loop (Compiler)
Very fine grain(multiple issue)With hardware
Task i-lTask i-l Task iTask i Task i+1Task i+1
func1 ( ){........}
func1 ( ){........}
func2 ( ){........}
func2 ( ){........}
func3 ( ){........}
func3 ( ){........}
a ( 0 ) =..b ( 0 ) =..
a ( 0 ) =..b ( 0 ) =..
a ( 1 )=..b ( 1 )=..
a ( 1 )=..b ( 1 )=..
a ( 2 )=..b ( 2 )=..
a ( 2 )=..b ( 2 )=..
++ xx LoadLoad
PVM/MPI
Threads
Compilers
CPU
Levels of ParallelismLevels of Parallelism