lecture 11: unix clusters
DESCRIPTION
Lecture 11: Unix Clusters. Asoc. Prof. Guntis Barzdins Asist. Girts Folkmanis University of Latvia Dec 10, 2004. Moore’s Law - Density. Moore's Law and Performance. The performance of computers is determined by architecture and clock speed. - PowerPoint PPT PresentationTRANSCRIPT
Lecture 11: Unix Clusters
Asoc. Prof. Guntis BarzdinsAsist. Girts Folkmanis
University of LatviaDec 10, 2004
Moore’s Law - Density
Moore's Law and Performance The performance of computers is determined by
architecture and clock speed. Clock speed doubles over a 3 year period due to
the scaling laws on chip. Processors using identical or similar
architectures gain performance directly as a function of Moore's Law.
Improvements in internal architecture can yield better gains cf Moore's Law.
Future of Moore’s Law Short-Term (1-5 years)
Will operate (due to prototypes in lab) Fabrication cost will go up rapidly
Medium-Term (5-15 years) Exponential growth rate will likely slow Trillion-dollar industry is motivated
Long-Term (>15 years) May need new technology (chemical or quantum) We can do better (e.g., human brain) I would not close the patent office
Different kinds of PC cluster
High Performance Computing Cluster Load Balancing High Availability
High Performance Computing Cluster (Beowulf)
Start from 1994 Donald Becker of NASA assemble the world’s first cluster
with 16 sets of DX4 PCs and 10 Mb/s ethernet Also called Beowulf cluster Built from commodity off-the-shelf hardware Applications like data mining, simulations, parallel
processing, weather modelling, computer graphical rendering, etc.
Examples of Beowulf cluster
Scyld Cluster O.S. by Donald Becker http://www.scyld.com
ROCKS from NPACI http://www.rocksclusters.org
OSCAR from open cluster group http://oscar.sourceforge.net
OpenSCE from Thailand http://www.opensce.org
Cluster Sizing Rule of Thumb
System software (Linux, MPI, Filesystems, etc) scale from 64 nodes to at most 2048 nodes for most HPC applications Max socket connections Direct access message tag lists & buffers NFS / storage system clients Debugging Etc
It is probably hard to rewrite MPI and all Linux system software for O(100,000) node clusters
Apple Xserve G5 with Xgrid Environment
Alternative to Beowulf PC cluster Server node + 10 compute nodes
Dual CPU G5 processors (2 GHz, 1 GB memory)
Gigabit ethernet inter-connectivity3 TB XServe RAID array
Xgrid offers ‘easy’ pool-of-processors computing model
MPI available for heritage code
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Xgrid Computing Environment Suitable for loosely coupled distributed computing
Controller distributes tasks to agent processors (tasks include data and code) Collects results when agents finish Distributes more chunks to agents as they become free and join cluster/grid
Xgrid client
Xgrid controller
Xgrid agents
Server storage
Xgrid Work Flow
Cluster Status
Offline turned off
Unavailable turned on,but busy w/ othernon-cluster tasks
Working computingon this cluster job
Available waiting tobe assignedcluster work
Cluster Status Displays
Tachometer illustratestotal processing poweravailable to cluster atany time.
Level will change if runningon a cluster of desktopworkstations, but will staysteady if monitoring adedicated cluster
Rocky’s Tachy Tach
Load Balancing Cluster
PC cluster deliver load balancing performance Commonly used with busy ftp and web servers
with large client base Large number of nodes to share load
High Availability Cluster
Avoid downtime of services Avoid single point of failure Always with redundancy Almost all load balancing cluster are with HA
capability
Examples of Load Balancing and High Availability Cluster
RedHat HA cluster http://ha.redhat.com
Turbolinux Cluster Server http://www.turbolinux.com/products/tcs
Linux Virtual Server Project http://www.linuxvirtualserver.org/
High Availability Approach: Redundancy + Failover
Redundancy eliminates Single Points Of Failure (SPOF)Auto detect Failures (hardware, network, applications) Automatic Recovery from failures (no human intervention)
Real-Time Disk Replication (DRDB) Distributed Replicating Block Device
IBM Supported Solutions
Linux-HA (Heartbeat)Open Source ProjectMultiple platform solution for IBM
eServers, Solaris, BSDPackaged with several Linux
DistributionsStrong focus on ease-of-use,
security, simplicity, low-cost> 10K clusters in production since
1999
Tivoli System Automation (TSA) for Multi-PlatformProprietary IBM SolutionUsed across all eServers,
ia32 from any vendorAvailable on Linux, AIX, OS/400Rules Based Recovery SystemOver 1000 licenses since 2003
HPCC Cluster and parallel computing applications
Message Passing Interface MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) LAM/MPI (http://lam-mpi.org)
Mathematical fftw (fast fourier transform) pblas (parallel basic linear algebra software) atlas (a collections of mathematical library) sprng (scalable parallel random number generator) MPITB -- MPI toolbox for MATLAB
Quantum Chemistry software gaussian, qchem
Molecular Dynamic solver NAMD, gromacs, gamess
Weather modelling MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)
MOSIX and openMosix
MOSIX: MOSIX is a software package that enhances the Linux kernel with cluster capabilities. The enhanced kernel supports any size cluster of X86/Pentium based boxes. MOSIX allows for the automatic and transparent migration of processes to other nodes in the cluster, while standard Linux process control utilities, such as 'ps' will show all processes as if they are running on the node the process originated from.
openMosix: openMosix is a spin off of the original Mosix. The first version of openMosix is fully compatible with the last version of Mosix, but is going to go in its own direction.
MOSIX architecture (3/9)
Preemptive process migration
any user’s process, trasparently and at any time, can migrate to any available node.
The migrating process is divided into two contexts: system context (deputy) that may not be migrated from “home” workstation
(UHN); user context (remote) that can be migrated on a diskless node;
MOSIX architecture (4/9)
Preemptive process migration
master node diskless node
Multi-CPU Servers
Benchmark - Memory
1x Stream: 2x Stream: 4x Stream:
2x Opteron, 1.8 GHz, HyperTransport: 1006 – 1671 MB/s 975 – 1178 MB/s 924 – 1133 MB/s
2x Xeon, 2.4 GHz, 400 MHz FSB: 1202 – 1404 MB/s 561 – 785 MB/s 365 – 753 MB/s
4x DIMM
1GB DDR266
Avent Techn.
4x DIMM
1GB DDR266
Avent Techn.
Sybase DBMS Performance
Multi-CPU Hardware and Software
Service Processor (SP)
Dedicated SP on-board PowerPC based Own IP name/address Front panel Command line interface Web-server
Remote administration System status Boot/Reset/Shutdown Flash the BIOS
Unix Scheduling
Process Scheduling
When to run scheduler1. Process create2. Process exit3. Process blocks4. System interrupt
Non-preemptive – process runs until it blocks or gives up CPU (1,2,3)
Preemptive – process runs for some time unit, then scheduler selects a process to run (1-4)
Solaris Overview Multithreaded, Symmetric Multi-Processing Preemptive kernel - protected data structures
Interrupts handled using threads MP Support - per cpu dispatch queues, one global
kernel preempt queue. System Threads Priority Inheritance Turnstiles rather than wait queues
Linux Today Linux scales very well in SMP systems up to 4 CPU’s. Linux on 8 CPU’s is still competitive, but between 4way and
8way systems the price per CPU increases significantly. For SMP systems with more than 8 CPU’s, classic Unix
systems are the best choice. With Oracle Real Application Clusters (RAC),
small 4 or 8way systems can be clustered to cross the today’s Linux limitations.
Commodity, inexpensive 4way Intel boxes, clustered with Oracle 9i RAC, help to reduce TCO.