design and implementation of high performance computing ... · pdf fileperformance computing...
TRANSCRIPT
Design and Implementation of High
Performance Computing Cluster for
Educational Purpose
Dissertation
submitted in partial fulfillment of the requirements
for the degree of
Master of Technology, Computer Engineering
by
SURAJ CHAVAN
Roll No: 121022015
Under the guidance of
PROF. S. U. GHUMBRE
Department of Computer Engineering and Information Technology
College of Engineering, Pune
Pune - 411005.
June 2012
Dedicated to
My Mother
Smt. Kanta Chavan
DEPARTMENT OF COMPUTER
ENGINEERING AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE
CERTIFICATE
This is to certify that the dissertation titled
Design and Implementation of High PerformanceComputing Cluster for Educational Purpose
has been successfully completed
By
SURAJ CHAVAN
(121022015)
and is approved for the degree of
Master of Technology, Computer Engineering.
PROF. S. U. GHUMBRE, DR. JIBI ABRAHAM,
Guide, Head,
Department of Computer Engineering Department of Computer Engineering
and Information Technology, and Information Technology,
College of Engineering, Pune, College of Engineering, Pune,
Shivaji Nagar, Pune-411005. Shivaji Nagar, Pune-411005.
Date :
Abstract
This project work confronts the issue of bringing high performance computing
(HPC) education to those who do not have access to a dedicated clustering en-
vironment in an easy, fully-functional, inexpensive manner through the use of
normal old PCs, fast Ethernet and free and open source softwares like Linux,
MPICH, Torque, Maui etc. Many undergraduate institutions in India do not
have the facilities, time, or money to purchase hardware, maintain user accounts,
configure software components, and keep ahead of the latest security advisories
for a dedicated clustering environment. The projects primary goal is to provide
an instantaneous, distributed computing environment. A consequence of provid-
ing such an environment is the ability to promote the education of high perfor-
mance computing issues at the undergraduate level through the ability to turn
an ordinary off the shelf networked computers into a non-invasive, fully-functional
cluster. The cluster is used to solve problems which require high degree of com-
putation like satisfiability problem for Boolean circuits, Radix-2 FFT algorithm,
1 dimensional time dependent heat equation and other. Also the cluster is bench-
marked by using High Performance Linpack and HPCC benchmark suite. This
cluster can be used for research on data mining applications with large data sets,
object-oriented parallel languages, recursive matrix algorithms, network protocol
optimization, graphical rendering, Fast Fourier transforms, built college’s private
cloud etc. Using this cluster students and faculty will receive extensive experience
in configuration, troubleshooting, utilization, debugging and administration issues
uniquely associated with parallel computing using such cluster. Several students
and faculty can use it for their project and research work in near future.
iii
Acknowledgments
It is great pleasure for me to acknowledge the assistance and contribution of num-
ber of individuals who helped me in my project titled Design and Implementation
of HPCC for Educational Purpose.
First and foremost I would like to express deepest gratitude to my Guide Prof.
S.U. Ghumbre who has encouraged, supported and guided me during every step
of the Project. Without his invaluable advice completion of this project would not
be possible. I take this opportunity to thank our Head of Department, Prof. Dr.
Jibi Abraham for her able guidance and for providing all the necessary facilities,
which were indispensable in the completion of this project. I am also thankful
to the staff of Computer Engineering Department for their invaluable suggestions
and advice. I thank the college for providing the required magazines, books and
access to the Internet for collecting information related to the Project.
I am thankful to Dr. P. K. Sinha, Senior Director HPC, C-DAC, Pune for
granting me permission to study C-DAC’s PARAM Yuva facility. I am also thank-
ful to Dr. Sandeep Joshi and Mr. Rishi Pathak, Mr. Vaibhav Pol of PARAM
Yuva Supercomputing facility, C-DAC, Pune for their continuous encouragement
and support throughout the course of this project.
Last, but not the least, I am also grateful to my friends for their valuable
comments and suggestions.
iv
Contents
Abstract iii
Acknowledgments iv
List of Figures vi
1 Introduction 1
1.1 High Performance Computing . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Types of HPC architectures . . . . . . . . . . . . . . . . . . 2
1.1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Characteristics and features of clusters . . . . . . . . . . . . . . . . 4
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Survey 6
2.1 HPC oppurtunities in Indian Market . . . . . . . . . . . . . . . . . 6
2.2 HPC at Indian Educational Institutes . . . . . . . . . . . . . . . . . 6
2.3 C-DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 C-DAC and HPC . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 PARAM Yuva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 GARUDA: The National Grid Computing Initiative of India 10
2.5.2 Garuda: Objectives . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Single Program, Multiple Data (SPMD) . . . . . . . . . . . . . . . 13
2.8 Message Passing and Parallel Programming Protocols . . . . . . . . 14
2.8.1 Message Passing Models . . . . . . . . . . . . . . . . . . . . 14
2.9 Speedup and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9.3 Factors affecting performance . . . . . . . . . . . . . . . . . 19
2.9.4 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.10 Maths Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.11 HPL Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11.1 Description of the HPL.dat File . . . . . . . . . . . . . . . . 25
2.11.2 Guidelines for HPL.dat configuration . . . . . . . . . . . . . 30
2.12 HPCC Challenge Benchmark . . . . . . . . . . . . . . . . . . . . . 32
3 Design and Implementation 35
3.1 Beowulf Clusters: A Low cost alternative . . . . . . . . . . . . . . . 35
3.2 Logical View of proposed Cluster . . . . . . . . . . . . . . . . . . . 36
3.3 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Master Node . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Compute Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Softwares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 MPICH2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 HYDRA: Process Manager . . . . . . . . . . . . . . . . . . . 44
3.4.3 TORQUE: Resource Manager . . . . . . . . . . . . . . . . . 44
3.4.4 MAUI: Cluster Scheduler . . . . . . . . . . . . . . . . . . . . 45
3.5 System Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Experiments 48
4.1 Finding Prime Numbers . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 PI Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Circuit Satisfiability Problem . . . . . . . . . . . . . . . . . . . . . 50
4.4 1D Time Dependent Heat Equation . . . . . . . . . . . . . . . . . . 51
4.4.1 The finite difference discretization . . . . . . . . . . . . . . . 51
4.4.2 Using MPI to compute the solution . . . . . . . . . . . . . . 53
4.5 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.1 Radix-2 FFT algorithm . . . . . . . . . . . . . . . . . . . . 54
4.6 Theoretical Peak Performance . . . . . . . . . . . . . . . . . . . . . 55
4.7 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 HPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8.1 HPL Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8.2 Run HPL on cluster . . . . . . . . . . . . . . . . . . . . . . 58
vi
4.8.3 HPL results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.9 Run HPCC on cluster . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.9.1 HPCC Results . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Results and Applications 63
5.1 Discussion on Results . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Observations about Small Tasks . . . . . . . . . . . . . . . . 63
5.1.2 Observations about Larger Tasks . . . . . . . . . . . . . . . 63
5.2 Factors affecting Cluster performance . . . . . . . . . . . . . . . . . 64
5.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Challenges of parallel computing . . . . . . . . . . . . . . . . . . . . 65
5.5 Common applications of high-performance computing clusters . . . 67
6 Conclusion and Future Work 69
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Bibliography 71
Appendix A PuTTy 74
A.1 How to use PuTTY to connect to a remote computer . . . . . . . . 74
A.2 PSCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.2.1 Starting PSCP . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.2.2 PSCP Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
vii
List of Figures
1.1 Basic Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Evolution of PARAM Supercomputers & HPC Roadmap . . . . . . 8
2.2 Block Diagram of PARAM Yuva . . . . . . . . . . . . . . . . . . . . 9
2.3 Single Instruction, Multiple Data streams (SISD) . . . . . . . . . . 12
2.4 Single Instruction, Multiple Data streams (SIMD) . . . . . . . . . . 12
2.5 Multiple Instruction, Single Data stream (MISD) . . . . . . . . . . 13
2.6 Multiple Instruction, Multiple Data streams (MIMD) . . . . . . . . 13
2.7 General MPI Program Structure . . . . . . . . . . . . . . . . . . . . 17
2.8 Speedup of a program using multiple processors . . . . . . . . . . . 21
3.1 The Schematic structure of proposed cluster . . . . . . . . . . . . . 35
3.2 Logical view of proposed cluster . . . . . . . . . . . . . . . . . . . . 36
3.3 The Network interconnection . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Graph showing performance for Finding Primes . . . . . . . . . . . 49
4.2 Graph showing performance for Calculating π . . . . . . . . . . . . 50
4.3 Graph showing performance for solving C-SAT Problem . . . . . . . 51
4.4 Graph showing performance for solving 1D Time Dependent Heat
Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Symbolic relation between four nodes . . . . . . . . . . . . . . . . . 52
4.6 Graph showing performance Radix-2 FFT algorithm . . . . . . . . . 54
4.7 8-point Radix-2 FFT: Decimation in frequency form . . . . . . . . . 55
4.8 Graph showing High Performance Linpack (HPL) Results . . . . . . 60
5.1 Application Perspective of Grand Challenges . . . . . . . . . . . . . 67
A.1 Putty GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.2 Putty Security Alert . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.3 Putty Remote Login Screen . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 1
Introduction
HPC is a collection or cluster of connected, independent computers that work in
unison to solve a problem. In general, the machines are tightly coupled at one
site, connected by Infiniband or some other high-speed interconnect technology.
With HPC, the primary goal is to crunch numbers, not to sort data. It demands
specialized program optimizations to get the most from a system in terms of
input/output, computation, and data movement. And the machines all have to
trust each other because theyre shipping information back and forth.
Development of new materials and production processes, based on high-technologies,
requires a solution of increasingly complex computational problems. However,
even as computer power, data storage, and communication speed continue to im-
prove exponentially, available computational resources are often failing to keep up
with what users demand of them. Therefore high-performance computing (HPC)
infrastructure becomes a critical resource for research and development as well as
for many business applications. Traditionally the HPC applications were oriented
on the use of high-end computer systems - so-called ”supercomputers”.
1.1 High Performance Computing
The High Performance Computing (HPC) allows scientists and engineers to deal
with very complex problems using fast computer hardware and specialized soft-
ware. Since often these problems require hundreds or even thousands of processor
hours to complete, an approach, based on the use of supercomputers, has been tra-
ditionally adopted. Recent tremendous increase in a speed of PC-type computers
opens relatively cheap and scalable solution for HPC using cluster technologies.
Linux clustering is popular in many industries these days. With the advent
of clustering technology and the growing acceptance of open source software, su-
1
1.1 High Performance Computing
percomputers can now be created for a fraction of the cost of traditional high-
performance machines.
Cluster operating systems divide the tasks amongst the available systems.
Clusters of systems or workstations, on the other hand, connect a group of systems
together to jointly share a critically demanding computational task. Theoretically,
a cluster operating system should provide seamless optimization in every case.
At the present time, cluster server and workstation systems are mostly used
in High Availability applications and in scientific applications such as numerical
computations.
1.1.1 Types of HPC architectures
Most HPC systems use the concept of parallelism. Many software platforms are
oriented for HPC, but first let’s look at the hardware aspects. HPC hardware falls
into three categories:
• Symmetric multiprocessors (SMP)
• Vector processors
• Clusters
Symmetric multiprocessors (SMP)
SMP is a type of HPC architecture in which multiple processors share the same
memory. (In clusters, also known as massively parallel processors (MPPs), they
don’t share the same memory.) SMPs are generally more expensive and less scal-
able than MPPs.
Vector processors
In vector processors, the CPU is optimized to perform well with arrays or vectors;
hence the name. Vector processor systems deliver high performance and were
the dominant HPC architecture in the 1980s and early 1990s, but clusters have
become far more popular in recent years.
Clusters
Clusters are the predominant type of HPC hardware these days; a cluster is a set
of MPPs. A processor in a cluster is commonly referred to as a node and has
its own CPU, memory, operating system, and I/O subsystem and is capable of
2
1.1 High Performance Computing
communicating with other nodes. These days it is common to use a commodity
workstation running Linux and other open source software as a node in a cluster.
Clustering is the use of multiple computers, typically PCs or UNIX work-
stations, multiple storage devices, and redundant interconnections, to form what
appears to users as a single highly available system. Cluster computing can be
used for load balancing, high performance computing as well as for high avail-
ability. It is used as a relatively low-cost form of parallel processing machine for
scientific and other applications that lend themselves to parallel operations. The
Figure 1.1 illustrates a basic cluster.
Figure 1.1: Basic Cluster
Computer cluster technology puts clusters of systems together to provide better
system reliability and performance. Cluster server systems connect a group of
systems together in order to jointly provide processing service for the clients in
the network.
1.1.2 Clustering
The term ”cluster” can take different meanings in different contexts. This section
focuses on three types of clusters:
• Fail-over clusters
• Load-balancing clusters
• High-performance clusters
3
1.2 Characteristics and features of clusters
Fail-over clusters
The simplest fail-over cluster has two nodes: one stays active and the other stays
on stand-by but constantly monitors the active one. In case the active node goes
down, the stand-by node takes over, allowing a mission-critical system to continue
functioning.
Load-balancing clusters
Load-balancing clusters are commonly used for busy Web sites where several nodes
host the same site, and each new request for a Web page is dynamically routed to
a node with a lower load.
High-performance clusters
These clusters are used to run parallel programs for time-intensive computations
and are of special interest to the scientific community. They commonly run simu-
lations and other CPU-intensive programs that would take an inordinate amount
of time to run on regular hardware.
1.2 Characteristics and features of clusters
1. Very high performance-price ratio.
2. Recycling possibilities of the hardware components.
3. Guarantee of usability/upgradeability in the future.
4. Clusters are built using commodity hardware and cost a fraction of the
vector processors. In many cases, the price is lower by more than an order
of magnitude.
5. Clusters use a message-passing paradigm for communication, and programs
have to be explicitly coded to make use of distributed hardware.
6. Open source software components and Linux lead to lower software costs.
7. Clusters have a much lower maintenance cost (they take up less space, take
less power, and need less cooling).
4
1.3 Motivation
1.3 Motivation
1.3.1 Problem Definition
A computer cluster is a group of linked computers, working together closely thus
in many respects forming a single computer. High-performance computing (HPC)
uses supercomputers and computer clusters to solve advanced computation prob-
lems. The benefits of HPCC (High-Performance Computing Clusters) is availabil-
ity, scalability and to a lesser extent, investment protection and simple adminis-
tration.
Portable and extensible parallel computing system has been build with a out-
standing capability near the commercial high performance supercomputer using
general PCs, network facilities, and open softwares such as Linux and MPI etc.
To check clusters performance, the popular HPL (High Performance Linpack)
Benchmark and HPPCC Benchmark suite are used.
1.3.2 Scope
Computing clusters provide a reasonably inexpensive method to aggregate com-
puting power and dramatically cut the time needed to find answers in research
that requires the analysis of vast amounts of data.
This HPCC can be used for research on object-oriented parallel languages,
recursive matrix algorithms, network protocol optimization, graphical rendering
etc. Also it can be used to create college’s own cloud and deploy cloud applications
on it, which can be accessed from anywhere outside world just with the help of
web browser.
1.3.3 Objectives
The projects primary goal is to support an instantaneous, easily available dis-
tributed computing environment. A consequence of providing such an environment
is the ability to promote the education of high performance computing issues at
the undergraduate level through the ability to turn an ordinary off the shelf net-
worked computers into a non-invasive, fully-functional cluster. Using this cluster
students and teachers will be able to gain insight into configuration, utilization,
troubleshooting, debugging, and administration issues uniquely associated with
parallel computing in a live, easy to use clustering environment.Availability of
such system will encourage more and more students and faculty to use it for their
project and research work.
5
Chapter 2
Literature Survey
2.1 HPC oppurtunities in Indian Market
While sectors such as education, R&D, biotechnology, and weather forecasting
have taken some good lead, it is likely to see industries such as oil & gas catching
up soon.
But challenges remain, largely on the application side, such as there is need for
more homegrown applications. Today, bulk of the codes is serial, running multiple
instances of the same code. There is genuine need to focus on code-parallelization
to leverage the true power of HPC. Also, the trend in HPC is toward packing more
and more power into less and less footprint and at the lowest possible price.
Getting people from diverse domains to share and collaborate along one plat-
form is the other challenge facing HPC deployment.
2.2 HPC at Indian Educational Institutes
India has the potential to be a global technology leader. Indian industry is com-
peting globally in various sectors of Science and Engineering. A critical issue for
the future success of state & Indian industry is the growth of engineering and re-
search education in India. High Performance Computing power is key to scientific
& engineering leadership, industrial competitiveness, and national security. Right
now the hardware and expertise which is needed for such systems is available with
few top notch colleges like IISc, IITs and few other renowned institutes. But if we
want to harness the true power of HPC we have to make sure that such systems
are available to each and every engineering college.
6
2.3 C-DAC
2.3 C-DAC
C-DAC was set-up in 1988 with the explicit purpose of demonstrating India’s
HPC capability after the US government denied the import of technology for
weather forecasting purposes. Since then, C-DAC’s developments have mirrored
the progress of HPC computing worldwide.
During the second mission, C-DAC advented the Open Frame Architecture
for cluster computing culminating in the PARAM 10000 in 1998 and the 1TF
PARAM Padma in 2002.
Along with 60 installations worldwide, C-DAC, now has two HPC facilities of
its own, The 100 GF (GigaFlop) PARAM 10000 at the National Param Super-
computing Facility (NPSF) at Pune and the 1 TF (TeraFlop) PARAM Padma
at the C-DAC’s Terascale Supercomputing Facility (CTSF) at Bangalore. The
indigenously built PARAM Padma debuted on the Top500 list of supercomputers
at 171 in May 2003.
After the completion of PARAM Padma (1 TF peak computing power, subse-
quently upgraded by another 1TF peak) in December 2002 and it’s dedication to
the nation in June 2003, it was used extensively as a third party facility (CTSF)
by a wide spectrum of users from academia, research labs and end-user agencies.
In addition, C-DAC has been actively working since then to build its Next Gener-
ation HPC system (Param NG) and associated technology components. C-DAC
commissioned the System called PARAM ”Yuva” in November 2008. This system
with Rmax (Sustained Performance) of 37.80 TFs and Rpeak (Peak Performance)
of 54.01 TFs, has been ranked at One Hundred Nine (109th) in TOP500 Systems
enlisted, as per the analysis released in June 2009. The system is an intermediate
milestone of C-DAC’s HPC Roadmap towards Petaflop Computing by 2012.
C-DAC has made significant contributions to the Indian HPC arena in terms of
awareness (by means of training programmes), consultancy, skilled manpower and
technology development as well as through deployment of systems and solutions
for use by the scientific, engineering and business community.
2.3.1 C-DAC and HPC
C-DAC has taken the initiative in conducting national awareness programs in
High Performance computing for the Scientific and Engineering community and
welcome to establish High Performance Computing Labs in all the universities and
colleges. This shall help in capacity building and act as a computational research
centers for the scientific & academic programs which will address & catalyse the
7
2.4 PARAM Yuva
Figure 2.1: Evolution of PARAM Supercomputers & HPC Roadmap
impact of high quality engineering education and high-end computational work
for the research community in the eastern region. It will also promote research
and teaching by integrating leading edge, high performance computing and visu-
alization for the faculties, students, graduate and post graduates of the institute
and will provide solutions to many of our most pressing national challenges.
2.4 PARAM Yuva
The latest in the series is called PARAM Yuva, which was developed last year
and was ranked 68th in the TOP500 list released in November 2008 at the Super-
computing Conference in Austin, Texas, United States. The system, according to
C-DAC scientists, is an intermediate milestone of C-DACs HPC road map towards
achieving petaflops (million billion flops) computing speed by 2012.
As part of this, C-DAC has also set up a National PARAM Supercomputing
Facility (NPSF) in Pune, where C-DAC is headquartered, to allow researchers
access to HPC systems to address their computer-intensive problems. C-DACs
efforts in this strategically and economically important area have thus put India
on the supercomputing map of the world along with select developed nations of
the world. As of 2008, 52 PARAM systems have been deployed in the country and
abroad, eight of them at locations in Russia, Singapore, Germany and Canada.
The PARAM series of cluster computing systems is based on what is called
OpenFrame Architecture. PARAM Yuva, in particular, uses a high-speed 10 gi-
gabits per second (Gbps) system area network called PARAM Net-3, developed
8
2.4 PARAM Yuva
indigenously by C-DAC over the last three years, as the primary interconnect.
This HPC cluster system is built with nodes designed around state-of-the-art ar-
chitecture known as X-86 based on Quad Core processors. In all, PARAM Yuva,
in its complete configuration, has 4,608 cores of Intel Xeon 73XX processors called
Tigerton with a clock speed of 2.93 gigahertz (GHz). The system has a sustained
performance of 37.8 Tflops and a peak speed of 54 Tflops.
Figure 2.2: Block Diagram of PARAM Yuva
A novel feature of PARAM Yuva is its reconfigurable computing (RC) capabil-
ity, which is an innovative way of speeding up HPC applications by dynamically
configuring hardware to a suite of algorithms or applications run on PARAM Yuva
for the first time. The RC hardware essentially uses acceleration cards as external
add-ons to boost speed significantly while saving on power and space. C-DAC is
one of the first organisations to bring the concept of reconfigurable hardware re-
sources to the country. C-DAC has not only implemented the latest RC hardware,
it has also developed system software and hardware libraries to achieve appropriate
accelerations in performance.
As C-DAC has been scaling different milestones in HPC hardware, it has also
been developing HPC application software, providing end-to-end solutions in an
HPC environment to different end-users on mission mode. Only in early January,
C-DAC set up a supercomputing facility around a scaled-down version of PARAM
Yuva at North-Eastern Hill University (NEHU) in Shillong complete with all allied
C-DAC technology components and application software.
9
2.5 Grid Computing
2.5 Grid Computing
Grid computing is a term referring to the federation of computer resources from
multiple administrative domains to reach a common goal. The grid can be thought
of as a distributed system with non-interactive workloads that involve a large
number of files. What distinguishes grid computing from conventional high per-
formance computing systems such as cluster computing is that grids tend to be
more loosely coupled, heterogeneous, and geographically dispersed. Although a
grid can be dedicated to a specialized application, it is more common that a single
grid will be used for a variety of different purposes. Grids are often constructed
with the aid of general-purpose grid software libraries known as middleware.
Grid size can vary by a considerable amount. Grids are a form of distributed
computing whereby a super virtual computer is composed of many networked
loosely coupled computers acting together to perform very large tasks. For certain
applications, distributed or grid computing, can be seen as a special type of parallel
computing that relies on complete computers (with onboard CPUs, storage, power
supplies, network interfaces, etc.) connected to a network (private, public or the
Internet) by a conventional network interface, such as Ethernet. This is in contrast
to the traditional notion of a supercomputer, which has many processors connected
by a local high-speed computer bus.
2.5.1 GARUDA: The National Grid Computing Initiative
of India
GARUDA is a collaboration of science researchers and experimenters on a nation-
wide grid of computational nodes, mass storage and scientific instruments that
aims to provide the technological advances required to enable data and compute
intensive science for the 21st century. One of GARUDA’s most important chal-
lenges is to strike the right balance between research and the daunting task of
deploying innovation into some of the most complex scientific and engineering
endeavors being undertaken today.
Building a commanding position in Grid computing is crucial for India. By
allowing researchers to easily access supercomputer-level processing power and
knowledge resources, grids will underpin progress in Indian science, engineering
and business. The challenge facing India today is to turn technologies developed
for researchers into industrial strength business tools.
The Department of Information Technology (DIT), Government of India has
funded the Centre for Development of Advanced Computing (C-DAC) to deploy
10
2.6 Flynn’s Taxonomy
the nationwide computational grid GARUDA’ which will connect 17 cities across
the country in its Proof of Concept (PoC) phase with an aim to bring ”Grid”
networked computing to research labs and industry. GARUDA will accelerate
India’s drive to turn its substantial research investment into tangible economic
benefits.
2.5.2 Garuda: Objectives
GARUDA aims at strengthening and advancing scientific and technological excel-
lence in the area of Grid and Peer-to-Peer technologies. The strategic objectives of
GARUDA are to: Create a test bed for the research and engineering of technolo-
gies, architectures, standards and applications in Grid Computing Bring together
all potential research, development and user groups who can help develop a na-
tional initiative on Grid computing Create the foundation for the next generation
grids by addressing long term research issues in the strategic areas of knowledge
and data management, programming models, architectures, grid management and
monitoring, problem solving environments, grid tools and services
The following key deliverables have been identified as important to achieving
the GARUDA objectives: Grid tools and services to provide an integrated in-
frastructure to applications and higher-level layers A Pan-Indian communication
fabric to provide seamless and high-speed access to resources Aggregation of re-
sources including compute clusters, storage and scientific instruments Creation of
a consortium to collaborate on grid computing and contribute towards the ag-
gregation of resources Grid enablement and deployment of select applications of
national importance requiring aggregation of distributed resources
To achieve the above objectives, GARUDA brings together a critical mass of
well-established researchers from 45 research laboratories and academic institu-
tions that have formulated an ambitious program of activities.
2.6 Flynn’s Taxonomy
The four classifications defined by Flynn are based upon the number of concurrent
instruction (or control) and data streams available in the architecture:
Single Instruction, Single Data stream (SISD)
A sequential computer which exploits no parallelism in either the instruction or
data streams. Single control unit (CU) fetches single Instruction Stream (IS) from
11
2.6 Flynn’s Taxonomy
memory. The CU then generates appropriate control signals to direct single pro-
cessing element (PE) to operate on single Data Stream (DS) i.e. one operation at
a time.
Figure 2.3: Single Instruction, Multiple Data streams (SISD)
Examples of SISD architecture are the traditional uniprocessor machines like
a PC (currently manufactured PCs have multiple processors) or old mainframes.
Single Instruction, Multiple Data streams (SIMD)
Figure 2.4: Single Instruction, Multiple Data streams (SIMD)
A computer which exploits multiple data streams against a single instruction
stream to perform operations which may be naturally parallelized. For example,
an array processor or GPU.
Multiple Instruction, Single Data stream (MISD)
Multiple instructions operate on a single data stream. Uncommon architecture
which is generally used for fault tolerance. Heterogeneous systems operate on the
same data stream and must agree on the result.
12
2.7 Single Program, Multiple Data (SPMD)
Figure 2.5: Multiple Instruction, Single Data stream (MISD)
Examples include the Space Shuttle flight control computer.
Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous
processors simultaneously executing different instructions on different data. Dis-
tributed systems are generally recognized to be MIMD architectures; either ex-
ploiting a single shared memory space or a distributed memory space. A multi-core
superscalar processor is an MIMD processor.
Figure 2.6: Multiple Instruction, Multiple Data streams (MIMD)
2.7 Single Program, Multiple Data (SPMD)
Proposed cluster is mostly using variation of MIMD category i.e. SPMD. Multiple
autonomous processors simultaneously executing the same program (but at inde-
pendent points, rather than in the lockstep that SIMD imposes) on different data.
Also referred to as ’Single Process, multiple data’ - the use of this terminology for
SPMD is erroneous and should be avoided, SPMD is a parallel execution model
and assumes multiple cooperating processes executing a program. SPMD is the
13
2.8 Message Passing and Parallel Programming Protocols
most common style of parallel programming. The SPMD model and the term was
proposed by Frederica Darema.
2.8 Message Passing and Parallel Programming
Protocols
Message passing is a form of communication used in parallel computing, object-
oriented programming, and interprocess communication. In this model processes
or objects can send and receive messages (comprising zero or more bytes, complex
data structures, or even segments of code) to other processes. By waiting for
messages, processes can also synchronize.
Three protocols are presented here for parallel programming, one which has
become the standard, one which used to be the standard, and one which some
feel might be the next big thing. For a while there, the parallel protocol war was
being waged over PVM and MPI. By most everyone’s account, MPI won. It is a
highly efficient and easy to learn protocol that has been implemented on a wide
variety of platforms. One criticism is that different implementations of MPI don’t
always talk to one another. However, most cluster install packages give both of
the two most common implementations (MPICH and LAM/MPI). If setting up a
small cluster, choose freely between either, they both work well, and as long as
there is same version of MPI on each machine, there is no need to rewrite any MPI
code. MPI stands for Message Passing Interface. Basically, independent processes
send messages to each other. Both LAM/MPI and MPICH simplify the process
of starting large jobs on multiple machines. It is the most common and efficient
parallel protocol in current use.
2.8.1 Message Passing Models
Message passing models for parallel computation have been widely adopted be-
cause of their similarity to the physical attributes of many multiprocessor architec-
tures. Probably the most widely adopted message passing model is MPI. MPI, or
Message Passing Interface, was released in 1994 after two years in the design phase.
MPIs functionality is fairly straightforward. For several years, MPI has been the
de facto standard for writing parallel applications. One of the most popular MPI
implementations is MPICH. Its successor, MPICH2, features a completely new
design that provides more performance and flexibility. To ensure portability, it
has a hierarchical structure based on which porting can be done at different levels.
14
2.8 Message Passing and Parallel Programming Protocols
MPICH2 programs are written in C or FORTRAN and linked against the MPI
libraries; C++ and Fortran90 bindings are also supported. MPI applications run
in a multiple-instruction multiple-data (MIMD) manner.
MPI
MPI provides a straight-forward interface to write software that can use multiple
cores of a computer, and multiple computers in a cluster or nodes in a supercom-
puter. Using MPI write code that uses all of the cores and all of the nodes in
a multicore computer cluster, and that will run faster as more cores and more
compute nodes become available.
MPI is a well-established, standard method of writing parallel programs. It
was first released in 1992, and is currently on version 2.1.4.1. MPI is implemented
as a library, which is available for nearly all computer platforms (e.g. Linux,
Windows, OS X), and with interfaces for many popular languages (e.g. C, C++,
Fortran, Python).
MPI stands for ”Message Passing Interface”, and it parallelizes computational
work by providing tools that use a team of processes to solve the problem, and
for the team to then share the solution by passing messages amongst one another.
MPI can be used to parallelize programs that run locally, by having all processes
in the team run locally, or it can be used to parallelize programs across a compute
cluster, by running one or more processes per node. MPI can be combined with
other parallel programming technologies, e.g. OpenMP.
Basic MPI Calls
It is often said that there are two views of MPI. One view is that MPI is a
lightweight protocol with only 6 commands. The other view is that it is a in
depth protocol with hundreds of specialized commands.
The 6 Basic MPI Commands
• MPI Init
• MPI Comm size
• MPI Comm rank
• MPI Send
• MPI Recv
15
2.8 Message Passing and Parallel Programming Protocols
• MPI Finalize
In short, set up an MPI program, get the number of processes participating in
the program, determine which of those processes corresponds to the one calling the
command, send messages, receive messages, and stop participating in a parallel
program.
1. MPI Init(int *argc, char ***argv) Takes the command line arguments to a
program, checks for any MPI options, and passes remaining command line
arguments to the main program.
2. MPI Comm size( MPI Comm comm, int *size ) Determines the size of a
given MPI Communicator. A communicator is a set of processes that work
together. For typical programs this is the default MPI COMM WORLD,
which is the communicator for all processes available to an MPI program.
3. MPI Comm rank( MPI Comm comm, int *rank ) Determine the rank of the
current process within a communicator. Typically, if a MPI program is being
run on N processes, the communicator would be MPI COMM WORLD, and
the rank would be an integer from 0 to N-1.
4. MPI Send( void *buf, int count, MPI Datatype datatype, int dest, int tag,
MPI Comm comm ) Send the contents of buf, which contains count elements
of type datatype to a process of rank dest in the communicator comm, flagged
with the message tag. Typically, the communicator is MPI COMM WORLD.
5. MPI Recv( void *buf, int count, MPI Datatype datatype, int source, int tag,
MPI Comm comm, MPI Status *status ) Read into buf count values of type
datatype from process source in communicator comm if a message is sent
flagged with tag. Also receive information about the transfer into status.
6. MPI Finalize() Handles anything that the current MPI protocol will need
to do before exiting a program. Typically should be the final or near final
line of a program.
MPICH2: Message Passing Interface
The MPICH implementation of MPI is one of the most popular versions of MPI.
Recently, MPICH was completely rewritten; the new version is called MPICH2 and
includes all of MPI, both MPI-1 and MPI-2. This section describes how to obtain,
16
2.8 Message Passing and Parallel Programming Protocols
Figure 2.7: General MPI Program Structure
build, and install MPICH2 on a Beowulf cluster. Then it describes how to set up
an MPICH2 environment in which MPI programs can be compiled, executed, and
debugged. MPICH2 is recommended for all Beowulf clusters by many researchers.
Original MPICH is still available but is no longer being developed.
PVM
PVM (Parallel Virtual Machine) is a freely-available, portable, message-passing
library generally implemented on top of sockets. PVMs daemon based implemen-
tation makes it easy to start large jobs on multiple machines. PVM was the first
standard for parallel computing to become widely accepted. As a result, there is
a large amount of legacy code in PVM still available. PVM also allows for the
ability to spawn multiple programs from within the original program. PVM easily
recursively spawn other processes. It is simple implementation that works across
different platforms. Now a days people having legacy code in PVM that they don’t
want to modify are using it.
JavaSpaces
Java is a versatile computer language that is object oriented and is widely used
in computer science schools around the country. JavaSpaces is Java’s parallel
programming framework which operates by writing entries into a shared space.
Programs can access the space, and either add an entry, read an entry without
removing it, or take an entry.
17
2.9 Speedup and Efficiency
Java is an interpreted language, and as such typical programs will not run at
the same speed as compiled languages such as C/C++ and Fortran. However,
much progress has been made in the area of Java efficiency, and many operating
systems have what are known as just-in-time compilers. Current claims are that a
well optimized java platform can run java code at about 90% of the speed of similar
C/C++ code. Java has a versatile security policy that is extremely flexible, but
also can be difficult to learn.
JavaSpaces suffers from high latency and a lack of network optimization, but
for embarrasingly parallel problems that do not require synchronization, the JavaS-
paces model of putting jobs into a space, letting any ”worker” take jobs out of
the space, and having the workers put results into the space when done leads to
very natural approaches to load balancing and may be well suited to non-coupled
highly distributed computations, such as SETI@Home. JavaSpaces does not have
any simple mechanism for starting large jobs on multiple machines. Javaspaces
is good choice if need to pass not just data, but instructions on what to do with
that data. Also it provides object oriented parallel framework.
2.9 Speedup and Efficiency
2.9.1 Speedup
The speedup of a parallel code is how much faster it runs in parallel. If the time
it takes to run a code on 1 processors is T1 and the time it takes to run the same
code on N processors is TN, then the speedup is given by
S =T1TN
This can depend on many things, but primarily depends on the ratio of the
amount of time the code spends communicating to the amount of time it spends
computing.
2.9.2 Efficiency
Efficiency is a measure of how much of available processing power is being used.
The simplest way to think of it is as the speedup per processor. This is equivalent
to defining efficiency as the time to run N models on N processors to the time to
18
2.9 Speedup and Efficiency
run 1 model on 1 processor.
E =S
N=
T1N × TN
This gives a more accurate measure of the true efficiency of a parallel program
than CPU usage, as it takes into account redundant calculations as well as idle
time.
2.9.3 Factors affecting performance
The factors which can affect an MPI application’s performance are numerous,
complex and interrelated. Because of this, generalizing about an application’s
performance is usually very difficult. Most of the important factors are briefly
described below.
Platform / Architecture Related
1. cpu - clock speed, number of cpus
2. Memory subsystem - memory and cache configuration, memory-cache-cpu
bandwidth, memory copy bandwidth
3. Network adapters - type, latency and bandwidth characteristics
4. Operating system characteristics - many
Network Related
1. Protocols - TCP/IP, UDP/IP, other
2. Configuration, routing, etc
3. Network tuning options (”no” command)
4. Network contention / saturation
Application Related
1. Algorithm efficiency and scalability
2. Communication to computation ratios
3. Load balance
19
2.9 Speedup and Efficiency
4. Memory usage patterns
5. I/O
6. Message size used
7. Types of MPI routines used - blocking, non-blocking, point-to-point, collec-
tive communications
MPI Implementation Related
1. Message buffering
2. Message passing protocols - eager, rendezvous, other
3. Sender-Receiver synchronization - polling, interrupt
4. Routine internals - efficiency of algorithm used to implement a given routine
Network Contention
1. Network contention occurs when the volume of data being communicated
between MPI tasks saturates the bandwidth of the network.
2. Saturation of the network bandwidth results in an overall decrease of com-
munications performance for all tasks.
Because of these challenges and complexities, performance analysis tools are essen-
tial to optimizing an application’s performance. They can assist in understanding
what program is ”really doing” and suggest how program performance should be
improved.
The primary issue with speedup is the communication to computation ratio.
To get a higher speedup,
• Communicate less
• Compute more
• Make connections faster
• Communicate faster
20
2.9 Speedup and Efficiency
The amount of time the computer requires to make a connection to another
computer is referred to as its latency, and the rate at which data can be transferred
is the bandwidth. Both can have an impact on the speedup of a parallel code.
Collective communication can also help speed up the code. As an example,
imagine you are trying to tell a number of people about a party. One method would
be to tell each person individually, another would be to tell people to ”spread the
word”. Collective communication refers to improving communication speed by
having any node with the information being sent participate in sending the infor-
mation to other nodes. Not all protocols allow for collective communication, and
even protocols which do may not require a vendor to implement collective com-
munication. An example is the broadcast routine in MPI. Many vendor specific
versions of MPI allow for broadcast routines which use a ”tree” method of commu-
nications. The more common implementation found on most clusters, openMPI,
LAM-MPI and MPICH, simply have the sending machine contact each receiving
machine in turn.
2.9.4 Amdahl’s Law
Amdahl’s law, also known as Amdahl’s argument, is named after computer ar-
chitect Gene Amdahl, and is used to find the maximum expected improvement
to an overall system when only part of the system is improved. It is often used
in parallel computing to predict the theoretical maximum speedup using multiple
processors.
Figure 2.8: Speedup of a program using multiple processors
21
2.10 Maths Libraries
OverallSpeedup =1
(1− f) + fs
where,
f-fraction of parallel code
s-speedup of enhanced portion The speedup of a program using multiple proces-
sors in parallel computing is limited by the time needed for the sequential fraction
of the program. For example, if a program needs 20 hours using a single processor
core, and a particular portion of 1 hour cannot be parallelized, while the remaining
promising portion of 19 hours (95%) can be parallelized, then regardless of how
many processors are devoted to a parallelized execution of this program, the min-
imum execution time cannot be less than that critical 1 hour. Hence the speedup
is limited up to 20x, as the diagram illustrates.
2.10 Maths Libraries
For computer programmers, calling pre-written subroutines to do complex calcu-
lations dates back to early computing history. With minimal effort, any developer
can write a function that multiplies two matrices, but these same developers would
not want to re-write that function for every new program that requires it. Fur-
ther, with good theory and practice, one can optimize practically any algorithm
to run several times faster, though it would typically take a several hours to days
to match the performance of a highly optimized algorithm.
Scientific computing, and the use of math libraries, was traditionally limited
to research labs and engineering disciplines. In recent decades, this niche com-
puting market has blossomed across a variety of industries. While research in-
stitutes and universities are still the largest users of math libraries, especially in
the High Performance Computing (HPC) arena, industries like financial services
and biotechnology are increasingly turning to math libraries as well. Even the
business analytics arena around business intelligence and data mining is starting
to leverage the existing tools. From bond pricing and portfolio optimization to
exotic instrument evaluations and exchange rate analysis, the financial services
industry has a wide variety of requirements for complex mathematical algorithms.
Similarly, the biology disciplines have aligned with statisticians to analyze exper-
imental procedures which produce hundreds of thousands of results.
The core area of the math library market implements linear algebra algorithms.
22
2.10 Maths Libraries
More specialized functions, such as numerical optimization and time series fore-
casting, are often invoked explicitly by users. In contrast, linear algebra functions
are often used as key background components for solving a wide variety of prob-
lems. Eigen analysis, matrix inversion and other linear calculations are essential
components in nearly every statistical analysis in use today including regression,
factor analysis, discriminate analysis, etc. The most basic suite of such algorithms
is the BLAS (Basic Linear Algebra Subprograms) libraries for basic vector and
matrix operations.
BLAS
BLAS is the Basic Linear Algebra Subprograms. It is a set of routines used to
perform common low level matrix manipulations such as rotations, or dot prod-
ucts. BLAS should be optimized to run on given hardware. This can be done by
getting a vendor supplied package (ie, provided by Sun, or Intel), or else by using
the ATLAS software.
ATLAS
ATLAS is the Automatically Tuned Linear Algebra Software package. It is soft-
ware that attempts to tune the BLAS implementation that it provides to hardware.
ATLAS also provides a very minimal LAPACK implementation, so it is better to
install the complete LAPACK package separately.
LAPACK
LAPACK is the Linear Algebra Package. It extends BLAS to provide higher level
linear algebra routines such as computing eigenvalues, or finding the solutions to
a system of linear equations. LAPACK is a library of Fortran 77 subroutines for
solving the most commonly occurring problems in numerical linear algebra. It has
been designed to be efficient on a wide range of modern high-performance comput-
ers. The name LAPACK is an acronym for Linear Algebra PACKage. Previously
LINPACK was used for benchmarking. LINPACK is a collection of Fortran sub-
routines that analyse and solve linear equations and linear least-squares problems.
But now it is completely superseded by LAPACK.
Problems that LAPACK can Solve
LAPACK can solve systems of linear equations, linear least squares problems,
eigenvalue problems and singular value problems. LAPACK can also handle
many associated computations such as matrix factorizations or estimating con-
23
2.11 HPL Benchmark
dition numbers.
LAPACK contains driver routines for solving standard types of problems, com-
putational routines to perform a distinct computational task, and auxiliary rou-
tines to perform a certain subtask or common low-level computation. Each driver
routine typically calls a sequence of computational routines. Taken as a whole,
the computational routines can perform a wider range of tasks than are covered
by the driver routines. Many of the auxiliary routines may be of use to numerical
analysts or software developers, so they documented the Fortran source for these
routines with the same level of detail used for the LAPACK routines and driver
routines.
Dense and band matrices are provided for, but not general sparse matrices. In
all areas, similar functionality is provided for real and complex matrices.
2.11 HPL Benchmark
HPL is a software package that solves a (random) dense linear system in double
precision (64 bits) arithmetic on distributed-memory computers. It can thus be
regarded as a portable as well as freely available implementation of the High
Performance Computing Linpack Benchmark.
The algorithm used by HPL can be summarized by the following keywords:
Two-dimensional block-cyclic data distribution - Right-looking variant of the LU
factorization with row partial pivoting featuring multiple look-ahead depths - Re-
cursive panel factorization with pivot search and column broadcast combined -
Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast
algorithm - backward substitution with look-ahead of depth 1.
The HPL package provides a testing and timing program to quantify the ac-
curacy of the obtained solution as well as the time it took to compute it. The
best performance achievable by this software on system depends on a large variety
of factors. Nonetheless, with some restrictive assumptions on the interconnection
network, the algorithm described here and its attached implementation are scal-
able in the sense that their parallel efficiency is maintained constant with respect
to the per processor memory usage.
The HPL software package requires the availability of an implementation of
the Message Passing Interface MPI on system. An implementation of either the
Basic Linear Algebra Subprograms BLAS or the Vector Signal Image Processing
Library VSIPL is also needed. Machine-specific as well as generic implementations
of MPI, the BLAS and VSIPL are available for a large variety of systems.
24
2.11 HPL Benchmark
2.11.1 Description of the HPL.dat File
Line 1: (unused) Typically one would use this line for its own good. For example,
it could be used to summarize the content of the input file. By default this line
reads:
HPL Linpack benchmark input file
Line 2: (unused) same as line 1. By default this line reads:
Innovative Computing Laboratory, University of Tennessee
Line 3: the user can choose where the output should be redirected to. In the
case of a file, a name is necessary, and this is the line where one wants to specify
it. Only the first name on this line is significant. By default, the line reads:
HPL.out output file name (if any)
This means that if one chooses to redirect the output to a file, the file will be called
”HPL.out”. The rest of the line is unused, and this space to put some informative
comment on the meaning of this line.
Line 4: This line specifies where the output should go. The line is formatted,
it must begin with a positive integer, the rest is unsignificant. 3 choices are possi-
ble for the positive integer, 6 means that the output will go the standard output,
7 means that the output will go to the standard error. Any other integer means
that the output should be redirected to a file, which name has been specified in
the line above. This line by default reads:
6 device out (6=stdout,7=stderr,file)
which means that the output generated by the executable should be redirected to
the standard output.
Line 5: This line specifies the number of problem sizes to be executed. This
number should be less than or equal to 20. The first integer is significant, the rest
is ignored. If the line reads:
3 # of problems sizes (N)
this means that the user is willing to run 3 problem sizes that will be specified in
the next line.
Line 6: This line specifies the problem sizes one wants to run. Assuming the
line above started with 3, the 3 first positive integers are significant, the rest is
ignored. For example:
25
2.11 HPL Benchmark
3000 6000 10000 Ns
means that one wants xhpl to run 3 (specified in line 5) problem sizes, namely
3000, 6000 and 10000.
Line 7: This line specifies the number of block sizes to be runned. This num-
ber should be less than or equal to 20. The first integer is significant, the rest is
ignored. If the line reads:
5 # of NBs
this means that the user is willing to use 5 block sizes that will be specified in the
next line.
Line 8: This line specifies the block sizes one wants to run. Assuming the line
above started with 5, the 5 first positive integers are significant, the rest is ignored.
For example:
80 100 120 140 160 NBs
means that one wants xhpl to use 5 (specified in line 7) block sizes, namely 80,
100, 120, 140 and 160.
Line 9: This line specifies how the MPI processes should be mapped onto the
nodes of platform. There are currently two possible mappings, namely row- and
column-major. This feature is mainly useful when these nodes are themselves
multi-processor computers. A row-major mapping is recommended.
0 PMAP process mapping (0=Row-,1=Column-major)
Line 10: This line specifies the number of process grid to be runned. This num-
ber should be less than or equal to 20. The first integer is significant, the rest is
ignored. If the line reads:
2 # of process grids (P x Q)
this means that it will try 2 process grid sizes that will be specified in the next line.
Line 11-12: These two lines specify the number of process rows and columns
of each grid to run on. Assuming the line above (10) started with 2, the 2 first
positive integers of those two lines are significant, the rest is ignored. For example:
1 2 Ps
6 8 Qs
means that one wants to run xhpl on 2 process grids (line 10), namely 1-by-6 and
2-by-8. Note: In this example, it is required then to start xhpl on at least 16
26
2.11 HPL Benchmark
nodes (max of Pi-by-Qi). The runs on the two grids will be consecutive. If one
was starting xhpl on more than 16 nodes, say 52, only 6 would be used for the
first grid (1x6) and then 16 (2x8) would be used for the second grid. The fact
that you started the MPI job on 52 nodes, will not make HPL use all of them. In
this example, only 16 would be used. If one wants to run xhpl with 52 processes
one needs to specify a grid of 52 processes, for example the following lines would
do the job:
4 2 Ps
13 8 Qs
Line 13: This line specifies the threshold to which the residuals should be com-
pared with. The residuals should be or order 1, but are in practice slightly less
than this, typically 0.001. This line is made of a real number, the rest is not
significant. For example:
16.0 threshold
In practice, a value of 16.0 will cover most cases. For various reasons, it is possible
that some of the residuals become slightly larger, say for example 35.6. xhpl will
flag those runs as failed, however they can be considered as correct. A run should
be considered as failed if the residual is a few order of magnitude bigger than 1
for example 106 or more. Note: if one was to specify a threshold of 0.0, all tests
would be flagged as failed, even though the answer is likely to be correct. It is
allowed to specify a negative value for this threshold, in which case the checks will
be by-passed, no matter what the threshold value is, as soon as it is negative. This
feature allows to save time when performing a lot of experiments, say for instance
during the tuning phase. Example:
-16.0 threshold
The remaning lines allow to specifies algorithmic features. xhpl will run all
possible combinations of those for each problem size, block size, process grid com-
bination. This is handy when one looks for an ”optimal” set of parameters. To
understand a little bit better, let say first a few words about the algorithm imple-
mented in HPL. Basically this is a right-looking version with row-partial pivoting.
The panel factorization is matrix-matrix operation based and recursive, dividing
the panel into NDIV subpanels at each step. This part of the panel factorization is
denoted below by ”recursive panel fact. (RFACT)”. The recursion stops when the
current panel is made of less than or equal to NBMIN columns. At that point, xhpl
uses a matrix-vector operation based factorization denoted below by ”PFACTs”.
27
2.11 HPL Benchmark
Classic recursion would then use NDIV=2, NBMIN=1. There are essentially 3 nu-
merically equivalent LU factorization algorithm variants (left-looking, Crout and
right-looking). In HPL, one can choose every one of those for the RFACT, as well
as the PFACT. The following lines of HPL.dat allows to set those parameters.
Lines 14-21: (Example 1)
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 No. of panels in recursion
2 3 4 NDIVs
3 No. of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try all variants of PFACT, 4 values for NBMIN, namely 1, 2,
4 and 8, 3 values for NDIV namely 2, 3 and 4, and all variants for RFACT.
Lines 14-21: (Example 2)
2 # of panel fact
2 0 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
4 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try 2 variants of PFACT namely right looking and left look-
ing, 2 values for NBMIN, namely 4 and 8, 1 value for NDIV namely 2, and one
variant for RFACT.
In the main loop of the algorithm, the current panel of column is broadcast
in process rows using a virtual ring topology. HPL offers various choices and one
most likely want to use the increasing ring modified encoded as 1. 3 and 4 are
also good choices.
Lines 22-23: (Example 1)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the increasing ring mod-
ified topology.
28
2.11 HPL Benchmark
Lines 22-23: (Example 2)
2 # of broadcast
0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the increasing ring vir-
tual topology and the long message algorithm.
Lines 24-25 allow to specify the look-ahead depth used by HPL. A depth of 0
means that the next panel is factorized after the update by the current panel is
completely finished. A depth of 1 means that the next panel is immediately fac-
torized after being updated. The update by the current panel is then finished. A
depth of k means that the k next panels are factorized immediately after being
updated. The update by the current panel is then finished. It turns out that a
depth of 1 seems to give the best results, but may need a large problem size before
one can see the performance gain. So use 1, if you do not know better, otherwise
you may want to try 0. Look-ahead of depths 3 and larger will probably not give
better results.
Lines 24-25: (Example 1):
1 No. of lookahead depth
1 DEPTHs (>= 0)
This will cause HPL to use a look-ahead of depth 1.
Lines 24-25: (Example 2):
2 No. of lookahead depth
0 1 DEPTHs (>= 0)
This will cause HPL to use a look-ahead of depths 0 and 1.
Lines 26-27 allow to specify the swapping algorithm used by HPL for all tests.
There are currently two swapping algorithms available, one based on ”binary ex-
change” and the other one based on a ”spread-roll” procedure (also called ”long”
below). For large problem sizes, this last one is likely to be more efficient. The
user can also choose to mix both variants, that is ”binary-exchange” for a number
of columns less than a threshold value, and then the ”spread-roll” algorithm. This
threshold value is then specified on Line 27.
Lines 26-27: (Example 1):
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the ”long” or ”spread-roll” swapping algorithm. Note
that a threshold is specified in that example but not used by HPL.
29
2.11 HPL Benchmark
Lines 26-27: (Example 2):
2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the ”long” or ”spread-roll” swapping algorithm as
soon as there is more than 60 columns in the row panel. Otherwise, the ”binary-
exchange” algorithm will be used instead.
Line 28 allows to specify whether the upper triangle of the panel of columns
should be stored in no-transposed or transposed form. Example:
0 L1 in (0=transposed,1=no-transposed) form
Line 29 allows to specify whether the panel of rows U should be stored in no-
transposed or transposed form. Example: 0 U in (0=transposed,1=no-transposed)
form
Line 30 enables / disables the equilibration phase. This option will not be used
unless 1 or 2 are selected in Line 26. Example:
1 Equilibration (0=no,1=yes)
Line 31 allows to specify the alignment in memory for the memory space allo-
cated by HPL. On modern machines, one probably wants to use 4, 8 or 16. This
may result in a tiny amount of memory wasted. Example:
8 memory alignment in double (> 0)
2.11.2 Guidelines for HPL.dat configuration
1. Figure out a good block size for the matrix multiply routine. The best
method is to try a few out. If the block size used by the matrix-matrix
multiply routine is known, a small multiple of that block size will do fine.
This particular topic is discussed in the FAQs section.
2. The process mapping should not matter if the nodes of platform are sin-
gle processor computers. If these nodes are multi-processors, a row-major
mapping is recommended.
3. HPL likes ”square” or slightly flat process grids. Unless very small process
grid is used, stay away from the 1-by-Q and P-by-1 process grids. This
particular topic is also discussed in the FAQs section.
30
2.11 HPL Benchmark
4. Panel factorization parameters: a good start are the following for the lines
14-21:
1 No. of panel fact
1 PFACTs (0=left, 1=Crout, 2=Right)
2 No. of recursive stopping criterium
4 8 NBMINs (>= 1)
1 No. of panels in recursion
2 NDIVs
1 No. of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
5. Broadcast parameters: at this time it is far from obvious to me what the
best setting is, so i would probably try them all. If I had to guess I would
probably start with the following for the lines 22-23:
2 No. of broadcast
1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
The best broadcast depends on problem size and harware performance. Usu-
ally 4 or 5 may be competitive for machines featuring very fast nodes com-
paratively to the network.
6. Look-ahead depth: as mentioned above 0 or 1 are likely to be the best
choices. This also depends on the problem size and machine configuration,
so I would try ”no look-ahead (0)” and ”look-ahead of depth 1 (1)”. That
is for lines 24-25:
2 No. of lookahead depth
0 1 DEPTHs (>= 0)
7. Swapping: one can select only one of the three algorithm in the input file.
Theoretically, mix (2) should win, however long (1) might just be good
enough. The difference should be small between those two assuming a swap-
ping threshold of the order of the block size (NB) selected. If this threshold
is very large, HPL will usebinexch(0) most of the time and if it is very small
(< NB)− 27:
2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
I would also try the long variant. For a very small number of processes in
every column of the process grid (say ¡ 4), very little performance difference
should be observable.
31
2.12 HPCC Challenge Benchmark
8. Local storage: I do not think Line 28 matters. Pick 0 in doubt. Line 29
is more important. It controls how the panel of rows should be stored.
No doubt 0 is better. The caveat is that in that case the matrix-multiply
function is called with ( Notrans, Trans, ... ), that is C := C−ABT . Unless
the computational kernel used has a very poor (with respect to performance)
implementation of that case, and is much more efficient with ( Notrans,
Notrans, ... ) just pick 0 as well. So, the choice:
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
9. Equilibration: It is hard to tell whether equilibration should always be per-
formed or not. Not knowing much about the random matrix generated and
because the overhead is so small compared to the possible gain, I turn it on
all the time.
1 Equilibration (0=no,1=yes)
10. For alignment, 4 should be plenty, but just to be safe, one may want to pick
8 instead.
8 memory alignment in double (> 0)
2.12 HPCC Challenge Benchmark
HPCC was developed to study future Petascale computing systems, and is in-
tended to provide a realistic measurement of modern computing workloads. HPCC
is made up of seven common computational kernels: STREAM, HPL, DGEMM
(matrix multiply), PTRANS (parallel matrix transpose), FFT, RandomAccess,
and b eff (bandwidth/latency tests). The benchmarks attempt to measure high
and low spatial and temporal locality space. The tests are scalable, and can be
run on a wide range of platforms, from single processors to the largest parallel
supercomputers.
The HPCC benchmarks test three particular regimes: local or single processor,
embarrassingly parallel, and global, where all processors compute and exchange
data with each other. STREAM measures a processor’s memory bandwidth. HPL
is the LINPACK TPP (Toward Peak Performance) benchmark; RandomAccess
measures the rate of random updates of memory; PTRANS measures the rate of
transfer of very large arrays of data from memory; b eff measures the latency and
bandwidth of increasingly complex communication patterns.
All of the benchmarks are run in two modes: base and optimized. The base
32
2.12 HPCC Challenge Benchmark
run allows no source modifications of any of the benchmarks, but allows gener-
ally available optimized libraries to be used. The optimized benchmark allows
significant changes to the source code. The optimizations can include alternative
programming languages and libraries that are specifically targeted for the platform
being tested.
The HPC Challenge benchmark consists at this time of 7 benchmarks: HPL,
STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b eff Latency/Bandwidth.
HPL ( system performance )
The Linpack TPP benchmark which measures the floating point rate of execu-
tion for solving a randomly generated dense linear system of equations in double
floating-point precision (IEEE 64-bit) arithmetic using MPI. The linear system
matrix is stored in a two-dimensional block-cyclic fashion and multiple variants
of code are provided for computational kernels and communication patterns. The
solution method is LU factorization through Gaussian elimination with partial
row pivoting followed by a backward substitution. Unit: Tera Flops per Second
PTRANS (A = A+BT ) ( system performance )
Implements a parallel matrix transpose for two-dimensional block-cyclic storage.
It is an important benchmark because it exercises the communications of the com-
puter heavily on a realistic problem where pairs of processors communicate with
each other simultaneously. It is a useful test of the total communications capacity
of the network. Unit: Giga Bytes per Second
RandomAccess ( system performance )
Global RandomAccess, also called GUPs, measures the rate at which the com-
puter can update pseudo-random locations of its memory - this rate is expressed
in billions (giga) of updates per second (GUP/s). Unit: Giga Updates per Second
FFTE ( system performance )
IT measures the floating point rate of execution of double precision complex one-
dimensional Discrete Fourier Transform (DFT). Global FFTE performs the same
test as FFTE but across the entire system by distributing the input vector in block
fashion across all the processes. Unit: Giga Flops per Second
STREAM ( system performance - derived )
The Embarrassingly Parallel STREAM benchmark is a simple synthetic bench-
33
2.12 HPCC Challenge Benchmark
mark program that measures sustainable memory bandwidth and the correspond-
ing computation rate for simple numerical vector kernels. It is run in embarrass-
ingly parallel manner - all computational processes perform the benchmark at the
same time, the arithmetic average rate is multiplied by the number of processes
for this value. ( EP-STREAM Triad * MPI Processes ) Unit: Giga Bytes per
Second
DGEMM ( per process )
The Embarrassingly Parallel DGEMM benchmark measures the floating-point ex-
ecution rate of double precision real matrix-matrix multiply performed by the
DGEMM subroutine from the BLAS (Basic Linear Algebra Subprograms). It is
run in embarrassingly parallel manner - all computational processes perform the
benchmark at the same time, the arithmetic average rate is reported. Unit: Giga
Flops per Second
Effective bandwidth benchmark (b eff)
Effective bandwidth benchmark a set of tests to measure latency and bandwidth
of a number of simultaneous communication patterns.
Random Ring Bandwidth ( per process )
Randomly Ordered Ring Bandwidth, reports bandwidth achieved in the ring com-
munication pattern. The communicating processes are ordered randomly in the
ring (with respect to the natural ordering of the MPI default communicator). The
result is averaged over various random assignments of processes in the ring. Unit:
Giga Bytes per second
Random Ring Latency ( per process )
Randomly-Ordered Ring Latency, reports latency in the ring communication pat-
tern. The communicating processes are ordered randomly in the ring (with respect
to the natural ordering of the MPI default communicator) in the ring. The re-
sult is averaged over various random assignments of processes in the ring. Unit:
micro-seconds
Giga-updates per second (GUPS) is a measure of computer performance. GUPS
is a measurement of how frequently a computer can issue updates to randomly
generated RAM locations. GUPS measurements stress the latency and especially
bandwidth capabilities of a machine.
34
Chapter 3
Design and Implementation
3.1 Beowulf Clusters: A Low cost alternative
Beowulf is not a particular product. It is a concept for clustering varying numbers
of small, relatively inexpensive computers running the Linux operating system.
The goal of Beowulf clustering is to create a parallel-processing supercomputer
environment at a price well below that of conventional supercomputers.
Figure 3.1: The Schematic structure of proposed cluster
A Beowulf Cluster is a PC cluster that normally runs under Linux OS. Each
PC (node) is dedicated to the work of the cluster and connected through a net-
work with other nodes. Figure 3.1 schematically shows the structure of a proposed
cluster. In this cluster, a master node controls other worker nodes by communicat-
ing through the network using the Message Passing Interface (MPI). A Proposed
35
3.2 Logical View of proposed Cluster
cluster will have better price/performance ratio and scalability than other parallel
computers due to the use of off-the-shelf components and Linux OS. It is easy and
economical to add more nodes as needed without changing software programs.
3.2 Logical View of proposed Cluster
The primary and most often used view is termed logical view and this is the view
that anybody is generally be interacting with when using a cluster. In this view,
the physical components are categorized and displayed in a layered manner, that
is, here the primary concern is the parallel applications, message passing library,
OS and interconnect.
Figure 3.2: Logical view of proposed cluster
3.3 Hardware Configuration
As it has been previously indicated a cluster is comprised of computers intercon-
nected through a LAN. Let’s talk first about the requirements of this cluster in
terms of hardware and then about the software that will run on this system.
3.3.1 Master Node
The master server provide access to the primary network and ensure availability
of the cluster. Server has Fast Ethernet connection to the network in order to
better keep up with the high speed of the PCs. Any system from Intel, AMD or
36
3.3 Hardware Configuration
other vendor can be used as server. Here PC with Intel i7-2600 processor and 4
GB RAM is used as server.
3.3.2 Compute Nodes
Build custom PCs from commodity off-the-shelf components requires a lot of work
to assemble the cluster but it can be fine tuned as per the need. Buy generic PCs
and shelves. May want keyboard switches for smaller configurations. For larger
configurations, a better solution would be use the serial ports from each machine
and connect to a terminal server. Or even custom rack-mount nodes can be used.
More expensive but saves space. May complicate cooling issues due to closely
packed components. For this complete setup use old unused PCs from the college.
Here for testing purpose similar PCs with Intel i7-2600 processor and 4 GB RAM
are used.
3.3.3 Network
As it has been previously indicated the computers in a cluster communicate us-
ing a network interconnection as can be seen in the Figure 3.3. The master and
the compute nodes have NICs and all the computers are connected to a switch
to perform the delivery of messages. The cost per port of an Ethernet Switch is
about four times larger than an Ethernet Hub but an Ethernet Switch will be used
due to the following reasons: An Ethernet Hub is a network device that acts as
a broadcast bus, where an input signal is amplified and distributed to all ports.
However only a couple of computers can communicate properly at once and if two
or more computers simultaneously send packets a collision will occur. Therefore,
the bandwidth of an Ethernet Hub is equivalent to the bandwidth of the communi-
cation link, 10Mb/s for standard Ethernet, 100Mb/s for Fast Ethernet and 1Gb/s
for Gigabit Ethernet. An Ethernet Switch provides more accumulated bandwidth
by allowing multiple simultaneous communications. If there are no conflicts in the
output ports, the Ethernet Switch can send multiple packets simultaneously. A
major disadvantage that clusters have compared to supercomputers is its latency.
The bandwidth of each computer could be increased using multiple NICs, which
is possible through what is known in Linux as Channel Bonding. It consists in the
simulation of a network interface linking multiple NICs so that applications will
only see a single interface. The access to the cluster is often made remotely, that
is the reason why the frontend will have two NICs, one to access the Internet and
another one to connect to other nodes in the cluster. The maximum bandwidth
37
3.4 Softwares
provided by the college end Ethernet is 100 Mb/s and minimum latency for fast
Ethernet is 80 microseconds. All cluster machines are connected through college’s
Ethernet.
Figure 3.3: The Network interconnection
3.4 Softwares
The system that has been designed and implemented uses the Linux kernel with
GNU applications. These applications range from servers and compilers.
1. Operating System: The operating system used is Linux based CentOS 6.2.
It is an enterprise-quality operating system, because it is based on the source
code of Red Hat Enterprise Linux, which has been tested and stabilized ex-
tensively prior to release. On the other hand, CentOS(Community ENTer-
prise Operating System) is completely free, open source, and no cost, offering
all of the user support and features of a community-run Linux distribution.
The version 6 has been chosen because it is the latest stable version. The op-
erating system that runs in the frontend includes the standard applications
of the distribution in addition to others required for the construction of the
cluster. The specific applications included for the construction of the cluster
are message-passing libraries, compilers, servers and software for monitoring
the resources of the cluster.
2. Message-passing libraries: In the parallel computation in order to perform
task resolutions and intensive calculations one must divide and distribute
independent tasks to the different computers using the message-passing li-
braries. There are several libraries of this type, the most well-known being
MPI and PVM (Parallel Virtual Machine). The system integrates MPI.
38
3.4 Softwares
The reason for this choice is that it is the most commonly used library by
the numerical analysis community for the passing of messages. Specifically
MPICH2 has been used in proposed system.
3. Compilers: Languages commonly used in parallel computing are C, C++,
Python and FORTRAN. For this reason the four programming languages
are supported within the system that has been developed integrating the
compilers gcc, g++ and gfortran.
4. Compute nodes: The operating system that runs in the nodes is basic Cen-
tOS 6.2 without GUI. It integrates the kernel and basic services which are
necessary for an adequate performance of the nodes. The unnecessary soft-
wares which are not needed for this purpose has been discarded. Therefore,
MPICH2 is included, as well as the compilers gcc, g + + and gfortran.
3.4.1 MPICH2
MPICH2 is architected so that a number of communication infrastructures can
be used. These are called ”devices.” The device that is most relevant for the
Beowulf environment is the channel device (also called ”ch3” because it is the
third version of the channel approach for implementing MPICH); this supports a
variety of communication methods and can be built to support the use of both
TCP over sockets and shared memory. In addition, MPICH2 uses a portable in-
terface to process management systems, providing access both to external process
managers (allowing the process managers direct control over starting and running
the MPI processes) and to the MPD scalable process manager that is included
with MPICH2. To run first MPI program, carry out the following steps for its
installation:
1. Download mpich2-1.4.1p1.tar.gz from www.mcs.anl.gov/mpi/mpich and copy
at /home/beowulf/sw/
2. Extract the contents in /home/beowulf/sw/
$tar xvfz mpich2-1.4.1p1.tar.gz
3. Create folder for installation
$mkdir /opt/mpich2-1.4.1p1
4. Create build directory
$mkdir /tmp/mpich2-1.4.1p1 $cd/tmp/mpich2-1.4.1p1
39
3.4 Softwares
5. configure ¡configure options¿ ¿& configure.log. Most users should specify a
prefix for the installation path when configuring:
$/home/beowulf/sw/mpich2-1.4.1p1/configure –prefix=/opt/mpich2-1.4.1p1
2 > &1 configure.log
6. By default, this creates the channel device for communication with TCP
over sockets. Now build.
$make 2 > &1 make.log
7. Install MPICH2 commands
$make install 2 > &1 install.log
8. Add the ’< prefix > /bin’ directory to path by adding below line in
$home/.bashrc file in home directory
$vi home/.bashrc
export PATH=< prefix > /bin : $PATH
9. Test mpich2 installation
$which mpicc
SSH login without password
Public key authentication allows to login to a remote host via the SSH proto-
col without a password and is more secure than password-based authentication.
Try creating a passwordless connection from master to node1 using public-key
authentication.
Create key
Press ENTER at every prompt.
[root@master]#ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/user/.ssh/id rsa.
Your public key has been saved in /home/user/.ssh/id rsa.pub.
The key fingerprint is:
b2:ad:a0:80:85:ad:6c:16:bd:1c:e7:63:4f:a0:00:15 user@host
The key’s randomart image is:
40
3.4 Softwares
[root@master]#
For added security ”’the key itself”’ would be protected using a strong ”passphrase”.
If a passphrase is used to protect the key, ssh-agent can be used to cache the
passphrase.
Copy key to remote host
[root@master]# ssh-copy-id root@node1
root@node1’s password:
Now try logging into the machine, with ”ssh ’root@node1’”, and check in:
.ssh/authorized keys
to make sure we haven’t added extra keys that you weren’t expecting.
[root@master]#
Login to remote host
Note that no password is required.
root@master# ssh root@node1
Last login: Tue May 18 12:47:53 2012 from 10.1.11.210
[root@node1]#
Also it is must to disable the firewall on all cluater machines so that the cluster
can work seamlessly. To achieve this first login as the root user. Next enter the
following three commands to disable firewall.
#service iptables save
#service iptables stop
#chkconfig iptables off
Now MPI programs can be run on the cluster.
Running MPI Program
The following assumes that MPICH2 is installed on all cluster machines running
CentOS 6.2 and every machine have access to every other via mpiexec command.
It is also assumed that to compile, copy, and run the code command line, or in a
terminal window is used. Typically, running an MPI program will consist of three
steps:
41
3.4 Softwares
Compile
Assuming that the code to compile is ready (If only binary executables are avail-
able, proceed to step 2, copy) it need to create an executable. This involves
compiling code with the appropriate compiler, linked against the MPI libraries.
It is possible to pass all options through a standard cc or f77 command, but
MPICH provides a ”wrapper” (mpicc for cc/gcc, mpicxx/mpic++ for c++/g++
on UNIX/Linux and mpif77 for f77) that appropriately links against the MPI li-
braries and sets the appropriate include and library paths.
Example: hello.c (Use vi text editor to create the file hello.c)
#include < stdio.h >
#include < mpi.h >
int main(int argc, char ** argv) {int rank, size;
char name[80];
int length;
MPI Init(&argc, &argv); // note that argc and argv are passed by address
MPI Comm rank(MPI COMM WORLD,&rank);
MPI Comm size(MPI COMM WORLD,&size);
MPI Get processor name(name,&length);
printf(”Hello MPI: processor %d of %d on %s\n”, rank,size,name);
MPI Finalize();
}After saving the above example file, compile the program using the mpicc
command.
$mpicc -o hello hello.c
The ”-o” option provides an output file name, otherwise executable would be saved
as ”a.out”. Be careful to make sure to provide an executable name if ”-o” option
is used. Many programmers have deleted part of their source code by accidentally
giving their source code as their output file name. If file name is typed correctly
and there are no bugs in the code, it will successfully compile the code, and an
”ls” command should show that the output file ”hello” is created.
$ls hello.c
$mpicc -o hello hello.c
$ls hello
hello.c hello
42
3.4 Softwares
Copy
In order for program to run on each node, the executable must exist on each node.
There are as many ways to make sure that executable exists on all of the nodes as
there are ways to put the cluster together in the first place. One method is coverd
below.
This method will assume that there exists a directory (/home/beowulf/testing)
on all the nodes, and authentication is being done via ssh, and that public keys
have been shared for the account to allow for login and remote execution without
a password.
One command that can be used to copy files between machines is ”scp”; scp is
a unix command that will securely copy files between remote machines, and in its
simplest use acts as a secure remote copy. It takes similar arguments to the unix
”cp” command.
Now save example in a directory /home/beowulf/testing (i.e. the file is saved
as /home/beowulf/testing/hello) the following command will copy the file hello to
a remote node.
$scp hello root@node1:/home/beowulf/testing
This will need to be done for each host. To check whether copy is working
properly or not, ssh into each host, and check to see that the files are there using
the ”ls” command.
Execute
Once compiled the code and copied it to all of the nodes, run it using the mpiexec
command. Two of the more common arguments to the mpiexec command are
the ”np or n” argument that specify how many processors to use, and the ”-f”
argument specify exactly which nodes are available for use. Already an entry has
been made for hosts file in .bashrc in home directory so there is no need to use
this argument.
Change directory to the file where executable is located, and run hello com-
mand using 4 processes:
$mpiexec -n 4 ./hello
Hello MPI: processor 0 of 4 on master
Hello MPI: processor 3 of 4 on node3
Hello MPI: processor 2 of 4 on node2
Hello MPI: processor 1 of 4 on node1
43
3.4 Softwares
3.4.2 HYDRA: Process Manager
Hydra is a process management system for starting parallel jobs. Hydra is de-
signed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and
sge. Starting MPICH2-1.3, hydra is the default process manager, which is auto-
matically used with mpiexec.
As there is a bug with hydra-1.4 which comes with mpich2-1.4.1p1, hydra-
1.5b1 has been installed separately. Once built, the new Hydra executables are
in mpich2/bin, or the bin subdirectory of the install directory if install have been
done. Put this (bin) directory in PATH in .bashrc for usage convenience:
Put in .bashrc: export PATH=/opt/mpich2-1.4.1p1/bin/bin:$PATH
HYDRA HOST FILE: This variable points to the default host file to use, when
the ”-f” option is not provided to mpiexec. For bash:
export HYDRA HOST FILE=< path to host file >/hosts
3.4.3 TORQUE: Resource Manager
TORQUE Resource Manager provides control over batch jobs and distributed
computing resources. It is an advanced The TORQUE Resource Manager is a
distributed resource manager providing control over batch jobs and distributed
compute nodes. Its name stands for Terascale Open-Source Resource and QUEue
Manager. Cluster Resources, Inc. describes it as open-source and Debian classifies
it as non-free owing to issues with the license. It is a community effort based on
the original PBS project and, with more than 1,200 patches, has incorporated sig-
nificant advances in the areas of scalability, fault tolerance, and features extensions
contributed by NCSA, OSC, USC, the US DOE, Sandia, PNNL, UB, TeraGrid,
and many other leading-edge HPC organizations. TORQUE can integrate with
the non-commercial Maui Cluster Scheduler or the commercial Moab Workload
Manager to improve overall utilization, scheduling and administration on a clus-
ter. TORQUE is described by its developers as open-source software, using the
OpenPBS version 2.3 license and as non-free software in the Debian Free Software
Guidelines.
Feature Set
TORQUE provides enhancements over standard OpenPBS in the following areas:
44
3.4 Softwares
Fault Tolerance
• Additional failure conditions checked/handled
• Node health check script support
Scheduling Interface
• Extended query interface providing the scheduler with additional and more
accurate information
• Extended control interface allowing the scheduler increased control over job
behavior and attributes
• Allows the collection of statistics for completed jobs
Scalability
• Significantly improved server to Message oriented middleware (MOM) com-
munication model
• Ability to handle larger clusters (over 15 TF/2,500 processors)
• Ability to handle larger jobs (over 2000 processors)
• Ability to support larger server messages
Usability
• Extensive logging additions
• More human readable logging (i.e. no more ’error 15038 on command 42’)
3.4.4 MAUI: Cluster Scheduler
Maui Cluster Scheduler is a open source job scheduler for use on clusters and
supercomputers initially developed by Cluster Resources, Inc.. Maui is capable
of supporting multiple scheduling policies, dynamic priorities, reservations, and
fairshare capabilities. Maui satisfies some definitions of open-source software and
is not available for commercial usage. It improves the manageability and effi-
ciency of machines ranging from clusters of a few processors to multi-teraflops
supercomputers.
45
3.5 System Considerations
Job State
Jobs in Maui can be in one of three major states:
Running
A jobs that have been alloted its required resources and have started its compu-
tation is considered running until it finish.
Queued (idle)
Jobs that are eligible to run. The priority is calculated here and the jobs are sorted
according to calculated priority. Advance reservations are made starting with the
job up front.
Non-queued
Jobs that, for some reason, are not allowed to start. Jobs in this state does not
gain any queue-time priority.
There is a limit on the number of jobs a group/user can have in the Queued
state. This prohibit users from acquiring longer queue-time than deserved by
submitting large number of jobs.
3.5 System Considerations
The following sections discuss system considerations and requirements:
Design/development Debug
There are a number of critical tools necessary for the implementation of a success-
ful HPCC cluster solution. The first is a compiler which can take advantage of the
architectural features of the processor. Next, a debugger such as gdb allows the
developer to debug the code and assists in finding the problem areas or sections
of code to be further tuned for performance. A profiler is also necessary to assist
in finding the performance bottlenecks in the overall system including the system
interconnect.
Job Control
Once an application has been developed or ported to a Beowulf cluster, the ap-
plication must be started and run on a portion or the entire cluster. Understand
46
3.5 System Considerations
particular needs and requirements for system partitioning, how jobs are started
and run, and how a queue of jobs can be setup to run automatically.
Checkpoint Restart
Many applications running on even very large HPC clusters will require many
hours, days, or weeks of execution time to run to completion. A failure in one
part of the system could corrupt a job execution run, forcing a restart. The solu-
tion is to periodically checkpoint the current state, writing the intermediate data
calculations available at the end of the interval to a disk subsystem. This usually
takes a small amount of time to write out the data with the compute functions
temporarily paused, the time dependent on the storage architecture. If there is a
system failure of one of the computing components, then the failing component
can be taken out of the cluster and the job restarted with the data available from
the previous periods checkpoint save.
Performance Monitoring
Even if a considerable amount of time is spent during the debug phase to tune
the application for best performance, a performance monitoring function is still
necessary to watch the cluster performance over time. With potentially multiple
job streams running concurrently on the system, each taking differing amounts of
CPU or memory, there may be situations where the applications are not running
at the expected efficiency. The performance-monitoring tool can assist in detect-
ing these situations.
Benchmarking
An excellent collection of benchmarks is the HPCC Benchmarking Suite. It con-
sists of seven well-known public domain benchmarks. The latest version allows
to compare network performance with raw TCP, PVM, and MPICH, LAM/MPI
among others. It is also worthwhile to use the latest version of HPL (High Per-
formance Linpack) benchmark. For parallel benchmarks, the above mentioned
benchmarks are a reasonable test (especially if running numerical computations
on cluster). The above and other benchmarks are necessary to evaluate different
architectures, motherboards, network cards.
47
Chapter 4
Experiments
To evaluate the usage and acceptability of the cluster and its performance few
parallel programs are implemented. The first one is a finding the prime numbers
in given range. The second is to calculate the value of π. Then one embarrassingly
parallel program to solve circuit satisfiability problem is tested. Implemented
1D Time Dependent Heat Equation and Radix-2 FFT algorithms as a real life
programs.
Also conducted two standard benchmarking experiments which are also used
to find the performance of Top500 supercomputers. The first of them is High Per-
formance Linpack Benchmark and the other one is the HPCC which is a complete
suite of seven tests covering many performance factors.
The work of a global problem can be divided into a number of independent
tasks, which rarely need to synchronize. Monte Carlo simulations or numerical
integration are examples of this. So here in below examples the code that can be
parallelized is found and then it is executed simultaneously on different cluster
node with different data. If the parallelizable code is not depend on the other
output of other nodes we get a better performance. The essence is to divide the
entire computation evenly among collaborative processors. Divide and conquer.
4.1 Finding Prime Numbers
This C program counts the number of primes between 1 and N, using MPI to carry
out the calculation in parallel. The algorithm is completely naive. For each integer
I, it simply checks whether any smaller J evenly divides it. The total amount of
work for a given N is thus roughly proportional to 1/2 ∗N2. Figure 4.1 shows the
performance of cluster for finding various primes as compared to single machine.
This program is mainly a starting point for investigations into parallelization.
48
4.2 PI Calculation
Figure 4.1: Graph showing performance for Finding Primes
Here the total range of numbers for which we want to find the primes are
divided into equal parts and then distributed amongst the computing nodes. Every
node has to carry out its task and send back the results to master node. At last
its the job of master node to combine the results of all the nodes and give the final
result.
4.2 PI Calculation
The number π is a mathematical constant that is the ratio of a circle’s circumfer-
ence to its diameter. The constant, sometimes written pi, is approximately equal
to 3.14159. It calculate the value of π using:
∫ 10
41+x2dx = π
Then compare the calculated π value with the original one and find out the ac-
curacy of the output. Also the time taken by program to calculate it is also
displayed. Figure 4.2 shows the time taken by different no. of PCs to calculate π.
To parallelize the code identify the part(s) of a sequential algorithm that can be
executed in parallel. This is the difficult part, then distribute the global work and
data among cluster nodes. Here we can parallely run different iterations of N(no.
of rectangles) from the code
49
4.3 Circuit Satisfiability Problem
Figure 4.2: Graph showing performance for Calculating π
4.3 Circuit Satisfiability Problem
CSAT is a C program which demonstrates, for a particular circuit, an exhaustive
search for solutions of the circuit satisfy problem. This version of the program
uses MPI to carry out the solution in parallel. This problem assumes that a logical
circuit of AND, OR and NOT gates is given, with N binary inputs and a single
output. Determine all inputs which produce a 1 as the output.
The general problem is NP complete, so there is no known polynomial-time
algorithm to solve the general case. The natural way to search for solutions then
is exhaustive search. In an interesting way, this is a very extreme and discrete
version of the problem of maximizing a scalar function of multiple variables. The
difference is that here it is known that both the input and output only have the
values 0 and 1, rather than a continuous range of real values!
This problem was a natural candidate for parallel computation, since the in-
dividual evaluations of the circuit are completely independent. So the complete
problem domain is divided into equal parts and then respective nodes will perform
there work to get the final results
50
4.4 1D Time Dependent Heat Equation
Figure 4.3: Graph showing performance for solving C-SAT Problem
4.4 1D Time Dependent Heat Equation
The heat equation is an important partial differential equation which describes
the distribution of heat (or variation in temperature) in a given region over time.
This program solves
∂u
∂t− k ∗ ∂2
∂x2= f(x, t)
over the interval [A,B] with boundary conditions
u(A, t) = uA(t),
u(B, t) = uB(t),
over the time interval [t0, t1] with initial conditions
u(x, t0) = u0(x)
4.4.1 The finite difference discretization
To apply the finite difference method, define a grid of points x(1) through x(n),
and a grid of times t(1) through t(m). In the simplest case, both grids are evenly
spaced. The approximate solution at spatial point x(i) and time t(j) is denoted
by u(i,j).
51
4.4 1D Time Dependent Heat Equation
Figure 4.4: Graph showing performance for solving 1D Time Dependent Heat
Equation
A second order finite difference can be used to approximate the second deriva-
tive in space, using the solution at three points equally separated in space.
A forward Euler approximation to the first derivative in time is used, which
relates the value of the solution to its value at a short interval in the future.
Thus, at the spatial point x(i) and time t(j), the discretized differential equa-
tion defines a relationship between u(i-1,j), u(i,j), u(i+1,j) and the ”future” value
u(i,j+1). This relationship can be drawn symbolically as a four node stencil:
Figure 4.5: Symbolic relation between four nodes
Since the value of the solution at the initial time is given, use the stencil, plus
the boundary condition information, to advance the solution to the next time step.
Repeating this operation gives us an approximation to the solution at every point
in the space-time grid.
52
4.5 Fast Fourier Transform
4.4.2 Using MPI to compute the solution
To solve the 1D heat equation using MPI, use a form of domain decomposition.
Given P processors, divide the interval [A,B] into P equal subintervals. Each
processor can set up the stencil equations that define the solution almost inde-
pendently. The exception is that every processor needs to receive a copy of the
solution values determined for the nodes on its immediately left and right sides.
Thus, each processor uses MPI to send its leftmost solution value to its left
neighbour, and its rightmost solution value to its rightmost neighbour. Of course,
each processor must then also receive the corresponding information that its neigh-
bours send to it. (However, the first and last processor only have one neighbour,
and use boundary condition information to determine the behaviour of the solution
at the node which is not next to another processor’s node.)
The naive way of setting up the information exchange works, but can be in-
efficient, since each processor sends a message and then waits for confirmation
of receipt, which can’t happen until some processor has moved to the ”receive”
stage, which only happens because the first or last processor doesn’t have to receive
information on a given step.
4.5 Fast Fourier Transform
To make the DFT operation more practical, several FFT algorithms were pro-
posed. The fundamental approach for all of them is to make use of the properties
of the DFT operation itself. All of them reduce the computational cost of per-
forming the DFT on the given input sequence.
WnN = e−j2πkn/N
This value of Wn is referred to as the twiddle factor or phase factor. This value
of twiddle factor being a trigonometric function over discrete points around the 4
quadrants of the two dimensional plane has some symmetry and periodicity prop-
erties.
Symmetry Property: Wk+N/2N = −W k
N
Periodicty Property: W k+NN = W k
N
53
4.5 Fast Fourier Transform
Figure 4.6: Graph showing performance Radix-2 FFT algorithm
Using these properties of the twiddle factor, unnecessary computations can
be eliminated. Another approach that can be used is the divide-and-conquer
approach. In this approach, the given single dimensional input sequence of length,
N, can be represented in a twodimensional form with M rows and L columns with
N = M x L. It can be shown that DFT that is performed on such a representation
will lead to lesser computations, N(M+L+1) complex additions and N(M+L-2)
complex additions. Please note that this approach is applicable only when the
value of N is composite.
4.5.1 Radix-2 FFT algorithm
This algorithm is a special case of the approaches described earlier in which N
can be represented as a power of 2 i.e., N = 2v. This means that the number of
complex additions and multiplications gets reduced to N(N+6)/2 and N2/2 just
by using the divide and conquer approach. When the symmetry and periodicity
property of the twiddle factor is used, it can be shown that the number of com-
plex additions and multiplications can be reduced to Nlog2N and (N/2)log2N
respectively. Hence, from a O(N2) algorithm, the computational complexity has
been reduced to O(NlogN). The entire process is divided into log2N stages and
in each stage N/2 two-point DFTs are performed. The computation involving
each pair of data is called a butterfly. Radix-2 algorithm can be implemented as
Decimation- in-time (M=N/2 and L=2) or Decimation in frequency (M=2 and
L=N/2) algorithms.
54
4.6 Theoretical Peak Performance
Figure 4.7: 8-point Radix-2 FFT: Decimation in frequency form
Figure 4.7 gives the decimation-infrequency form of the Radix-2 algorithm for
an input sequence of length, N=8.
4.6 Theoretical Peak Performance
The theoretical peak is based not on an actual performance from a benchmark run,
but on a paper computation to determine the theoretical peak rate of execution of
floating point operations for the machine. This is the number manufacturers often
cite; it represents an upper bound on performance. That is, the manufacturer
guarantees that programs will not exceed this rate for a given computer.
To calculate theoretical peak performance of the HPC system, first it required
to calculate theoretical peak performance of one node (server) in GFlops and than
just to multiply node performance on the number of nodes of HPC system. The
following formula is used for node theoretical peak performance:
Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores)
x (CPU instruction per cycle) x (number of CPUs per node)
For cluster:
CPUs based on Intel i7-2600 (3.40GHz 4-cores):
3.40 x 4 x 4 = 54.4 GFlops
CPU speed in GHz: 3.40
No. of cores per CPU: 4
55
4.7 Benchmarking
No. of instructions per cycle: 4
Four PC Clusters Theoretical Peak Performance:
54.4 GFlops x 4 = 217.6 GFlops
4.7 Benchmarking
It is generally a good idea to verify that the newly built cluster actually can do
work. This can be accomplished by running a few industry accepted benchmarks.
The purpose of benchmarking is not to get the best results, but to get consistent
repeatable accurate results that are also the best results.
4.8 HPL
HPL (High Performance Linpack) is a software package that solves a (random)
dense linear system of equations in double precision (64 bits) arithmetic on distributed-
memory computers. The performance measured using this program on several
computers forms the basis for the Top 500 super computer list. Using ATLAS
(Automatically Tuned Linear Algebra Software) for the BLAS library it gives
28.67 GFlops for 4 node cluster.
4.8.1 HPL Tuning
After having built the executable /root/hpl-2.0/bin/Linux PII CBLAS/xhpl, one
may want to modify the input data file HPL.dat. This file should reside in the
same directory as the executable /root/hpl-2.0/bin/Linux PII CBLAS/xhpl. An
example HPL.dat file is provided by default. This file contains information about
the problem sizes, machine configuration, and algorithm features to be used by
the executable. It is 31 lines long. All the selected parameters will be printed in
the output generated by the executable.
There so many ways to tackle tuning, for example:
1. Fixed Processor Grid, Fixed Block size and Varying Problem size N.
2. Fixed Processor Grid, Fixed Problem size and Varying Block size.
3. Fixed Problem size, Fixed Block size and Varying the Processor grid.
4. Fixed Problem size, Varying the Block size and Varying the Processor grid.
56
4.8 HPL
5. Fixed Block size, Varying the Problem size and Varying the Processor grid.
HPL.dat file for cluster
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
8 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
41328 Ns
1 # of NBs
168 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
4 Ps
4 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>= 0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
57
4.8 HPL
4.8.2 Run HPL on cluster
At this point all that remains is to add some software that can run on the cluster
and there is nothing better than HPL or Linpack, which is widely used to measure
cluster efficiency (the ratio between theoretical and actual performance). Do the
following steps on all nodes:
Copy Make.Linux PII CBLAS file from $(HOME)/hpl-2.0/setup/ to $(HOME)/hpl-
2.0/
Edit Make.Linux PII CBLAS file
# ———————————————————————-
# - HPL Directory Structure / HPL library ——————————
# ———————————————————————-
#
TOPdir = $(HOME)/hpl-2.0
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
#
# ———————————————————————-
# - Message Passing library (MPI) ————————————–
# ———————————————————————-
# MPinc tells the C compiler where to find the Message Passing library
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
#MPdir = /usr/lib64/mpich2
#MPinc = -I$(MPdir)/include
#MPlib = $(MPdir)/lib/libmpich.a
#
# ———————————————————————-
# - Linear Algebra library (BLAS or VSIPL) —————————–
# ———————————————————————-
# LAinc tells the C compiler where to find the Linear Algebra library
# header files, LAlib is defined to be the name of the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
58
4.8 HPL
LAdir = /usr/lib/atlas
LAinc =
LAlib = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
# ———————————————————————-
# - Compilers / linkers - Optimization flags —————————
# ———————————————————————-
#
CC = /opt/mpich2-1.4.1p1/bin/mpicc
CCNOOPT = $(HPL DEFS)
CCFLAGS = $(HPL DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms, it is necessary to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER = /opt/mpich2-1.4.1p1/bin/mpicc
LINKFLAGS = $(CCFLAGS)
#
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
#
# ———————————————————————-
After configuring the above file Make.Linux PII CBLAS run
$(HOME)make arch=Linux PII CBLAS
Now run Linpack (on a single node):
$(HOME)cd bin/Linux PII CBLAS
$mpiexec -n 4 ./xhpl
Repeat steps 1- 5 on all nodes and the now Linpack can be run on all nodes like
this (from directory $(HOME)/hpl-2.0/Linux PII CBLAS/ )
$mpiexec -n x ./xhpl
where x is the number of cores in cluster.
4.8.3 HPL results
The first thing to note is that the HPL.dat file that is available post install is
simply useless to extract any kind of meaningful performance numbers, so the
59
4.9 Run HPCC on cluster
file needs to be edited. The first test using the default configuration. Then tune
the HPL.dat and test again. For a single PC the HPL gave performance of 11.25
Figure 4.8: Graph showing High Performance Linpack (HPL) Results
GFlops. The highest value which is given by the cluster of four machines is 28.67
GFlops. Which means there is an absolute performance gain in cluster over a
single machine.
It is interesting to note that the maximum performance (28.67 GFlops) was
achieved for a problem size of 30000 and a block size of 168, although to be fair,
the difference between a block size of 168 and 128 is small. Also interesting is how
much the data varies for different problem size, the PCs in the cluster don’t have
a separate network and thus performance is unlikely to ever be constant.
The efficiency of the cluster is 13%, which is appalling, but given the various
limitations in the system it’s perhaps not that surprising.
4.9 Run HPCC on cluster
The HPC Challenge Benchmark set of tests is primarily High Performance Lin-
pack along with some additional bells-and-whistles tests. The nice thing is the
experience in running HPL can be directly leveraged in running HPCC, and vice-
versa.
Instead of a binary named xhpl, with HPCC, a binary named hpcc is generated
after compiling the HPC Challenge Benchmark. This binary runs the whole series
of tests. First download hpcc-1.4.1.tar.gz and save into /root directory. Following
60
4.9 Run HPCC on cluster
is a set of commands to get going with HPCC:
#cd /root
#tar xzvf hpcc-1.4.1.tar.gz
#cd hpcc-1.4.1
#cd hpl
#cp setup/Make.Linux PII CBLAS
#vi Make.Linux PII CBLAS
Apply the same changes to this file as in section Compiling and Running HPL
above except Topdir=../../.. Next, there is a need to build the HPC Challenge
Benchmark, configure hpccinf.dat (which can be derived from the previous settings
for HPL.dat), and then invoke the tool. After modifying Make.Linux PII CBLAS:
#cd /root/hpcc-1.4.1
#make arch=Make.Linux PII CBLAS
Copy hpccinf.txt found in the root hpcc-1.4.1 directory to hpccinf.txt. Make the
following changes to lines 33-36 to control the problem sizes and blocking factors
for PTRANS.
Change lines 33-34 (number of PTRANS problems sizes and the sizes) to:
4 Number of additional problem sizes for PTRANS
1000 2500 5000 10000 values of N
Change lines 35-36 (number of block sizes and the sizes) to:
2 Number of additional blocking sizes for PTRANS
64 128 values of NB
Now run it:
#cd /root/hpcc-1.4.1
#mpiexec np < numprocs > ./hpcc
The results will be in hpccoutf.txt.
4.9.1 HPCC Results
Finally few HPCC benchmark runs are carried out. As with the Linpack bench-
marks, the HPCC benchmark with ATLAS is also compiled.
Generally speaking cluster continues to perform better than single PC but
clearly some of the benchmarks are hardly affected at all.
It is worth bearing in mind that this four node cluster does not have its own
separate network switch and thus results will vary more than in a cluster with
dedicated networking. Table 4.1 shows some import results from various tests of
HPCC Benchmark suite.
61
4.9 Run HPCC on cluster
Test One Processor Cluster
HPL Tflops 0.00072716 0.0283605
StarDGEMM Gflops 4.83506 4.77583
SingleDGEMM Gflops 4.79438 4.92708
PTRANS GBs 0.0425573 0.0409784
MPIRandomAccess LCG GUPs 0.00707042 0.00663434
MPIRandomAccess GUPs 0.00706074 0.00660636
StarRandomAccess LCG GUPs 0.176132 0.0170042
SingleRandomAccess LCG GUPs 0.171557 0.0344993
StarRandomAccess GUPs 0.24612 0.0183594
SingleRandomAccess GUPs 0.241174 0.0448233
StarSTREAM Copy 27.0668 2.92135
StarSTREAM Scale 25.3788 2.91262
StarSTREAM Add 27.1221 3.23188
StarSTREAM Triad 25.4848 3.40194
SingleSTREAM Copy 26.7578 10.9827
SingleSTREAM Scale 24.7451 10.9912
SingleSTREAM Add 26.3792 12.6537
SingleSTREAM Triad 24.7451 12.7064
StarFFT Gflops 2.14797 1.33174
SingleFFT Gflops 2.10237 1.85049
MPIFFT N 65536 134217728
MPIFFT Gflops 0.0587352 0.107084
MaxPingPongLatency usec 340.059 344.502
RandomlyOrderedRingLatency usec 167.139 154.586
MinPingPongBandwidth GBytes 0.0116524 0.0116511
NaturallyOrderedRingBandwidth GBytes 0.0104104 0.00243253
RandomlyOrderedRingBandwidth GBytes 0.00981377 0.00228357
MinPingPongLatency usec 322.998 0.203097
AvgPingPongLatency usec 334.724 267.98
MaxPingPongBandwidth GBytes 0.0116628 0.0116638
AvgPingPongBandwidth GBytes 0.0116578 0.0116587
NaturallyOrderedRingLatency usec 126.505 130.391
Table 4.1: HPCC Results on Single PC and Cluster
62
Chapter 5
Results and Applications
5.1 Discussion on Results
Clusters effectively reduce the overall computational time, demonstrating excellent
performance improvement in terms of Flops.Finally, performance on clusters may
be limited by interconnect speed. Finally, performance on clusters may be limited
by interconnect speed. The choice of which interconnect to use depends more on
whether inter-server communications will be a bottleneck in the mix of jobs to be
run.
5.1.1 Observations about Small Tasks
1. Jobs with very small numbers are bound by communication time.
2. Since sequential runtime is so small, the time to send and receive from the
head node makes the program take longer with more nodes, and makes
adding processors slow down the programs runtime.
3. Parallel execution of such computations is impractical.
4. Speedup is observed by using a small cluster, but it doesnt scale well at all.
5. Its better off with one processor than even a remotely large cluster.
5.1.2 Observations about Larger Tasks
1. Jobs with larger numbers as input are bound by sequential computation time
for a small number of processors, but eventually adding processors causes
communication time to take over.
63
5.2 Factors affecting Cluster performance
2. Sequential runtime with large numbers is much larger, so it scales much
better than with small numbers as input.
3. Inter-node communication has a much larger effect on runtime than intra-
node communication.
4. With infinitely large numbers, communication times would be negligible.
5. Unlike with job requiring very little sequential computation and a lot of
communication, this job achieved speedup with large numbers of processors.
Due to the various overheads discussed throughout certain part of a sequential
algorithm cannot be parallelized we may not achieve an optimal parallelization. In
such cases the performance gain is not there rather in some cases the performance
is degraded due to communication and synchronization overhead.
5.2 Factors affecting Cluster performance
As per the result analysis of various tests and benchmarks here are few of the most
important factors which affect the performance of the cluster. Metrics having
significant affect on Linpack are:
1. Problem Size
2. Size of Blocks
3. Topology
Tightly Coupled MPI Applications
1. Very sensitive to network performance characteristics like internodal com-
munications delay or OS Network Stack
2. Very sensitive to mismatched node performance like random OS activities
can add msec delays to usec type communication line delays.
5.3 Benefits
1. Cost-effective: Built from relatively inexpensive commodity components
that are widely available.
64
5.4 Challenges of parallel computing
2. Keeps pace with technologies: Use mass-market components. Easy to em-
ploy the latest technologies to maintain the cluster.
3. Flexible configuration: Users can tailor a configuration that is feasible to
them and allocate the budget wisely to meet the performance requirements
of their applications.
4. Scalability: Can be easily scaled up by adding more compute nodes.
5. Usability: The system can be used by specified users to achieve specified
goals with effectiveness, efficiency, and satisfaction in a specified context of
use.
6. Manageability: Group of systems can be managed as a single system or
single database, without having to sign on to individual systems. Even a
cluster administrative domain can be used to more easily manage resources
that are shared within a cluster.
7. Reliability: The system, including all hardware, firmware, and software, will
satisfactorily perform the task for which it was designed or intended, for a
specified time and in a specified environment.
8. High availability: Each compute node is an individual machine. The failure
of a compute node will not affect other nodes or the availability of the entire
cluster.
9. Compatibility and Portability: A parallel application using MPI can be
easily ported from expensive parallel computers to a Beowulf cluster.
5.4 Challenges of parallel computing
Parallel Programming is not constrained to just the problem of selection of whether
to code using threads, message passing or some other tool. But in general, anybody
working in the field of parallelization must consider the overall picture containing
a plethora of issues like:
1. Understanding the hardware: An understanding of the parallel computer
architecture is necessary for efficient mapping and distribution of computa-
tional tasks. A simplified classification of parallel architectures is UMA/NUMA
and distributed systems. A typical application may have to run on a com-
bination of these architectures.
65
5.4 Challenges of parallel computing
2. Mapping and distribution on to the hardware: Mapping and distribution of
both computational tasks on processors and of data onto memory elements
must be considered. The whole application must be divided into compo-
nents and subcomponents and then these components and subcomponents
distributed on the hardware. The distribution may be static or dynamic.
3. Parallel Overhead: Parallel overhead refers to the amount of time required
to coordinate parallel tasks as opposed to doing useful work. Typical par-
allel overhead includes the time to start/terminate a task, the time to pass
messages between tasks, synchronization time, and other extra computation
time. When parallelizing a serial application, overhead is inevitable. De-
velopers have to estimate the potential cost and try to avoid unnecessary
overhead caused by inefficient design or operations.
4. Synchronization: Synchronization is necessary in multi-threading programs
to prevent race conditions. Synchronization limits parallel efficiency even
more than parallel overhead in that it serializes parts of the program. Im-
proper synchronization methods may cause incorrect results from the pro-
gram. Developers are responsible for pinpointing the shared resources that
may cause race conditions in a multi-threaded program, and they are re-
sponsible also for adopting proper synchronization structures and methods
to make sure resources are accessed in the correct order without inflicting
too much of a performance penalty.
5. Load Balance: Load balance is important in a threaded application because
poor load balance causes under utilization of processors. After one task fin-
ishes its job on a processor, the processor is idle until new tasks are assigned
to it. In order to achieve the optimal performance result, developers need to
find out where the imbalance of the work load lies between different threads
running on the processors and fix this imbalance by spreading out the work
more evenly for each thread.
6. Granularity: For a task that can be divided and performed concurrently by
several subtasks, it is usually more efficient to introduce threads to perform
some subtasks. However, there is always a tipping point where performance
cannot be improved by dividing a task into smaller-sized tasks (or introduc-
ing more threads). The reasons for this are 1) multi-threading causes extra
overhead; 2) the degree of concurrency is limited by the number of proces-
sors; and 3) for most of the time, one subtask’s execution is dependent on
66
5.5 Common applications of high-performance computing clusters
another’s completion. That is why developers have to decide to what extent
they make their application parallel. The bottom line is that the amount of
work per each independent task should be sufficient to leverage the threading
cost.
5.5 Common applications of high-performance
computing clusters
Almost everyone needs fast processing power. With the increasing availability of
cheaper and faster computers, more people are interested in reaping the techno-
logical benefits. There is no upper boundary to the needs of computer processing
power; even with the rapid increase in power, the demand is considerably more
than what’s available.
1. Scheduling: Manufacturing: Transportation (Dairy delivery to military de-
ployment); University classes; Airline scheduling.
2. Network Simulations: Power Utilities, Telecommunications providers simu-
lations.
3. Computational ElectroMagnetics: Antenna design; Stealth vehicles; Noise
in high frequency circuits; Mobile phones.
Figure 5.1: Application Perspective of Grand Challenges
67
5.5 Common applications of high-performance computing clusters
4. Environmental Modelling-Earth/Ocean/Atmospheric Simulation: Weather
forecasting, climate simulation, oil reservoir simulation, waste repository
simulation
5. Simulation on Demand: Education, tourism, city planning, defense mission
planning, generalized flight simulator.
6. Graphics Rendering: Hollywood movies, Virtual reality.
7. Complex Systems Modelling and Integration: Defense (SIMNET, Flight
Simulators), Education (SIMCITY), Multimedia/VR in entertainment, Mul-
tiuser virtual worlds, Chemical and Nuclear plant operation .
8. Financial and Economic Modelling: Real time optimisation, Mortgage backed
securities, Option pricing.
9. Image Processing: Medical instruments, EOS Mission to Planet Earth, De-
fense Surveillance, Computer Vision.
10. Healthcare and Insurance Fraud Detection: Inefficiency, Securities fraud,
Credit card fraud.
11. Market Segmentation Analysis: Marketing and sales planning. Sort and
classify records to determine customer preference by region (city and house).
68
Chapter 6
Conclusion and Future Work
6.1 Conclusion
The implemented HPCC system allows any research center to install and use a low-
cost parallel programming environment, which may be administered in an easy-
to-use basis even by staff unfamiliar with clusters. Such clusters allow evaluating
the efficiency of any parallel code to solve the computational problems faced by
the scientific community. This type of parallel programming environments are
expected to be subject to a great development efforts within the coming years,
since an increasing number of universities and research centers around the world
include Beowulf clusters in their hardware. The main disadvantage with this type
of environment could be the latency of the interconnections between the machines.
This HPCC can be used for research on object-oriented parallel languages,
recursive matrix algorithms, network protocol optimization, graphical rendering
etc. Also it can be used to create college’s own cloud and deploy cloud applications
on it, which can be accessed from anywhere outside world just with the help of
web browser. Computer science and Information Technology students will receive
extensive experience using such cluster, and t is expected that several students
and faculty will use it for their project and research work.
6.2 Future Work
As computer networks become cheaper and faster, a new computing paradigm,
called the Grid, has evolved. The Grid is a large system of computing resources
that performs tasks and provides to users a single point of access, commonly based
on the World Wide Web interface, to these distributed resources. Users can submit
69
6.2 Future Work
thousands of jobs at a time without being concerned about where they run. The
Grid may scale from single systems to supercomputer-class compute farms that
utilise thousands of processors.
By providing scalable, secure, high-performance mechanisms for discovering
and negotiating access to remote resources, the Grid promises to make it possible
for colleges and universities in collaboration to share resources on an unprece-
dented scale, and for geographically distributed groups to work together in ways
that were previously impossible.
Additionally, the HPCC can be used to create cloud applications and give
actual experience of this very booming technology to students. The advantages of
cloud computing could work in the students advantage when it comes to getting
hands-on experience in managing environments. Before virtualization, it would
have been impossible for an individual student to practice managing their own
multiple-server environment. Even just three servers would have cost thousands
of dollars in years past. But now, with virtualization, it takes just a few minutes to
spin up three new VMs. If a college were to leverage virtualization in its classroom,
students could manage their own multi-server environment in the cloud with ease.
The student could control everything from creation of the VMs to their retirement,
giving them great experience in one of the hottest fields in IT.
70
Bibliography
[1] Christian Vecchiola, Suraj Pandey, and Rajkumar Buyya : High-Performance
Cloud Computing: A View of Scientific Applications at Proceedings of the 10th
International Symposium on Pervasive Systems, Algorithms and Networks (I-
SPAN 2009, IEEE CS Press, USA), Kaohsiung, Taiwan, December 14-16, 2009
[2] Luiz Carlos Pinto, Luiz H. B. Tomazella, M. A. R. Dantas : An Experimental
Study on How to Build Efficient Multi-Core Clusters for High Performance
Computing at 2008 11th IEEE International Conference on Computational
Science and Engineering.
[3] IkerCastaos, IzaskunGarrido, AitorGarrido, GorettiSevillano: Design and Im-
plementation of an easy-to-use Automated System to build Beowulf Parallel-
Computing Clusters at University of the Basque, IEEE International Confer-
ence 2009
[4] Azzedine Boukerche Raed Al-Shaikh and Mirela Sechi Moretti Notare :To-
wards Building a Highly-Available Cluster Based Model for High Performance
Computing at Proceedings 20th IEEE International Parallel and Distributed
Processing Symposium 2006
[5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G.
Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia. : Above the Clouds: A
Berkeley View of Cloud computing. Technical Report No. UCB/EECS-2009-
28, University of California at Berkley, USA, Feb. 10, 2009.
[6] R. Buyya, C.S. Yeo, and S. Venugopal, Market-Oriented Cloud Computing:
Vision, Hype, and Reality for Delivering IT Services as Computing Utilities,
Keynote Paper, in Proc. 10th IEEE International Conference on High Perfor-
mance Computing and Communications (HPCC 2008), IEEE CS Press, Sept.
2527, 2008, Dalian, China.
71
BIBLIOGRAPHY
[7] Bonnie Holte Bennett, Emmett Davis, Timothy Kunau : Beowulf Parallel
Processing for Dynamic Load-balancing, IEEE International Conference 1999
[8] Lustre File System : High-Performance Storage Architecture and Scalable
Cluster File System White Paper December 2007
[9] Amina Saify, Garima Kochhar, Jenwei Hsieh, and Onur Celebioglu: Enhancing
High-Performance Computing Clusters with Parallel File Systems, 2005
[10] Rajkumar Buyya, High Performance Cluster Computing: Architectures and
Systems, Vol. 1 Ed. Prentice Hall PTR, Upper Saddle River, NJ, 1999
[11] Prabhu, C.S.R., Grid and Cluster Computing. Prentice Hall India 2009
[12] Judith Hurwitz, Robin Bloor, Marcia Kaufman, Fern Halper, Cloud Com-
puting For Dummies, John Wiley and Sons, 2009
[13] Barrie Sosinsky, Cloud Computing Bible, Wiley India 2011
[14] Christopher Negus, Timothy Boronczyk, CentOS Bible, Wiley, 2009
[15] Vladimir Silva, Grid Computing for Developers, Dreamtech Press, 2006
[16] Peter Membrey, Tim Verhoeven, Ralph Angenendt , The Definitive Guide to
CentOS, Apress, 2009
[17] Grid computing, http://www.ctwatch.org/quarterly/articles/2006/
02/garuda-indias-national-grid-computing-initiative/1/index.html
[18] Torque resources, http://www.adaptivecomputing.com/products/
open-source/torque/
[19] Introduction to Torque, http://www.clusterresources.com/
torquedocs21/p.introduction.shtml
[20] High Performance Computing Training, https://computing.llnl.gov/
?set=trainingandpage=index
[21] Applications of HPCC, http://www.new-npac.org/projects/cdroms/
cewes-1999-06-vol1/nhse/roadmap/applications/
[22] Beowulf Project Overview, http://www.beowulf.org/overview/index.
html
[23] Beowulf clusters, http://www.lehigh.edu/computing/linux/beowulf/
72
BIBLIOGRAPHY
[24] Parallel Virtual Machine, http://www.csm.ornl.gov/pvm/
[25] Open MPI Project, http://www.open-mpi.org.
[26] Message Passing Interface, http://www.unix.mcs.anl.gov/mpi/
[27] Beowulf Overview, http://www.beowulf.org/overview/faq.html17
[28] High-performance Linux clustering, Part 1: Clustering fundamentals, http:
//www.ibm.com/developerworks/linux/library/l-cluster1/, 2005
73
Appendix A
PuTTy
PuTTY is a free and open source terminal emulator application which can act
as a client for the SSH, Telnet, rlogin, and raw TCP computing protocols and as
a serial console client. The name ”PuTTY” has no definitive meaning, though
”tty” is the name for a terminal in the Unix tradition, usually held to be short for
Teletype.
PuTTY was originally written for Microsoft Windows, but it has been ported
to various other operating systems. Official ports are available for some Unix-
like platforms, with work-in-progress ports to Classic Mac OS and Mac OS X, and
unofficial ports have been contributed to platforms such as Symbian and Windows
Mobile.
A.1 How to use PuTTY to connect to a remote
computer
1. First download and install PuTTy. Open PuTTy By Double Clicking The
PuTTy Icon.
2. In the host name box, enter the server name which account is being hosted
on ( For Example: 115.119.224.72 ). under protocol choose SSH and then
press open.
3. It will then give a dialogue box like this, don’t be alarmed, simply press yes
when prompted.
4. It will prompt to enter login name (username) and then password, simply
enter username or login name, hit enter and then type password (password
74
A.2 PSCP
Figure A.1: Putty GUI
Figure A.2: Putty Security Alert
won’t be visible. This is how linux and unix server work). Then hit enter.
Also, please remember, passwords are case sensitive.
A.2 PSCP
PSCP, the PuTTY Secure Copy client, is a tool for transferring files securely
between computers using an SSH connection. If SSH 2 server is there, prefer
PSFTP for interactive use. PSFTP does not in general work with SSH 1 servers,
however.
75
A.2 PSCP
Figure A.3: Putty Remote Login Screen
A.2.1 Starting PSCP
PSCP is a command line application. This means that just double-click on its
icon to run it won’t work and instead bring up a console window. With Windows
95, 98, and ME, this is called an MS-DOS Prompt and with Windows XP, Vista
and Windows 7 it is called a Command Prompt. It should be available from the
Programs section of Start Menu.
To start PSCP it will need either to be on PATH or in current directory. To
add the directory containing PSCP to PATH environment variable, type into the
console window:
set PATH=C:\Program Files < x86 >\PuTTy
This will only work for the lifetime of that particular console window. To set
PATH more permanently on Windows NT, use the Environment tab of the System
Control Panel. On Windows XP, Vista,7 edit AUTOEXEC.BAT to include a set
command like the one above.
A.2.2 PSCP Usage
To copy the local file c:\documents\foo.txt form windows to the linux server ex-
ample.com as user beowulf to the folder /tmp type:
C:\Users\FOSS>pscp c:\documents\foo.txt [email protected]:/tmp
To copy the local file /root/hosts from linux machine to the file e:\tmp on windows
type:
C:\Users\FOSS>pscp [email protected]:/root/hosts e:\tmp
76