evolution towards cloud : overview of next generation computing...

133
Evolution Towards Cloud : Overview of Next Generation Computing Architecture by Monowar Hasan (Student ID. 0605021) & Sabbir Ahmed (Student ID. 0605013) A Thesis submitted to the Department of Computer Science and Engineering in partial fulfillment of the requirements for the degree of Bachelor of Science (B.Sc.) in Computer Science and Engineering Thesis Supervisor: Dr. Md. Humayun Kabir Bangladesh University of Engineering and Technology Dhaka, Bangladesh 17 March 2012

Upload: others

Post on 11-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Evolution Towards Cloud :Overview of Next Generation Computing

    Architectureby

    Monowar Hasan(Student ID. 0605021)

    &Sabbir Ahmed

    (Student ID. 0605013)

    A Thesis submitted to the Department of Computer Science and Engineering inpartial fulfillment of the requirements for the degree of

    Bachelor of Science (B.Sc.) in

    Computer Science and Engineering

    Thesis Supervisor:Dr. Md. Humayun Kabir

    Bangladesh University of Engineering and TechnologyDhaka, Bangladesh

    17 March 2012

  • Certification

    The thesis titled “Evolution Towards Cloud : Overview of Next Generation Com-

    puting Architecture”, submitted by Monowar Hasan, Student No. 0605021, Sabbir

    Ahmed, Student No. 0605013, to the Department of Computer Science and Engi-

    neering, Bangladesh University of Engineering and Technology, has been accepted as

    satisfactory for the partial fulfillment of the requirements for the degree of Bachelor

    of Science in Computer Science and Engineering.

    Supervisor

    Dr. Md. Humayun Kabir

    Professor,

    Department of Computer Science and Engineering,

    Bangladesh University of Engineering and Technology,

    Dhaka-1000, Bangladesh.

    ii

  • Declaration

    We, hereby, declare that the work presented in this thesis is the outcome of the in-

    vestigation performed by us under the supervision of Dr. Md. Humayun Kabir,

    Professor, Department of Computer Science and Engineering, Bangladesh University

    of Engineering and Technology. We also declare that no part of this thesis has been

    submitted elsewhere for the award of any degree or diploma.

    Signature of the Students

    Monowar Hasan

    Student No. 0605021

    Sabbir Ahmed

    Student No. 0605013

    iii

  • Abstract

    Nowadays Cloud Computing has become a buzz-word in distributed processing. Cloud

    Computing, originated from the ideas of concurrent processing from Computer Clus-

    ter. It has enhanced the established architecture and standards of Grid Computing

    with the ideas of Utility and Service-oriented Computing. Computing through Cloud

    supplements a business model as a form of X-as-a-Service where X stands for Hard-

    ware, Software, Developing platform or some Storage media. End-users can consume

    any of these services from providers, pay-as-you-go basis without knowing the details

    of underlying architecture. Hence, Cloud offers a layers of abstraction to end-users

    and a scope to modify the application demand for end-users, developers and providers.

    iv

  • Acknowledgements

    We are grateful to several people for this work without whom it will not be a success-

    ful. Our heartiest thanks goes to our supervisor Professor Dr. Md. Humayun Kabir

    for his support and valuable guidelines. His continuous feedback and assistance help

    us to clear our ideas and understandings on the topic.

    Special thanks to Professor Dr. Hanan Lutfiyya from University of Western Ontario,

    Canada and Professor Dr. Ivona Brandic, Vienna University of Technology, Vienna,

    Austria for proving their research publications which help to progress our thesis.

    Department of Computer Science and Engineering, Bangladesh University of Engi-

    neering and Technology provides us with sound working environment and helps us to

    get electronic copy of the publications.

    Last but not the least, we acknowledge the contribution and support of our family

    members for being with us and encouraging us all the way. Without their sacrifice it

    would not end up a successful one.

    v

  • Table of Contents

    Certification ii

    Declaration iii

    Abstract iv

    Acknowledgments v

    Table of Contents ix

    List of Tables x

    List of Figures xiii

    1 Introduction 1

    2 Computer Clusters 4

    2.1 Architecture of Computer Clusters . . . . . . . . . . . . . . . . . . . 5

    2.2 Cluster Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.3 Protocols for Cluster Communication . . . . . . . . . . . . . . . . . . 9

    2.3.1 Internet Protocols . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.3.2 Low-latency Protocols . . . . . . . . . . . . . . . . . . . . . . 11

    2.3.2.1 Active Messages . . . . . . . . . . . . . . . . . . . . 11

    vi

  • 2.3.2.2 Fast Messages . . . . . . . . . . . . . . . . . . . . . . 12

    2.3.2.3 VMMC . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.3.2.4 U-net . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.3.2.5 BIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.3.3 Standards for Cluster Communication . . . . . . . . . . . . . 14

    2.3.3.1 VIA . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.3.3.2 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.4 Cluster Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.4.1 Message-based Middleware . . . . . . . . . . . . . . . . . . . 20

    2.4.2 RPC-based Middleware . . . . . . . . . . . . . . . . . . . . . 20

    2.4.3 Object Request Broker . . . . . . . . . . . . . . . . . . . . . . 21

    2.5 Single System Image (SSI) . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.5.1 Benefits of SSI . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.5.2 Features of SSI Clustering Systems . . . . . . . . . . . . . . . 23

    2.5.3 Functional Relationship among Middleware SSI Modules . . . 24

    2.5.3.1 Resource Management and scheduling (RMS) . . . . 24

    2.6 Examples of Cluster implementation . . . . . . . . . . . . . . . . . . 25

    2.6.1 Linux Virtual Server . . . . . . . . . . . . . . . . . . . . . . . 26

    2.6.2 Windows Compute Cluster Server 2003 . . . . . . . . . . . . . 29

    2.6.2.1 Compute Cluster Components . . . . . . . . . . . . . 30

    2.6.2.2 Network Architecture . . . . . . . . . . . . . . . . . 30

    2.6.2.3 Software Architecture . . . . . . . . . . . . . . . . . 31

    2.6.2.4 Job Execution . . . . . . . . . . . . . . . . . . . . . 33

    2.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3 Grid Computing : An Introduction 38

    3.1 Grid Computing: definitions and overview . . . . . . . . . . . . . . . 39

    vii

  • 3.2 Grids over Cluster Computing . . . . . . . . . . . . . . . . . . . . . . 41

    3.3 An example of Grid Computing environment . . . . . . . . . . . . . . 43

    3.4 Grid Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.4.1 Fabric Layer: Interfaces to Local Resources . . . . . . . . . . . 45

    3.4.2 Connectivity Layer: Managing Communications . . . . . . . . 46

    3.4.3 Resource Layer: Sharing of a Single Resource . . . . . . . . . 47

    3.4.4 Collective Layer : Co-ordination with multiple resources . . . 47

    3.4.5 Application Layer : User defined Grid Applications . . . . . . 48

    3.5 Grid Computing with Globus . . . . . . . . . . . . . . . . . . . . . . 49

    3.6 Resource Management in Grid Computing . . . . . . . . . . . . . . . 51

    3.6.1 Resource Specification Language . . . . . . . . . . . . . . . . . 52

    3.6.2 Globus Resource Allocation Manager (GRAM) . . . . . . . . . 53

    3.7 Resource Monitoring in Grid Computing . . . . . . . . . . . . . . . . 54

    3.8 Evolution towards Cloud Computing from Grid . . . . . . . . . . . . 61

    3.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4 An overview of Cloud Architecture 63

    4.1 Cloud Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.2 Cloud Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.2.1 A layered model of Cloud architecture - Cloud ontology . . . . 66

    4.2.2 Cloud Business Model . . . . . . . . . . . . . . . . . . . . . . 73

    4.2.3 Cloud Deployment Model . . . . . . . . . . . . . . . . . . . . 74

    4.3 Cloud Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    4.3.1 Infrastructure as a Service (IaaS) . . . . . . . . . . . . . . . . 78

    4.3.2 Platform as a Service (PaaS) . . . . . . . . . . . . . . . . . . . 79

    4.3.3 Software as a Service (SaaS) . . . . . . . . . . . . . . . . . . . 81

    4.4 Virtualization on Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 82

    viii

  • 4.4.1 Full virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 83

    4.4.2 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . 85

    4.4.3 Motivations of Virtualization . . . . . . . . . . . . . . . . . . 87

    4.5 Example of a Cloud Implementation . . . . . . . . . . . . . . . . . . 88

    4.5.1 Amazon S3 Concepts . . . . . . . . . . . . . . . . . . . . . . . 88

    4.5.2 Amazon S3 Data Consistency Model . . . . . . . . . . . . . . 91

    4.5.3 Managing Concurrent Applications . . . . . . . . . . . . . . . 92

    4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    5 Comparisons of Grid and Cloud : Similarities & Differences 95

    5.1 Major Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    5.2 Points of Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 96

    5.2.1 Business Model . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.2.2 Scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.2.3 Multitasking and Availability . . . . . . . . . . . . . . . . . . 98

    5.2.4 Resource Management . . . . . . . . . . . . . . . . . . . . . . 98

    5.2.5 Application Model . . . . . . . . . . . . . . . . . . . . . . . . 102

    5.2.6 Other issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    5.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    5.3.1 Comparative results . . . . . . . . . . . . . . . . . . . . . . . 104

    5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    6 Conclusion and Future works 106

    Bibliography 107

    ix

  • List of Tables

    2.1 Categories of Cluster Interconnection Hardware . . . . . . . . . . . . 7

    4.1 Example of existing Cloud Systems w.r.to classification into layers of

    Cloud Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    5.1 Comparative analysis between an existing Grid and Cloud implemen-

    tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    x

  • List of Figures

    2.1 Architecture of a computer Cluster . . . . . . . . . . . . . . . . . . . 6

    2.2 Traditional Protocol Overhead and Transmission Time. . . . . . . . . 10

    2.3 The InfiniBand Architecture . . . . . . . . . . . . . . . . . . . . . . . 17

    2.4 Functional Relationship Among Middleware SSI Modules . . . . . . 24

    2.5 Resource Management and scheduling (RMS) . . . . . . . . . . . . . 25

    2.6 Linux Virtual Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.7 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.8 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2.9 Serial Task execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.10 Parallel Task execution . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.1 Evolution of Grid Computing . . . . . . . . . . . . . . . . . . . . . . 39

    3.2 Serving job requests in traditional environment . . . . . . . . . . . . 41

    3.3 Serving job requests in Grid environment . . . . . . . . . . . . . . . . 42

    3.4 Google search architecture . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.5 Grid Protocol Architecture . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.6 Collective and Resource layer protocols are combined in various ways

    to provide application functionality . . . . . . . . . . . . . . . . . . . 48

    3.7 Programmers view of Grid Architecture. Thin lines denotes protocol

    interactions where bold lines represent a direct call . . . . . . . . . . 49

    xi

  • 3.8 A resource management architecture for Grid Computing environment 51

    3.9 Globus GRAM Architecture . . . . . . . . . . . . . . . . . . . . . . . 54

    3.10 Grid Monitoring Architecture Components . . . . . . . . . . . . . . . 55

    3.11 Enhancement of generic Grid architecture to Service Oriented Grid . 61

    4.1 Components of a Cloud Computing Solution . . . . . . . . . . . . . . 64

    4.2 Hierarchical abstraction layers of Cluster, Grid and Cloud Computing 66

    4.3 Cloud layered architecture : consists of five layers, figure represents

    inter-dependency between layers . . . . . . . . . . . . . . . . . . . . . 67

    4.4 Virtualization reduces number of servers . . . . . . . . . . . . . . . . 70

    4.5 Cloud computing Business model . . . . . . . . . . . . . . . . . . . . 73

    4.6 External or Public Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 75

    4.7 Internal or Private Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 76

    4.8 Example of Hybrid Cloud . . . . . . . . . . . . . . . . . . . . . . . . 78

    4.9 Correlation between Cloud Architecture and Cloud Services . . . . . 79

    4.10 Infrastructure as a Service . . . . . . . . . . . . . . . . . . . . . . . . 80

    4.11 Platform as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    4.12 Software as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    4.13 Full virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    4.14 A Paravirtualized deployment where many OS can run simultaneously 85

    4.15 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    4.16 Conceptual view of Amazon Simple Storage Service . . . . . . . . . . 89

    4.17 Managing Concurrent Applications : W1 & W2 complete before the

    start of R1 & R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    4.18 Managing Concurrent Applications : W2 does not complete before the

    start of R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    xii

  • 4.19 Managing Concurrent Applications : W2 is performed before S3 returns

    a ‘success’ for W1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    5.1 Motivation of Grid and Cloud . . . . . . . . . . . . . . . . . . . . . . 96

    5.2 Comparison regarding performance, reliability and cost . . . . . . . . 97

    xiii

  • Chapter 1

    Introduction

    Sometime applications need more computing power than a sequential computer can

    provide. A feasible and cost-effective solution is to connect multiple processors to-

    gether and coordinate their computational powers. The resulting systems are popu-

    larly known as parallel computers or Computer Clusters, and they allow the sharing

    of a computational task among multiple processors. The components of a Cluster

    are usually connected to each other through fast Local Area Networks. This is in

    contrast to traditional supercomputer, which has many processors connected by a

    local high-speed computer bus. Each node in Cluster running its own instance of an

    operating system. Traditionally, Computer Clusters run on separate physical comput-

    ers with the same operating system. Hence, nodes in Cluster are homogeneous and

    tightly-coupled. The activities of the computing nodes are monitored by ‘Clustering

    Middleware’, a software layer that sits atop the nodes and allows the users to con-

    sider the Cluster as single computing unit, through a ‘Single System Image’ concept.

    Computer Clusters are covered details in Chapter 2.

    Computational Grids, another approach of distributed processing, also uses many

    nodes like Computer Cluster but a more dynamic and usually heterogeneous system.

    The heterogeneous pools of servers, storage systems and networks are pooled together

    1

  • in a virtualized system that is exposed to the user as a single computing entity. In

    Grid, a computer job uses one or few nodes, with a little or no inter-node commu-

    nication. Job requests are firstly pooled and allocated to the processors available in

    an efficient way. ‘Grid middleware’ is specific software, which provides the necessary

    functionality required to enable sharing of heterogeneous resources. Grid Computing

    is the deployed Grid middleware. Architectures and issues of Computer Grids are

    covered in Chapter 3.

    Cluster Grids (or Computer Clusters) are local resources that operate inside the fire-

    wall and are controlled by a single administrative entity that has complete control

    over each component. Thus, clusters do actually not involve sharing of resources and

    cannot be considered as Grids in the narrow sense. Enterprise Grid is used to refer

    to application of Grid Computing for sharing resources within the bounds of a single

    company. All components of an Enterprise Grid operate inside the firewall of a com-

    pany, but may be heterogeneous and physically distributed across multiple company

    locations. A Grid that is owned and deployed by a third party service provider is

    called a Utility Grid. The service being offered via a Utility Grid is utility computing,

    i.e. compute capacity and storage in a pay-per-use manner. A Utility Grid operates

    outside the firewall of the user. The moving trends toward Utility Grid populates the

    influential approaches of Cloud Computing.

    Cloud Computing, a relatively recent term, is a computing paradigm, where a large

    pool of systems are connected in private or public networks, to provide dynamically

    scalable infrastructure for application, data and file storage. It implies a service ori-

    ented architecture, reduced information technology overhead for the end-user, great

    flexibility, reduced total cost of ownership, on-demand service and many other things.

    2

  • In Cloud, the applications delivered as services over the Internet. Infrastructure re-

    sources (hardware, storage and system software) and applications are provided in a

    X-as-a-Service manner. When a Cloud is made available in a pay-as-you-go manner

    to the general public, we call it a Public Cloud. We use the term Private Cloud to

    refer to internal datacenters of a business or other organization, not made available

    to the general public. Thus, Cloud Computing is the combination of SaaS and Utility

    Computing, but does not include Private Clouds. Detailed overview of Cloud Com-

    puting is discussed in Chapter 4.

    Cloud Computing is not all the way similar to Computer Grids or Utility Grids. Cloud

    differs with Grid in various considerations. Issues related to similarity and differences

    between Grid and Cloud Computing are discussed in Chapter 5.

    3

  • Chapter 2

    Computer Clusters

    A Cluster [1] is a type of parallel or distributed processing system. It consists of a

    collection of interconnected stand-alone computers and they are working together as a

    single, integrated computing resource. All the component subsystems of a Cluster are

    supervised within a single administrative domain, usually residing in a single room

    and managed as a single computer system. Cluster Computing can be used for load

    balancing as well as for high availability [2]. Cluster Computing can also be used as

    a relatively low-cost form of parallel processing for scientific and other applications

    that lend themselves to parallel operations.

    Some properties of Cluster Computing:

    • Computers also known as nodes on a Cluster are networked in a tightly-coupled

    fashion. They are all on the same subnet of the same domain and often net-

    worked with very high bandwidth connections.

    • Nodes of a Cluster are homogeneous. They all use the same hardware, run the

    same software, and are generally configured identically. Each node in a Cluster

    is a dedicated resource generally only the Cluster applications run on a Cluster

    node.

    4

  • • Message Passing Interface (MPI) [3] is used in Cluster which is a programming

    interface that allows the distributed application instances to communicate with

    each other and share information.

    • Dedicated hardware, high-speed interconnects, and MPI provide Clusters the

    ability to work efficiently on fine-grained parallel problems where the subtasks

    must communicate many times per second, including problems with short tasks,

    some of which may depend on the results of previous tasks.

    2.1 Architecture of Computer Clusters

    In Cluster Computing a computer node can be a single or multiprocessor system[4].

    The nodes can be PCs, workstations, or Symmetric Multiprocessors (SMP) with mem-

    ory, I/O facilities, and an operating system. In Cluster Computing, two or more nodes

    are connected together. These nodes can exist in a single cabinet or be physically

    separated and connected via a LAN. This LAN-based inter-connected Cluster of com-

    puters appear as a single system to the users and applications. Cluster Computing

    can provide a cost-effective way to gain features and benefits like fast and reliable ser-

    vices that could previously found only on more expensive proprietary shared memory

    systems. Typical architecture of a Cluster is shown in Figure 2.1.

    In Cluster Computing several high Performance Networks or Switches are used to

    connect the nodes of the Cluster. Among them Gigabit Ethernet and Myrinet are

    most common. Switched networks are preferred since it allows multiple simultane-

    ous messages to be sent, which can improve overall application performance. Cluster

    Interconnection use Network Interface Cards. Interconnection technologies may be

    classified into four categories, depending on whether the internal connection is from

    5

  • Figure 2.1: Architecture of a computer Cluster

    the I/O bus or the memory bus, and depending on whether the communication be-

    tween the computers is performed primarily using messages or using shared storage.

    We will discuss Cluster Interconnection in Section 2.2. Several Fast Communication

    Protocols and Services are used to communicate within nodes. We will discuss them

    briefly in Section 2.3.

    The operating system in the individual nodes of the Cluster provides the fundamental

    system support for Cluster operations. Whether the user is opening files, sending mes-

    sages, or starting additional processes, the operating system is always present. The

    primary role of an operating system is to multiplex multiple processes onto hardware

    components that comprise a system (resource management and scheduling), as well

    as provide a high-level software interface for user applications. These services include

    protection boundaries, process and thread co-ordination, inter-process communication

    and device handling.

    There is a Middleware which sits between operating system and application. Middle-

    6

  • ware layers enable the seamless usage of heterogeneous components across the Cluster.

    Middleware provides the system Single System Image (SSI) and System Availability

    Infrastructure. Cluster Middleware and Single System Image (SSI) are discussed in

    Section 2.4 and 2.5. Both Sequential and Parallel or Distributed Applications can be

    done by using Cluster Computing. For Parallel Applications several Parallel Program-

    ming Environments and Tools such as compilers, MPI (Message Passing Interface) are

    used. We will conclude the Chapter by giving two Applications of Cluster Computing:

    Linux Virtual Server (LVS) in Section 2.6.1 and Windows Compute Cluster Server

    2003 in Section 2.6.2.

    2.2 Cluster Interconnection

    In Cluster Computing the choice of interconnection technology is a key component.

    We can classify the Interconnection technologies into four categories. These four cat-

    egories depend on the internal connection and how the nodes communicate with each

    other. The internal connection can be from the I/O bus or the memory bus and the

    communication between the computers can be performed primarily using messages or

    using shared storage [5]. Table 2.1 illustrates the four types of interconnection.

    Type Message Based Shared StorageI/O Attached Most common type, in-

    cludes most high-speednetworks; VIA, TCP/IP.

    Shared disk subsystems.

    Memory Attached Usually implemented insoftware as optimizationsof I/O attached message-based.

    Global shared memory,Distributed shared mem-ory.

    Table 2.1: Categories of Cluster Interconnection Hardware

    7

  • Among the four interconnection categories I/O attached message-based systems are

    by far the most common. This system includes all commonly-used wide-area and

    local-area network technologies. It also includes several recent products that are

    specifically designed for Cluster computing. I/O attached shared storage systems in-

    clude computers that share a common disk sub-system. Memory attached systems

    are not common like I/O attached systems, since the memory bus of an individual

    computer generally has a design that is unique to that type of computer. How-

    ever, many memory-attached systems are implemented. Most of the time they are

    implemented in software or with memory-mapped I/O, such as Reflective Memory [6].

    There are several Hybrid systems that combine the features of more than one category.

    Example of a Hybrid system is the Infiniband standard [7]. Infiniband is an I/O

    attached interconnection. It can be used to send data to a shared disk sub-system

    as well as to send messages to another computer. There are many factors that affect

    the choice of interconnect technology for a Cluster. Factors like compatibility with

    the Cluster hardware and operating system, price, and performance. Performance of

    a Cluster depends on the latency and bandwidth.

    • Latency is the time needed to send data from one computer to another. Latency

    also includes overhead for the software to construct the message as well as the

    time to transfer the bits from one computer to another.

    • Bandwidth is the number of bits per second that can be transmitted over the

    interconnect hardware.

    Applications that utilize small messages will have better performance particularly be-

    cause the latency is reduced. Applications that send large messages will have better

    performance particularly as the bandwidth increases. The latency is a function of

    8

  • both the communication software and network hardware.

    2.3 Protocols for Cluster Communication

    A communication protocol defines a set of rules and conventions for communicat-

    ing among the nodes in the Cluster [8]. Each protocol uses different technology to

    exchange information. Communication protocols can be classified as:

    • Connection oriented or connectionless.

    • Offers various level of reliability. Protocol can be reliable that fully guaranteed

    to arrive in order. Protocol can be unreliable that not guaranteed to arrive in

    order.

    • Communication can be not buffered which is synchronous or buffered which is

    asynchronous.

    • By the number of intermediate data copies between buffers, which may be zero,

    one or more.

    Several protocols are used in Clusters. Formerly Traditional Internet protocols are

    used for Clustering. Later several protocols that have been designed specifically for

    Cluster communication. Finally two new protocol standards have been specially de-

    signed for use in Cluster Computing.

    2.3.1 Internet Protocols

    The Internet Protocol (IP) is the standard for networking worldwide. The Trans-

    mission Control Protocol (TCP) and the User Datagram Protocol (UDP) are both

    9

  • transport layer protocols built over the Internet Protocol. TCP and UDP protocols

    and the de facto standard BSD sockets Application Programmer’s Interface (API) to

    TCP and UDP were among the first messaging libraries used for [9] Cluster Comput-

    ing.

    • Internet Protocol uses one or more buffers in system memory with the help of

    operating system services.

    • User application constructs the message in user memory, and then makes an

    operating system request to copy the message into a system buffer.

    • A system interrupt is required to send and receive the message.

    In Internet protocol, Operating system overhead and the overhead for copies to and

    from system memory are a significant portion of the total time to send a message. As

    network hardware became faster during the 1990s, the overhead of the communica-

    tion protocols became significantly larger than the actual hardware transmission time

    for messages, as shown in Figure 2.2. So there needed the necessity of new types of

    protocols for Cluster computing.

    Figure 2.2: Traditional Protocol Overhead and Transmission Time.

    10

  • 2.3.2 Low-latency Protocols

    For avoiding operating system intervention in message transmission several research

    projects were done during the 1990’s. These projects led to the development of low-

    latency protocols. These protocols also provide user-level messaging services across

    high-speed networks. Low-latency protocols developed during the 1990’s include Ac-

    tive Messages, Fast Messages, the VMMC (Virtual Memory-Mapped Communication)

    system, U-net, and Basic Interface for Parallelism (BIP), among others.

    2.3.2.1 Active Messages

    Active Messages was developed in the University of Berkeley. It has provided low-

    latency communications library for the Berkeley Network of Workstations (NOW)

    project [10, 11]. Short messages used in Active Messages, are synchronous and based

    on the concept of a request-reply primitive.

    • Sending side user-level application constructs a message in user memory. The

    receiving process allocates a receive buffer in user memory on the receiving side

    and sends a request to the sender.

    • The sender replies by copying the message from the user buffer on the sending

    side directly to the network buffer. No buffering in system memory is performed.

    • Network hardware transfers the message to the receiver, and then the message

    is transferred from the network buffer to the receive buffer in user memory.

    It is required that user virtual memory on both the sending and receiving sides being

    pinned to an address in physical memory. The reason behind it not to be paged out

    during the network operation. Once the pinned user memory buffers are established,

    no operating system intervention is required for a message to be sent. Since no copies

    11

  • from user memory to system memory are used, this protocol is known as a zero-copy

    protocol.

    To support multiple concurrent parallel applications in a Cluster Active Messages

    was extended to Generic Active Messages (GAM). In GAM, a copy sometimes oc-

    curs to a buffer in system memory on the receiving side so that user buffers can be

    reused more efficiently. In this case, the protocol is referred to as a ‘one-copy’ protocol.

    2.3.2.2 Fast Messages

    Fast Message was developed at the University of Illinois. It is similar to Active

    Messages [12]. Fast Message extends Active Message by imposing stronger guarantees

    on the underlying communication.

    • Fast Message guarantees that all messages arrive reliably and in-order, even if

    the underlying network hardware does not.

    • Fast Message uses flow control to ensure that a fast sender cannot overrun a

    slow receiver, thus causing messages to be lost. Flow control is implemented in

    Fast Messages with a credit system that manages pinned memory in the host

    computers.

    2.3.2.3 VMMC

    The Virtual Memory-Mapped Communication (VMMC) [13] system was developed

    as a low-latency protocol for the Princeton SHRIMP project. One goal of VMMC

    was to view messaging as reads and writes into the user-level virtual memory system.

    12

  • • VMMC works by mapping a page of user virtual memory to physical memory.

    It makes a correspondence between pages on the sending and the receiving sides.

    • It uses specially designed hardware. This hardware allows the network interface

    to snoop writes to memory on the local host and have these writes automatically

    updated on the remote hosts memory. Various optimizations of these writes have

    been developed that helped to minimize the total number of writes, network

    traffic, and overall application performance.

    VMMC is an example of a paradigm known as distributed shared memory (DSM).

    In DSM systems memory is physically distributed among the nodes in a system, but

    processes in an application may view shared memory locations as identical and per-

    form reads and writes to the shared memory locations.

    2.3.2.4 U-net

    The U-net network interface architecture [14] was developed at Cornell University.

    U-net provides zero-copy messaging where possible.

    • U-net adds the concept of a virtual network interface for each connection in a

    user application. Just as an application has a virtual memory address space

    that is mapped to real physical memory on demand.

    • Each communication endpoint of the application is viewed as a virtual network

    interface mapped to a real set of network buffers and queues on demand.

    The advantage of this architecture is that once the mapping is defined, each active

    interface has direct access to the network without operating system intervention. The

    result is that communication can occur with very low latency.

    13

  • 2.3.2.5 BIP

    Basic Interface for Parallelism (BIP) is a low-latency protocol that was developed at

    the University of Lyon [15].

    • BIP is designed as a low-level message layer over which a higher-level layer such

    as Message Passing Interface (MPI) [3] can be built. Programmers can use MPI

    over BIP for parallel application programming.

    • The initial BIP interface consisted of both blocking and non-blocking calls.

    Later versions (BIP-SMP) provide multiplexing between the network and shared

    memory under a single API for use on Clusters of symmetric multiprocessors.

    BIP achieves low latency and high bandwidth by using different protocols, like Active

    Messages and Fast Messages for various message sizes. It also provides a zero or single

    memory copy of user data. To simplify the design and keep the overheads low, BIP

    guarantees in-order delivery of messages, although some flow control issues for small

    messages are passed to higher software levels.

    2.3.3 Standards for Cluster Communication

    Research on low-latency protocols had progressed sufficiently and established new

    standard for low-latency messaging to be developed, the Virtual Interface Architec-

    ture (VIA). Industrial researchers worked on standards for shared storage subsystems.

    The combination of the efforts of many researchers has resulted in the InfiniBand stan-

    dard.

    14

  • 2.3.3.1 VIA

    The Virtual Interface Architecture [16] is a communications standard that combines

    many of the best features of various academic projects. A consortium of academic and

    industrial partners, including Intel, Compaq, and Microsoft, developed the standard.

    • VIA supported heterogeneous hardware and was available as of early 2001.

    • It was based on the concept of a virtual network interface. Before a message

    can be sent in VIA, send and receive buffers must be allocated and pinned to

    physical memory locations.

    • There was no need of system calls after the buffers and associated data structures

    are allocated.

    • A send or receive operation in a user application consists of posting a descriptor

    to a queue. The application can choose to wait for a confirmation that the

    operation has completed, or can continue host processing while the message is

    being processed.

    Several hardware vendors and some independent developers have developed VIA im-

    plementations for various network [17][18] products. VIA implementations can be

    classified as native or emulated.

    • A native implementation of VIA off-loads a portion of the processing required

    to send and receive messages to special hardware on the network interface card.

    When a message arrives in a native VIA implementation, the network card

    performs at least a portion of the work required to copy the message into user

    memory.

    15

  • • An emulated VIA implementation, the host CPU performs the processing to

    send and receive messages. Although the host processor is used in both cases,

    an emulated implementation of VIA has less overhead than TCP/IP. However,

    the services provided by VIA are different than those provided by TCP/IP, since

    the communication may not be guaranteed to arrive reliably in VIA.

    2.3.3.2 InfiniBand

    The InfiniBand standard [19] is another standard for Cluster protocol and was sup-

    ported by a large consortium of industrial partners, including Compaq, Dell, Hewlett-

    Packard, IBM, Intel, Microsoft and Sun Microsystems. The InfiniBand architecture

    replaces the standard shared bus for I/O on current computers with a high-speed

    serial, channel-based, message-passing, scalable, and switched fabric. There are two

    types of adaptors. Host channel adapters (HCA) and target channel adapters (TCA).

    All systems and devices attach to the fabric through host channel adapters (HCA) or

    target channel adapters (TCA), as shown in Figure 2.3 . In InfiniBand data is sent

    as packets, and six types of transfer methods are available, including:

    • Reliable and unreliable connections.

    • Reliable and unreliable datagrams.

    • Multicast connections.

    • Raw packets.

    InfiniBand supports remote direct memory access (RDMA) read or write operations.

    This allows one processor to read or write the contents of memory at another processor,

    and also directly supports IPv6 [20] messaging for the Internet. There are several

    components of InfiniBand. They are -

    16

  • Figure 2.3: The InfiniBand Architecture

    • Host channel adapter (HCA): Host channel adapter is an interface that

    resides within a server. HCA communicates directly with the server’s memory,

    processor, target channel adapter or a switch. It guarantees delivery of data

    and can recover from transmission errors.

    • Target channel adapter (TCA): Target channel adapter enables I/O devices

    to be located within the network independent of a host computer. It includes

    an I/O controller that is specific to its particular device’s protocol. TCAs can

    communicate with an HCA or a switch.

    • Switch: Switch is virtually equivalent to a traffic police. It allows many HCAs

    and TCAs to connect to it and handles network traffic. Offers higher availability,

    higher aggregate bandwidth, load balancing, data mirroring and much more.

    Looks at the “local route header” on each packet of data and forwards it to the

    appropriate location. A group of switches is referred to as a fabric. If a host

    computer is down, the switch still continues to operate. The switch also frees

    up servers and other devices by handling network traffic.

    • Router: Router forwards data packets from a local network (called a subnet)

    17

  • to other external subnets. Reads the ‘global route header’ and forwards to

    appropriate address. It rebuilds each packet with the proper local address header

    as it passes it to the new subnet.

    • Subnet Manager: It is an application responsible for configuring the local

    subnet and ensuring its continued operation. Configuration responsibilities in-

    clude managing switch and router setups and reconfiguring the subnet if a link

    goes down or a new one is added.

    The InfiniBand Architecture (IBA) is comprised of four primary layers that describe

    communication devices and methodology.

    • Physical Layer: Defines the electrical and mechanical characteristics of the

    IBA, including the cables, connectors and hot-swap characteristics. IBA con-

    nectors include fiber, copper and backplane connectors. There are three link

    speeds specified as 1X, 4X and 12X. 1X link cable has four wires; two for each

    direction of communication (read and write).

    • Link Layer: Link Layer includes packet layout, point-to-point link instruction,

    switching within a local subnet and data integrity. Two type of packets, man-

    agement and data. Management packets handle link configurations and main-

    tenance. Data packets carry up to 4 kilobytes of transaction payload. Every

    device in a local subnet has a local ID (LID) for forwarding data appropriately.

    It handles data integrity by including variant and invariant cyclic redundancy

    checking (CRC). The variant CRC checks fields that change from point-to-point

    and the invariant CRC provides end-to-end data integrity.

    • Network Layer: The network layer is responsible for routing packets from one

    subnet to another. The global route header located within a packet includes an

    18

  • IPv6 address for the source and destination of each packet. For single subnet

    environments, the network layer information is not used.

    • Transport Layer: Transport layer handles the order of packet delivery. Also

    handles partitioning, multiplexing and transport services that determine reliable

    connections.

    2.4 Cluster Middleware

    Middleware is the layer of software sandwiched between the operating system and

    applications. It has re-emerged as a means of integrating software applications that

    run in a heterogeneous environment. There is large overlap between the infrastructure

    that is provided to a Cluster by high-level Single System Image (SSI) services and

    those provided by the traditional view of middleware. Middleware helps a developer

    overcome three potential problems with developing applications on a heterogeneous

    Cluster:

    • Gives the ability to access to software inside or outside their site.

    • Helps to integrate software from different sources.

    • Rapid application development.

    The services that middleware provides are not restricted to application development.

    Middleware also provides services for the management and administration of a het-

    erogeneous system.

    19

  • 2.4.1 Message-based Middleware

    Message-based middleware uses a common communication protocol to exchange data

    between applications. The communication protocol hides many of the low-level mes-

    sage passing primitives from the application developer. Message-based middleware

    software can pass messages directly between applications, send messages via software

    that queues waiting messages, or use some combination of the two. Examples of this

    type of middleware are the three upper layers of the OSI model [21], the session,

    presentation and applications layers.

    2.4.2 RPC-based Middleware

    There are many applications where the interactions between processes in a distributed

    system are remote operations, often with a return value. For these applications Re-

    mote Procedure Call (RPC) is used. The implementation of the client/server model in

    terms of Remote Procedure Call (RPC) allows the code of the application to remain

    the same whether the procedures are the same or not. Inter-process communication

    mechanisms serve four important functions [22]:

    • They offer mechanisms against failure. They also provide the means to cross

    administrative boundaries.

    • They allow communications between separate processes over a computer net-

    work.

    • They enforce clean and simple interfaces, thus providing a natural aid for the

    modular structure of large distributed applications.

    • They hide the distinction between local and remote communication, thus allow-

    ing static or dynamic reconfiguration.

    20

  • 2.4.3 Object Request Broker

    An Object Request Broker (ORB) is a type of middleware that supports the remote

    execution of objects. An international ORB standard is CORBA (Common Object

    Request Broker Architecture). It is supported by more than 700 groups and managed

    by the Object Management Group (OMG) [23]. The OMG is a non profit-making

    organization whose objective is to define and promote standards for object orienta-

    tion in order to integrate applications based on existing technologies. The Object

    Management Architecture (OMA) is characterized by the following:

    • The Object Request Broker (ORB): It is the controlling element of the archi-

    tecture and it supports the portability of objects and their interoperability in a

    network of heterogeneous systems.

    • Object services: These are specific system services for the manipulation of ob-

    jects. Their goal is to simplify the process of constructing applications.

    • Application services: They offer a set of facilities for allowing applications access

    databases, to printing services, to synchronize with other application, and so on.

    • Application objects: They allow the rapid development of applications. A new

    application can be formed from objects in a combined library of application

    services.

    2.5 Single System Image (SSI)

    SSI is the illusion, created by software or hardware, that presents a collection of com-

    puting resources as one, more whole resource [24]. In other words, it the property of a

    21

  • system that hides the heterogeneous and distributed nature of the available resources

    and presents them to users and applications as a single unified computing resource.

    SSI makes the Cluster appear like a single machine to the user, to applications, and to

    the network. SSI Cluster-based systems are mainly focused on complete transparency

    of resource management, scalable performance, and system availability in supporting

    user applications. SSI is supported by a middleware layer that resides between the

    OS and user-level environment. Middleware consists of essentially 2 sub-layers of SW

    infrastructure.

    • SSI infrastructure - Glue together OSs on all nodes to offer unified access to

    system resources.

    • System availability infrastructure - Enable Cluster services such as check

    pointing, automatic failover, recovery from failure and fault-tolerant support

    among all nodes of the Cluster.

    2.5.1 Benefits of SSI

    There are several benefits of SSI:

    • Use of system resources transparent.

    • Transparent process migration and load balancing across nodes.

    • Improved reliability and higher availability.

    • Improved system response time and performance.

    • Simplified system management.

    • Reduction in the risk of operator errors.

    22

  • • No need to be aware of the underlying system architecture to use these machines

    effectively.

    2.5.2 Features of SSI Clustering Systems

    • Single I/O Space: Any node can access any peripheral or disk devices without

    the knowledge of physical location.

    • Single Process Space: Any process on any node create process with Cluster

    wide process and they communicate through signal, pipes, etc, as if they are

    one a single node.

    • Single Global Job Management System: SSI provides single global job

    management system. The manager node manages all the operations.

    • Checkpointing : Some SSI systems allow checkpointing of running processes,

    allowing their current state to be saved and reloaded at a later date. Check-

    pointing can be seen as related to migration, as migrating a process from one

    node to another can be implemented by first checkpointing the process, then

    restarting it on another node. Alternatively checkpointing can be considered as

    migration to disk.

    • Process Migration: Many SSI systems provide process migration. Processes

    may start on one node and be moved to another node, possibly for resource

    balancing or administrative reasons. As processes are moved from one node to

    another, other associated resources may be moved with them.

    23

  • Figure 2.4: Functional Relationship Among Middleware SSI Modules

    2.5.3 Functional Relationship among Middleware SSI Mod-ules

    Every SSI has a boundary. Single system support can exist at different levels within

    a system, one able to be build on another. In SSI there can be three levels of ab-

    stractions. They are application and subsystem level, operating system kernel level

    and hardware level. In Figure 2.4 the functional relationship among middleware SSI

    module is shown. Resource Management and Scheduling is done in subsystem level.

    2.5.3.1 Resource Management and scheduling (RMS)

    RMS system is responsible for distributing applications among Cluster nodes. It

    enables the effective and efficient utilization of the resources available. In RMS there

    are two types software components. Basic architecture of RMS: client-server system

    is shown in Figure 2.5.

    • Resource manager: Locating and allocating computational resource, authen-

    tication, process creation and migration.

    24

  • Figure 2.5: Resource Management and scheduling (RMS)

    • Resource scheduler: Queuing applications, resource location and assignment.

    It instructs resource manager what to do when (policy).

    There are several services which are provided by RMS:

    • Process Migration.

    • Checkpointing.

    • Fault Tolerance.

    • Minimization of Impact on Users.

    • Load Balancing.

    • Multiple Application Queues.

    2.6 Examples of Cluster implementation

    In this Section, we will discuss two existing Cluster implementation: Linux Virtual

    Server (LVS), an open source project which is an advanced load balancing solution

    25

  • for Linux systems; and Windows Compute Cluster Server 2003, a commercial Cluster

    server developed by Microsoft Corporation.

    2.6.1 Linux Virtual Server

    In this Section, we will briefly discuss Linux Virtual Server [25]. Linux Virtual Server

    (LVS) is an advanced load balancing solution for Linux systems. It is an open source

    project started by Wensong Zhang in May 1998. The mission of the project was

    to build a high-performance and highly available server for Linux using Clustering

    technology, which provides good scalability, reliability and serviceability. The Linux

    Virtual Server directs clients’ network connection requests to multiple servers that

    share their workload, which can be used to build scalable and highly available Inter-

    net services.

    The Linux Virtual Server directs clients’ network connection requests to the different

    servers according to scheduling algorithms and makes the parallel services of the

    Cluster to appear as a single virtual service with a single IP address. The Linux

    Virtual Server extends the TCP/IP stack of Linux kernel to support three IP load-

    balancing techniques:

    • NAT (Network Address Translation): Maps IP addresses from one group

    to another. NAT is used when hosts in internal networks want to access the

    Internet and be accessed in the Internet.

    • IP tunneling: Encapsulates IP datagram within IP datagrams. This allows

    datagrams destined for one IP address to be wrapped and redirected to another

    IP address.

    26

  • • Direct routing: Allows route response to the actual user machine instead of

    the load balancer.

    The Linux Virtual Server also provides four scheduling algorithms for selecting servers

    from Cluster for new connections:

    • Round robin: Directs the network connections to the different server in a

    round-robin manner.

    • Weighted round robin: Treats the real servers of different processing capac-

    ities. A scheduling sequence will be generated according to the server weights.

    Clients’ requests are directed to the different real servers based on the scheduling

    sequence in a round robin manner.

    • Least-connection: Directs clients’ network connection requests to the server

    with the least number of established connections.

    • Weighted least-connection: A performance weight can be assigned to each

    real server. The servers with a higher weight value will receive a larger percent-

    age of live connections at any time.

    Client applications interact with the Cluster as if it were a single server. The clients

    are not affected by the interaction with the Cluster and do not need modification.

    The application performance scalability is achieved by adding one or more nodes to

    the Cluster. Automatically detecting node or daemon failures and reconfiguring the

    system appropriately achieve high availability. The Linux Virtual Server that follows

    a three-tier architecture is shown in Figure 2.6. The functionality of each tier is:

    • Load Balancer: The front end to the service as viewed by connecting clients.

    The load balancer directs network connections from clients who access a single

    27

  • Figure 2.6: Linux Virtual Server

    IP address for a particular service, to a set of servers that actually provide the

    service.

    • Server Pool: It consists of a Cluster of servers that implement the actual

    services, such as Web, FTP, mail, DNS, and so on.

    • Back-end Storage: It provides the shared storage for the servers, so that it is

    easy for servers to keep the same content and provide the same services.

    The load balancer handles incoming connections using IP load balancing techniques.

    The Load balancer selects servers from the server pool, maintains the state of con-

    current connections and forwards packets, and all the work is performed inside the

    kernel, so that the handling overhead of the load balancer is low. The load balancer

    can handle much larger numbers of connections than a general server, therefore the

    load balancer can schedule a large number of servers and it will not be a potential

    bottleneck in the system.

    28

  • The server nodes may be replicated for either scalability or high availability. When

    the load on the system saturates the capacity of the current server nodes, more server

    nodes can be added to handle the increasing workload. One of the advantages of

    a Clustered system is that it can be built with hardware and software redundancy.

    Detecting a node or daemon failure and then reconfiguring the system appropriately

    so that its functionality can be taken over by the remaining nodes in the Cluster is

    a means of providing high system availability. A Cluster-monitor-daemon can run on

    the load balancer and monitor the health of server nodes. If a server node cannot be

    reached by ICMP (Internet Control Message Protocol) ping or there is no response

    of the service in the specified period, the monitor will remove or disable the server in

    the scheduling table, so that the load balancer will not schedule new connections to

    the failed one and the failure of a server node can be masked.

    The back-end storage for this system is usually provided by distributed and fault

    tolerant file system. Such a system also takes care of the availability and scalability

    issues of file system accesses. The server nodes access the distributed file system in

    a similar fashion to that of accessing a local file system. However, multiple identical

    applications running on different server nodes may access a shared data concurrently.

    Any conflict among applications must be reconciled so that the data remains in a

    consistent state.

    2.6.2 Windows Compute Cluster Server 2003

    In this Section, we will briefly discuss Windows Compute Cluster Server 2003 [26].

    It is an integrated platform for running, managing, and developing high performance

    29

  • computing applications.

    2.6.2.1 Compute Cluster Components

    Each Windows Compute Cluster Server 2003 Cluster consists of a head node and one

    or more compute nodes. The head node mediates all access to the Cluster resources

    and acts as a single point for Cluster deployment, management, and job scheduling.

    A Cluster can consist of only a head node.

    • Head node: The head node is responsible for providing user interface and

    management services to the Cluster. The user interface consists of the Com-

    pute Cluster Administrator, which is a Microsoft Management Console (MMC)

    snap-in, the Compute Cluster Job Manager, which is a Win32 graphic user

    interface, and a Command Line Interface (CLI). Management services include

    job scheduling, job and resource management, node management, and Remote

    Installation Services (RIS).

    • Compute node: A compute node is a computer configured as part of a high

    performance Cluster to provide computational resources for the end user. Com-

    pute nodes on a Windows Compute Cluster Server 2003 Cluster must have a

    supported operating system installed, but nodes within the same Cluster can

    have different operating systems and different hardware configurations.

    2.6.2.2 Network Architecture

    Network configuration consists of a head node and a scalable number of compute

    nodes. The nodes can be connected as part of a larger server network, or as a private

    network with the head node serving as a gateway. Figure 2.7 shows both types of

    30

  • arrangement. The networking medium can be Ethernet or it can be a high-speed

    medium such as InfiniBand (typically used only for MPI or similar communication

    among the nodes).

    Figure 2.7: Network Architecture

    2.6.2.3 Software Architecture

    The software architecture consists of a user interface layer, a scheduling layer, and an

    execution layer. The interface and scheduling layers reside on the head node. The

    execution layer resides primarily on the compute nodes. The execution layer as shown

    in Figure 2.8 includes the Microsoft implementation of MPI, called MS MPI, which

    was developed for Windows and is included in the Microsoft Compute Cluster Pack.

    • Interface layer: The user interface layers consist of the Compute Cluster Job

    Manager, the Compute Cluster Administrator, and Command Line Interface

    (CLI). The Compute Cluster Job Manager is a WIN32 graphic user interface to

    the Job Scheduler that is used for job creation and submission. The Compute

    Cluster Administrator is a Microsoft Management Console (MMC) snap-in that

    is used for configuration and management of the Cluster. The Command Line

    Interface is a standard Windows command prompt which provides a command-

    line alternative to use of the Job Manager and the Administrator.

    31

  • Figure 2.8: Software Architecture

    • Scheduling layer: The scheduling layer consists of the Job Scheduler, which is

    responsible for queuing the jobs and tasks, reserving resources, and dispatching

    jobs to the compute nodes.

    • Execution layer: The execution layer consists of the following components

    replicated on each compute node: the Node Manager Service, the MS MPI

    launcher mpiexec, and the MS MPI Service. The Node Manager is a service

    that runs on all compute nodes in the Cluster. The Node Manager executes jobs

    on the node, sets task environmental variables, and sends a heartbeat (health

    check) signal to the Job Scheduler at specified intervals (the default interval is

    32

  • one minute). mpiexec is the MPICH2-compatible multi-threading executable

    within which all MPI tasks are run. The MS MPI Service is responsible for

    starting the job tasks on the various processors.

    2.6.2.4 Job Execution

    Steps of job execution are as follows:

    1. Creating and submitting jobs:

    Creating a job is the first step in Cluster computing. It is a resource request con-

    taining one or more computing tasks to be run in parallel. Each task may in turn

    be parallel or it may be serial. One can create a job using the Job Manager or

    the CLI. To create a job means describe job priority, run time limit, number of

    processors required, specific nodes requested, and whether nodes will be reserved

    exclusively for the job. Then add the tasks that the job will execute. The task’s

    properties also include any input, output, and error files required, as well as a list

    of any other tasks on which this task depends. After defining the job and its tasks,

    the next step is to submit it to the Job Scheduler. After the job is submitted, it

    takes its place in the job queue with the status Queued and waits its turn to be

    activated.

    2. Job Scheduler:

    When a job is submitted, it is placed under the management of Job Scheduler. Job

    Scheduler determines the job’s place in the queue and allocates resources to the

    job when the job reaches the top of the queue and as resources become available.

    Jobs are ordered in the queue according to a set of rules called scheduling policies.

    33

  • Resource allocation is based on resource sorting. When the requested resources

    have been allocated, the scheduler dispatches the job tasks to the compute nodes

    and takes on a management and monitoring function. The scheduler manages jobs

    by enforcing certain job and task options, as well as managing job or task status

    changes. It monitors jobs by reporting on the status of the job and its tasks, as

    well as the health of the nodes. Job Scheduler implements the following scheduling

    policies:

    • Priority-based, first-come, first-served scheduling: Priority-based, first-

    come, first-served (FCFS) scheduling is a combination of FCFS and priority-

    based scheduling. Using priority-based FCFS scheduling, the scheduler places

    a job into a higher or lower priority group depending on the job’s priority set-

    ting, but always places that job at the end of the queue in that priority group

    because it is the last submitted job.

    • Backfilling: Backfilling maximizes node utilization by allowing a smaller

    job or jobs lower in the queue to run ahead of a job waiting at the top of

    the queue, as long as the job at the top is not delayed as a result. When a

    job reaches the top of the queue, a sufficient number of nodes may not be

    available to meet its minimum processors requirement. When this happens,

    the job reserves any nodes that are immediately available and waits for the

    job that is currently running to complete.

    • Exclusive scheduling: By default, a job has exclusive use of the nodes

    reserved by it. This can produce idle reserved processors on a node. Idle

    reserved processors are processors that are not used by the job but are also

    not available to other jobs. By turning off the exclusive property, the user

    allows the job to share its unused processors with other jobs that have also

    34

  • been set as nonexclusive. Therefore, non-exclusivity is a reciprocal agreement

    among participating jobs, allowing each to take advantage of the other’s un-

    used processors.

    3. Task execution:

    Job Scheduler dispatches tasks to the compute nodes in the order that they appear

    in the task list. To dispatch the task, Job Scheduler passes the task to a desig-

    nated node, which can be any of the compute nodes allocated to the job. Unless

    dependencies have been specified, the tasks are dispatched a first-come, first-served

    (FCFS) basis.

    For serial tasks, the first two tasks will be dispatched to and run on the designated

    node (assuming it has two processors), the next two tasks will be dispatched to

    and run on a second designated node, and the sequence will repeat itself until

    there are no more tasks or until all the processors in the Cluster are being used.

    Any remaining tasks must wait for the next available processor and run when it

    becomes available. The following Figure 2.9 shows this process. The file server

    shown on the head node may not actually reside there. It can reside anywhere

    in the external or internal network. An MSDE server stores the job specifications

    and user log-on credentials. The task ID number, which also contains the job ID

    number, allows Job Scheduler to keep track of the status of the task as part of the

    job, displaying both job and task status to the user.

    For parallel tasks, execution flow depends on the user application and the software

    that supports it. For jobs that are run using the Microsoft Message Passing In-

    35

  • Figure 2.9: Serial Task execution

    terface Service, tasks are executed as follows. The MS MPI executable mpiexec

    is started on the designated node. mpiexec, in turn, starts all the task processes

    through the node-specific MS MPI Service. If more than one node is required for

    the task, additional instances of MS MPI, one per node, are spawned before the

    task processes themselves are started. Parallel task flow is shown in Figure 2.10.

    In the Figure, P0 through P5 represent the processes that are created, each part

    of a single task. This illustration shows the most common case, in which only one

    process,P0, handles all the standard input and output files.

    36

  • Figure 2.10: Parallel Task execution

    2.7 Concluding Remarks

    As a beginning of our work, we are trying to study the issues related of parallel com-

    putation and focusing architectures, protocols and standards of Computer Clusters.

    The motivation of distributed processing using Computer Cluster turns into more

    advance technology known as Grid Computing which we will going to discuss in the

    next Chapter.

    37

  • Chapter 3

    Grid Computing : An Introduction

    Grid Computing, more specially ‘Grid Computing System’ is a virtualized distributed

    environment. Grid environment provides dynamic runtime selection, sharing and ag-

    gregation of geographically distributed resources based on availability, capability, per-

    formance and cost of these computing resources. Fundamentally, Grid Computing is

    the advanced form of distributed processing which is the combination of decentralized

    architecture for managing computing resources and a layered hierarchical architecture

    for providing services to the user [27].

    The rest of the Chapter is organized as follows. We begin our discussion with definition

    of Grid Computing in Section 3.1 and the comparing Grid with Computer Clusters

    in Section 3.2. In Section 3.4 and 3.5 we consider the underlying layers of Grid

    Computing in details. Resource management architecture is discussed in Section 3.6

    and the protocol for resource management (GRAM) is discussed in Section 3.6.2. We

    also present a Resource Monitoring Architecture for Grid environment in Section 3.7.

    We Conclude our discussion in Section 3.8 introducing a new approach of distributed

    processing known as Cloud Computing.

    38

  • 3.1 Grid Computing: definitions and overview

    The concept of Grid was introduced in early 1990’s, where high performance com-

    puters were connected by fast data communication. The motivation of that approach

    was to support calculation and data-intensive scientific applications. Figure 3.1 [28]

    shows the evolution of grid over time.

    Figure 3.1: Evolution of Grid Computing

    The basics of Grid is to co-allocation of distributed computation resources. The most

    cited definition of Grid is [29]:

    “A computational grid is a hardware and software infrastructure

    that provides dependable, consistent, pervasive, and inexpensive

    access to high-end computational capabilities.”

    Again, according to IBM definition [30],

    “A grid is a collection of distributed computing resources available

    over a local or wide area network that appear to an end user or

    application as one large virtual computing system. The vision is to

    39

  • create virtual dynamic organizations through secure, coordinated

    resource-sharing among individuals, institutions, and resources.”

    A Grid Computing environments must include:

    Coordinated resources: Grid environment must be facilitated with necessary in-

    frastructure for co-ordination of resources based upon policies and service level

    agreements.

    Open standard protocols and frameworks: Open standards can provide inter-

    operability and integration facilities. These standard should be applied for re-

    source discovery, resource access and resource co-ordination. Open Grid Services

    Infrastructure (OGSI) [31] and Open Grid Services Architecture (OGSA) [32]

    was published by the Global Grid Forum (GGF) as a proposed recommendation

    for this approach.

    Grid Computing can be distinguished also from High Performance Computing (HPC)

    and Clustered Systems in following way: while Grid focuses on resource sharing and

    can result in HPC, whereas HPC does not necessarily involve sharing of resources

    [33]. Grid enables the abstraction of distributed systems and resources such as pro-

    cessing, network bandwidth and data storage to create a Single System Image. Such

    abstraction provides continuous access to large pool of IT capabilities. Figure 3.2

    and 3.3 [28] compares the Grid environment over the traditional computations. An

    organization-owned computational Grid is shown in Figure 3.3 on Page 42, where a

    scheduler sets policies and priorities for placing jobs in the Grid infrastructure.

    40

  • Figure 3.2: Serving job requests in traditional environment

    3.2 Grids over Cluster Computing

    Computer Clusters discussed in Chapter 2 are local to the domain. The Clusters

    are designed to resolve the problem of inadequate computing power. It provides

    more computation power by pooling of computational resources and parallelizing the

    workload. As Clusters provide dedicated functionality to local domain, they are not

    suitable solution for resource sharing between users of various domains. Nodes in the

    Cluster controlled centrally and Cluster manager is monitoring the state of the node

    [34]. So, in brief, Cluster units only provide a subset of Grid functionality.

    The big difference is that a Cluster is homogeneous while Grids are heterogeneous

    [35]. The computers that are part of a Grid can run different operating systems and

    have different hardware whereas the Cluster Computers all have the same hardware

    and OS. A Grid can make use of spare computing power on a desktop computer while

    the machines in a Cluster are dedicated to work as a single unit. Grid are inherently

    41

  • Figure 3.3: Serving job requests in Grid environment

    distributed by its nature over a LAN or WAN. The computers in the Cluster are

    normally contained in a single location.

    Clusters are configurable in Active-Active or Active-Passive ways. Active-Active be-

    ing that each computer runs it’s own set of services (Say, one runs a SQL instance, the

    other runs a web server) and they share some resources such as storage. If one of the

    computers in a Cluster goes down the service fails over to the other node and almost

    seamlessly starts running there. Active-Passive is similar, but only one machine runs

    these services and only takes over once there is a failure. Cluster components can be

    shared or dedicated. On the other hand, some Grid resources may be shared, other

    may be dedicated or reserved.

    42

  • Another difference lies in the way resources are handled. In case of Cluster, all nodes

    behave like a single system view and resources are managed by centralized resource

    manager. In case of Grid, every node is autonomous, for example, it has its own

    resource manager and behaves like an independent entity.

    3.3 An example of Grid Computing environment

    Figure 3.4: Google search architecture

    We consider searching world wide web in Google as an example of Grid Computing.

    Figure 3.4 shows the abstract view of Google search architecture [36]. Google pro-

    cesses tens of thousands of queries per second. Each of this query is first received by

    one of the Web Servers, then passes it to the array of Index Servers. Index Servers are

    responsible for keeping index of words and phrases found in websites. The servers are

    distributed in several machines and hence the searching runs concurrently. In fraction

    43

  • of second, index servers perform a logical AND operation and return the reference of

    the websites containing query (searching phrase). The resultant references then sent

    to Store Servers. Store Servers maintain compressed copies of all the pages known

    to Google. These compressed copies are used to prepare page snippets and finally

    presented to the end user in a readable form.

    Crawler Machines synchronizing through the web and updating the Google database

    of pages stored in Index and Store servers. So, the Store Servers actually contains

    relatively recent and compressed copies of all the pages available in the web.

    Grid Computing can facilitates the above scenario of efficient searching. As it stated

    earlier the servers are distributed and searching should be parallel in order to achieve

    efficiency. The infrastructure also need to scale with the growth of web as the num-

    ber of pages and indexes increased. Different organizations and numerous servers are

    shared with Google. Copy the content and transforming it into its local resource is al-

    lowed by Google. Local resources contain keyword database of the Index Servers and

    cached content in the database of the Store Servers. The resources partially shared

    with end-users who send queries through their browsers. Users can then directly con-

    tact with the original servers to request the full content of the web page.

    Google also shares computing cycles. Google shares its computing resources, such

    as storage and computing capabilities with the end-user by performing data caching,

    ranking and searching of query.

    44

  • 3.4 Grid Architecture

    In this Section, we will discuss Grid architecture, which identifies the basic compo-

    nents of a Grid system. It also defines the purpose and functions of such components.

    However, this layered Grid architecture also indicates how these components actually

    interacts with one another.

    Here, we present Grid architecture described in [37]. Figure 3.5 shows the Grid layers

    from top to bottom.

    Figure 3.5: Grid Protocol Architecture

    3.4.1 Fabric Layer: Interfaces to Local Resources

    Fabric layer provides the resources that can be shared in Grid environment. An exam-

    ple of such resources may be computational resources, storage systems, sensors and

    network systems. Grid architecture does not deal with resources like distributed file

    systems, where resource implementation requires individual internal protocols [37].

    The computational resources represent multiple architectures such as clusters, super-

    computers, servers and ordinary PCs which run on variety of operating systems (such

    45

  • as UNIX variants or Windows) [38].

    Components of the Fabric layer implement the local and resource-specific operations

    on specific resources. Such resources are physical or even logical. Logical resources

    my include Software Components, Policy files, Workflow applications etc. [39]. These

    resource-specific operations provides functionalities of sharing operations at higher

    levels. In order to support sharing mechanisms we need to provide [34] :

    • an inquiry mechanism so that the components of Fabric are allowed to discover

    and monitor resources.

    • an appropriate (either application dependent or unified or both) resource man-

    agement functionalities to control the QoS in Grid environment.

    3.4.2 Connectivity Layer: Managing Communications

    Connectivity layer defines the core communication and authentication protocols neces-

    sary for grid networks. Communication protocol transfers data between Fabric layer

    resources. Authentication protocols, however, build on communication services for

    providing cryptographically secure mechanisms to the Grid users and resources.

    The communication protocol can work with any of the networking layer protocols

    that support transport, routing, and naming functionalities. In computational Grid,

    TCP/IP Internet protocol stack is commonly used [37].

    46

  • 3.4.3 Resource Layer: Sharing of a Single Resource

    Resource layer is on the top of Connectivity layer to define the protocols along with

    API and SDKs for secure negotiation, monitoring, initialization, control and payment

    of sharing operations on individual resources. Resource layer uses Fabric layer inter-

    faces and functions to access and control local resources. This layer entirely considers

    local and individual resources and therefore, ignores global resource management is-

    sues. To share single resource, we need to classify two resource layer protocols [37]:

    • Information protocols: Information protocols are used to discover the infor-

    mation about state and structure of the resource for example - the configuration

    of resource, current load state, usage policy or costing of the resource.

    • Management protocols: Management protocols in Resource layer are used

    to control and access to a shared resource. The protocols specify resource re-

    quirements, which includes advanced reservation and QoS and the operations

    on resources. Such operations include process creation, data access etc.

    3.4.4 Collective Layer : Co-ordination with multiple resources

    Resource layer, described in Section 3.4.3 deals with operation and management of

    single resource (for example, computational resources, storage and network systems

    etc.). But the Collective layer in the Grid architecture contains protocols and services

    that are not associated with any one specific resource but rather are global in nature

    and handles interactions across collections of resources. This layer provides necessary

    API and SDKs not associated with specific resource rather the global resources in

    overall grid environment.

    47

  • Figure 3.6: Collective and Resource layer protocols are combined in various ways toprovide application functionality

    The implementation of Collective layer functions can be built on Resource layer or

    other Collective layer protocols and APIs [37]. Figure 3.6 shows a Collective co-

    allocation API and SDK that uses a Resource layer management protocol to control

    resources. On the top of this, a co-reservation service protocol and the service itself

    are defined. To implement co-allocation operations, co-allocation API is called which

    provides additional functionality such as authorization, fault tolerance etc. An appli-

    cation then use the co-reservation service protocols to request and perform end-to-end

    reservations.

    3.4.5 Application Layer : User defined Grid Applications

    The top layer of the Grid consists of user applications, which are constructed by uti-

    lizing the services defined at each lower layer. At each layer, we have well-defined

    protocols that access some useful services for example resource management, data ac-

    cess, resource discovery etc. Figure 3.7 shows the correlation between different layers

    [37]. APIs are implemented by SDKs, which use Grid protocols to provide function-

    48

  • alities to end user. Higher level SDKs can also provide functionality so that it is not

    directly mapped to a specific protocol. However, it may combine protocol operations

    with calls to additional APIs to implement local functionality.

    Figure 3.7: Programmers view of Grid Architecture. Thin lines denotes protocolinteractions where bold lines represent a direct call

    3.5 Grid Computing with Globus

    Globus [40] provides a software infrastructure so that applications can distribute com-

    puting resources as a single virtual machine [41]. Globus Tooklit, the core component

    of the infrastructure defines basic services and capabilities required for computational

    Grid. Globus is designed as a layered architecture where high-level global services are

    built on the top of low-level local services. In this Section, we will discuss how Globus

    toolkit protocols actually interact with Grid layers.

    • Fabric Layer:

    Globus toolkit is designed to use existing fabric components [37]. For example,

    enquiry software is provided for discovering and state information of various

    49

  • common resources such as computer information (i.e. OS version, hardware

    configuration etc), storage systems (i.e. available spaces) etc. In the higher

    level protocols (particularly at the Resource layer) implementation of Resource

    management, is normally assumed to be the domain of local resource managers.

    • Connectivity Layer:

    Globus uses public-key based Grid Security Infrastructure (GSI) protocols [42,

    43] for authentication, communication protection, and authorization. GSI ex-

    tends the Transport Layer Security (TLS) protocols [44] to address the issues

    of single sign-on, delegation, integration with various local security solutions.

    • Resource Layer:

    A Grid Resource Information Protocol (GRIP) [45] is used to define standard re-

    source information protocol. The HTTP-based Grid Resource Access and Man-

    agement (GRAM) [46] protocol is used for allocation of computational resources

    and also for monitoring and controlling the computation of those resources. An

    extended version of the FTP, GridFTP [47], is used for partial file access and

    management of parallelism in the high-speed data transfers [37].

    The Globus Toolkit defines client-side C and Java APIs and SDKs for these

    protocols. However, Server-side SDKs can also provided for each protocol, to

    provide the integration of various resources for example computational, storage,

    network into the Grid [37].

    • Collective Layer:

    Grid Information Index Servers (GIISs) supports arbitrary views on resource

    subsets, LDAP information protocol used to access resource-specific GRISs to

    obtain resource state and Grid Resource Registration Protocol (GRRP) is used

    50

  • for resource registration. Also couple of replica catalog and replica management

    services are used to support the management of dataset replicas. There is an

    on-line credential repository service known ‘MyProxy’ provide secure storage for

    proxy credentials [48]. The Dynamically-Updated Request Online Coallocator

    (DUROC) provides an SDK and API for resource co-allocation [49].

    3.6 Resource Management in Grid Computing

    In this Section, we will discuss a resource management architecture which is used as a

    Resource layer protocol described in [46]. Block diagram of the architecture is found

    in Figure 3.8.

    Figure 3.8: A resource management architecture for Grid Computing environment

    To communicate request for resources between components an Resource Specification

    Language (RSL) is used which is described details in Section 3.6.1. With the help of

    the process called specialization, Resource Brokers transfer the high level RSL speci-

    51

  • fication into concrete specification of resources. This specification of request named

    ground request is passed to a co-allocator, which is responsible for allocating and man-

    aging the resources at multiple sites. A multi-request is a request which is involved

    resources at multiple sites. Resource co-allocators can break such multi-request into

    components and pass each element into appropriate resource manager. The infor-

    mation service, working between Resource Broker and Co-allocator is responsible for

    giving access to availability and capability of resources.

    3.6.1 Resource Specification Language

    Resource Specification Language (RSL) is combination of parameters including the

    operators:

    • & : conjunction of parameter specifications

    • | : disjunction of parameter specifications

    • + : combining two or more request into single compound request or multi-request

    Resource broker, co-allocators and resource managers each define a set of parameter-name.

    Resource managers generally recognize two types of parameter-name in order to com-

    municate with local schedulers.

    • MDS attribute name: to express constraint on resources. For example: memory>64

    or network=atm etc.

    • Scheduler parameters: used to communicate information related to job, i.e.

    count (number of nodes required), max_time (maximum time required), executables,

    environment (environment variables) etc.

    52

  • For example the following simple specification taken from [46],

    &(executable=myprog)(|(&count=5)(memory>=64))(&(count=10)(memory>=32)))

    requests 5 nodes with at least 64MB memory or 10 nodes with atleast 32 MB memory.

    Here, executable and count are scheduler parameters.

    Again, the following is an example of multi-request:

    +(&count=80)(memory>=64)(executable=my_executable)(resourcemanager=rm1)

    (&(count=256)(network=atm)(executable=my_executable)(resourcemanager=rm2)

    Here two requests are concatenated by + operator. This is also an example of ground

    request as every component of