the sg cluster with load balance and fault tolerance

The SG Cluster with Load Balance and Fault Tolerance

Shang Rong Tsai

Department of Electrical Engineering

National Cheng-Kung University

2001 Nov. 20

What is a SG Cluster

• The SG cluster is a mixture of load balancing cluster and high availability cluster. It enables you to create load balancing, fault tolerant and high availability cluster for most existing applications.

• A typical SG Cluster contains one or two load balancers and several back-end application servers. Using more than one load balancer can tolerate the faults of a load balancer.

• It was developed at the DSLab EE.NCKU.

Features of SG Clusters

• Client Transparent• A group of back-end application servers that may run on different

platforms appear as a single server to the client• Scalable

• system service capacity can be increased by adding new servers to the cluster

• Extensible• various read/write models• make existing applications into a scalable system with little or no

modification.• Manageable

• simple to install (single floppy) and easy to administer (web interface)

Features of SG Clusters (continued)

• Load Balancing• incoming requests are routed to the least loaded servers based on

various policies for optimal performance.• Fault Tolerant

• load balancer monitors the availability of back-end servers and only routes client's requests to those alive ones.

• More than one load balancers can be setup to avoid the single point of failure in the whole system.

• High Availability• SG cluster can mask the faults on load balancer or back-end

servers if there are sufficient redundancies. It can also keep service available when doing system upgrade.

• Robust to Denial-of-service attack

192.168.1.1

192.168.1.2

192.168.1.3

192.168.1.4

192.168.1.5

HighSpeedSwitch

HighSpeed Switch

SGLoad balancer

(primary)

SGLoad balancer

(backup)

Internetaccess

A Typical Physical Wiring

Server Pool

140.116.72.114

140.116.72.114:*

140.116.72.115:23

192.168.1.1:*

192.168.1.2:*

192.168.1.3:23

192.168.1.4:23

192.168.1.5:23

user request

user request

Virtual Server

Logical View

192.168.1.2:23

192.168.1.3:*

140.116.72.219

SGLoad balancer

Network Address Translation• A technique to convert IP address fields in an IP packet

between a private IP and public IP• Same private IP addresses can be reused in private

networks at the same time, thus save IP addresses needed• Private IP

– 10.0.0.0-10.255.255.255– 172.16.0.0-172.31.255.255– 192.168.0.0-192.168.255.255

• Applications embedding IP addresses in protocol contents may have problems

• Private IP addresses are generally used by client only hosts

NATdevice

IP pubClient-onlyhost

Client-onlyhost

Client-onlyhost

IPd1, IPs1

Host with IPs1(a private IP address)

IPd1, IP pub

IPs2

IPs3

S-port1D-port1

S-port1D-port1

IPd3, IPs3 S-port3D-port3

Host withIPd1 address

IPd1,D-port1IPs1,S-port1IPd3,D-port3IPs3,S-port3

IPd3, IP pub S-port3D-port3

Host withIPd3 address

Internet

The Operational Principle of NAT

NATD

SGctrld

SGmonServerGroup

PropertiesServer

140.116.72.114:*

192.168.1.1:*

192.168.1.3:*

192.168.1.2:*

140.116.72.219

Alive?

feedback protocol

Heartbeat to other bidds for SG failover

The Overall Architecture

SGcmd

IP packet mrouted

feedback protocol SGhb

Alive?

feedback protocol

bidd

Alive?

Load Balancer

AP Server

AP Server

The Major Components in SG Cluster

• Bidd• Used for the election of a new primary. Bidd on the prima

ry load balancer generates heartbeats and Bidds on the backups monitor the heartbeats.

• Using bidding modelEach server gives a price (a unique value) to bid, theserver giving the highest price becomes the new primary

• Fully symmetric, each node could have exactly the same configuration

• Independent with the service to support

The Major Components in SG Cluster (continued)

• Server Group Properties • This is a block of shared memory accessed by SG processes. It con

tains the membership, the load balancing policy, properties of each server group and statistical information of all servers

• NATD • The key component of SG load balancer, it is responsible for chan

ging the IP address in the IP packet header based on the Server Group Properties

• mrouted • SG cluster supports not only the "select one" model of service but

also the "write all" model of service. A write request under "write all" model will be multicast to all servers in the server group. This modified mrouted is used to support multicast service in SG cluster


• sgctrld • Sgctrld provides an interface for processes outside the load balanc

er to modify the "Server Group Properties". Processes can use "feedback protocol" to communicate with the sgctrld to make changes on the “Server Group Properties”. For example, an application server can feed back current load to the load balancer for specific load balancing policy.

• sgcmd • This is a client of sgctrld and provides a command line interface of

the "feedback protocol". It can be used by shell script or by the user interactively.


• sgmon • Normally, NATD can detect the failure of a server if the server do

es not response to the the client's request. But NATD won't find out the failure if no request is coming at all. Furthermore, NATD won't detect the recovery of dead servers since no request would be sent to a dead server. Sgmon monitors the failure and recovery of servers by periodically sending requests to application servers.

• sghb • The is optional. It is a little monitor process executed on applicatio

n servers. Since not all server components are network reachable, sghb can be used to monitor those quiet servers and generate heartbeat to SG load balancer

Load Balancing

• Balancing Type• whole server• a specific service port

• Balancing Policy• by RoundRobin• by Connection Count• by Packet Traffic• by External Counter

• Application service program can make its own load definition and update it to this external counter

• Weighted on above counter

Load Balancing (continued)

• Load balancing is done by making selection on the target servers when a link is created.

• A link becomes active when a response packet is found on this link• Once a link is active, the mapping of this link won’t be changed until i

t is closed or removed• If the target server for a link is dead before the link becomes active, th

e load balancer will remap this link to another target server

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*request

192.168.1.3:*

140.116.72.118:1029 140.116.72.114:23

192.168.1.1:23

Creation of link (140.116.72.118:1029, 140.116.72.114:23, 192.168.1.1:23)

Link creation

Load Balancing (continued)

• The target server for a new created link is chosen based on the balancing policy. But sometimes two different links are actually related and the packets from a particular client should be redirected to the same target server

• Example:• Port mapper: a RPC client will ask port mapper which port a speci

fic service is bound to and the client then sends its request to that port

• Squid: a squid proxy will use ICP to query its neighbors and parent for a specific object and use HTTP to get that object from others if any cache hit.

• “Keep Same Server” will redirect packets to the same target server if any link from a particular client is still available in the SG internal link table.

Keep Same Server

Examples to use ‘Keep Same Server’

portmapper

RPCclient

RPCserver ZZ

2. Which port is server ZZ?

3. server ZZ is port 2345

4. request sent toport 2345

1. ZZ is port 2345

packets to port mapper and packets to RPC server ZZ are related

squid

1. ICP request: do you have xx.html

2. ICP reply: yes

3. HTTP: get xx.html

squid

packets in ICP and packets in HTTP are related

Read/Write Model Supported by SG

• ReadAny• for TCP/UDP readonly service• data in each application server is identical with one another• Using unicast to forward requests

• ReadOne/WriteAll• for UDP read/write service• data in each application server is identical with one another• Using multicast to forward requests

• ReadFirst/WriteAll• for UDP read/write service• data maybe partitioned in the application server cluster• Using multicast to forward requests

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*

request

ReadAny

192.168.1.3:*

SGLoad balancer

• Operation• When connection is created, SG tries to select a real server to

serve this request• Benefit

• TCP, UDP and ICMP are supported• No modification of the service program is required

• Requirement/Limitation• The data must be fully identical on all servers• If any data modification is required, it must be handled by using a

centric database or file server on the backend

Virtual Server

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*read request

ReadOne/WriteAll

192.168.1.3:*

SGLoad balancer

Virtual ServerMulticastGroup 234.116.72.114

write request

• Operation• ReadOne

• If no write is processing, read any• If other write is processing, change to read preferred to guarantee a

consistency view from clients• WriteAll

• Multicast write• collect all replies

• Any one replying failure is turned off immediately• Servers reply success will be grouped based on the return value

• return the one that majority of servers agree

• Benefit• Support both read and write operations• Application Service programs don’t have to care the membership

of the service group• Requirement/Limitation

• Data must be identical on all servers• Any session key (to uniquely identify an application session, e.g.

The transaction id in an RPC) generated by service program must be deterministic

• Service program needs little modification• Join multicast group at startup• When serving a write request, an application server has to set

IP option to represent the return status (SG uses this info to determine whether a write request is successful or failed)

• New packet analyzer needs to be implemented to support protocols other than RPC-type service (such as NFS service).

ReadOne/WriteAll (continued)

• Operation• ReadFirst

• Multicast read• Return the earliest reply to the client

• WriteAll• Multicast write• collect all replies

• Any one replying failure will be turned off immediately• Servers reply success will be grouped based on the return

value• return the reply that majority of servers agree

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*read request

ReadFirst/WriteAll

192.168.1.3:*

SGLoad balancer

Virtual ServerMulticastGroup 234.116.72.114

write request

• Benefit• Support both read and write operations• Data can be partitioned in the server cluster

• Requirement/Limit• Service program have to know the membership for job assignment

• Any session key generated by a service program must be

deterministic• Service program needs little modification

• Join multicast group at startup• When serving a read request, an application server uses

membership information to determine which server is responsible for this request. (an application server not responsible for this request just drops it)

• When serving a write request, an application server has to set IP option to represent the return status (server not responsible for this request just returns ok)

• New packet analyzer needs to be implemented to support protocols other than RPC-type services.

ReadFirst/WriteAll (continued)

Mcast Service support Routine

• int sock_joingroup(int sockfd, struct in_addr groupaddr, int ttl);join a sockfd into the groupaddrused after the creation of server socket

• int prepare_ipopt_mcast(u_short type, int retval);set the return type and return value into a global variableused before return from a wrtie function of a mcast service

• int sock_set_ipopt_mcast(int sockfd);set ip option with the value set in prepare_ipopt_mcastused before send reply

• int sock_clear_ipopt_mcast(int sockfd)clear ip optionused after send reply

These routines are used by an application service program to set the status and return value of a mcast request into the IP option which will be inspectedby the load balancer to determine whether a reply is successful or not (in Read/Write Models).

Packet Analyzer API

• int mcast_init_xxxx(void);• Initialize internal data structure

• int mcast_check_port_xxxx(u_short port);• return whether the servce is located on a special port

• int mcast_check_request_xxxx(struct ip *pip, int *id, int *rwmode);• validate the structure of a request packet• get unique id and read/write mode of this request

• int mcast_check_reply_xxxx(struct ip *pip, int *id);• Validate the structure of a reply packet• Get unique id of this reply

A packet analyzer is used by NATD to parse the request/reply packet of a multicast service, return the unique id of them and check whether a request is write or not. A designer (for read/write request models) should implement the following API to be called by the NATD:

Feedback Protocol

• UDP based

id handle class op group server property datalen data…

id datalen data…status

• Result message

• Command message

• A library libsgmsg.a is available for application server developers, which eases the use of feedback protocol.

• An executable sgcmd is available for system administrator, which can be used by shell script, so existing application can make use of feedback protocol too.

• A web interface for feedback protocol is also available for an interactive administrator

The feedback protocol is designed for updating group or server properties stored on the shared memory of the load balancer

Fault Tolerance Support

• Fault detection• Packet snoop• Port test• Heartbeat monitor• Multicast write result comparison

• Fault Recovery• The recovery happened on the real server, so SG system can just

wait the recovery to complete

• Recovery detection• Packet snoop• Port test• Heartbeat monitor• Triggered by server through feedback protocol

Alive

Pending

DeadKeyport pkt delta in > P && timeout > 2T

Keyport pkt delta in > P && timeout > T

heartbeat timeout > H

heartbeat timeout > 2H

sgmon_porttest_error >E

sgmon_porttest_error > 2E

Keyport pkt responsed

Sgmon_porttest ok

heartbeat received

D: packet delta thresholdT: response timeout thresholdE: porttest_error_thresholdH: heartbeat timeout thresholdM: mcast_error_threshold

Server status transition

mcast_errort > M

mcast_errort > 2M

User recovery

Server status transition (continued)

• A server has three state: Alive, Pending or Dead. • Various fault/recovery detecting mechanisms are used in SG system.

The server status is calculated by sorting all fault/recovery events with timestamp. The latest event would decides the result.

• Candidate for load balancing selection• Alive: default candidate• Pending: the candidate if no alive is available• Dead: the candidate if all server are dead

• Why Pending State?• A server not responding to client’s request or monitor’s test may

crashed or be busy in serving others under heavy load. We put a server into pending state at the beginning instead of dead state to expect it to come back later.

Fault Recovery

• The load balancer does not handle recovery of application servers• The recovery happened on the real server, the SG load balancer can

just wait the recovery to complete • The recovering server should not response to requests before the re

covery is done• Since the detection in SG is targeted on failed-stop fault

• The group should be in RDonly mode when doing state transfer• For read only service, the dead server can do state transfer from ali

ve server directly. • For readone/writeall service, the dead server should turn server gro

up into readonly before state transfer and turn server group back to readwrite mode when transfer is complete.

Deny-Of-Service attack

• Process saturation• Some servers have a limitation on the TCP connections it can hand

le, it will stop response to client if this limit is reached.• Servers using fork() to handle new connections would consume sy

stem resource (ex:process table)

• Mbuf exhaustion• A connection related mbuf won’t be released if the connection stay

s in FIN_WAIT_1 state• a BSD machine have only 1536 mbuf when maxuser=64• a Linux machine doesn’t has such a limit, but since mbuf and mbu

f cluster are non-pageable, an evil client can lock out lots of physical memory from others

Attacks on Unix systems typically come in two ways :

Protection against DOS attack

• Per client limitation• Max connections• Max connection rate• Max TCP connections in FIN_WAIT_1 state.• Any client breaking the above limitation will be denied for new

connections. The deny interval can be specified by SG admin.

• Per server based ACL• Allow/deny client’s requests based on its IP/subnet addresses• Servers in same group can have different ACL to provide

differential service for different clients.• Ex: reserving the best computer in a group for internal use in a

computing cluster

A distributed NFS file server cluster based on SG

• UDP service• Based on synchronized RPC• ReadOne/WriteAll• Modification

• make filehandle from pathname, this guarantees same handle will map to the same file on different servers

• After server socket creationsock_joingroup(int sockfd, struct in_addr groupaddr, int ttl)

• At end of each write function, prepare_ipopt_mcast(MCAST_SUCCESS, return_value)

• before send reply sock_set_ipopt_mcast(int sockfd)

• After send reply sock_clear_ipopt_mcast(int sockfd)

Performance Test

140.116.72.114:*

192.168.1.2:*

192.168.1.3:*

192.168.1.4:*

SGLoad balancer

MulticastGroup 234.116.72.114

581.44K/s

559.77K/s

421.13K/s140.116.72.128100Mb/s lan

NFS Write Efficience 373.40/421/13=88.66%

373.40K/s

Performance Test

140.116.72.115:*192.168.1.1:*

SGLoad balancer

0.489ms

140.116.72.128100Mb/s lan

0.293ms

Ping Echo Efficience (0.293+0.489)/0.891=87%

0.891ms

Performance Test

140.116.72.115:*192.168.1.1:*

SGLoad balancer

2.67MB/s

140.116.72.128100Mb/s lan

4.33MB/s

Ftp download Efficience 2.24/2.67=83.89%

2.24MB/s

Some Other Application Examples

• Web Proxy Server Cluster• Web Server Cluster• Telnet Server Cluster• Mail Server Cluster

Proxy Server Cluster

cache

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*

request

192.168.1.3:*

SGLoad balancer

Virtual Proxy Server

• Configuration• Each proxy server has its own disk for the cache pool• Each proxy server set others as its sibling.

• Data Consistence• each proxy server uses ICP protocol to query objects on other’s cache

pool and fetch the object from others if needed

cache

cache

Web Server Cluster

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*

request

192.168.1.3:*

SGLoad balancer

Virtual Web Server

• Configuration• Each web server has its own disk to store static data (web pages, imag

es)• Common db server and nfs server in backend to store dynamic data (c

ustomer input, session…)• Data Consistence

• Multiple copies of static data are maintained by administrators• There is only one copy of dynamic data in central db/nfs server, no m

aintenance is required

dsk

dsk

dsk

DB

DB server

dsk

NFS server

Telnet Server Cluster

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*

request

192.168.1.3:*

SGLoad balancer

Virtual Telnet Server

• Configuration• NIS is server used to store accounts for users• NFS server is used to store the user home directory and mail spool(/v

ar/mail)• Data Consistence

• There is only one copy of user data/mail, no maintenance is required

dsk

account

NFS server

NIS server(accounts)

Mail Server Cluster

140.116.72.114:*

192.168.1.1:*

192.168.1.2:*request

192.168.1.3:*

SGLoad balancer

Virtual Mail Server

dsk

account

NFS server

NIS server

Sendmail

Sendmail

Sendmail

• Configuration• NIS is server used to store accounts for users• NFS server is used to store the user home dir and mail spool(/var/mai

l)• Sendmail daemon on each server must accepts mails targeted on virtu

al mail server• Sendmail daemon on each server does masquerade on each outgoing

mail as they were sent from virtual mail server

Mail Server Cluster (continued)

• Data Consistence• There is only one copy of user data/mail, no maintenance is required

• Sendmail Setup• Accept mails targeted on virtual mail server

• Search sendmail.cf, find a line like Fw-o /etc/mail/sendmail.cw• Add the hostname of virtual mail server to sendmail.cw

• Masquerade outgoing mails as they sent from virtual mail server• Seach sendmail.cf, find a line containing only ‘DM’• Change the line to ‘DM you.virtual.mail.server’

Epilogue• A free working clustering system, all required binary codes are packed in a

1.4M floppy (http://turtle.ee.ncku.edu.tw/sgcluster/)• Some good features

– Load balance with various policies– Fault tolerance support for both application servers and the load balancer.– Support readany, readany/writeall, readfirst/writeall models– Enabling quick application developments for load balance and high availability

clusters– The Bidding algorithm supports the election of a primary server in a symmetrica

l way. It is used for the fault tolerance of the load balancer – Flexibility in application cluster configuration– Support deny-of-service and access control– Feedback protocol permits customized policy control and administration

• The SG cluster has been used by 台南市教育網路中心 for one year to support proxy cluster

the sg cluster with load balance and fault tolerance

Documents

ip addresses neededprivate

sg clusterthe sg cluster

typical sg cluster

load balancers

problemsprivate ip addresses

end servers

ip pubsport3d

ip pubips2ips3sport1d