june 18, 2002 donglu1 communication networks of parallel & distributed systems: low latency...

June 18, 2002 http://www.cs.northwestern.edu/~donglu

1

Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to clusters

& grids

Dong Lu

Dept. of Computer Science

Northwestern University


2

Introduction

Communication networks play a vital role in parallel & distributed systems.

Modern communication networks support low latency & high bandwidth communication services.


3

How is low latency & high bandwidth achieved?

DMA based zero copy and OS bypassing which can provide applications with direct access to Network Interface Card.Communication protocol processing is offloaded by using a helper processor on NIC or channel adapter. Often, TCP/IP is not used.Switched networks or hypercube with high speed routers that support cut-through routing.


4

TREND

Those technologies are migrating from inside parallel systems to clusters. Example: InfiniBand Architecture.

Lower latency & higher bandwidth communication networks are becoming available to Grid computing. Example: High speed optical networking is

becoming dominate in Internet.


5

Outline of this talkCommunication networks in parallel systems.

Communication networks in Clusters and System Area Networks.

New and improved Communication network protocols & technologies in Grid computing.

Trend: Low latency & High bandwidth comes to clusters & the Grid.


6

Communication networks in parallel systems

IBM SP2

SGI Origin 2000


7

IBM SP2Any-to-any packet switched, multistage network. Excellent scalability.

Micro Channel adapter has an onboard microprocessor that offloads some of the protocol processing load.

The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus supports zero-copy message passing.


8

64 nodes IBM SP2 network topology


9

SGI Origin 2000 Origin is a distributed shared memory, cc-NUMA multiprocessor. cc-NUMA stands for cache-coherent non-uniform memory access. Hypercube network connected by SPIDER routers, which support wormhole “cut-through” routing, that is, start forwarding without getting the whole packet, contrary to what “store and forward” does.Low latency remote memory access is supported and the ratio of remote memory to local memory latency is very low.


10

128 processors SGI Origin 2000 network


11

Communication Networks in Clusters

Gigabit Ethernet

Myrinet

Virtual Interface Architecture

InfiniBand Architecture


12

Gigabit Ethernet

Can be switch based --- Higher bandwidth and smaller collision domain.Jumbo frame is supported, up to 9K.Some NIC are Programmable, which have a onboard processor. Zero-copy OS-bypass message passing can be supported with programmable NIC and DMA, which make it a low-cost, low latency and high bandwidth architecture!


13

High performance GigaE Architecture

EMP system, appeared in HPDC2001


14

Myrinet

Developed based on the technology of a parallel system --- Intel Paragon.The first commercial LAN technology able to provide zero-copy message passing and can offload protocol processing to the interface processor.Switch based, Cut-through-routing is supported.


15

Myrinet

LANai is the host interface that has a processor and DMA engine onboard.

High bandwidth & low latency, but very expensive and not very stable.


16

Myrinet


17


Support zero-copy and OS-bypassing to provide low latency and high bandwidth communication service. Message send/receive operations and Remote DMA are supported.

To a user process, VIA provides direct access to the network interface in a fully protected fashion.


18

Remote DMA


19


Each process owns a VI and each VI consists of one send queue and a receive queue. The memory regions are registered before data transfer by Open/Connect operations. After the Open/Connection and memory registration, user data can be transferred without the operating system. Memory protection is provided by protection tag mechanism. Protection tags are associated with VIs and memory regions.


20



21

Infiniband Architecture


22


Encompasses a system-area network for connecting multiple independent processor and I/O platforms.

Defines the communication and management infrastructure supporting both I/O and inter-processor communications.


23


Components: A host channel adapter (HCA), A target channel adapter (TCA) and fabric switch. Channel adapter offload the protocol processing load from CPU. DMA/RDMA is supported.Zero copy-data transfers without kernel involvement and uses hardware to provide highly reliable, fault-tolerant communication


24

Communication Networks in the GRID

IPv6

High performance TCP Reno

TCP tuning for distributed applications on the WAN

TCP Vegas vs. TCP Reno

Random Early Detection gateways

Aggressive TCP Reno: What I have done on Linux kernel


25

IPv6

Expanded Addressing Capabilities. 128 bits vs. 32 bits in IPv4.Flow Labeling Capability. Good news to real time applications and high performance applications.Header Format Simplification.Improved Support for Extensions and Options.Authentication and Privacy Capabilities.


27

High performance TCP Reno (RFC1323)

TCP extension for high performance.

TCP performance depends not upon the transfer rate itself, but rather upon “bandwidth*delay product“, which is growing quickly, much bigger than 65K.

The TCP header uses a 16 bit field to report the receive window size to the sender. Therefore, the largest window that can be used is 2**16 = 65K bytes.


28


A TCP option, "Window Scale" is adopted to allow windows larger than 2**16 bytes.However, high transfer rate alone can threaten TCP reliability by violating the assumptions behind the TCP mechanism for duplicate detection and sequencing. That is, any sequence number may eventually be reused, error may result from an accidental reuse of TCP sequence numbers in data segments.


29


PAWS (Protect Against Wrapped Sequence numbers) mechanism is proposed to avoid this potential problem.


30

TCP tuning for distributed applications

The congestion window size is used by TCP to control how many packets should be sent into the network, and the send &receive buffer size as well as the network congestion status decide the congestion window size.

Many operating systems use a default TCP buffer size of either 24 or 32 KB (Linux is only 8 KB).


31

TCP tuning for distributed applications

Suppose the slowest hop from site A to site B is 100 Mbps (about 12 MB/sec), typical latency across the US is about 25 ms. 12*25=300K. If the default 24K is used as TCP buffer, then 24/300 = 8%. So, only a small portion of bandwidth is used!Buffer size = 2 * bandwidth * delay

or Buffer size = bandwidth * RTT


32


Researchers have shown us that aggregate network traffic can be characterized as self-similar or fractal, which usually is a bad property for the performance of Internet. Several researchers claim that the primary source of self-similarity is from TCP Reno via an "additive increase, multiplicative decrease" (AIMD) congestion-control mechanism.

Instead of reacting to congestion as TCP Reno does, TCP Vegas tries to avoid congestion.


33

TCP Vegas

Vegas has two threshold values, A and B, default values are A=1, B=3. ESR is the expected sending rate and ASR is the actual sending rate. Let diff = ESR – ASR If diff < A, increase the congestion window linearly

during the next round trip time. If diff > B, decrease the window linearly during the next

RTT. Otherwise, don’t change the congestion window size.


34


Some researchers show that with proper values for A and B, Vegas behave better than Reno in the Grid computing environment.

The problem with Vegas is that Vegas is not verified on a large-scale network and the optimal values of A and B are not easy to decide.


35

Random Early Detection gateways

RED gateways maintain a weighted average of the queue length, a minimum and maximum threshold (REDmin, REDmax), and an early drop rate P. Packets are then queued as follows: If (queue length < REDmin), queue all packets. If (queue length > REDmin, and queue length < REDmax), drop

packets with probability P. If (queue length > REDmax), drop all packets.

RED can increase the fairness and overall network performance, so it is widely applied in the routers in the world. Since GRID is built on the basis of Internet, the effect of RED routers should be considered.


36

Aggressive TCP Reno: What I have done

Linux kernel modification on TCP congestion control.

Some studies have shown us that TCP Reno congestion control is too conservative thus the bandwidth is not fully utilized.

So, make it more aggressive. How?


37


Window size start from more than one packet (for example 20), and increase more quickly during “slow start”.Do the same “Congestion avoidance”.Whenever there is a packet loss, don’t drop to one packet, instead, drop to 80% of window size, and new threshold will be 90% of current window size.


38


Built into Linux kernel

TCP Reno Aggress TCP


39

Some Performance gains for Virtualized Audio

Frequency

(computational work load)1000 1600 2000

Run time on

Modified kernel

(aggressive TCP)

169.9

166.5

169.3

1035.5

1032.0

1035.8

2187.9 2134.8 2130.2

Run time on

Unmodified Kernel182.9

185.3

183.1

1171.7

1161.8

1159.7

2207.2 2206.4 2207.7


40

Aggressive TCP RenoThrough the kernel modification, some performance gains are achieved without modifying the application code.

But the results are still not very satisfactory and not very stable. Why?


41

Aggressive TCP Reno

That can be due to three reasons. First, virtual audio is very computational intensive, so enhancing the communication performance even more drastically will not change the overall performance much (most time was spent on computing). Second, the bandwidth*delay product on the cluster is small, which implies that this technique may be more effective on the WAN (with much bigger RTT). Third, the effect of fast retransmit and fast recovery is not considered here but it turns out to be important.


42

Conclusion

Some technologies used by parallel systems are going into cluster, making low latency and high bandwidth available.

With the development of Internet, new & improved protocols are proposed and tested to provide lower latency & higher bandwidth to Grid computing.

New proposal on aggressive TCP is implemented and tested and some performance gains are achieved. More work is needed to make it more effective and stable.


43

Questions?

june 18, 2002 donglu1 communication networks of parallel & distributed systems: low latency...

Documents

network slide

talk communication networks

donglu6 communication

modern communication

local memory latency

processor memory

system area networks

network interface