june 18, 2002 donglu1 communication networks of parallel & distributed systems: low latency...

43
June 18, 2002 http://www.cs.northwester n.edu/~donglu 1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to clusters & grids Dong Lu Dept. of Computer Science Northwestern University

Post on 21-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

1

Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to clusters

& grids

Dong Lu

Dept. of Computer Science

Northwestern University

Page 2: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

2

Introduction

Communication networks play a vital role in parallel & distributed systems.

Modern communication networks support low latency & high bandwidth communication services.

Page 3: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

3

How is low latency & high bandwidth achieved?

DMA based zero copy and OS bypassing which can provide applications with direct access to Network Interface Card.Communication protocol processing is offloaded by using a helper processor on NIC or channel adapter. Often, TCP/IP is not used.Switched networks or hypercube with high speed routers that support cut-through routing.

Page 4: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

4

TREND

Those technologies are migrating from inside parallel systems to clusters. Example: InfiniBand Architecture.

Lower latency & higher bandwidth communication networks are becoming available to Grid computing. Example: High speed optical networking is

becoming dominate in Internet.

Page 5: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

5

Outline of this talkCommunication networks in parallel systems.

Communication networks in Clusters and System Area Networks.

New and improved Communication network protocols & technologies in Grid computing.

Trend: Low latency & High bandwidth comes to clusters & the Grid.

Page 6: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

6

Communication networks in parallel systems

IBM SP2

SGI Origin 2000

Page 7: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

7

IBM SP2Any-to-any packet switched, multistage network. Excellent scalability.

Micro Channel adapter has an onboard microprocessor that offloads some of the protocol processing load.

The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus supports zero-copy message passing.

Page 8: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

8

64 nodes IBM SP2 network topology

Page 9: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

9

SGI Origin 2000 Origin is a distributed shared memory, cc-NUMA multiprocessor. cc-NUMA stands for cache-coherent non-uniform memory access. Hypercube network connected by SPIDER routers, which support wormhole “cut-through” routing, that is, start forwarding without getting the whole packet, contrary to what “store and forward” does.Low latency remote memory access is supported and the ratio of remote memory to local memory latency is very low.

Page 10: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

10

128 processors SGI Origin 2000 network

Page 11: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

11

Communication Networks in Clusters

Gigabit Ethernet

Myrinet

Virtual Interface Architecture

InfiniBand Architecture

Page 12: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

12

Gigabit Ethernet

Can be switch based --- Higher bandwidth and smaller collision domain.Jumbo frame is supported, up to 9K.Some NIC are Programmable, which have a onboard processor. Zero-copy OS-bypass message passing can be supported with programmable NIC and DMA, which make it a low-cost, low latency and high bandwidth architecture!

Page 13: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

13

High performance GigaE Architecture

EMP system, appeared in HPDC2001

Page 14: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

14

Myrinet

Developed based on the technology of a parallel system --- Intel Paragon.The first commercial LAN technology able to provide zero-copy message passing and can offload protocol processing to the interface processor.Switch based, Cut-through-routing is supported.

Page 15: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

15

Myrinet

LANai is the host interface that has a processor and DMA engine onboard.

High bandwidth & low latency, but very expensive and not very stable.

Page 16: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

16

Myrinet

Page 17: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

17

Virtual Interface Architecture

Support zero-copy and OS-bypassing to provide low latency and high bandwidth communication service. Message send/receive operations and Remote DMA are supported.

To a user process, VIA provides direct access to the network interface in a fully protected fashion.

Page 18: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

18

Remote DMA

Page 19: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

19

Virtual Interface Architecture

Each process owns a VI and each VI consists of one send queue and a receive queue. The memory regions are registered before data transfer by Open/Connect operations. After the Open/Connection and memory registration, user data can be transferred without the operating system. Memory protection is provided by protection tag mechanism. Protection tags are associated with VIs and memory regions.

Page 20: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

20

Virtual Interface Architecture

Page 21: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

21

Infiniband Architecture

Page 22: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

22

Infiniband Architecture

Encompasses a system-area network for connecting multiple independent processor and I/O platforms.

Defines the communication and management infrastructure supporting both I/O and inter-processor communications.

Page 23: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

23

Infiniband Architecture

Components: A host channel adapter (HCA), A target channel adapter (TCA) and fabric switch. Channel adapter offload the protocol processing load from CPU. DMA/RDMA is supported.Zero copy-data transfers without kernel involvement and uses hardware to provide highly reliable, fault-tolerant communication

Page 24: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

24

Communication Networks in the GRID

IPv6

High performance TCP Reno

TCP tuning for distributed applications on the WAN

TCP Vegas vs. TCP Reno

Random Early Detection gateways

Aggressive TCP Reno: What I have done on Linux kernel

Page 25: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

25

IPv6

Expanded Addressing Capabilities. 128 bits vs. 32 bits in IPv4.Flow Labeling Capability. Good news to real time applications and high performance applications.Header Format Simplification.Improved Support for Extensions and Options.Authentication and Privacy Capabilities.

Page 26: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

26

IPv6 header

|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|

| Version| Traffic Class | Flow Label |

|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|

| Payload Length | Next Header | Hop Limit |

|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|

| Source Address |

|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-|

| Destination Address |

|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|

Page 27: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

27

High performance TCP Reno (RFC1323)

TCP extension for high performance.

TCP performance depends not upon the transfer rate itself, but rather upon “bandwidth*delay product“, which is growing quickly, much bigger than 65K.

The TCP header uses a 16 bit field to report the receive window size to the sender. Therefore, the largest window that can be used is 2**16 = 65K bytes.

Page 28: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

28

High performance TCP Reno (RFC1323)

A TCP option, "Window Scale" is adopted to allow windows larger than 2**16 bytes.However, high transfer rate alone can threaten TCP reliability by violating the assumptions behind the TCP mechanism for duplicate detection and sequencing. That is, any sequence number may eventually be reused, error may result from an accidental reuse of TCP sequence numbers in data segments.

Page 29: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

29

High performance TCP Reno (RFC1323)

PAWS (Protect Against Wrapped Sequence numbers) mechanism is proposed to avoid this potential problem.

Page 30: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

30

TCP tuning for distributed applications

The congestion window size is used by TCP to control how many packets should be sent into the network, and the send &receive buffer size as well as the network congestion status decide the congestion window size.

Many operating systems use a default TCP buffer size of either 24 or 32 KB (Linux is only 8 KB).

Page 31: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

31

TCP tuning for distributed applications

Suppose the slowest hop from site A to site B is 100 Mbps (about 12 MB/sec), typical latency across the US is about 25 ms. 12*25=300K. If the default 24K is used as TCP buffer, then 24/300 = 8%. So, only a small portion of bandwidth is used!Buffer size = 2 * bandwidth * delay

or Buffer size = bandwidth * RTT

Page 32: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

32

TCP Vegas vs. TCP Reno

Researchers have shown us that aggregate network traffic can be characterized as self-similar or fractal, which usually is a bad property for the performance of Internet. Several researchers claim that the primary source of self-similarity is from TCP Reno via an "additive increase, multiplicative decrease" (AIMD) congestion-control mechanism.

Instead of reacting to congestion as TCP Reno does, TCP Vegas tries to avoid congestion.

Page 33: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

33

TCP Vegas

Vegas has two threshold values, A and B, default values are A=1, B=3. ESR is the expected sending rate and ASR is the actual sending rate. Let diff = ESR – ASR If diff < A, increase the congestion window linearly

during the next round trip time. If diff > B, decrease the window linearly during the next

RTT. Otherwise, don’t change the congestion window size.

Page 34: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

34

TCP Vegas vs. TCP Reno

Some researchers show that with proper values for A and B, Vegas behave better than Reno in the Grid computing environment.

The problem with Vegas is that Vegas is not verified on a large-scale network and the optimal values of A and B are not easy to decide.

Page 35: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

35

Random Early Detection gateways

RED gateways maintain a weighted average of the queue length, a minimum and maximum threshold (REDmin, REDmax), and an early drop rate P. Packets are then queued as follows: If (queue length < REDmin), queue all packets. If (queue length > REDmin, and queue length < REDmax), drop

packets with probability P. If (queue length > REDmax), drop all packets.

RED can increase the fairness and overall network performance, so it is widely applied in the routers in the world. Since GRID is built on the basis of Internet, the effect of RED routers should be considered.

Page 36: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

36

Aggressive TCP Reno: What I have done

Linux kernel modification on TCP congestion control.

Some studies have shown us that TCP Reno congestion control is too conservative thus the bandwidth is not fully utilized.

So, make it more aggressive. How?

Page 37: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

37

Aggressive TCP Reno: What I have done

Window size start from more than one packet (for example 20), and increase more quickly during “slow start”.Do the same “Congestion avoidance”.Whenever there is a packet loss, don’t drop to one packet, instead, drop to 80% of window size, and new threshold will be 90% of current window size.

Page 38: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

38

Aggressive TCP Reno: What I have done

Built into Linux kernel

TCP Reno Aggress TCP

Page 39: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

39

Some Performance gains for Virtualized Audio

Frequency

(computational work load)1000 1600 2000

Run time on

Modified kernel

(aggressive TCP)

169.9

166.5

169.3

1035.5

1032.0

1035.8

2187.9 2134.8 2130.2

Run time on

Unmodified Kernel182.9

185.3

183.1

1171.7

1161.8

1159.7

2207.2 2206.4 2207.7

Page 40: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

40

Aggressive TCP RenoThrough the kernel modification, some performance gains are achieved without modifying the application code.

But the results are still not very satisfactory and not very stable. Why?

Page 41: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

41

Aggressive TCP Reno

That can be due to three reasons. First, virtual audio is very computational intensive, so enhancing the communication performance even more drastically will not change the overall performance much (most time was spent on computing). Second, the bandwidth*delay product on the cluster is small, which implies that this technique may be more effective on the WAN (with much bigger RTT). Third, the effect of fast retransmit and fast recovery is not considered here but it turns out to be important.

Page 42: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

42

Conclusion

Some technologies used by parallel systems are going into cluster, making low latency and high bandwidth available.

With the development of Internet, new & improved protocols are proposed and tested to provide lower latency & higher bandwidth to Grid computing.

New proposal on aggressive TCP is implemented and tested and some performance gains are achieved. More work is needed to make it more effective and stable.

Page 43: June 18, 2002  donglu1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to

June 18, 2002 http://www.cs.northwestern.edu/~donglu

43

Questions?