communication networks of parallel & distributed systems: low

43
June 18, 2002 http://www.cs.northwester n.edu/~donglu 1 Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to clusters & grids Dong Lu Dept. of Computer Science Northwestern University

Upload: haliem

Post on 03-Jan-2017

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

1

Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to clusters

& grids

Dong LuDept. of Computer ScienceNorthwestern University

Page 2: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

2

IntroductionCommunication networks play a vital role in parallel & distributed systems. Modern communication networks support low latency & high bandwidth communication services.

Page 3: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

3

How is low latency & high bandwidth achieved?

DMA based zero copy and OS bypassing which can provide applications with direct access to Network Interface Card.Communication protocol processing is offloaded by using a helper processor on NIC or channel adapter. Often, TCP/IP is not used.Switched networks or hypercube with high speed routers that support cut-through routing.

Page 4: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

4

TREND Those technologies are migrating from inside parallel systems to clusters. Example: InfiniBand Architecture.

Lower latency & higher bandwidth communication networks are becoming available to Grid computing. Example: High speed optical networking is

becoming dominate in Internet.

Page 5: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

5

Outline of this talkCommunication networks in parallel systems.Communication networks in Clusters and System Area Networks.New and improved Communication network protocols & technologies in Grid computing.Trend: Low latency & High bandwidth comes to clusters & the Grid.

Page 6: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

6

Communication networks in parallel systems

IBM SP2SGI Origin 2000

Page 7: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

7

IBM SP2Any-to-any packet switched, multistage network. Excellent scalability. Micro Channel adapter has an onboard microprocessor that offloads some of the protocol processing load. The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus supports zero-copy message passing.

Page 8: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

8

64 nodes IBM SP2 network topology

Page 9: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

9

SGI Origin 2000 Origin is a distributed shared memory, cc-NUMA multiprocessor. cc-NUMA stands for cache-coherent non-uniform memory access. Hypercube network connected by SPIDER routers, which support wormhole “cut-through” routing, that is, start forwarding without getting the whole packet, contrary to what “store and forward” does.Low latency remote memory access is supported and the ratio of remote memory to local memory latency is very low.

Page 10: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

10

128 processors SGI Origin 2000 network

Page 11: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

11

Communication Networks in Clusters

Gigabit Ethernet Myrinet Virtual Interface Architecture InfiniBand Architecture

Page 12: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

12

Gigabit EthernetCan be switch based --- Higher bandwidth and smaller collision domain.Jumbo frame is supported, up to 9K.Some NIC are Programmable, which have a onboard processor. Zero-copy OS-bypass message passing can be supported with programmable NIC and DMA, which make it a low-cost, low latency and high bandwidth architecture!

Page 13: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

13

High performance GigaE Architecture

EMP system, appeared in HPDC2001

Page 14: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

14

MyrinetDeveloped based on the technology of a parallel system --- Intel Paragon.The first commercial LAN technology able to provide zero-copy message passing and can offload protocol processing to the interface processor.Switch based, Cut-through-routing is supported.

Page 15: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

15

MyrinetLANai is the host interface that has a processor and DMA engine onboard. High bandwidth & low latency, but very expensive and not very stable.

Page 16: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

16

Myrinet

Page 17: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

17

Virtual Interface Architecture

Support zero-copy and OS-bypassing to provide low latency and high bandwidth communication service. Message send/receive operations and Remote DMA are supported. To a user process, VIA provides direct access to the network interface in a fully protected fashion.

Page 18: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

18

Remote DMA

Page 19: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

19

Virtual Interface Architecture

Each process owns a VI and each VI consists of one send queue and a receive queue. The memory regions are registered before data transfer by Open/Connect operations. After the Open/Connection and memory registration, user data can be transferred without the operating system. Memory protection is provided by protection tag mechanism. Protection tags are associated with VIs and memory regions.

Page 20: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

20

Virtual Interface Architecture

Page 21: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

21

Infiniband Architecture

Page 22: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

22

Infiniband ArchitectureEncompasses a system-area network for connecting multiple independent processor and I/O platforms. Defines the communication and management infrastructure supporting both I/O and inter-processor communications.

Page 23: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

23

Infiniband ArchitectureComponents: A host channel adapter (HCA), A target channel adapter (TCA) and fabric switch. Channel adapter offload the protocol processing load from CPU. DMA/RDMA is supported.Zero copy-data transfers without kernel involvement and uses hardware to provide highly reliable, fault-tolerant communication

Page 24: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

24

Communication Networks in the GRID

IPv6High performance TCP Reno TCP tuning for distributed applications on the WAN TCP Vegas vs. TCP Reno Random Early Detection gateways Aggressive TCP Reno: What I have done on Linux kernel

Page 25: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

25

IPv6Expanded Addressing Capabilities. 128 bits vs. 32 bits in IPv4.Flow Labeling Capability. Good news to real time applications and high performance applications.Header Format Simplification.Improved Support for Extensions and Options.Authentication and Privacy Capabilities.

Page 26: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

26

IPv6 header |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| | Version| Traffic Class | Flow Label | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| | Payload Length | Next Header | Hop Limit | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| | Source Address | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-| | Destination Address |

|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|

Page 27: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

27

High performance TCP Reno (RFC1323)

TCP extension for high performance.TCP performance depends not upon the transfer rate itself, but rather upon “bandwidth*delay product“, which is growing quickly, much bigger than 65K. The TCP header uses a 16 bit field to report the receive window size to the sender. Therefore, the largest window that can be used is 2**16 = 65K bytes.

Page 28: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

28

High performance TCP Reno (RFC1323)

A TCP option, "Window Scale" is adopted to allow windows larger than 2**16 bytes.However, high transfer rate alone can threaten TCP reliability by violating the assumptions behind the TCP mechanism for duplicate detection and sequencing. That is, any sequence number may eventually be reused, error may result from an accidental reuse of TCP sequence numbers in data segments.

Page 29: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

29

High performance TCP Reno (RFC1323)

PAWS (Protect Against Wrapped Sequence numbers) mechanism is proposed to avoid this potential problem.

Page 30: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

30

TCP tuning for distributed applications

The congestion window size is used by TCP to control how many packets should be sent into the network, and the send &receive buffer size as well as the network congestion status decide the congestion window size.Many operating systems use a default TCP buffer size of either 24 or 32 KB (Linux is only 8 KB).

Page 31: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

31

TCP tuning for distributed applications

Suppose the slowest hop from site A to site B is 100 Mbps (about 12 MB/sec), typical latency across the US is about 25 ms. 12*25=300K. If the default 24K is used as TCP buffer, then 24/300 = 8%. So, only a small portion of bandwidth is used!Buffer size = 2 * bandwidth * delay

or Buffer size = bandwidth * RTT

Page 32: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

32

TCP Vegas vs. TCP Reno Researchers have shown us that aggregate network traffic can be characterized as self-similar or fractal, which usually is a bad property for the performance of Internet. Several researchers claim that the primary source of self-similarity is from TCP Reno via an "additive increase, multiplicative decrease" (AIMD) congestion-control mechanism.Instead of reacting to congestion as TCP Reno does, TCP Vegas tries to avoid congestion.

Page 33: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

33

TCP VegasVegas has two threshold values, A and B, default values are A=1, B=3. ESR is the expected sending rate and ASR is the actual sending rate. Let diff = ESR – ASR If diff < A, increase the congestion window linearly

during the next round trip time. If diff > B, decrease the window linearly during the next

RTT. Otherwise, don’t change the congestion window size.

Page 34: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

34

TCP Vegas vs. TCP RenoSome researchers show that with proper values for A and B, Vegas behave better than Reno in the Grid computing environment. The problem with Vegas is that Vegas is not verified on a large-scale network and the optimal values of A and B are not easy to decide.

Page 35: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

35

Random Early Detection gateways

RED gateways maintain a weighted average of the queue length, a minimum and maximum threshold (REDmin, REDmax), and an early drop rate P. Packets are then queued as follows: If (queue length < REDmin), queue all packets. If (queue length > REDmin, and queue length < REDmax), drop

packets with probability P. If (queue length > REDmax), drop all packets.

RED can increase the fairness and overall network performance, so it is widely applied in the routers in the world. Since GRID is built on the basis of Internet, the effect of RED routers should be considered.

Page 36: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

36

Aggressive TCP Reno: What I have done

Linux kernel modification on TCP congestion control. Some studies have shown us that TCP Reno congestion control is too conservative thus the bandwidth is not fully utilized. So, make it more aggressive. How?

Page 37: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

37

Aggressive TCP Reno: What I have done

Window size start from more than one packet (for example 20), and increase more quickly during “slow start”.Do the same “Congestion avoidance”.Whenever there is a packet loss, don’t drop to one packet, instead, drop to 80% of window size, and new threshold will be 90% of current window size.

Page 38: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

38

Aggressive TCP Reno: What I have done

Built into Linux kernel

TCP Reno Aggress TCP

Page 39: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

39

Some Performance gains for Virtualized Audio

Frequency (computational work load)

1000 1600 2000

Run time onModified kernel(aggressive TCP)

169.9166.5169.3

1035.51032.01035.8

2187.9 2134.8 2130.2

Run time onUnmodified Kernel

182.9185.3183.1

1171.71161.81159.7

2207.2 2206.4 2207.7

Page 40: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

40

Aggressive TCP RenoThrough the kernel modification, some performance gains are achieved without modifying the application code. But the results are still not very satisfactory and not very stable. Why?

Page 41: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

41

Aggressive TCP RenoThat can be due to three reasons. First, virtual audio is very computational intensive, so enhancing the communication performance even more drastically will not change the overall performance much (most time was spent on computing). Second, the bandwidth*delay product on the cluster is small, which implies that this technique may be more effective on the WAN (with much bigger RTT). Third, the effect of fast retransmit and fast recovery is not considered here but it turns out to be important.

Page 42: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

42

Conclusion Some technologies used by parallel systems are going into cluster, making low latency and high bandwidth available. With the development of Internet, new & improved protocols are proposed and tested to provide lower latency & higher bandwidth to Grid computing.New proposal on aggressive TCP is implemented and tested and some performance gains are achieved. More work is needed to make it more effective and stable.

Page 43: Communication Networks of Parallel & Distributed Systems: Low

June 18, 2002 http://www.cs.northwestern.edu/~donglu

43

Questions?