ddrlink aggregation with failover apr09

20
Link Aggregation with Failover 4/15/2009 Contents Introduction Terminology References Link Aggregation Types Topologies Direct Connect Private Network Local Network Remote Network Data Domain Link Aggregation and Failover Bond Functions Available in Linux Distribution Hash Methods Used Link Failures Other Link Aggregation Cisco Sun Windows AIX HPUX Data Domain Link Aggregation and Failover in the Customer’s Environment Normal Link Aggregation Failover of NICs Failover Associated with Link Aggregation Recommended Link Aggregation Switch Information Introduction This document describes the use of link aggregation and failover techniques to maximize throughput on networks with Data Domain systems installed. The basic topologies are described with notes on the usefulness of different aggregation methods, so the right method can be chosen for the site. The goal of Link Aggregation is to evenly split the network traffic across all the links or ports that are in the aggregation group. This is done to maximize the network throughput on the LAN or LANs until the maximum computer speed is encountered. Normally the aggregation is between the local system and the network device or system that it is connected. Normally a system is connected to a switch or router. In theory aggregation allows the system to send data on both links at the same time and therefore can get up to double the throughput. There are a few things that can impact how well the aggregation actually performs. 1. Speed of the switch

Upload: taufikrizki

Post on 17-Jan-2016

106 views

Category:

Documents


8 download

DESCRIPTION

buat belajar data domain

TRANSCRIPT

Page 1: DDRLink Aggregation With Failover Apr09

Link Aggregation with Failover 4/15/2009

ContentsIntroductionTerminologyReferencesLink Aggregation TypesTopologies

Direct ConnectPrivate NetworkLocal NetworkRemote Network

Data Domain Link Aggregation and FailoverBond Functions Available in Linux DistributionHash Methods UsedLink Failures

Other Link AggregationCiscoSunWindowsAIXHPUX

Data Domain Link Aggregation and Failover in the Customer’s EnvironmentNormal Link AggregationFailover of NICsFailover Associated with Link AggregationRecommended Link Aggregation

Switch Information

IntroductionThis document describes the use of link aggregation and failover techniques to maximize throughput onnetworks with Data Domain systems installed. The basic topologies are described with notes on the usefulnessof different aggregation methods, so the right method can be chosen for the site.

The goal of Link Aggregation is to evenly split the network traffic across all the links or ports that are in theaggregation group. This is done to maximize the network throughput on the LAN or LANs until the maximumcomputer speed is encountered. Normally the aggregation is between the local system and the network deviceor system that it is connected. Normally a system is connected to a switch or router. In theory aggregationallows the system to send data on both links at the same time and therefore can get up to double thethroughput.

There are a few things that can impact how well the aggregation actually performs.

1. Speed of the switch

Page 2: DDRLink Aggregation With Failover Apr09

2. How much the DDR can process3. Network overhead4. Acknowledging and coalescing out of order packets5. Aggregation method may not effectively distribute the data evenly across all the links6. Number of clients7. Number of streams (connections) per client

For impact 1, normally the switch can handle the speed of each link that is connected to it, but it may lose somepackets all the packets coming from several ports are concentrated on one uplink all running at maximumspeed. Note: this implies that only one switch can be used for port aggregation coming out of a system. Formost of the implementations this is true, but there are some network topologies that allow for link aggregationacross multiple switches.

Impact 2 addresses the DDR systems. DDR systems and programs processing rate is limited. As the hardwaregets faster and the use of parallel processing improves DDR systems will support a higher network throughput,but as the processing speed increases the network link speed will also increase. For example, with the currentsystems it makes sense to aggregate 1 GbE links but not 10 GbE links because one 10 GbE can provideenough data to saturate the processing power of the current DDR systems. As the system speed improves itwill make sense to aggregate 10 GbE links.

Impact 3 addresses the inherent overhead of the network programs. This overhead will guarantee that thetransfer speed will never reach 100%. The throughput will always be reduced by the overhead it takes to createand send a packet of data through the system until it is put onto the wire. There is an inherent delay separatingthe sending of packets on Ethernet.

Impact 4 deals with the case that the packets may be out of order. The network program will need to coalesceout of order packets into the original order. If the link aggregation mode allows the packets to be sent out oforder and the protocol requires that they be put back into the original order this added overhead may impact thethroughput speed to where the specific mode of link aggregation that causes out of order packets should not beused.

5. Can a single client drive data fast enough to fully utilitize multiple aggregated links? In most cases, either thephysical or OS resources cannot drive data at multiple Gbps. Also, due to hashing limitations, multiple clientswould be required to push data at those speeds.

6. The number of streams, which translates to separate connections, can play a significant role in link utilizationdepending on the hashing that is used.

A final impact deals with the effectiveness of the aggregation method used. If two systems are connectedtogether by direct connect cables, the use of Layer 2 (MAC) hashing or Layer 3 (IP) hashing would not provideany aggregation at all. All the packets would go over the same link. In general the number of systems that willbe communicating with the Data Domain system will be small. So the aggregation method used will need towork for a limited number of target systems.

The number of links that are aggregated will depend on the switch performance, the DDR system andapplication performance and the link aggregation mode used.

Terminology:

DDRData Domain appliance, a Linux system used to perform only Data Domain operations.

EtherChannelThis is a term used by Cisco to define the bundling of network links as described under EthernetChannel. With Cisco there are three ways to form an EtherChannel: manually, automatically usingPAgP, and automatically using LACP. If it is done manually both sides have to be setup by the

Page 3: DDRLink Aggregation With Failover Apr09

administrator. If one of the protocols is used, the specific packets with the specific protocol are sent tothe other side to where the EtherChannel is setup based on the information in the packets.

Ethernet ChannelThis is multiple of individual Ethernet links that is bundled into a single logical link between systems.This provides a higher throughput than a single link does. The term used by Cisco to identify this isEtherChannel. The actual throughput is dependent on the number of links bundled together, theindividual link speed of the individual links and the switch or router that is actually being used. If a linkwithin the Ethernet Channel fails the normal traffic over the failed link is sent over the remaining linkswithin the bundle.

LACPLink Aggregation Control Protocol (LACP) provides a dynamic network aggregation as defined in IEEE802.3ad standard. This is not available in DDOS 4.9 and before.

Link AggregationUsing multiple Ethernet network cables or ports in parallel, Link Aggregation increases the link speedbeyond the limits of any one single cable or port. Link aggregation is usually limited to being connectedto the same switch. Other terms used are EtherChannel (from Cisco), Trunking, Port Trunking, Portaggregation, NIC bonding, and Load balancing. There are proprietary methods that are used, but themain standard method is IEEE 802.3ad. Link aggregation can be used for a type of failover too.

Load BalancingAggregation methods used to try to distribute loads across all available links or ports.

Port Aggregation Protocol (PAgP)This is Cisco’s Proprietary networking protocol providing logical aggregation of Ethernet ports. This isused in Cisco’s EtherChannel. This is the older method used by Cisco. Later releases of their softwareuse the standard LACP to provide the same type of functions. Note PAgP EtherChannels do notinteroperate with LACP EtherChannels. This is not supported by DDR’s.

Round RobinEach new packet is sent to the least busy link or port. This usually means that packets are not sent tothe first link or port until packets have been sent to all the other links or ports, but it may take intoaccount the packet size in the distribution of packets.

RSTPRapid Spanning Tree Protocol, IEEE 802.1W, allows a network topology with bridges to provideredundant paths. This allows for failover of network traffic among systems. This is an extension to thespanning tree protocol (STP). The two names are used inter-changeably.

TOETCP Offload Engine – Network cards (NIC) that have the full TCP/IP stack on the card.

TrunkingTrunking is the use of multiple communication links to provide an aggregated data transfer amongsystems. For computers this may be referred to as port trunking to distinguish from other types oftrunking such as frequencies sharing.

Note: Cisco uses the term “trunking” to refer to VLAN tagging not link aggregation, whereas othervendors use this term in reference to link aggregation.

Page 4: DDRLink Aggregation With Failover Apr09

References:Catalyst 4500 Series Switch Cisco IOS Software Configuration Guide (also used for the 4900 Series Switch too)Release 12.2(44)SG, available from the Cisco Documentation site.

Cisco Documentation, http://www.cisco.com/univercd/home/home.htm

IEEE 802.3 Standard http://standards.ieee.org/getieee802/802.3.htmlAlso available under: http://iweb.datadomain.com/eweb/technical_library/Vendor/Cisco/IEEE 802.3ad Standard is “Clause 43” under IEEE802_3-sec3.pdf of the standards documents listed.

Linux distribution documentation, http://www.kernel.org/

Linux Ethernet Bonding Driver HOWTO, http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driver-howto.php, http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-single-interface.html

Linux Ethernet Bonding Driver HOWTO: http://www.cyberciti.biz/howto/question/static/linux-ethernet-bonding-driver-howto.php, http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-single-interface.html

Wikipedia, http://en.wikipedia.org/wiki/Main_Page

Various links on the web as noted within the document by hotlinks

Link Aggregation TypesLink aggregation needs to balance the number of packet across all the links within the aggregation group withminimum impact on the splitting, assembling, and reordering of out of order packets.

Currently IEEE 802.3ad is the accepted standard. This can be used by most systems that can support Linkaggregation, but there is no “one size fits all.” There are other aggregation types that may work better in somesituations such as round robin which is not part of the IEEE 802.3ad standard.

The IEEE 802.3ad standard is contained in clause 43 of the IEEE 802.3 standard that is freely available on theweb. In the IEEE standards the term “clause 43” can be thought of as chapter 43. Clause 43 is part of the IEEE802.3-2005 – Section Three of the pdf file on the IEEE web site. A large part of the IEEE 802.3ad standard isthe LACP. This is a protocol that is used to coordinate the aggregation between the two systems that aredirectly connected. Note: This standard does not identify how the actual link is selected to send a packet, but itdoes emphatically mention two things: packets within a “conversation” should always be kept in order andshould packets should not be duplicated. For the purposes of this document, “conversion” is the same asconnection. Although the aggregation process is highly described in this standard, the only thing that has animpact on systems outside the local system is the LACP and the messages send that use it. Note: The LACPprotocol is not supported in releases 4.9 and before.

If the IEEE standard is not used the aggregation on both sides has to be setup manually. Other than roundrobin the DDR uses the Linux bonding module’s balance XOR type to provide link aggregation. As implied inthe name the aggregation is done by doing a XOR function on one or more of the addresses and/or portnumbers within the packet headers. This aggregation has to be manually setup on both sides and theaggregation used needs to match. For example if Layer 3+4 is used on the DDR it needs to be also used on thesystem connected to the DDR.

An important consideration is the network topology. Important things to consider in the network topology are

• The equipment directly connected to the DDRo It may be the media server or another DDRo If it is a switch or a router, the make and model number should also be determined

• Whether the target system is local or remote, there may be a gateway involved• The DDR may be on a private network or shared with the rest of the customer’s network• The number of target computer systems that will be connected• Single DDR or multiple DDRs.

Page 5: DDRLink Aggregation With Failover Apr09

Each part of this information will have an impact on the type of Link aggregation that is used.

Consider what systems will be doing the link aggregation. Normally link aggregation configuration requirescoordination from both the DDR system and the switch. Another type of link aggregation configuration can behandled from the DDR system only (both transmit and receive). There is at least one network topology where aswitch may not be part of the configuration, i.e. direct connect. This will need the link aggregation to beconfigured between the DDR and the Media Servers. If the DDR is on the local network and is communicatingwith many systems then using Layer 2 (MAC address) could be acceptable. If connection path goes through arouter/gateway then layer 3 (IP address only) or Layer 3+4 (IP address and the port number) may be needed.

The link aggregation will need to address the use of different speed links, for example: using both 1 GbE and10GbE. The 10 GbE TOE cards may have aggregation on the cards and not support aggregation off the card.Most aggregation methods do not support links running at different speeds so it should be avoided..

There is also the question of the use of fail-over. Failover can be considered to be part of aggregation. Mostlink aggregation modes include an failover component by allowing data transfer to continue in a degraded state.For example, one of the links goes down the link aggregation can recognize this and drop that link from theaggregation list and continue with one less link. The customer may feel full failover is more important than linkaggregation. Instead of aggregating over multiple links, these links can be configured in full failover modewhere idle spares that carries no data would be setup until the active link fails. This way there would be nodegradation of throughput if the one link fails and data is sent over the other. One or more would be kept in astandby mode until it is needed.

Administration network interface is also needed with DDRs. For direct connections and one to one serverconnections there is a separate Ethernet interface for this, but this could also be part of the link aggregationunless there is a physical separation needed between the links.

TopologiesThe basic types of network topologies are described below, along with their differing suitability for various typesof aggregation methods.

Direct ConnectThe Data Domain system is directly connected to one or more backup servers. To be able to provide linkaggregation within this topology will require multiple links between each backup server and the Data Domainsystem. Usually link aggregation is not done with this topology, especially with multiple backup servers,because of the limited number of links available on the Data Domain system.

Backup/media server

Network switch

Tape Library

Data Domain

Business Servers

Page 6: DDRLink Aggregation With Failover Apr09

Private NetworkThis topology is the same as the direct connect except the connections are through a switch rather than beingdirectly connected. This would normally be used to connect multiple media server to multiple DDRs. The linkaggregation would be between a DDR and the switch or between a media server and the switch. Theaggregation would be to get the data to and from the switch. In this case the aggregation between the DDR andthe switch would be independent of the aggregation used between the media server and the switch. Note:there is a possible special case where the switch would be only a pass through and would be transparent to theaggregation. That would not be the norm and is discussed in further detail later.

Backup/media server

Network switch

Tape Library

Data Domain

Business Servers

Private network switch

Page 7: DDRLink Aggregation With Failover Apr09

Local NetworkThe Data Domain system is connected to the backup server through a common switch. In the previous networktopologies shown the Data Domain system may have a connection through the common switch to handleadministration and maintenance tasks which need not be part of the aggregation. In this example the data isalso being sent through the shared network.

Backup/media servers

Network switch

Tape Library

Data Domain System

Business Servers

Page 8: DDRLink Aggregation With Failover Apr09

Remote NetworkThis is similar to the local network except that connection is through a router before it gets to the media server.There will normally be switch in between the DDR and the router unless the router also provides switchfunctionality. What is important to note in this diagram is that there is a gateway function that is involved in thenetwork data flow. It is important to maximize the data throughput between the DDR and the media servers. Sonormally the DDR will be located on the same LAN and use the same switch as the media server. There maybe cases that with multiple media servers some of them may be on separate VLANs. The DDR would need togo through at least one gateway to get to them. It is not expected that the remote network will go across a WAN.Across a WAN topology is likely to be the case for DDR with replication. Normally the data flow in replication islow enough where it does not need aggregation and also the WAN would tend to make aggregation ineffective.Yet there has been one customer that has asked about it and may be pursuing it.

Backup/media servers

Tape Library

Data Domain

Network router

Business Servers

Network router

Page 9: DDRLink Aggregation With Failover Apr09

Data Domain Link Aggregation and FailoverThere are two link aggregation methods supported by Data Domain:

• Round Robin and• Balanced-xor (setup manually on both sides).

The balanced-xor aggregation is selected by choosing the specific hash that is supported:

• Layer 2 or• Layer 3+4.

There are four virtual interfaces that can be used to define the aggregation or failover:

• veth0,• veth1,• veth2,• veth3.

Any of the physical links that are available on the system can be included: eth0, eth1, eth2, eth3, etc. The on-board links (eht0 and eth1) have only recently been allowed to be added to the aggregation group. Olderinstallations of the Data Domain software may not allow those two links to be aggregated.

To specify aggregation of eth2 and eth3 in the virtual interface veth0 one of the following commands would beused:

Net aggregate add veth0 mode roundrobin interfaces eth2 eth3

The first network packet sent to veth0 will be forwarded to one of the interfaces and the next packet wouldbe forwarded to the other. Sending of packets will continue to alternate between the interfaces until thereare no more packets or a link fails. If eth3 loses physical connection all packets are sent through eth2 untilthe eth3 link is brought back up. To make this effective the other side of the network will also need to besetup to do round robin. For direct connect (the only topology that is recommended for round robin) themedia server will have to be able to setup and support round robin.

Net aggregate add veth0 mode xor-L2 interfaces eth2 eth3

The aggregation used would be balanced-xor. The packets are distributed across eth2 and eth3 based onXOR of the source and destination MAC addresses. Because there are only 2 links to be aggregated thelowest bit is used to determine the interface to use for the packet. If the result is 0 one interface will bechosen. If the result is 1 the other interface will be used. To get the packets to be spread across the twolinks requires that data is sent to more than one destination and the MAC addresses of the destinationneeds to be different in such a way that XOR results provide a different number. This means that oneaddress needs to be odd and the other needs to be even. If there are three links that are aggregate, theXOR result is split 3 ways. There has to be at least two media servers there must be at least two mediaservers with odd and even MAC addresses to get any aggregation at all. In general, this aggregation shouldnot be used with less than 4 media servers.

Net aggregate add veth0 mode xor-L3L4 interfaces eth2 eth3

The aggregation used with this command will also be balanced-xor. The packets are distributed across eth2and eth3 based on the XOR of the source IP address, destination IP address, source port number, and thedestination port number. The result gives a number in which the lowest bit is used to determine which linkto use to send the packet. An even result will go over one and an odd result will go over the other. Withthree links the result is divided by 3 with the remainder determining which interface to use. This aggregationwould be used when there are a lot of connections (there is one connection per stream) or a lot of mediaservers or both. This is the mode of choice for Data Domain, but some switches do not support this type ofhashing.

Net failover add veth0 interfaces eth2 eth3

Page 10: DDRLink Aggregation With Failover Apr09

This is not aggregation but the command will group together interfaces eth2 and eth3 for failover. There isonly one failover type supported. If the active physical link goes away the data is sent to the secondphysical link. The active interface is determined by which link comes up first when it is setup. This isnondeterministic. It is dependent on several factors such as switch activity, network activity, and whichinterface is brought up first when they are enabled. The active one can be determined by specifying one ofthe links as primary. The primary interface will always be set as active if it is UP and RUNNING.

Functions available in Linux distributionThe following is a summary of the aggregation and failover modes and hashing available. A more completedescription can be found in Documentation/networking/bonding.txt in the Linux distribution:

Mode Options1. balance-rr or BOND_MODE_ROUNDROBIN (0)

• Aggregation using Round Robin• Failover with degradation• Normally a good type to use with direct connect or something equivalent• To get full aggregation both ends of the link needs to be set up to use round robin

2. active-backup or BOND_MODE_ACTIVEBACKUP (1)• Failover method used by Data Domain• Works only when one or more standby links are in the group• There is one active and all others in the group are stanby• The active link is non-deterministic unless a primary is specified

3. balance-xor or BOND_MODE_XOR (2)• Send transmit to a specific NIC based on specified hash method being used• Default (Source MAC address XOR Destination MAC address) modulo size of aggregation

group• Note: this only aggregates transmissions.• The receive needs to be aggregated on the other end• This mode is referred to as static because of the manual setup that is needed.

Hash method used:1. Layer 2

• Uses (source MAC XOR destination MAC) modulo count of links in aggregation group• This works best when there are many hosts and they are connected to the same switch• All packets to or through a specific MAC address goes through the same link

2. Layer 3+4• Uses ((source port XOR dest port) XOR ((source IP XOR dest IP) AND 0xffff)• modulo count of links in aggregation group• This works best with many connections and/or many media servers• This can work with as little as one media server• For packets that do not include the port number such as IP fragmentation packets and non-TCP

and non-UDP packets this method will use the IP address only. For non-IP packets the Layer 2mode is used. It is because of these exceptions that this is not IEEE compliant. Note that theData Domain network configuration is set up so that packets are not fragmented.

The DD CLI simplifies the interface by not requiring the administrator to specify any more than is necessary.Therefore there is no option to specify mode 3, balance-xor, directly. Rather by specifying the hash method touse the CLI will set the mode to number 3.

The aggregation method used is very important to getting the desired performance. In general the aggregationof choice is “mode xor-L3L4” along with many streams, but if the DDR and media servers are directlyconnected and there are enough links to do aggregation then “mode roundrobin” may work best. There are

Page 11: DDRLink Aggregation With Failover Apr09

some switches that do not support port number hashing. In this case “mode xor-L3L4” will not work.Consider also that the best “aggregation” may be to have each media server use a different link instead ofgrouping them together. Consider the following example:

• four media servers• each media server is sending data at the same time• there are 4 links available on the DDR,

Assign a different IP address to each link and setup up each media server to send data to one unique IPaddress on the media server. That way the throughput will approach 4 times a single link speed verses around2.5 times if aggregation is used. This is very dependent on the expected traffic pattern from the media servers.

Link failuresA link can fail at several places. It can occur in the driver, the wire, the switch, the router, or the remote system.For failover to work the program (this is the bonding module in the Data Domain case) must be able todetermine that a link to the other side is down. This information is normally provided by the hardware driver.

For a simple case consider a direct connect were the wire is disconnected. The driver can sense that the carrieris down and will report this back to the bonding module. The bonding module will mark it as down and switch toa different link. The bonding module will continue to monitor the link and when it comes back up it will mark it asup. If the restored link is marked as the primary the data will be switched back to using that link again.Otherwise the data flow will stay on the current link.

Note: the failover method that is currently supported is for directly attached hardware. The driver can sensewhen the directly attached link is no longer functioning, but beyond that it gets a little harder. Consider the casethat there is a switch or maybe two in the middle. Can the driver determine that the connection to the remotesystem has failed and therefore it needs to switch to the backup? This is possible if the switch provides a linkfault signaling similar to what is defined in IEEE 802.3ae. This is supported by the Fujitsu 10GbE switch and asimilar thing is supported by Cisco. This is rather limited network topology where the systems are directlyconnected via switches and there are no other routes available. This would be an extension of the directconnect to the media server. Currently the driver and the bonding module does not support the link faultsignaling because it is not widely available – too limited of a network topology

For a more complex case consider the local network but with a switch and a router in the network path. Thereare at least two distinct paths that can be followed to get to the router. Failures have to be able to be detectedon any part of each network path. For example if there is a failure at the one port on the router that the DDR linkconnected via the switch, the driver would have to be able to determine that the remote link is down and markthat link as down. In this case the switch itself would be able to switch the signal to the other path between theswitch and the router and a failover at the DDR is not needed. Once again the DDR need only determine thatthere is a failure between its NIC and the switch or router to which it is attached.

There are two types of failover. One is failover to a standby. The standby is not being used until a failurehappens and the traffic is redirected to the standby link. This is a waste of resources if there is never a failover.This is the method used by Data Domain when the bonding method “failover” is specified:

Net add failover veth1 interfaces eth3

Another type of failover is failover with degradation. In this method there is no standby. All the links in thegroup are being used. If there is a failure the failed link is removed and the rest of the network traffic from thatlink is redirected to the other links in the group. This is the failover associated with link aggregation, but it canbecome complex if the bonding driver has to determine if a path to the target system no longer exists and itneeds to not send data to that link.

Other Link AggregationThe link aggregation used is dependent on what network equipment the DDR is connected and the networktopology. The equipment connected to the DDR could be a switch or router, and the target system. So it isimportant to understand what aggregation is provided by other systems. Most switches and routers support

Page 12: DDRLink Aggregation With Failover Apr09

LACP link aggregation (IEEE 802.3ad standard). Some offer proprietary aggregation types. If they offeraggregation they support the XOR of Layer 2 to define which packet goes to which port.

CiscoSome of the older Cisco switches and routers only support the older proprietary protocol, PAgP. The DataDomain system will not support this type of aggregation. Fortunately, the newer switches and routers supportthe IEEE 802.3ad standard. When using Cisco switches and routers the IEEE 802.3ad should be used withLayer 3 and 4 hashing. It may be possible in some cases to set the aggregation with PAgP to round robin, butthat is not currently supported for the DDR when connected to a switch or a router because of through putdelays from potential packet ordering issues. At high speeds with fast retransmissions out of order packets cangenerate many more packets which would decrease the overall performance.

NortelNortel supports an aggregation called Split Multi-Link Trunking which uses LACP_AUTO mode link aggregation

SunThe initial version 10 Solaris and earlier models supported Sun Trunking. Later releases of Solaris 10 andbeyond support IEEE 802.3ad standard in communicating with switches. Back-to-back link aggregation issupported in which two systems are directly connected over multiple ports. The balancing of the load can bedone with L2 (MAC address), L3 (IP address), L4 (TCP port number), or any combination of these. Note theDDR currently only supports L2 or L3+L4. Link aggregation can run in either passive mode or active mode. Atleast one side must be in active mode. The DDR always uses active mode.

Sun trunking supports round robin type of aggregation. This type of aggregation could be used if the DDR isconnected directly to a Sun system.

For more information on Sun Aggregation refer to the following:

http://docs.sun.com/app/docs/doc/816-4554/fpjvl?l=en&q=%22link+aggregation%22&a=view

For more information on Sun Trunking refer to the following:

http://docs.sun.com/source/817-3374-11/preface.html

WindowsMicrosoft’s view of Link aggregation is that it is a switch problem or a hardware problem. So Microsoft feels thatit should be handled by the switch/router and the NIC card. There is nothing in the OS that directly supports it.Rather if the customer wants it they should get NIC cards that support it and either have a special driver toinitiate it or use the switch to drive it. In the current documentation for their server 2008 they refer to the supportof PAgP an old proprietary Cisco aggregation protocol:

http://blogs.technet.com/winserverperformance/

They also refer to Receive-Side Scaling (RSS):

http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx

This refers to a way to allocate a program to handle packets across NIC cards which are normally tied tospecific CPUs. There are drivers from outside of Microsoft that at least provide passive IEEE 802.3ad support ifnot active. Passive support means that the Windows system will respond to the to the IEEE 802.3ad protocolpackets, but it will not generate them. For direct connect this may be the only way to have a directly connectedaggregated link. The following link provides Microsoft’s view of servers for 2008:

http://technet2.microsoft.com/windowsserver2008/en/library/59e1e955-3159-41a1-b8fd-047defcbd3f41033.mspx?mfr=true

If the Window’s server is not directly connected then it is not important to the DDR system if or how Linkaggregation is provided by Windows. That would be between the windows server and the switch/router.

Page 13: DDRLink Aggregation With Failover Apr09

It is still TBD for more specific information on which NIC cards support Link aggregation.

AIXAccording to an AIX and Linux administration guide AIX supports EtherChannel and IEEE 802.3ad types of linkaggregation as mentioned in the RSCT administration guide:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.rsct.doc/rsct_aix5l53/bl5adm05/bl5adm0559.html

When using DDR, the round robin available through the EtherChannel can be used when directly connected.IEEE 803.3ad can be used if Layer 4 hashing is included. If it is not directly connected then it is dependent onthe switch or router being used.

AIX uses a variant of EtherChannel for backup, referred to as EtherChannel backup. This is similar to the activebackup supported by the Linux bonding driver and does not need any handshake from the equipment connectedto the links except to have multiple links available.

HPUXThe link aggregation product is referred to as HP Auto Port Aggregation (APA). As with the Link bonding thisproduct also provides either a full standby failover or a degradation failover by overloading other links with in anaggregation group. The aggregation can use Layer 2, Layer 3, and/or Layer 4 hashing for aggregating acrossthe links. It also supports the IEEE 802.3ad standard. A summary of the product is given here:

http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=J4240AA

The administration guide can be found here:

http://docs.hp.com/en/J4240-90039/index.html

According to the administration guide, direct connect server to server is supported, but round robin type ofaggregation does not seem to be. This is further brought out in figure 3-4 in the document where for directconnect it is recommended to have many connections “for load balancing to be effective”. With round robinmultiple connections are not required for effective aggregation. With this understanding the HPUX systemswould not support round robin with a directly connected system

Data Domain Link Aggregation and Failover in the Customer’sEnvironmentThe history of the use of failover and link aggregation in the Data Domain products is as follows:

1. Failover was added in release 4.3

2. Link aggregation was added in release 4.4

3. Source routing was set as the default in release 4.5 to allow separate NICs to reside on the same networkand the response packets would get sent to the correct route. This means the following settings are the default:

net option set net.ipv4.route.src_ip_routing 1net option set net.ipv4.route.flush 2

4. Since July 2008 a primary can be specified for failover. If the primary is up it will always be the active link.

5. Since July 2008 release for 4.4.4, 4.5.2, and beyond the on-board NICs can be included in link aggregation.

Normal Link AggregationNormally link aggregation is between the DDR system and the equipment it is connected to. For example, if theDDR is directly connected to a media server then the link aggregation is between the DDR and the mediaserver. All the links in the aggregation bond would be directly connected to the media server and the mediaserver would need to provide a compatible aggregation to the type used in the DDR.

Page 14: DDRLink Aggregation With Failover Apr09

If the DDR is connected to a switch the link aggregation is between the DDR and the switch. All the links in theaggregation bond would be directly connected to this switch and the switch would need to be setup to handlethe aggregation chosen. What type of aggregation is done on the target system which may be connected tothis same switch or a different switch is independent of what method is used by the DDR.

There is a case that there may be one or more switches between the DDR and the target system, but it is stillconsidered a direct connect direct to the target server. The following diagram shows the network topology ofthis setup. Notice that there is a separate switch for each link and the switches do not communicate with eachother. This is important because the IP address for each link on the DDR is the same. This target server wouldalso have to have a similar setup with the IP at the media server being shared.

This setup may be done to handle distances that are too long for direct connect, but the user still wants todirectly connect the two systems. In this case the aggregation handshake would be between the two endsystems. It would be expected that round robin would be used and would have to be set up on both sides.There are some concerns with this setup as dealing with failures. If the link between the target system and theswitch goes down the local system would have to be able to detect this and send everything over the otherlink(s). For example, suppose the link between switch B and en4 is broken. The media server would sense thecarrier is lost and route the traffic to en5, but the driver for eth3 on the DDR would also have to be able to sensethis and indicate a carrier down condition to the bonding module so the DDR would also route all the trafficthrough eth2. With the current software and switch hardware this is not done and sense the switches areisolated the packets would just get dropped.

Failover of NICsThe case of the pure failover is different. In this case the bonded links do not necessarily need to be connectedto the same switch or router as long as all the links in the bond can transfer data to the target system. Withfailover the data is not split among the links. It is sent over one link only and that link is referred to as the activelink. A single virtual IP is shared across all links in the failover bond, but the MAC does not necessarily need tobe the same. While the active link is up the other links are idle. When a failure occurs, the DDR sends thepackets to another link and redirects the receive packets to another link through the use of ARP. To get thereceive packets to go to the active link, the ARP is turned off for all the links except the active link and agratuitous ARP is sent on the new active link. This sets to the ARP cache in the associated switches androuters.

Failover Associated with Link AggregationWhen aggregation is being performed there is a limited failover capability. The bonding module can sensewhen a link is down and for link aggregation the bonding module will mark the link as down and will remove itfrom the list of active aggregated links. This will also be conveyed to the associate switch or directly connectedsystem through the aggregation network protocol. If an aggregation network protocol is not supported the otherside will also sense that a link is down and stop using it for aggregation. Once the link is brought back up thiscondition will be sensed by the aggregation software and data will flow over the link again.

While a link within the aggregated group is down the data will be distributed among the remaining links. Socommunication will be maintained, but the throughput will be degraded from the level that is able to be achievedby the full aggregation. Note this degradation may have a temporary impact on the retransmission of packets,but over time that will be corrected as timeout values get corrected.

Backup/media serversData Domain

Appliance Network switch B

Network switch A

eth2

eth3

en5

en4

Page 15: DDRLink Aggregation With Failover Apr09

Recommended Link AggregationThe following is what should be considered when trying to decide on aggregation. If no aggregation is to bedone then failover should be considered. Therefore the last choice given is failover as an alternative toaggregation. When considering aggregation some important things to consider: How many simultaneousMedia Server actively doing backups? If the number is less then 4 xor-L2 will not be very effective. As thenumber of aggregated links increase the number of active clients will also need to increase if xor-L2 is to beused. With three aggregated links the number of active clients should be above 5. What is the networktopology? If it route goes through gateways using xor-L2 won’t work because it the destination mac will be thegateway router.

Direct Connect1. mode roundrobin (if it is supported by the media servers)2. separate NIC per media server (if there are enough NICs)3. mode xor-L3L44. failover (if aggregation cannot be used)

Private Network (less than 4 active client or route has gateways in the path)1. separate NIC per media server (if there are enough NICs)2. mode xor-L3L43. mode xor-L2 (if there are a suitable number of clients)4. failover (if aggregation cannot be used)

Private Network (more than 4 active client and the network path as no gateways)1. separate mode xor-L22. mode xor-L3L43. separate NIC per media server (if there are enough NICs)4. failover (if aggregation cannot be used)

Local Network (less than 4 active client or route has gateways in the path)1. separate NIC per media server (if there are enough NICs)2. mode xor-L3L43. mode xor-L2 (if there are a suitable number of active clients)4. failover (if aggregation can not be used)

Local Network (more than 4 active client and the network path as no gateways)1. mode xor-L22. mode xor-L3L43. separate NIC per media server (if there are enough NICs)4. failover (if aggregation can not be used)

Remote Network (normally through gateway and routers)1. separate NIC per media server (if there are enough NICs)2. mode xor-L3L43. failover (if aggregation can not be used)

Switch informationLink aggregation is setup on both sides of a link. The link aggregation does not necessarily have to match onboth sides of the link. For example, the DDR may be set to xor-L3L4 but the switch may be set to src-ip. Agood rule of thumb to follow is to keep the aggregations close, such as xor-L3L4 on the DDR and src-dst-port onthe switch. The reason for this is that if an aggregation is good enough for one direction it is good eough for theother direction.

Aggregation on the switch is used to distribute traffic being received by the DDR. If the main set of operationsbeing done is backup the switch aggregation is very important. Backup network traffic is mostly data beingreceived by the DDR.

Page 16: DDRLink Aggregation With Failover Apr09

Because of the limited number of clients communicating with the DDR the recommended aggregation method isbalance-xor with Layer 3+4 hashing. To support this, the device directly connected to the DDR, e.g. switch orrouter (see the Normal Link Aggregation), needs to support src-dst-port or at least src-port load balancing. Thissection uses the vendor’s documentation to provide potential switches that may work with the Layer 3+4hashing and also some that may not. There are no plans to validate or certify these. The final authority whethera switch supports the desired aggregation is to physically try it. For example, there is at least one case whereround robin was desired and tried and it worked satisfactory even though it is listed that it is not supported. Noteagain, even though round robin may be supported by a switch the aggregation performance is poor or evenworst then not having it. This is mostly due to the out of order packets.

Note: There are few switches that supports layer 3 + 4 aggregation. The supported aggregation may be forlayer 3 only or layer 4 only. Matching layer 4 (port aggregation) with layer 3 + 4 (IP address and portaggregation) is not a problem, but be aware that it may cause data to be sent on one link and received on adifferent link, but the concern of out of order packets shold not occur. Which link the data is sent on is notimportant as long as all the data associated with a connection is sent on the same link.

Definitions:

Dest := DestinationIP := IP addressL4 := Layer 4 of the network stack, i.e. TCPMAC := mac or hardware addressPort := TCP port numberSrc := SourceSW := software

Switch brand& model

SwitchVendor SWRelease

SrcMAC

DestMAC

Src-DestMAC

SrcIP

DestIP

Src-DestIP

SrcL4Port

DestL4Port

Src-DestL4Port

RoundRobin

Cisco Catalyst6500 CatOS

8.6 Yes Yes Yes Yes Yes Yes Yes Yes Yes No

Cisco Catalyst6500 IOS

12.2SXF Yes Yes Yes Yes Yes Yes Yes Yes Yes No

Cisco Catalyst3560

12.2(44)SE Yes Yes Yes Yes Yes Yes No No No No

Cisco Catalyst2960

12.2(44)SE Yes Yes Yes Yes Yes Yes No No No No

Cisco Catalyst3750

12.2(44)SE Yes Yes Yes Yes Yes Yes No No No No

Cisco Catalyst4500/4948/4924

12.2(37)SG Yes Yes Yes Yes Yes Yes Yes Yes Yes No

For directly connected systems the support for round robin is as follows:

• Sun - yes• AIX - yes, it can• HPUX - no• Windows – maybe, it depends on the NIC software, but don’t count on it.

Cisco ConfigurationSet the etherchannel mode to “on”:Manually set the ports to participate in the channel group

Page 17: DDRLink Aggregation With Failover Apr09

DDR Configuration Cisco Load Balance Configurationxor-l3l4 src-dst-portxor-l2 src-dst-mac

Page 18: DDRLink Aggregation With Failover Apr09

AppendixThis appendix gives more details as to what aggregation is normally offered by the Linux system being used byData Domain. The other options are not made available because they do not provide better aggregation orfailover than what is already available. It is expected that is section will be used by developers.

Data domain uses the link aggregation and failover provided by the bonding module available in the Linuxdistribution. The bonding module was developed separate from Linux OS, but is now provided with eachdistribution under drivers/net/bonding. For each mode used on the system a separate bonding module isloaded. To do this each bonding module instance is tied to a specific virtual interface. The names used by DataDomain are: veth0, veth1, veth2, veth3. You can see these along with all the physical interfaces available andwhat the aggregation or failover is being used by using the command:

net show settings

Bond functions available in Linux distributionThe following is a summary of the bonding modes and hashing available. A more complete description can befound in Documentation/networking/bonding.txt in the Linux distribution:

Mode Options1. balance-rr or BOND_MODE_ROUNDROBIN (0)

• Aggregation using Round Robin• Failover with degradation• Normally a good type to use with direct connect or something equivalent• To get full aggregation both ends of the link needs to be set up to use round robin

2. active-backup or BOND_MODE_ACTIVEBACKUP (1)• Failover method used by Data Domain• Works only when one or more standby links are in the group• The active link is non-deterministic unless a primary is specified

3. balance-xor or BOND_MODE_XOR (2))• Note: this only aggregates transmissions. The receive needs to be aggregated on the other end• Send transmit to a specific NIC based on specified hash method being used• Default (Source MAC address XOR Destination MAC address) modulo size of aggregation

group• This mode is used when mode XOR is specified in the CLI

4. broadcast or BOND_MODE_BROADCAST (3)• Failover – send everything on all links in group• This mode is not available when using the Data Domain shell CLI

5. 802.3ad or BOND_MODE_8023AD (4)• IEEE dynamic link aggregation using the same hash as used by mode 3• The aggregation is determined by the hash method chosen, layer 2 is the default• Requires the driver to support ethtool• Requires the switch to support the IEEE 802.3ad standard, specifically the protocol• Requires the same IP address and MAC address across all the slaves• All the slaves must run at the same speed and be connected to the same switch• This mode is not available when using the Data Domain shell CLI

6. balance-tlb or BOND_MODE_TLB (5)• Aggregation and failover, does not require switch support, but deals with transmit only.• Aggregated according to load of each link and the link speed.• Different speed links can be used.

Page 19: DDRLink Aggregation With Failover Apr09

• The MAC addresses across the links do not have to be the same.• Matching MACs may be required for receive aggregation, but it is not tied to this.• The associated drivers must support ethtool interface• This is not currently supported by the Data Domain shell CLI.

7. balance-alb or BOND_MODE_ALB (6)• Adaptive link aggregation for both transmit and receive without switch support• Uses arp to control which link the receive uses. Different MAC address are used for each link• Links can be added to and removed from the aggregation group.• The switch should not be using aggregation to this system.• Links with different speeds can be used, but not recommended.• The associated drivers must support ethtool interface• This is not currently supported by the Data Domain shell CLI.

Hash method used:1. Layer 2

• Uses (source MAC XOR destination MAC) modulo count of links in aggregation group• This works best when there are many hosts and they are connected to the same switch• All packets to or through a specific MAC address (first hop peer) goes through the same link

2. Layer 3+4• Uses ((source port XOR dest port) XOR ((source IP XOR dest IP) AND 0xffff)• modulo count of links in aggregation group• This works best with many connections• This can work with many systems or just two systems• For packets that do not include the port number such as IP fragmentation packets and non-TCP

and non-UDP packets this method will use the IP address. For non-IP packets the Layer 2mode is used. It is because of these exceptions that this is not IEEE compliant. Note that theconfiguration is set up so that the packets are not fragmented.

3. Layer 2+3• Uses ((source MAC XOR dest MAC) XOR ((source IP XOR dest IP) AND 0xffff)• This works best when there are many target hosts.• Not available from the bonding version used by Data Domain

Note, in some cases there is the comment that a feature is not supported by the CLI. This means that thefeature is in the code, but cannot be activated through the CLI. In the case of hash method three, the feature inlater versions of the Linux code, but it is not in the current version used by Data Domain.

The CLI simplifies the interface by not requiring the administrator to specify any more than is necessary.Therefore there is not an option to specify mode 5, 802.3ad, directly. Rather by specifying the hash method touse the CLI will set the mode to number 5. Modes 6 and 7 do not support hash mode 2. So if these areenabled in the CLI they would be referenced only by the mode name. For example, for mode 7 all that would bespecified would be balance-alb.

The aggregation method used is very important to getting the desired performance. Consider the differentnetwork topologies that may be used, but especially consider the number of target systems. If there is only onetarget system then aggregation methods that use the MAC address or the IP address will not be effectivebecause the addresses will always be the same. Also consider having a limited number of target systems. Aslong as the addresses allow the traffic to go over different links there will be some aggregation, but if one systemhas much more data then the other or if the target systems are not transferring data at the same time then theaggregation will be not provide the desired performance. This is why it is recommended the Layer 3+4 hashmethod be used along with using many streams. The multiple streams will create multiple connections andeach connection will have at least one unique port number. If the port numbers are suitably distributed they willdistribute the traffic across multiple links based on the port number.