1/29/2002 cs 545 - distributed systems 1 infiniband architecture aniruddha bohra
TRANSCRIPT
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 11
Infiniband ArchitectureInfiniband Architecture
Aniruddha BohraAniruddha Bohra
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 22
Distributed Applications and Distributed Applications and Data TransferData Transfer
Traditional distributed applicationsTraditional distributed applications Need low latency message deliveryNeed low latency message delivery Data volume in transfers between nodes not too highData volume in transfers between nodes not too high
Server applicationsServer applications Need low latency and high bandwidth data transfersNeed low latency and high bandwidth data transfers Data volumes in transfers are high e.g. in a cluster Data volumes in transfers are high e.g. in a cluster
based storage or streaming multimedia serversbased storage or streaming multimedia servers Need Reliable and Available ServicesNeed Reliable and Available Services Need easy maintenanceNeed easy maintenance
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 33
Traditional message sendTraditional message send
One kernel boundary crossingOne kernel boundary crossing Two memory copies!!Two memory copies!!
To NIC
Application
Memory buffersSystem Call
Kernel
TCP sendmsgCopy from user space
IP and lower layers
Backup buffers
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 44
Lessons from parallel Lessons from parallel computingcomputing
Co-processors that can access memory Co-processors that can access memory directly used for communicationdirectly used for communication FLASH, J-Machine, AlewifeFLASH, J-Machine, Alewife
User level networkingUser level networking Virtual Memory Mapped CommunicationVirtual Memory Mapped Communication
Unet Unet VMMCVMMC VIAVIA
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 55
Interconnect bottleneckInterconnect bottleneck
Servers require high data transfer rateServers require high data transfer rate CPUs operate at GHz speedCPUs operate at GHz speed Gigabit ethernet is commonly used in cluster Gigabit ethernet is commonly used in cluster
based serversbased servers Data volumes are high Data volumes are high
PCI bus is much slowerPCI bus is much slower operates at 32 bit/33 MHz or 64 bit/66 MHz operates at 32 bit/33 MHz or 64 bit/66 MHz the next generation bus PCI-X operates at 133 the next generation bus PCI-X operates at 133
MHzMHz
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 66
Some solutionsSome solutions
HyperTransportHyperTransport Runs at 800MHz full duplexRuns at 800MHz full duplex Bridges with current buses and other Bridges with current buses and other
HyperTransport busesHyperTransport buses 3GIO3GIO
Switch basedSwitch based Provides a layered implementationProvides a layered implementation Promises more than 40 Gb/s transfer ratePromises more than 40 Gb/s transfer rate
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 77
More problems with bus based More problems with bus based interconnectsinterconnects
Cannot keep up with the increasing CPU Cannot keep up with the increasing CPU and peripheral speedand peripheral speed
Bus is shared between all peripheralsBus is shared between all peripherals The pin count is high – PCB space is The pin count is high – PCB space is
limited!limited! Buses are not able to extend to long Buses are not able to extend to long
distancesdistances Do not support a large number of devicesDo not support a large number of devices
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 88
OutlineOutline
Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 99
Infiniband ArchitectureInfiniband Architecture
Provides switch based interconnectProvides switch based interconnect Increased reliabilityIncreased reliability Scalable and easily maintainableScalable and easily maintainable
Supports memory to memory communicationSupports memory to memory communication Low latency communicationLow latency communication
Provides support for “out of box” componentsProvides support for “out of box” components ScalableScalable Easier to manage and operateEasier to manage and operate
Is complimentary to the 3GIO and Is complimentary to the 3GIO and HyperTransport BusesHyperTransport Buses
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1010
What is Infiniband?What is Infiniband? Infiniband Architecture(IBA) defines a Infiniband Architecture(IBA) defines a
System Area Network (SAN) System Area Network (SAN) IBA SAN is a communications and management IBA SAN is a communications and management
infrastructure for I/O and IPCinfrastructure for I/O and IPC IBA defines a switched communications IBA defines a switched communications
fabricfabric high bandwidth and low latency high bandwidth and low latency protected, remotely managed environment.protected, remotely managed environment.
IBA hardware off-loads from the CPU much IBA hardware off-loads from the CPU much of the I/O communications operation.of the I/O communications operation.
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1111
An IBA SANAn IBA SAN
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1212
OutlineOutline
Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1313
Topologies and componentsTopologies and components
IBA serves as an interconnect for IBA serves as an interconnect for endnodesendnodes
A node can be a processor node, an I/O A node can be a processor node, an I/O unit and/or a router to another networkunit and/or a router to another network
Infiniband Fabric
Infiniband Fabric
Node
Node
Node
Node
NodeNode
Node
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1414
Topologies and ComponentsTopologies and Components
An IBA network is subdivided into subnets An IBA network is subdivided into subnets interconnected by routersinterconnected by routers Endnodes can attach to a single or multiple Endnodes can attach to a single or multiple
subnetssubnets An IBA subnet is composed of endnodes, An IBA subnet is composed of endnodes,
switches, routers and subnet managersswitches, routers and subnet managers Each IBT device may attach to a single switch Each IBT device may attach to a single switch
or multiple switches and/or directly with each or multiple switches and/or directly with each otherother
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1515
Verbs
IBT device – processor nodeIBT device – processor node
Channel Adapter(endnode)
Port Port
Channel Adapter(endnode)
Port Port
Message and Data Service
Consumer Consumer Consumer
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1616
Processor nodeProcessor node
Each channel adapter constitutes a node on Each channel adapter constitutes a node on the fabricthe fabric Architecture supports multiple channel adapters Architecture supports multiple channel adapters
per unit with each adapter providing one or more per unit with each adapter providing one or more ports to the fabricports to the fabric
Message and Data service is an OS componentMessage and Data service is an OS component Verbs describe the functions to configure, Verbs describe the functions to configure,
manage and operate a host channel adaptermanage and operate a host channel adapter Verbs are not API but provide the framework for OS Verbs are not API but provide the framework for OS
to specify itto specify it
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1717
Channel AdapterChannel Adapter
An IBA channel adapter(CA) is a An IBA channel adapter(CA) is a programmable DMA engine with special programmable DMA engine with special protection features that allow DMA operations protection features that allow DMA operations to be initiated locally and remotely.to be initiated locally and remotely.
Host Channel Adapter(HCA) provides a Host Channel Adapter(HCA) provides a consumer interface providing the functions consumer interface providing the functions specified by IBA verbs.specified by IBA verbs.
Target Channel Adapter(TCA) provides an Target Channel Adapter(TCA) provides an interface to the deviceinterface to the device
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1818
Channel AdapterChannel Adapter
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1919
Addressing in IBAAddressing in IBA
Each endnode has one or more CAs and Each endnode has one or more CAs and each CA has one or more portseach CA has one or more ports
Each Queue Pair (QP) has a QP number Each Queue Pair (QP) has a QP number (QPN) assigned by the CA(QPN) assigned by the CA
Each port has a unique Local ID (LID) and Each port has a unique Local ID (LID) and at least one IPv6 address – Global ID (GID)at least one IPv6 address – Global ID (GID)
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2020
SwitchesSwitches
Do not generate or consume packets – Do not generate or consume packets – pass them along based on the destination pass them along based on the destination addressaddress
Are the routing components for intra-Are the routing components for intra-subnet routing – support uni or multicastsubnet routing – support uni or multicast
Every destination is configured with one or Every destination is configured with one or more unique Local IDs (LIDs)more unique Local IDs (LIDs)
Subnet manager configures switches Subnet manager configures switches including loading their forwarding tablesincluding loading their forwarding tables
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2121
RoutersRouters
Routers are inter-subnet routing elementsRouters are inter-subnet routing elements Routers forward packets based on the Routers forward packets based on the
packet’s global route headerpacket’s global route header Routers expose one or more ports Routers expose one or more ports
between which packets are relayedbetween which packets are relayed IPv6 specifies the protocol performed IPv6 specifies the protocol performed
between routers to derive their routing between routers to derive their routing tablestables
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2222
Subnet ManagersSubnet Managers
An Subnet Manager(SM) is an entity An Subnet Manager(SM) is an entity attached to a subnet responsible for its attached to a subnet responsible for its managementmanagement
TasksTasks Discover topologyDiscover topology Configure the CA port with a range of LIDs, Configure the CA port with a range of LIDs,
GIDs, subnet prefix and Partition_KeysGIDs, subnet prefix and Partition_Keys Maintains LID/GID resolution tablesMaintains LID/GID resolution tables
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2323
OutlineOutline
Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2424
CommunicationCommunication
QueuingQueuing Consumer queues up a set of instructions for Consumer queues up a set of instructions for
hardware to execute (hardware to execute (Work queueWork queue).). Work queues are created in pairs(Queue pairs – Work queues are created in pairs(Queue pairs –
QP) for send and receive operationsQP) for send and receive operations Each Work Queue has corresponding Each Work Queue has corresponding
Completion QueueCompletion Queue
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2525
Work Queue OperationsWork Queue Operations
Send operationsSend operations SEND SEND
Block in memory space to send to destinationBlock in memory space to send to destination RDMARDMA
RDMA_READ, RDMA_WRITE, ATOMICRDMA_READ, RDMA_WRITE, ATOMIC Memory BindingMemory Binding
Alters the memory binding relationship – gives the Alters the memory binding relationship – gives the R_KEY to components which allows secure DMAR_KEY to components which allows secure DMA
Receive operationReceive operation Specifies a receive data bufferSpecifies a receive data buffer
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2626
Work Queue OperationsWork Queue Operations
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2727
Communication StackCommunication Stack
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2828
KeysKeys Keys are used to provide isolation and Keys are used to provide isolation and
protectionprotection M_KEY M_KEY
Enforces the control of a master Subnet ManagerEnforces the control of a master Subnet Manager B_KEYB_KEY
Enforces control of a baseboard Subnet ManagerEnforces control of a baseboard Subnet Manager P_KEYP_KEY
Enforces membership in a subnetEnforces membership in a subnet Q_KEYQ_KEY
Enforces access rights for reliable or unreliable Enforces access rights for reliable or unreliable serviceservice
L_KEY and R_KEYL_KEY and R_KEY Provide access rights to Remote registered memoryProvide access rights to Remote registered memory
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2929
OutlineOutline
Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3030
Virtual LanesVirtual Lanes
A virtual lane represents a set of transmit A virtual lane represents a set of transmit and receive buffers in a portand receive buffers in a port
VL15 is used for subnet managementVL15 is used for subnet management Each port must have at least one data VLEach port must have at least one data VL Separate flow control is maintained over Separate flow control is maintained over
each VLeach VL
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3131
Service LevelsService Levels
Service levels(SLs) are maintained by Service levels(SLs) are maintained by attaching a VL to a SLattaching a VL to a SL
IBA does not specify any QoS levels(e.g. IBA does not specify any QoS levels(e.g. best effort)best effort)
The SMA must keep a mapping of Service The SMA must keep a mapping of Service Level to Virtual Lane and propagate it Level to Virtual Lane and propagate it through the switchthrough the switch
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3232
StatusStatus
Intel Developer Forum had several status Intel Developer Forum had several status talks talks http://www.intel.com/idf/ushttp://www.intel.com/idf/us
IBA enabled network storage has been IBA enabled network storage has been demonstrated at industry showsdemonstrated at industry shows BanderacomBanderacom WindriverWindriver
The first products are expected to be in The first products are expected to be in the market by middle of 2002the market by middle of 2002
1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3333
SummarySummary
Future bandwidth requirements for servers Future bandwidth requirements for servers would lead to the interconnect becoming a would lead to the interconnect becoming a bottleneck – IBA is an attempt to alleviate bottleneck – IBA is an attempt to alleviate the problemthe problem
IBA provides a thorough migration from a IBA provides a thorough migration from a bus based to a switch based architecture bus based to a switch based architecture while maintaining interoperabilitywhile maintaining interoperability
Further deployment is needed to realize Further deployment is needed to realize other issues that would arise in operationother issues that would arise in operation