1/29/2002 cs 545 - distributed systems 1 infiniband architecture aniruddha bohra

33
1/29/2002 1/29/2002 CS 545 - Distributed Sy CS 545 - Distributed Sy stems stems 1 Infiniband Infiniband Architecture Architecture Aniruddha Bohra Aniruddha Bohra

Upload: hugo-wheeler

Post on 27-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 11

Infiniband ArchitectureInfiniband Architecture

Aniruddha BohraAniruddha Bohra

Page 2: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 22

Distributed Applications and Distributed Applications and Data TransferData Transfer

Traditional distributed applicationsTraditional distributed applications Need low latency message deliveryNeed low latency message delivery Data volume in transfers between nodes not too highData volume in transfers between nodes not too high

Server applicationsServer applications Need low latency and high bandwidth data transfersNeed low latency and high bandwidth data transfers Data volumes in transfers are high e.g. in a cluster Data volumes in transfers are high e.g. in a cluster

based storage or streaming multimedia serversbased storage or streaming multimedia servers Need Reliable and Available ServicesNeed Reliable and Available Services Need easy maintenanceNeed easy maintenance

Page 3: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 33

Traditional message sendTraditional message send

One kernel boundary crossingOne kernel boundary crossing Two memory copies!!Two memory copies!!

To NIC

Application

Memory buffersSystem Call

Kernel

TCP sendmsgCopy from user space

IP and lower layers

Backup buffers

Page 4: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 44

Lessons from parallel Lessons from parallel computingcomputing

Co-processors that can access memory Co-processors that can access memory directly used for communicationdirectly used for communication FLASH, J-Machine, AlewifeFLASH, J-Machine, Alewife

User level networkingUser level networking Virtual Memory Mapped CommunicationVirtual Memory Mapped Communication

Unet Unet VMMCVMMC VIAVIA

Page 5: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 55

Interconnect bottleneckInterconnect bottleneck

Servers require high data transfer rateServers require high data transfer rate CPUs operate at GHz speedCPUs operate at GHz speed Gigabit ethernet is commonly used in cluster Gigabit ethernet is commonly used in cluster

based serversbased servers Data volumes are high Data volumes are high

PCI bus is much slowerPCI bus is much slower operates at 32 bit/33 MHz or 64 bit/66 MHz operates at 32 bit/33 MHz or 64 bit/66 MHz the next generation bus PCI-X operates at 133 the next generation bus PCI-X operates at 133

MHzMHz

Page 6: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 66

Some solutionsSome solutions

HyperTransportHyperTransport Runs at 800MHz full duplexRuns at 800MHz full duplex Bridges with current buses and other Bridges with current buses and other

HyperTransport busesHyperTransport buses 3GIO3GIO

Switch basedSwitch based Provides a layered implementationProvides a layered implementation Promises more than 40 Gb/s transfer ratePromises more than 40 Gb/s transfer rate

Page 7: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 77

More problems with bus based More problems with bus based interconnectsinterconnects

Cannot keep up with the increasing CPU Cannot keep up with the increasing CPU and peripheral speedand peripheral speed

Bus is shared between all peripheralsBus is shared between all peripherals The pin count is high – PCB space is The pin count is high – PCB space is

limited!limited! Buses are not able to extend to long Buses are not able to extend to long

distancesdistances Do not support a large number of devicesDo not support a large number of devices

Page 8: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 88

OutlineOutline

Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary

Page 9: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 99

Infiniband ArchitectureInfiniband Architecture

Provides switch based interconnectProvides switch based interconnect Increased reliabilityIncreased reliability Scalable and easily maintainableScalable and easily maintainable

Supports memory to memory communicationSupports memory to memory communication Low latency communicationLow latency communication

Provides support for “out of box” componentsProvides support for “out of box” components ScalableScalable Easier to manage and operateEasier to manage and operate

Is complimentary to the 3GIO and Is complimentary to the 3GIO and HyperTransport BusesHyperTransport Buses

Page 10: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1010

What is Infiniband?What is Infiniband? Infiniband Architecture(IBA) defines a Infiniband Architecture(IBA) defines a

System Area Network (SAN) System Area Network (SAN) IBA SAN is a communications and management IBA SAN is a communications and management

infrastructure for I/O and IPCinfrastructure for I/O and IPC IBA defines a switched communications IBA defines a switched communications

fabricfabric high bandwidth and low latency high bandwidth and low latency protected, remotely managed environment.protected, remotely managed environment.

IBA hardware off-loads from the CPU much IBA hardware off-loads from the CPU much of the I/O communications operation.of the I/O communications operation.

Page 11: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1111

An IBA SANAn IBA SAN

Page 12: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1212

OutlineOutline

Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary

Page 13: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1313

Topologies and componentsTopologies and components

IBA serves as an interconnect for IBA serves as an interconnect for endnodesendnodes

A node can be a processor node, an I/O A node can be a processor node, an I/O unit and/or a router to another networkunit and/or a router to another network

Infiniband Fabric

Infiniband Fabric

Node

Node

Node

Node

NodeNode

Node

Page 14: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1414

Topologies and ComponentsTopologies and Components

An IBA network is subdivided into subnets An IBA network is subdivided into subnets interconnected by routersinterconnected by routers Endnodes can attach to a single or multiple Endnodes can attach to a single or multiple

subnetssubnets An IBA subnet is composed of endnodes, An IBA subnet is composed of endnodes,

switches, routers and subnet managersswitches, routers and subnet managers Each IBT device may attach to a single switch Each IBT device may attach to a single switch

or multiple switches and/or directly with each or multiple switches and/or directly with each otherother

Page 15: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1515

Verbs

IBT device – processor nodeIBT device – processor node

Channel Adapter(endnode)

Port Port

Channel Adapter(endnode)

Port Port

Message and Data Service

Consumer Consumer Consumer

Page 16: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1616

Processor nodeProcessor node

Each channel adapter constitutes a node on Each channel adapter constitutes a node on the fabricthe fabric Architecture supports multiple channel adapters Architecture supports multiple channel adapters

per unit with each adapter providing one or more per unit with each adapter providing one or more ports to the fabricports to the fabric

Message and Data service is an OS componentMessage and Data service is an OS component Verbs describe the functions to configure, Verbs describe the functions to configure,

manage and operate a host channel adaptermanage and operate a host channel adapter Verbs are not API but provide the framework for OS Verbs are not API but provide the framework for OS

to specify itto specify it

Page 17: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1717

Channel AdapterChannel Adapter

An IBA channel adapter(CA) is a An IBA channel adapter(CA) is a programmable DMA engine with special programmable DMA engine with special protection features that allow DMA operations protection features that allow DMA operations to be initiated locally and remotely.to be initiated locally and remotely.

Host Channel Adapter(HCA) provides a Host Channel Adapter(HCA) provides a consumer interface providing the functions consumer interface providing the functions specified by IBA verbs.specified by IBA verbs.

Target Channel Adapter(TCA) provides an Target Channel Adapter(TCA) provides an interface to the deviceinterface to the device

Page 18: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1818

Channel AdapterChannel Adapter

Page 19: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 1919

Addressing in IBAAddressing in IBA

Each endnode has one or more CAs and Each endnode has one or more CAs and each CA has one or more portseach CA has one or more ports

Each Queue Pair (QP) has a QP number Each Queue Pair (QP) has a QP number (QPN) assigned by the CA(QPN) assigned by the CA

Each port has a unique Local ID (LID) and Each port has a unique Local ID (LID) and at least one IPv6 address – Global ID (GID)at least one IPv6 address – Global ID (GID)

Page 20: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2020

SwitchesSwitches

Do not generate or consume packets – Do not generate or consume packets – pass them along based on the destination pass them along based on the destination addressaddress

Are the routing components for intra-Are the routing components for intra-subnet routing – support uni or multicastsubnet routing – support uni or multicast

Every destination is configured with one or Every destination is configured with one or more unique Local IDs (LIDs)more unique Local IDs (LIDs)

Subnet manager configures switches Subnet manager configures switches including loading their forwarding tablesincluding loading their forwarding tables

Page 21: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2121

RoutersRouters

Routers are inter-subnet routing elementsRouters are inter-subnet routing elements Routers forward packets based on the Routers forward packets based on the

packet’s global route headerpacket’s global route header Routers expose one or more ports Routers expose one or more ports

between which packets are relayedbetween which packets are relayed IPv6 specifies the protocol performed IPv6 specifies the protocol performed

between routers to derive their routing between routers to derive their routing tablestables

Page 22: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2222

Subnet ManagersSubnet Managers

An Subnet Manager(SM) is an entity An Subnet Manager(SM) is an entity attached to a subnet responsible for its attached to a subnet responsible for its managementmanagement

TasksTasks Discover topologyDiscover topology Configure the CA port with a range of LIDs, Configure the CA port with a range of LIDs,

GIDs, subnet prefix and Partition_KeysGIDs, subnet prefix and Partition_Keys Maintains LID/GID resolution tablesMaintains LID/GID resolution tables

Page 23: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2323

OutlineOutline

Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary

Page 24: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2424

CommunicationCommunication

QueuingQueuing Consumer queues up a set of instructions for Consumer queues up a set of instructions for

hardware to execute (hardware to execute (Work queueWork queue).). Work queues are created in pairs(Queue pairs – Work queues are created in pairs(Queue pairs –

QP) for send and receive operationsQP) for send and receive operations Each Work Queue has corresponding Each Work Queue has corresponding

Completion QueueCompletion Queue

Page 25: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2525

Work Queue OperationsWork Queue Operations

Send operationsSend operations SEND SEND

Block in memory space to send to destinationBlock in memory space to send to destination RDMARDMA

RDMA_READ, RDMA_WRITE, ATOMICRDMA_READ, RDMA_WRITE, ATOMIC Memory BindingMemory Binding

Alters the memory binding relationship – gives the Alters the memory binding relationship – gives the R_KEY to components which allows secure DMAR_KEY to components which allows secure DMA

Receive operationReceive operation Specifies a receive data bufferSpecifies a receive data buffer

Page 26: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2626

Work Queue OperationsWork Queue Operations

Page 27: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2727

Communication StackCommunication Stack

Page 28: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2828

KeysKeys Keys are used to provide isolation and Keys are used to provide isolation and

protectionprotection M_KEY M_KEY

Enforces the control of a master Subnet ManagerEnforces the control of a master Subnet Manager B_KEYB_KEY

Enforces control of a baseboard Subnet ManagerEnforces control of a baseboard Subnet Manager P_KEYP_KEY

Enforces membership in a subnetEnforces membership in a subnet Q_KEYQ_KEY

Enforces access rights for reliable or unreliable Enforces access rights for reliable or unreliable serviceservice

L_KEY and R_KEYL_KEY and R_KEY Provide access rights to Remote registered memoryProvide access rights to Remote registered memory

Page 29: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 2929

OutlineOutline

Motivation and backgroundMotivation and background Infiniband architectureInfiniband architecture Infiniband componentsInfiniband components Infiniband operationInfiniband operation Other Infiniband featuresOther Infiniband features StatusStatus SummarySummary

Page 30: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3030

Virtual LanesVirtual Lanes

A virtual lane represents a set of transmit A virtual lane represents a set of transmit and receive buffers in a portand receive buffers in a port

VL15 is used for subnet managementVL15 is used for subnet management Each port must have at least one data VLEach port must have at least one data VL Separate flow control is maintained over Separate flow control is maintained over

each VLeach VL

Page 31: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3131

Service LevelsService Levels

Service levels(SLs) are maintained by Service levels(SLs) are maintained by attaching a VL to a SLattaching a VL to a SL

IBA does not specify any QoS levels(e.g. IBA does not specify any QoS levels(e.g. best effort)best effort)

The SMA must keep a mapping of Service The SMA must keep a mapping of Service Level to Virtual Lane and propagate it Level to Virtual Lane and propagate it through the switchthrough the switch

Page 32: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3232

StatusStatus

Intel Developer Forum had several status Intel Developer Forum had several status talks talks http://www.intel.com/idf/ushttp://www.intel.com/idf/us

IBA enabled network storage has been IBA enabled network storage has been demonstrated at industry showsdemonstrated at industry shows BanderacomBanderacom WindriverWindriver

The first products are expected to be in The first products are expected to be in the market by middle of 2002the market by middle of 2002

Page 33: 1/29/2002 CS 545 - Distributed Systems 1 Infiniband Architecture Aniruddha Bohra

1/29/20021/29/2002 CS 545 - Distributed SystemsCS 545 - Distributed Systems 3333

SummarySummary

Future bandwidth requirements for servers Future bandwidth requirements for servers would lead to the interconnect becoming a would lead to the interconnect becoming a bottleneck – IBA is an attempt to alleviate bottleneck – IBA is an attempt to alleviate the problemthe problem

IBA provides a thorough migration from a IBA provides a thorough migration from a bus based to a switch based architecture bus based to a switch based architecture while maintaining interoperabilitywhile maintaining interoperability

Further deployment is needed to realize Further deployment is needed to realize other issues that would arise in operationother issues that would arise in operation