Download - Cisco's journey from Verbs to Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1
Cisco’s Journey From Verbs to Libfabric
Abondon the shackles of Verbs
Embrace the freedom of Libfabric
Jeffrey M. Squyres Cisco Systems 23 September 2015
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Application
Kernel
Cisco VIC ethX port
TCP stack
General Ethernet driver
enic.ko
Userspace sockets API userspace library
Application
Verbs IB core
usnic.ko
Send and receive fast path
usNIC TCP/IP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Verbs is a fine API. …if you make InfiniBand hardware.
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
...but now there’s this libfabric thing (see libfabric.org community for details)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Keep in mind, Cisco already supports UD Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
• Monotonic enum • Could not add popular Ethernet values
1500
9000
• usNIC verbs provider had to lie (!) …just like iWARP providers
• MPI had to match verbs device with IP interface to find real MTU
Verbs IBV_MTU_256 IBV_MTU_512 IBV_MTU_1024 IBV_MTU_2048 IBV_MTU_4096
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
• Integer (not enum) endpoint attribute
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
• Integer (not enum) endpoint attribute
Libfabric
DONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
• Mandatory GRH structure InfiniBand-specific header
• 40 bytes UDP header is 42 bytes
…and a different format
• Breaks ib_ud_pingpong • usnic verbs provider used “magic”
ibv_port_query() to return extensions pointers
E.g., enable 42-byte UDP mode
Verbs
et len chk smac dmac …
ver len next
hop
sgid dgid
UDP header: 42 bytes
GRH: 40 bytes
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
• FI_MSG_PREFIX and ep_attr.msg_prefix_size
Libfabric
et len chk smac dmac …
payload
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
• FI_MSG_PREFIX and ep_attr.msg_prefix_size
Libfabric
et len chk smac dmac …
payload
DONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
• Tuple: (device, port) Usually a physical device and port
Does not match virtualized VIC hardware
• Queue pair • Completion queue
Verbs
Machine (64GB total)
NUMANode P#0 (32GB)
Socket P#0
L3 (25MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#8
PU P#5
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#9
PU P#6
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#10
PU P#7
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#11
PU P#8
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#12
PU P#9
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
eth5
usnic_1
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
NUMANode P#1 (32GB)
Socket P#1
L3 (25MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#10
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#11
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#12
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#13
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#14
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#8
PU P#15
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#9
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#10
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#11
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#12
PU P#19
PCI 1000:0073
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
eth7
usnic_3
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
Indexes: physical
Date: Sat Mar 14 09:27:31 2015
ibv_device ibv_port
QP QP CQ
QP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
• Maps nicely to SR-IOV • Fabric à PCI physical function (PF) • Domain à PCI virtual function (VF) • Endpoint à Resources in VF
Machine (64GB total)
NUMANode P#0 (32GB)
Socket P#0
L3 (25MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#8
PU P#5
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#9
PU P#6
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#10
PU P#7
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#11
PU P#8
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#12
PU P#9
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
eth5
usnic_1
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
NUMANode P#1 (32GB)
Socket P#1
L3 (25MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#10
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#11
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#12
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#13
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#14
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#8
PU P#15
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#9
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#10
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#11
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#12
PU P#19
PCI 1000:0073
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:0043
eth7
usnic_3
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
PCI 1137:00cf
Indexes: physical
Date: Sat Mar 14 09:27:31 2015
Libfabric
fi_fabric
fi_domain
fi_endpoint (resources in domain)
EP EP CQ
EP
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
• GID and GUID No easy mapping back to IP interface
• usnic verbs provider encoded MAC in GID
Still cumbersome to map back to IP interface
• Could use RDMA CM …but that would be a ton more code
Verbs mac[0] = gid->raw[8] ^ 2; mac[1] = gid->raw[9]; mac[2] = gid->raw[10]; mac[3] = gid->raw[13]; mac[4] = gid->raw[14]; mac[5] = gid->raw[15];
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
• Can use IP addressing directly
Libfabric
Everything is awesome
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
• Can use IP addressing directly
Libfabric
Everything is awesome DONE
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
• Generic send call ibv_post_send(…SG list…)
Lots of branches
• Wasteful allocations • No prefixed receive • Branching in completions
Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
• Multiple types of send calls fi_send(buffer, …)
• Variable-length prefix receive Provider-specific
• Fewer branches in completions
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
1.9
1.95
2
2.05
2.1
2.15
2.2
2.25
2.3
2.35
2.4
0.1 1 10 100
Tim
e (m
icro
seco
nds)
Buffer size
Open MPI with usNIC: IMB PingPong Latency
imb-pingpong-ompi-1.8-verbs.outimb-pingpong-ompi-1.8-libfabric.out
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
61000
62000
63000
64000
65000
66000
67000
68000
69000
1e+06
Band
wid
th (m
egab
its/s
econ
d)
Buffer size
Open MPI with usNIC: IMB SendRecv Bandwidth
imb-sendrecv-ompi-1.8-verbs.outimb-sendrecv-ompi-1.8-libfabric.out
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
• Performance issues • Memory registration still a problem • No MPI-style tag matching • One-sided capabilities do not match MPI • Network topology is a separate API
Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
• Performance happiness • Many MPI-helpful features:
Tag matching
One-sided operations
Triggered operations
• Inherently designed to be more than just point-to-point
• More work to be done… but promising MMU notify
Network topology
Libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
• Long design discussions about how to expose Ethernet / VIC concepts in the verbs API …usually with few good answers
Especially problematic with new VIC features over time
• Conclusion: possible (obviously), but not preferable
• Whole API designed with multiple vendor hardware models in mind
• Much easier to match our hardware to core Libfabric concepts
• Conclusion: much more preferable than verbs
Libfabric Verbs
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
Ok, so let’s do libfabric!
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Does it play well with MPI?
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Byte Transport Layer (BTL) plugins
Matching Transport Layer (MTL) plugins
MPI_Send(…)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
• Inherently multi-device • Round-robin for
small messages • Striping for large messages
• Major protocol decisions and MPI message matching driven by an Open MPI engine
Byte Transport Layer (BTL) plugins
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Matching Transport Layer (MTL) plugins
• Most details hidden by network API • MXM • Portals • PSM
• As a side effect, must handle: • Process loopback • Server loopback (usually via shared memory)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Byte Transport Layer (BTL) plugins
Matching Transport Layer (MTL) plugins
• IB / iWarp (verbs) • Portals • SCIF • Shared memory • TCP • uGNI • usNIC (verbs)
• MXM • Portals • PSM • PSM2
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
• IB / iWarp (verbs) • Portals • SCIF • Shared memory • TCP • uGNI • usNIC
Byte Transport Layer (BTL) plugins
Matching Transport Layer (MTL) plugins
• MXM • Portals • PSM • PSM2 • ofi
libfabric
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
libfabric
usnic BTL ofi MTL
• Cisco developed • usNIC-specific • OFI point-to-point / UD • Tested with usNIC
• Intel developed • Provider neutral • OFI tag matching • Tested with PSM / PSM2
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Bootstrapping
Message passing
There are two main parts of the usNIC BTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
verbs bootstrapping
verbs message passing
These two parts were previously written to the Verbs API
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
verbs bootstrapping
verbs message passing
sideband bootstrapping
1. Find the corresponding ethX device 2. Obtain MTU 3. Open usNIC-specific configuration
options
Per the previous slides, the Verbs API requires some… help… in the form of sideband bootstrapping
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
verbs bootstrapping
verbs message passing
sideband bootstrapping
libfabric bootstrapping
à
libfabric message passing à
Now let’s convert to use the libfabric API
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
verbs bootstrapping
verbs message passing
sideband bootstrapping
libfabric bootstrapping
à
libfabric message passing à Pretty much a ~1:1 swap of verbs à libfabric calls
Bootstrapping sequence totally different / not comparable
…but libfabric needs no sideband bootstrapping (got to delete several hundred lines of OMPI code – yay!)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
• For a specific provider Ask fi_getinfo() for prov_name=“usnic”
• Use usNIC extensions Netmask, link speed, IP device name, etc.
• usNIC-specific error messages
• For any tag-matching provider
• No extension use 100% portable
• Generic error messages
usnic BTL ofi MTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
• For a specific provider Ask fi_getinfo() for prov_name=“usnic”
• Use usNIC extensions Netmask, link speed, IP device name, etc.
• usNIC-specific error messages
• For any tag-matching provider
• No extension use 100% portable
• Generic error messages
usnic BTL ofi MTL
Both libfabric usage models co-exist (and play well with each other)
inside a single MPI implementation.
Proof positive of successful co-design
of libfabric and MPI implementations.
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
• For a specific provider Ask fi_getinfo() for prov_name=“usnic”
• Use usNIC extensions Netmask, link speed, IP device name, etc.
• usNIC-specific error messages
• For any tag-matching provider
• No extension use 100% portable
• Generic error messages
usnic BTL ofi MTL
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
• Libfabric is the Way Forward for Cisco
Open community Matches our hardware Performance benefits Features benefits
• Libfabric matches MPI Has features MPI has been asking for… for years Optimistic about its future (come join us!)
http://libfabric.org
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Thank you.