lcsc 2004 sci socket: the fastest socket on earth? atle vesterkjær [email protected] olaf...

38
LCSC 2004 SCI SOCKET: The fastest socket on earth? Atle Vesterkjær [email protected] http://www.dolphinics.com Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone: +47 23 16 70 00 Fax: +47 23 16 71 80

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

LCSC 2004

SCI SOCKET: The fastest socket on earth?

Atle Vesterkjæ[email protected]

http://www.dolphinics.com

Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone: +47 23 16 70 00 Fax: +47 23 16 71 80

LCSC 2004

SCI SOCKET - Outline

• The fastest socket on earth and the impact on storage and applicationsSCI technology SCI SOCKET for storage and

applications. SCI SOCKET benchmarks

LCSC 2004

Highlights of the Dolphin SCI Technology

• Ultra Low Latency CPU has direct access to remote memory No protocol overhead

• 1.4 µs 4 bytes write

• < 3 µs 512 bytes write

• 0.2 µs pipelined write Fast failover for HA systems

• Highly efficient bus bridging Bus Requests and Responses (CPU load/store

operations) are translated directly in Hardware to Request and Response Packets

Point to Point Links gives Bus Performance and Latency over Distance

• High data throughput: ~ 346 MByte/s

• 0.2 µs pipelined write

LCSC 2004

Highlights of the Dolphin SCI Technology

• Wide Application Area - Common

Mode : Multiprocessing Storage, Clustering,

Multiprocessing, Embedded Systems, Telecommunication, Defense, Medical Imaging

• Choice of Topologies, Ring, Torus,

Switched

• Shipping in Critical Applications

for more than 10 years

• Based on ANSI/IEEE 1592-1992

Scalable Coherent Interface (SCI)

Standard

LCSC 2004

Linköping University - NSC - SCI Clusters

• Monolith: 200 node, 2xXeon, 2,2 GHz, 3D SCI

• INGVAR: 32 node, AMD 900 MHz, 2D SCI

• Otto: 48 node, P4 2.26 GHz, 2D SCI

• Maxwell: 40 node 2xXeon, 2D SCI

• Bris: 16+2, 2x Xeon• Total 336 SCI Nodes

Also in Sweden, Umeå University 120 Athlon nodes

LCSC 2004

Applications, Database Clustering

• SUN’s High End servers are clustered with Dolphin Cards Money Transaction and Data Base

Applications High Availability and Performance Dolphin Ships: Cards and Switches 7th year of shipments Oracle 9i Performance and Scaleability SCI runs natively on SUN’s RSM (Remote

Shared Memory API

Ultra Enterprise Cluster

LCSC 2004

Mirage 2000 Upgrade, First Test Flight January 2001

Thales uses Dolphin’s Technology as the main interconnect in the on-board Multi Processor

Offered with systems like Mirage 2000-9, Mirage 2000-5, Rafaleand more

LCSC 2004

Space Mission Application

http://sim.jpl.nasa.gov/

Dolphin’s technology is chosen for evaluation

Dolphins in Space!

LCSC 2004

SCI Adapter Cards - 64 bit 66 MHz

• PCI-, PMC(VME)- and CompactPCI™- SCI Adapter Card

• Industry-best latency 1.4 microseconds 4 bytes write < 3 microseconds 512 bytes write 0.2 microseconds pipelined write

• High data throughput ~ 346 MBytes/s

• Supports both: Direct Memory Access (DMA) Remote Memory Access (RMA) Remote Interrupt

• Hot-pluggable cabling

• Redundant SCI adapters can be used for Fault-tolerance

Cluster AdapterPCI to PCI BridgePCI ExtensionReflected Memory

SCIPSBPCI

LC

LCSC 2004

Dolphin Products: Switches, Chips and Cards

LCSC 2004

SCI

1D Topology (Ring) to 10 Nodes

2D Torus Topology to 100+ Nodes

3D Torus Toplogy to 1000s of Nodes

Torus Topology

SCILCPSB

PCI

SCI

PSB

PCI

LC LC

SCI SCI

PSB

PCI

LC LC

SCI

LC

LCSC 2004

Dolphin SW

• All Dolphin SW is free open source (GPL or LGPL)• SISCI – shared memory interface• SCI-Sockets

Low Latency Socket Library TCP and UDP Replacement User and Kernel level support Release 2.3 available

• SCI-MPICH (RWTH Aachen) MPICH 1.2 and some MPICH 2 features. MPICH 2 in

development. New release is being prepared, beta available

• SCI Interconnect Manager Automatic failover recovory. No single point of failuere in 2D and 3D networks.

• Other SCI Reflective Memory, Scali MPI, Linux Labs SCI

Cluster Cray-compatible shmem and Clugres PostgreSQL, MandrakeSoft Clustering HPC solution, Xprimes X1 Database Performance Cluster for Microsoft SQL Servers, ClusterFrame from Qlusters and SunCluster 3.1 (Oracle 9i), MySQL Cluster

LCSC 2004

Latency vs SW

SW Latency (1/2 Ping Pong roundtrip)

SISCI (Direct HW) 1.4 µs

SCI-Sockets 2.3 µs

Scali MPI Connect 3.5 µs

SCI-MPICH 3.8 µs

SCI SOCKET

Replace in Title/Slide Master

with Company Logo or delete

Legacy SocketApplications

Low LatencySCI InterconnectSCI SOCKET

LCSC 2004

Motivation

• Link level speeds of interconnects are increasing

Communication bottleneck moved to protocol software

High speed networks provide their own efficient interfaces

• On the other hand:

A large number of applications is build around legacy protocols such as TCP/IP suite

De-facto standard: Berkeley Sockets API

Porting to hardware specific APIs unprofitable in many cases

• SCI SOCKET aims to bring together:

Legacy SocketApplications

Low LatencySCI InterconnectSCI SOCKET

LCSC 2004

Berkeley Sockets over SCI

• High Speed, Low Latency Replacement for Gigabit Ethernet for Critical Applications

• Bypassing traditional network stacks like TCP/UDP/IP Eliminating protocol overhead and Reducing latency

• Transparent to applications, no modifications or recompilation required

• Ultra low latency 2.27 us socket send/receive latency

Legacy SocketApplications

Low LatencySCI InterconnectSCI SOCKET

LCSC 2004

Berkeley Sockets over SCI

• Data transfer through remote shared memory

• Offers new socket transport family AF_SCI

• Flexible using configuration filesSpecifying Cluster nodesSpecifying ports

Legacy SocketApplications

Low LatencySCI InterconnectSCI SOCKET

LCSC 2004

LD_PRELOAD

• Standard mechanism to preload C library functions

• User defined Library fuctions called instead of C library

• AF_INET selects traditional TCP/IP path

• AF_SCI selects SCI_SOCKET

int socket(int family, int type, int protocol) {

if((family == AF_INET) && (type == TCP || type == UDP))

socket_lib(AF_SCI,type);

else

socket_lib(family,type);

}

Legacy SocketApplications

Low LatencySCI InterconnectSCI SOCKET

LCSC 2004

SCI SOCKET

• Easy installation of the SCI socket library

Configuration file

StandardSocket library

SCI Socketlibrary

SCI

Ethernet

Application

Legacy SocketApplications

Low LatencySCI InterconnectSCI SOCKET

LCSC 2004

Configuration File /etc/sci/scisock.conf

#This is a SCI socket config file#Should be placed in /etc/sci ##hostname SCI NodeId

nodeA 4193.71.152.89 8Mailhost 16File-serv 20

• Selects which machines that can be reached using SCI

• Optionally /etc/sci/scisock_opt.conf selects which ports that can be reached using SCI

#This is a SCI socket_opt config file#Should be placed in /etc/sci directory##-key -Type -value

EnablePortsByDefault -yes/noEnablePort tcp|udp ’portnumber’ DisablePort tcp|udp ’portnumber’EnablePortRange tcp|udp ’start_port end_port’DisablePortRange tcp|udp ’start_port end_port’

LCSC 2004

Linux Kernel Socket Switch

Cluster File System

SCI SOCKET

IPTCP UDP

Ethernet driver

Ethernet HW SCI HW

Socket lib

User App

iSCSILinux Kernel Socket Switch

User space

Kernel space

Native SOCKET

LCSC 2004

Small Message Latency

LCSC 2004

TCP STREAM

LCSC 2004

TCP-RR SCI SOCKET vs Gigabit Ethernet

LCSC 2004

Scali MPI over SCI SOCKET

• SCI SOCKET is 1.6 - 6.0 times faster than TCP/GigE

LCSC 2004

Why is SCI SOCKET so fast ?

• Small messages are sent using basic CPU instructions Data are normally located in CPU cache Low cost write post to local memory address Single store CPU instruction to send 8 bytes Raw send latency for 8 bytes is approximately 210 nanoseconds No need to lock down or register memory

• Large messages are sent using DMA

• Stream-lined and lock-free messaging protocol on top of shared memory

• Combination of polling and interrupts

• Receive message causes received message to be cached No additional memory access

Legacy SocketApplications

Low LatencySCI InterconnectSCI SOCKET

LCSC 2004

Cluster File Systems

• SCI SOCKET: A typical cluster file system will run out of the box

• PVFS Open Source / GPL software

http://www.parl.clemson.edu/pvfs/desc.html

• Lustre Open Source / GPL software

• http://www.lustre.org/

• GFS Global File System

• Commersial file system available from Sistina www.sistina.com/products_gfs.htm

LCSC 2004

iSCSI

• SCSI over IP Protocol for encapsulating SCSI

commands into IP packets I/O block data transport over IP

networks

• iSCSI and SCI SOCKET can be used to build scalable SAN / NAS solutions

iSCSI Driver

TCP/IPNIC

IP network

NICTCP/IP

SCSI Driver

LCSC 2004

iSCSI over SCI SOCKET

• Latency is approximately 10x better than Gigabit Ethernet Latency is reported by Intels ’ktest’

Gigabit Ethernet SCI SOCKET

SCSI op 0x28 250 us 29 us

SCSI op 0x2A 250 us 31 us

SCSI op 0x25 250 us 27 us

LCSC 2004

iSCSI over SCI SOCKET

• Throughput is 2-4 times Gigabit Ethernet

LCSC 2004

SCI SOCKET comparison

28 us

23 us

12 us

2.26 us

Latency

3768 Mbps

936 Mbps

1818 Mbps

2016 Mbps

Throughput

www.dolphinics.comGbit Ethernet

IEEE Symposium IPASS 2004

Infiniband

www.myrinet.com

www.dolphinics.com

Reference

Myrinet

SCI

Technology

Your User Name
The new version has a 'bug' causing the latency to increase from 2.2 us to 2.6. A fix is being investigated.

LCSC 2004

SCI vs other interconnects

• As reported by Ameslab (Iowa state University, USA) Netpipe benchmark

LCSC 2004

Applications running SCI SOCKET

Intel iSCSIPVFSLUSTREMySQL Cluster LAM-MPI MPICH2 PVM

Oracle (Client/Server sqlplus)

TerraGrid (tm) by Terrascale

Scali MPI Connect™ Latency_bench Netpipe TCP/PVM Netperf

LCSC 2004

Current Development

Available on X86, X86_64, Linux 2.4 and 2.6. Itanium beta release is ready Porting to windows in progress Support for multiple adapters in progress

• Data striping gives multiple throughput with no latency penalty or extra CPU load

• Redundancy and transparent failover to other SCI adapter and Ethernet

LCSC 2004

SCI SOCKET: The fastest socket on earth?

Atle Vesterkjæ[email protected]

http://www.dolphinics.com

Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone: +47 23 16 70 00 Fax: +47 23 16 71 80

LCSC 2004

LCSC 2004

http://www.gria.org/

• Would you like your computers to earn you extra money?

• Would you like to have cheap access to tons of computing power? The GRIA project will take Grid technology into the real world, enabling

industrial users to trade computational resources on a commercial basis to meet their needs more cost effectively.

• GRIA enables organizations to: Outsource computation.

• If you need short-term computation, and cannot justify the expense of the hardware purchase, GRIA provides a mechanism to discover, negotiate and utilize other organizations' spare computing resources.

Rent out spare CPU cycles. • GRIA provides a mechanism allowing you to commercially offer your spare

computing resources on the Grid.

LCSC 2004

Acknowledgement

• SCI SOCKET kernel module has been developed in the IST-33240 project GRIA (http://www.gria.org)

• SCI SOCKET user space software library has been developed in the ITEA project HYADES (http://www.hyades-itea.org)

• The SCI SOCKET software is open source and available under GPL/LGPL. Dolphin strongly appreciates the contribution to the code and testing done by volunteer programmers and partners.

• More information about SCI SOCKET can be found at http://www.dolphinics.com/products/software/sci_sockets.html