network server performance and scalability

39
Network Server Performance and Scalability June 22, 2005 Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/ Architecture/

Upload: buffy

Post on 08-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Network Server Performance and Scalability. Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/. June 22, 2005. Rice Computer Architecture. Faculty Scott Rixner Students Mike Calhoun Hyong-youb Kim Jeff Shafer Paul Willmann Research Focus - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Network Server Performance and Scalability

Network ServerPerformance and Scalability

June 22, 2005

Scott RixnerRice Computer Architecture Group

http://www.cs.rice.edu/CS/Architecture/

Page 2: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 2

Rice Computer Architecture Group

Rice Computer Architecture

Faculty– Scott Rixner

Students– Mike Calhoun– Hyong-youb Kim– Jeff Shafer– Paul Willmann

Research Focus– System architecture– Embedded systems

http://www.cs.rice.edu/CS/Architecture/

Page 3: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 3

Rice Computer Architecture Group

Network Servers Today

Content types– Mostly text, small images– Low quality video (300-500 Kbps)

Internet

1 Gbps

NetworkServer

Clients3 Mbps

Page 4: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 4

Rice Computer Architecture Group

Network Servers in the Future

Content types– Diverse multimedia content– DVD quality video (10 Mbps)

Internet

100 Gbps

Clients100 Mbps

NetworkServer

Page 5: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 5

Rice Computer Architecture Group

TCP Performance Issues

Network Interfaces– Limited flexibility – Serialized access

Computation– Only about 3000 instructions per packet– However, very low IPC, parallelization difficulties

Memory– Large connection data structures (about 1KB each)– Low locality, high DRAM latency

Page 6: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 6

Rice Computer Architecture Group

Selected Research

Network Interfaces– Programmable NIC design– Firmware parallelization– Network interface data caching

Operating Systems– Connection handoff to the network interface – Parallizing network stack processing

System Architecture– Memory controller design

Page 7: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 7

Rice Computer Architecture Group

Designing a 10 Gigabit NIC

Programmability for performance– Computation offloading improves performance

NICs have power, area concerns– Architecture solutions should be efficient

Above all, must support 10 Gbps links– What are the computation and memory requirements?– What architecture efficiently meets them?– What firmware organization should be used?

Page 8: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 8

Rice Computer Architecture Group

Aggregate Requirements10 Gbps – Maximum-sized Frames

Instruction Throughput

Control Data Bandwidth

Frame Data Bandwidth

TX Frame 229 MIPS 2.6 Gbps 19.75 Gbps

RX Frame 206 MIPS 2.2 Gbps 19.75 Gbps

Total 435 MIPS 4.8 Gbps 39.5 Gbps

1514-byte Frames at 10 Gbps 812,744 Frames/s

Page 9: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 9

Rice Computer Architecture Group

Meeting 10 Gbps Requirements

Processor Architecture– At least 435 MIPS within embedded device– Limited instruction-level parallelism– Abundant task-level parallelism

Memory Architecture– Control data needs low latency, small capacity– Frame data needs high bandwidth, large capacity– Must partition storage

Page 10: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 10

Rice Computer Architecture Group

Processor Architecture

Perfect 1BP No BPIn-order 1 0.87 0.87

Out-order 2 1.74 1.21 2x performance costly

– Branch prediction, reorder buffer, renaming logic, wakeup logic– Overheads translate to greater than 2x core power, area costs– Great for a GP processor; not for an embedded device

Are there other opportunities for parallelism?– Many steps to process a frame – run them simultaneously– Many frames need processing – process simultaneously

Solution: use parallel single-issue cores

Page 11: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 11

Rice Computer Architecture Group

0

10

20

30

40

50

60

16B

32B

64B

128B

256B

512B 1KB

2KB

4KB

8KB

16KB

32KB

Cache Size (Bytes)

Hit

Rat

io (P

erce

nt)

6 ProcessorHit Ratio

Control Data Caching

SMPCache trace analysis of a 6-processor NIC architecture

Page 12: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 12

Rice Computer Architecture Group

A Programmable10Gbps NIC

Instruction Memory

I-Cache 0

CPU 0

(P+4)x(S) Crossbar (32-bit)

PCIInterface

EthernetInterfacePCI

Bus DRAM

Ext. Mem. Interface(Off-Chip)

Scratchpad 0 Scratchpad 1 S-pad S-1

CPU P-1

I-Cache 1 I-Cache P-1

CPU 1

Page 13: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 13

Rice Computer Architecture Group

Network Interface Firmware

NIC processing steps are well defined Must provide high latency tolerance

– DMA to host– Transfer to/from network

Event mechanism is the obvious choice– How do you process and distribute events?

Page 14: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 14

Rice Computer Architecture Group

Task Assignment with an Event Register

PCI Read Bit SW Event Bit … Other Bits

PCI Interface Finishes Work

Processor(s) inspect

transactions

0 0 011

Processor(s) need to enqueue

TX Data

Processor(s) pass data to

Ethernet Interface

Page 15: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 15

Rice Computer Architecture Group

Task-level Parallel Firmware

TransferDMAs 0-4 0 Idle Idle

PCI Read Bit

PCI Read HW Status

Proc 0 Proc 1

1TransferDMAs 5-9

1

0

TimeProcessDMAs

0-4Idle

ProcessDMAs

5-91 Idle

Page 16: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 16

Rice Computer Architecture Group

Frame-level Parallel Firmware

TransferDMAs 0-4 Idle

PCI RD HW Status Proc 0 Proc 1

TransferDMAs 5-9

TimeProcessDMAs

0-4

Build Event

Idle

ProcessDMAs

5-9

Build Event

Idle

Page 17: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 17

Rice Computer Architecture Group

Scaling in Two Dimensions

0

2

4

6

8

10

12

14

16

18

20

100 150 200 250 300Core Frequency (MHz)

Thro

ughp

ut (G

b/s)

Ethernet Limit8 Processors6 Processors4 Processors2 Processors1 Processor

Gbp

s

Page 18: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 18

Rice Computer Architecture Group

A Programmable 10 Gbps NIC

This NIC architecture relies on:– Data Memory System – Partitioned organization, not

coherent caches– Processor Architecture – Parallel scalar processors– Firmware – Frame-level parallel organization – RMW Instructions – reduce ordering overheads

A programmable NIC: A substrate for offload services

Page 19: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 19

Rice Computer Architecture Group

NIC Offload Services

Network Interface Data Caching Connection Handoff Virtual Network Interfaces …

Page 20: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 20

Rice Computer Architecture Group

Network Interface Data Caching

Cache data in network interface Reduces interconnect traffic Software-controlled cache Minimal changes to the operating system

Prototype web server– Up to 57% reduction in PCI traffic– Up to 31% increase in server performance– Peak 1571 Mbps of content throughput

• Breaks PCI bottleneck

Page 21: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 21

Rice Computer Architecture Group

Results: PCI Traffic

~1260 Mb/s is limit!

~60 % Content trafficPCI saturated60 % utilization1198 Mb/s of HTTP content

30 % Overhead

Page 22: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 22

Rice Computer Architecture Group

Content Locality Block cache with 4KB block size

8-16MB caches capture locality

Page 23: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 23

Rice Computer Architecture Group

Results: PCI Traffic Reduction

Low temporal reuseLow PCI utilization

Good temporal reuseCPU bottleneck

36-57 % reductionwith four tracesUp to 31%performanceimprovement

Page 24: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 24

Rice Computer Architecture Group

Connection Handoff to the NIC

No magic processor on NIC– OS must control work

between itself and NIC Move established connections

between OS and NIC– Connection: unit of control– OS decides when and what

Benefits– Sockets are intact – no need to

change applications– Zero-copy– No port allocation or routing

on NIC– Can adapt to route changes

TCPIP

Handoff

TCPIP

EthernetHandoff

Sockets

Ethernet / Lookup

Driver

NIC

OS

Handoff interface:1. Handoff2. Send3. Receive4. Ack5. …

Page 25: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 25

Rice Computer Architecture Group

Connection Handoff

Traditional offload– NIC replicates entire

network stack– NIC can limit connections

due to resource limitations Connection handoff

– OS decides which subset of connections NIC should handle

– NIC resource limitations limit amount of offload, not number of connections

OS

NIC

Page 26: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 26

Rice Computer Architecture Group

Establishment and Handoff

OS establishes connections

OS decides whether or not to handoff each connection

OS

Connection

NIC

Connection

1. Establish a connection

2. Handoff

Page 27: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 27

Rice Computer Architecture Group

Data Transfer

Offloaded connections require minimal support from OS for data transfers– Socket layer for interface to

applications– Driver layer for interrupts,

buffer management

OS

Connection

NIC

Connection

3. Send, Receive, Ack, …

Data

Data

Page 28: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 28

Rice Computer Architecture Group

Connection Teardown

Teardown requires both NIC and OS to deallocate connection data structures

OS

Connection

NIC

Connection

4. De-alloc

5. De-alloc

Page 29: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 29

Rice Computer Architecture Group

Connection Handoff Status

Working prototype built on FreeBSD Initial results for web workloads

– Reductions in cycles and cache misses on host– Transparently handle multiple NICs– Fewer messages on PCI

• 1.4 per packet to 0.6 per packet• Socket-level instead of packet-level communication

– ~17% throughput increase (simulations) To do

– Framework for offload policies– Test zero-copy, more workloads– Port to Linux

Page 30: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 30

Rice Computer Architecture Group

Virtual Network Interfaces

Traditionally used for user-level network access– Each process has its own “virtual NIC”– Provide protection among processes

Can we use this concept to improve network stack performance within the OS?– Possibly, but we need to understand the behavior of the

OS on networking workloads first

Page 31: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 31

Rice Computer Architecture Group

Networking Workloads

Performance is influenced by– The operating system’s network stack– The increasing number of connections– Microprocessor architecture trends

Page 32: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 32

Rice Computer Architecture Group

Networking Performance

Bound by TCP/IP processing 2.4GHz Intel Xeon: 2.5 Gbps for one nttcp stream

0%10%20%30%40%50%60%70%80%90%

100%

SPECWEBRice IBM

NASA

World Cup

OtherUserSystem CallTCPIPEthernetDriver

- Hurwitz and Feng, IEEE Micro 2004

Page 33: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 33

Rice Computer Architecture Group

Throughput vs. Connections

Faster links more connections More connections worse performance

4 8 16 32 64 128 256 512 102420480

200

400

600

800

1000

1200

ConnectionsHTT

P C

onte

nt T

hrou

ghpu

t (M

b/s)

CSIBMNASAWC

Page 34: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 34

Rice Computer Architecture Group

The End of the Uniprocessor?

Uniprocessors have become too complicated– Clock speed increases have slowed down– Increasingly complicated architectures for performance

Multi-core processors are becoming the norm– IBM Power 4 – 2 cores (2001)– Intel Pentium 4 – 2 hyperthreads (2002)– Sun UltraSPARC IV – 2 cores (2004) – AMD Opteron – 2 cores (2005)

Sun Niagra – 8 cores, 4 threads each (est. 2006) How do we use these cores for networking?

Page 35: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 35

Rice Computer Architecture Group

Parallelism with Data-Synchronized Stacks

Linux 2.4.20+, FreeBSD 5+

Page 36: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 36

Rice Computer Architecture Group

DragonflyBSD, Solaris 10

Parallelism with Control-Synchronized Stacks

Page 37: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 37

Rice Computer Architecture Group

Parallelization Challenges

Data-Synchronous– Lots of thread parallelism– Significant locking overheads

Control-Synchronous– Reduces locking– Load balancing issues

Which approach is better?– Throughput? Scalability?– We’re optimizing both schemes in FreeBSD 5 to find out

Network Interface– Serialization point– Can virtualization help?

Page 38: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 38

Rice Computer Architecture Group

Memory Controller Architecture

Improve DRAM efficiency– Memory access scheduling– Virtual Channels

Improve copy performance– 45-61% of kernel execution time can be copies– Best copy algorithm dependent on copy size, cache

residency, cache state– Probe copy– Hardware copy acceleration

Improve I/O performance…

Page 39: Network Server Performance and Scalability

© Scott Rixner, 2005 Network Server Performance and Scalability 39

Rice Computer Architecture Group

Summary

Our focus is on system-level architectures for networking Network interfaces must evolve

– No longer just a PCI-to-Ethernet bridge– Need to provide capabilities to help the operating system

Operating systems must evolve– Future systems will have 10s to 100s of processors– Networking must be parallelized – many bottlenecks remain

Synergy between the NIC and OS cannot be ignored Memory performance is also increasingly a critical factor