memory performance evaluation of high thoughput servers

1

MEMORY PERFORMANCE EVALUATION MEMORY PERFORMANCE EVALUATION OF OF

HIGH THOUGHPUT SERVERSHIGH THOUGHPUT SERVERS

Garba Ya’u IsaMaster’s Thesis Oral DefenseComputer EngineeringKing Fahd University of Petroleum & Minerals

Saturday, 7th June 2003

2

Introduction Problem Statement Analysis of Memory Accesses Measurement Based Performance Evaluation Design and Implementation of Prototype Contributions Conclusions Future Work

Outline

3

Introduction

Processor and memory performance discrepancy

Growing network bandwidth Data rates in Terabits per

second possible Gigabit per second LANs

already deployed High throughput servers in

network infrastructure Streaming media servers Web servers Software Routers

10

1

100

1000

10,000

Year

Per

form

ance

4

Dealing with Performance Gap

Hierarchical memory architecture temporal locality spatial locality

Constrains Characteristics of network payload data:

Large won’t fit into cache Hardly reusable poor temporal locality

5

Problem Statement

Network servers should: Deliver high throughput Respond to requests with

low latency Respond to large number

of clients

Our goal Identify specific conditions

at which server memory becomes a bottleneck

Includes: cache, main memory, and virtual memory

Benefits Better server design that

alleviates memory bottlenecks

Optimal performance can be achieved

Constraints Large amount of data

flowing through CPU and memory

Writing code to optimize memory utilization is a challenge

6

Analysis of Memory Accesses: Data Flow Analysis

Four data transfer paths:

Memory-CPU Memory-memory Memory-I/O Memory-network

Processor

On-chip cache

Off-chip cache

Bus/DMAcontroller

I/O bus

Internal (CPU-memory) bus

Mainmemory

Disk controllerNetworkinterface

Networkinterface

Disk

Disk

Disk transfer via DMA

Network transfer via DMA

Memory-memory transfer via CPU

Cache-memorytransfers

7

Latency Model and Memory Overhead

Each transaction involves: CPU cycles Data transfers: one or more of four identified types

Transaction latency:

Ttrans = Tcpu + n1Tm-c + n2Tm-m + n3Tm-disk + n4Tm-net

Tcpu Total CPU time needed for the transaction Tm-c Time to transfer entire PDU from memory to CPU for proc. Tm-m Latency of memory-memory copy of a PDU Tm-disk Latency of memory-I/O read/write of a block of data Tm-net Latency of memory-network read/write of a PDU ni Number of each type of data movement operations

8

Memory-CPU Transfers

PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache)

Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP)

Cache miss rate Worst case: MR = 1 (not as bad!) Best case: MR = 0 (trivial)

9

Cache overhead in various cases: Worst case: MR = 1, MP = 10 and (MR)(MP) 10 Best case: MR = 0 trivial Average case: MR = 0.1, MP = 10 and (MR)(MP)1

Memory-CPU latency dependent on internal bus bandwidth Tm-c = S/32Bi usec where S is the PDU size and Bi is the

internal bus bandwidth in MB/s

Memory-CPU Transfers cont.

10

Memory-memory transfer: Due to memory copy of PDU between protocol layers Transfers through caches and CPU Stride =1 (contiguous) Transfer involves memorycacheCPUcachememory

data movement

Latency: Dependent on internal (system) bus bandwidth Tm-m = 2S/Bi usec

Memory-Memory Transfers

11

Memory-network transfers: Passes over the I/O bus DMA can be used Again, stride = 1 (contiguous)

Latency: Limiting factor is the I/O bus bandwidth Tm-net = S/Be usec

Memory-I/O and Memory-Network Transfers

12

RTP Transaction Latency

HTTP Transaction Latency

IP Transaction Latency

4

32RTPRTP cpui i

S S ST T

B B Be

4

32HTTPHTTP cpui i

S S ST T

B B Be

2IPIP cpu

ST T

Be

1

2

3

Latency of Reference Applications

13

Assumptions CPU usage latency compared to data transfer latencyis negligible and can be ignored Bus contention from multiple simultaneously executedtransactions do not result in any additional overhead

Server Throughput = S/T S = size of transaction data T = latency of a transaction given by equations 1, 2 and 3

Peak Throughputs

14

Peak Throughputs cont.

Processor Internal busbandwidth(MB/sec)

Throughput of three network applications

IP forwarding(Mbits/sec)

HTTP(Mbits/sec)

RTPStreaming(Mbits/sec)

Intel Pentium IV 3.06 GHz 3200 4264 3640 3640

AMD Athlon XP 3000+ 2700 4264 3291 3291

MIPS R16000 700 MHz 3200 4264 3640 3640

Sun Ultraspac III 900 MHz 1200 4264 1862 1862

15

Measurement Based PerformanceEvaluation

Experimental Testbed Dual boot server (Pentium IV 2.0 GHz)

256 MB RAM 1.0 GHz NIC

Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)

Tools Intel Vtune Windows Performance Monitor Netstat Linux tools: vmstat, sar, iostat

16

Platforms and Applications

Platforms Linux (kernel 2.4.7-10) Windows 2000

Applications Streaming media servers

Darwin streaming server Windows media server

Web servers Apache web server Microsoft Internet Information server

Software router Linux kernel IP forwarding

17

Analysis of Operating System Role

0

1000

2000

3000

4000

5000

6000

block size (working set)

Mem

ory b

andw

idth

(Mby

tes/se

c)

Linux

Windows

Memory Throughput Test ECT (extended copy

transfer) – memperf

Locality of reference: temporal locality – varying

working set size (block size) spatial locality – varying

access pattern (strides)

18

Context switching overhead

0

1

2

3

4

5

6

7

8

2 4 8 16 32 64 128

Number of threads

usec

/con

text

sw

itch

Linux

Windows

Analysis of Operating System Role cont.

19

Streaming Media Servers

Experimental Design Factors

Number of streams (streaming clients) Media encoding rate (56kbps and 300kbps) Stream distribution (unique and multiple media)

Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput

Benchmarking Tools DSS - streaming load tool WMS – media load simulator

20

Cache Performance

1

101

201

301

401

501

601

701

801

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns)

dss, unique

dss, multiple

wms, unique

wms, multiple

L1 cache misses (56kbps)

21

1

101

201

301

401

501

601

701

801

1 10 100 200 300 400 500 600 700 1000


nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns) dss, unique

dss, multiple

wms, unique

wms, multiple

L1 cache misses (300 kbps)

Cache Performance cont.

22

Memory Performance

0

100

200

300

400

1 10 100 200 300 400 500 600 700 1000


pag

e fa

ult

s /

sec

dss, unique

dss, multiple

wms, unique

wms, multiple

Page fault (300kbps)

23

1

10001

20001

30001

40001

50001

1 10 100 200 300 400 500 600 700 1000


thro

ug

hp

ut

(kb

ps)

dss, unique

dss, multiple

wms, unique

wms, multiple

Throughput Throughput (300kbps)

24

Summary: Streaming Media Server Memory Performance

Highest degradation in cache performance (both L1 and L2) when the number of clients is large and the encoding rate is 300kbps with multiple multimedia objects.

When clients demand unique media objects, page fault rate is constant. However, if the request is for multiple objects, the page fault rate increases with the number of clients.

Throughput increases with number of clients. Higher encoding rate - 300kbps, also accounts for more throughputs. Darwin streaming server has less throughput compared to Windows media server.

25

Web Servers

Experimental Design

Factors Number of web clients Document size

Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput Transactions/sec (connection rate) Average latency

Benchmarking Tool Webstone

26

Transactions

0100020003000400050006000700080009000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

BFile size (Kilobytes)

Tra

nsa

ctio

ns/

sec

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

27

L1 Cache Miss

0100200300400500600700800900

1000

5B 500B 10KB 100KB 5MB

File size (Kilobytes)

L1

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

28

Page Fault

0100200300400500600700800900

1000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B


pag

e fa

ult

s /s

ec apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

29

Throughput

0

100

200

300

400

500

600

700

5B 50B

500B

5KB

10KB

50KB

100K

B

500K

B5M

B50

MB


Th

rou

gh

pu

t (M

byt

es/s

ec)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

30

Summary: Web Server Memory Performance Evaluation

Attribute Value

Apache IIS

Max. transaction rate (conn/sec)Max. throughput (Mbps)CPU utilization (%)

258621771

4178 (58 % more than apache)349 (62% more than Apache)

63

L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec)

4241673< 10

200117< 10

Comparing Apache and IIS for an average file size of 10K

31

Software Router

Experimental Design Factors

Routing configurations TCP message size (64bytes, 10 Kbytes, and 64 Kbytes)

Metrics Throughput Number of context switching Number of active pages

Benchmarking Tool Netperf

32

Software Router Throughput

Ethernet interface 0

050

100150200250300350400450500

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet

Ethernet interface1

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet


0

50

100

150

200

250

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet


0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet

33

CPU Utilization

CPU utilization

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8

Configuration

CP

U u

tili

zati

on

%

64 bytes packet

10K packet

64K packet

34

Context Switching

context switchinig

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8

Configuration

con

text

/sec 64 bytes packet

10K packet

64K packet

35

Active Page

Active page

860

880

900

920

940

960

980

1000

1020

1 2 3 4 5 6 7 8

Configuration

nu

mb

er o

f ac

tive

pag

es

64 bytes packet

10K packet

64K packet

36

Summary: Software Router Performance Evaluation

Maximum throughput of 449 Mbps for configuration number 2 - full duplex one-to-one communication.

Highest CPU utilization was 84%

Highest context switching rate was 5378/sec

Number of active pages fairly uniformly distributed. Indicates low memory activity.

37

Design, Implementation and Evaluation of Prototype DB-RTP

ServerArchitecture

Implementation Linux platform (C) Our implementation of RTSP/RTP (why?)

RTPpacketizer

NIC

UDP/IP&

TCP/IPstack

RTSP server&

scheduler

Parser

RTP serverDisk memory buffer

media chunk RTP packet IP packet

To media client

From media client

38

Double Buffering and Synchronization

Start

Next bit

Dirty_bit_B= 1

Dirty_bit_A= 1

readBuffer_B

readBuffer_A

Next = 0 - ANext = 1 - B

no

yesyes

no

Buffer readStart

Fetch mediachunk from

disk

Next bit

Dirty_bit_B= 0

Dirty_bit_A= 0

writeBuffer_B

writeBuffer_A

Next = 0 - ANext = 1 - B

no

yesyes

no

Buffer write

39

RTP Server Throughput

0

10

20

30

40

50

60

70

Number of streams

Ban

dw

idth

Mb

ps

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

40

Jitter

0

5000

10000

15000

20000

25000

10 30 50 70 90 110

Number of streams

jitt

er u

sec RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

41

Throughput DB-RTP server – 63.85 Mbps RTP server – 59 Mbps.

Both servers exhibit steady jitter, but DB-RTP has relatively lower jitter compared to RTP server.

Summary: DB-RTP Server Performance Evaluation

42

Contributions

Cache overhead analysis. Memory latency and bandwidth analysis Measurement-based performance evaluation Design, implementation, and evaluation of a prototype streaming server - Double Buffer RTP (DB-RTP) server.

43

Conclusions

High throughput is possible with server design enhancement. Server throughput is significantly degraded by

excessive cache misses and page faults. Latency hiding with pre-fetching and buffering can

improve throughput and jitter performance

44

Future Work

Server Development hybrid = multiplexing + multithreading

Special Architectures (Network processors & ASICs) resource scheduling investigation of the role I/O use of IRAM (intelligent RAM) architectures integrated network infrastructure server

45

Thank you

46 go back

Array restructuring

Loop nest transformation

Array PaddingArray Padding

float rgbFrames [64][64][64][8]

Original code


Transformed code


Original code


Transformed code

float rgbFrames [8][64][64][64];float yuvFrames[8]64][64][64];int i, j, k, l;for (i=0; i<64; i++) for (j=0; j<64; j++) for (k =0; k<64; k++) for (l=0;l<8; l++) yuvFrame[l][i][j][k] = rgbFrames [l][i][j][k]

float rgbFrames [8][64][64][64];float yuvFrames[8]64][64][64];int i, j, k, l;for (l=0; l<8; l++) for (i=0; i<64; i++) for (j =0; j<64; j++) for (k=0;k<64; k++) yuvFrame[l][i][j][k] = rgbFrames [l][i][j][k]

Original code

Transformed code

47

Testbeds

IDC Card

Card

Card

Card

Linux RouterServer

Router clients

NICs

go back

A B C D E F G HSELECTED

ON-LINE

Dual boot server(Windows 2000/Linux Server)

Triple-boot client machines(Windows 2000/Linux Server)

Catalyst 35501 Gbps switch

Streaming media/web servertestbed

Software routertestbed

48

Communication Configurations

go back

host host

host

host

host

host

host

hosthosthosthost

host

hosthost

host

1-1 communication(1. simplex 2. duplex)

Double 1-1communication

(3. simplex 4. duplex)

1-4 communication(5. simplex 6. duplex)

Ring communication(7. simplex 8. duplex)

49

Backup slides

50

0

100

200

300

400

1 10 100 200 300 400 500 600 700 1000


pag

e fa

ult

s /

sec

dss, unique

dss, multiple

wms, unique

wms, multiple

0

100

200

300

400

1 10 100 200 300 400 500 600 700 1000


pag

e fa

ult

s /

sec

dss, unique

dss, multiple

wms, unique

wms, multiple

Page fault

56 kbps 300 kbps

Memory Performance

51

0

10

20

30

40

50

60

70

80

90

1 10 100 200 300 400 500 600 700 1000


cp

u u

tili

zati

on

(%

)

dss, unique

dss, multiple

wms, unique

wms, multiple

Streaming Server: CPU Utilization

52

0

500

1000

1500

2000

2500

1 10 100 200 300 400 500 600 700 1000


nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns) dss, unique

dss, multiple

wms, unique

wms, multiple



53

0

500

1000

1500

2000

2500

1 10 100 200 300 400 500 600 700 1000


nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns) dss, unique

dss, multiple

wms, unique

wms, multiple



54

Web Servers

0100200300400500600700800900

1000

5B 500B 10KB 100KB 5MB


L1

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

0

500

1000

1500

2000

2500

5B 500B 10KB 100KB 5MB


L2

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Cache performanceL1 cache misses L2 cache misses

Transaction

0100020003000400050006000700080009000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B


Tra

nsa

ctio

ns/

sec

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

55

0

50

100

150

200

250

300

350

5B 50B

500B

5KB

10KB

50KB

100K

B

500K

B5M

B50

MB


late

ncy

(se

c) apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

0

20

40

60

80

100

120

5B 500B 10KB 100KB 5MB


cpu

uti

liza

tio

n (

%)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Latency

CPU Utilization

Web Servers

56

DB-RTP Server

0

5

10

15

20

25

30

35

40

Number of streams

nu

mb

er

of

cach

e m

isses

(mil

lio

ns)

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

0

1

2

3

4

5

6

7

8

9

10

20

30

40

50

60

70

80

90

100

110

120

Number of streams

nu

mb

er

of

cach

e m

isses

(mil

lio

ns)

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

0

0.5

1

1.5

2

2.5

3

3.5

4

10 30 50 70 90 110

Number of streams

CP

U u

tili

zati

on

%

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

L1 cache misses L2 cache misses

CPU Utilization

57

Memory Performance Evaluation Methodologies

Analytical Requires just paper and pencil Accuracy?

Simulation Requires programming Time and cost?

Measurement Real system or a prototype required Using on-chip counters Benchmarking tools More accurate

58

Server Performance Tuning

Memory performance tuning

Array padding Array restructuring Loop nest transformation

Latency hiding and multithreading

EPIC (IA-64) VIRAM Impulse

Multiprocessing and clustering

Task parallelization E.g. Panama cluster

router

Special Architectures Network processors ASICs and Data flow

architectures

59

Temporal vs. spatial locality A PDU lacks temporal locality Observation: PDU processing exhibits excellent spatial

locality Suppose data cache line is 32 bytes (or 16 words) long Sequential accesses with stride = 1 Accessing one word, brings other 15 words as well Thus, effective MR = 1/16 = 6.2% better than even scientific

apps Thus, generally MR = W/L

W - Width of each memory access (in bytes) L - Length of each cache line (in bytes)

Validation of above observation: Similar special locality characteristics reported via measurements:

S. Sohoni et al., “A Study of Memory System Performance of Multimedia Applications,” in proc. of ACM SIGMETRICS 2001

MR for streaming media player better than SPEC benchmark apps!

60

Memory-CPU Transfers

PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache)

Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP)

IC – Instruction count per transaction AR – Number of memory accesses/instruction (AR=1) MR – Ratio of cache misses to memory accesses MP – Miss penalty in terms of clock cycles

Cache miss rate Worst case: MR = 1 while typically MP = 10 Stall cycles = 10 x IC

61

Determine cache overhead wrt execution time: (Execution time)no-cache = (IC)(CPI)(CC) (Execution time)with-cache = (IC)(CPI)(CC) {1 + (MR)(MP)} Cache overhead = 1 + (MR)(MP)

Cache overhead in various cases: Worst case: MR = 1 and MP = 10

Cache results in 11 times higher latency for each transaction!

Memory-CPU latency dependent on internal bus bandwidth Best case: MR = 0 trivial Average case: MR = 0.1 and MP = 10 and (MR)

(MP)1 Latency due to stalls = ideal execution time without stalls

Tm-c = S/32Bi usec where S is the PDU size and Bi is the internal bus BW in MB/s

Memory-CPU Transfers cont.

62

Open Questions

Role of specific-purpose architecture on performance of high throughput servers (e.g. network processor)

Role of memory compression

Role of scheduling

Open Questions

memory performance evaluation of high thoughput servers

Documents

server memory

memory utilization

main memory

dmamemorymemory transfer

net latency of memory

memory copy of pdu

oneway data flow memory

pdutmdisk latency of