memory performance evaluation of high thoughput servers

62
1 MEMORY PERFORMANCE EVALUATION MEMORY PERFORMANCE EVALUATION OF OF HIGH THOUGHPUT SERVERS HIGH THOUGHPUT SERVERS Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals Saturday, 7 th June 2003

Upload: briar-richardson

Post on 03-Jan-2016

37 views

Category:

Documents


4 download

DESCRIPTION

MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS. Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals Saturday, 7 th June 2003. Outline. Introduction Problem Statement Analysis of Memory Accesses - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

1

MEMORY PERFORMANCE EVALUATION MEMORY PERFORMANCE EVALUATION OF OF

HIGH THOUGHPUT SERVERSHIGH THOUGHPUT SERVERS

Garba Ya’u IsaMaster’s Thesis Oral DefenseComputer EngineeringKing Fahd University of Petroleum & Minerals

Saturday, 7th June 2003

Page 2: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

2

Introduction Problem Statement Analysis of Memory Accesses Measurement Based Performance Evaluation Design and Implementation of Prototype Contributions Conclusions Future Work

Outline

Page 3: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

3

Introduction

Processor and memory performance discrepancy

Growing network bandwidth Data rates in Terabits per

second possible Gigabit per second LANs

already deployed High throughput servers in

network infrastructure Streaming media servers Web servers Software Routers

10

1

100

1000

10,000

Year

Per

form

ance

Page 4: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

4

Dealing with Performance Gap

Hierarchical memory architecture temporal locality spatial locality

Constrains Characteristics of network payload data:

Large won’t fit into cache Hardly reusable poor temporal locality

Page 5: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

5

Problem Statement

Network servers should: Deliver high throughput Respond to requests with

low latency Respond to large number

of clients

Our goal Identify specific conditions

at which server memory becomes a bottleneck

Includes: cache, main memory, and virtual memory

Benefits Better server design that

alleviates memory bottlenecks

Optimal performance can be achieved

Constraints Large amount of data

flowing through CPU and memory

Writing code to optimize memory utilization is a challenge

Page 6: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

6

Analysis of Memory Accesses: Data Flow Analysis

Four data transfer paths:

Memory-CPU Memory-memory Memory-I/O Memory-network

Processor

On-chip cache

Off-chip cache

Bus/DMAcontroller

I/O bus

Internal (CPU-memory) bus

Mainmemory

Disk controllerNetworkinterface

Networkinterface

Disk

Disk

Disk transfer via DMA

Network transfer via DMA

Memory-memory transfer via CPU

Cache-memorytransfers

Page 7: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

7

Latency Model and Memory Overhead

Each transaction involves: CPU cycles Data transfers: one or more of four identified types

Transaction latency:

Ttrans = Tcpu + n1Tm-c + n2Tm-m + n3Tm-disk + n4Tm-net

Tcpu Total CPU time needed for the transaction Tm-c Time to transfer entire PDU from memory to CPU for proc. Tm-m Latency of memory-memory copy of a PDU Tm-disk Latency of memory-I/O read/write of a block of data Tm-net Latency of memory-network read/write of a PDU ni Number of each type of data movement operations

Page 8: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

8

Memory-CPU Transfers

PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache)

Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP)

Cache miss rate Worst case: MR = 1 (not as bad!) Best case: MR = 0 (trivial)

Page 9: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

9

Cache overhead in various cases: Worst case: MR = 1, MP = 10 and (MR)(MP) 10 Best case: MR = 0 trivial Average case: MR = 0.1, MP = 10 and (MR)(MP)1

Memory-CPU latency dependent on internal bus bandwidth Tm-c = S/32Bi usec where S is the PDU size and Bi is the

internal bus bandwidth in MB/s

Memory-CPU Transfers cont.

Page 10: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

10

Memory-memory transfer: Due to memory copy of PDU between protocol layers Transfers through caches and CPU Stride =1 (contiguous) Transfer involves memorycacheCPUcachememory

data movement

Latency: Dependent on internal (system) bus bandwidth Tm-m = 2S/Bi usec

Memory-Memory Transfers

Page 11: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

11

Memory-network transfers: Passes over the I/O bus DMA can be used Again, stride = 1 (contiguous)

Latency: Limiting factor is the I/O bus bandwidth Tm-net = S/Be usec

Memory-I/O and Memory-Network Transfers

Page 12: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

12

RTP Transaction Latency

HTTP Transaction Latency

IP Transaction Latency

4

32RTPRTP cpui i

S S ST T

B B Be

4

32HTTPHTTP cpui i

S S ST T

B B Be

2IPIP cpu

ST T

Be

1

2

3

Latency of Reference Applications

Page 13: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

13

Assumptions CPU usage latency compared to data transfer latencyis negligible and can be ignored Bus contention from multiple simultaneously executedtransactions do not result in any additional overhead

Server Throughput = S/T S = size of transaction data T = latency of a transaction given by equations 1, 2 and 3

Peak Throughputs

Page 14: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

14

Peak Throughputs cont.

Processor Internal busbandwidth(MB/sec)

Throughput of three network applications

IP forwarding(Mbits/sec)

HTTP(Mbits/sec)

RTPStreaming(Mbits/sec)

Intel Pentium IV 3.06 GHz 3200 4264 3640 3640

AMD Athlon XP 3000+ 2700 4264 3291 3291

MIPS R16000 700 MHz 3200 4264 3640 3640

Sun Ultraspac III 900 MHz 1200 4264 1862 1862

Page 15: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

15

Measurement Based PerformanceEvaluation

Experimental Testbed Dual boot server (Pentium IV 2.0 GHz)

256 MB RAM 1.0 GHz NIC

Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)

Tools Intel Vtune Windows Performance Monitor Netstat Linux tools: vmstat, sar, iostat

Page 16: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

16

Platforms and Applications

Platforms Linux (kernel 2.4.7-10) Windows 2000

Applications Streaming media servers

Darwin streaming server Windows media server

Web servers Apache web server Microsoft Internet Information server

Software router Linux kernel IP forwarding

Page 17: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

17

Analysis of Operating System Role

0

1000

2000

3000

4000

5000

6000

block size (working set)

Mem

ory b

andw

idth

(Mby

tes/se

c)

Linux

Windows

Memory Throughput Test ECT (extended copy

transfer) – memperf

Locality of reference: temporal locality – varying

working set size (block size) spatial locality – varying

access pattern (strides)

Page 18: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

18

Context switching overhead

0

1

2

3

4

5

6

7

8

2 4 8 16 32 64 128

Number of threads

usec

/con

text

sw

itch

Linux

Windows

Analysis of Operating System Role cont.

Page 19: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

19

Streaming Media Servers

Experimental Design Factors

Number of streams (streaming clients) Media encoding rate (56kbps and 300kbps) Stream distribution (unique and multiple media)

Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput

Benchmarking Tools DSS - streaming load tool WMS – media load simulator

Page 20: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

20

Cache Performance

1

101

201

301

401

501

601

701

801

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns)

dss, unique

dss, multiple

wms, unique

wms, multiple

L1 cache misses (56kbps)

Page 21: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

21

1

101

201

301

401

501

601

701

801

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns) dss, unique

dss, multiple

wms, unique

wms, multiple

L1 cache misses (300 kbps)

Cache Performance cont.

Page 22: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

22

Memory Performance

0

100

200

300

400

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

pag

e fa

ult

s /

sec

dss, unique

dss, multiple

wms, unique

wms, multiple

Page fault (300kbps)

Page 23: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

23

1

10001

20001

30001

40001

50001

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

thro

ug

hp

ut

(kb

ps)

dss, unique

dss, multiple

wms, unique

wms, multiple

Throughput Throughput (300kbps)

Page 24: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

24

Summary: Streaming Media Server Memory Performance

Highest degradation in cache performance (both L1 and L2) when the number of clients is large and the encoding rate is 300kbps with multiple multimedia objects.

When clients demand unique media objects, page fault rate is constant. However, if the request is for multiple objects, the page fault rate increases with the number of clients.

Throughput increases with number of clients. Higher encoding rate - 300kbps, also accounts for more throughputs. Darwin streaming server has less throughput compared to Windows media server.

Page 25: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

25

Web Servers

Experimental Design

Factors Number of web clients Document size

Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput Transactions/sec (connection rate) Average latency

Benchmarking Tool Webstone

Page 26: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

26

Transactions

0100020003000400050006000700080009000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

BFile size (Kilobytes)

Tra

nsa

ctio

ns/

sec

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Page 27: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

27

L1 Cache Miss

0100200300400500600700800900

1000

5B 500B 10KB 100KB 5MB

File size (Kilobytes)

L1

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Page 28: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

28

Page Fault

0100200300400500600700800900

1000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B

File size (Kilobytes)

pag

e fa

ult

s /s

ec apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Page 29: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

29

Throughput

0

100

200

300

400

500

600

700

5B 50B

500B

5KB

10KB

50KB

100K

B

500K

B5M

B50

MB

File size (Kilobytes)

Th

rou

gh

pu

t (M

byt

es/s

ec)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Page 30: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

30

Summary: Web Server Memory Performance Evaluation

Attribute Value

Apache IIS

Max. transaction rate (conn/sec)Max. throughput (Mbps)CPU utilization (%)

258621771

4178 (58 % more than apache)349 (62% more than Apache)

63

L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec)

4241673< 10

200117< 10

Comparing Apache and IIS for an average file size of 10K

Page 31: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

31

Software Router

Experimental Design Factors

Routing configurations TCP message size (64bytes, 10 Kbytes, and 64 Kbytes)

Metrics Throughput Number of context switching Number of active pages

Benchmarking Tool Netperf

Page 32: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

32

Software Router Throughput

Ethernet interface 0

050

100150200250300350400450500

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet

Ethernet interface1

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet

Ethernet interface 3

0

50

100

150

200

250

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet

Ethernet interface 2

0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8

Configuration

Mb

its/s

ec

64 bytes packet

10K packet

64K packet

Page 33: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

33

CPU Utilization

CPU utilization

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8

Configuration

CP

U u

tili

zati

on

%

64 bytes packet

10K packet

64K packet

Page 34: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

34

Context Switching

context switchinig

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8

Configuration

con

text

/sec 64 bytes packet

10K packet

64K packet

Page 35: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

35

Active Page

Active page

860

880

900

920

940

960

980

1000

1020

1 2 3 4 5 6 7 8

Configuration

nu

mb

er o

f ac

tive

pag

es

64 bytes packet

10K packet

64K packet

Page 36: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

36

Summary: Software Router Performance Evaluation

Maximum throughput of 449 Mbps for configuration number 2 - full duplex one-to-one communication.

Highest CPU utilization was 84%

Highest context switching rate was 5378/sec

Number of active pages fairly uniformly distributed. Indicates low memory activity.

Page 37: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

37

Design, Implementation and Evaluation of Prototype DB-RTP

ServerArchitecture

Implementation Linux platform (C) Our implementation of RTSP/RTP (why?)

RTPpacketizer

NIC

UDP/IP&

TCP/IPstack

RTSP server&

scheduler

Parser

RTP serverDisk memory buffer

media chunk RTP packet IP packet

To media client

From media client

Page 38: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

38

Double Buffering and Synchronization

Start

Next bit

Dirty_bit_B= 1

Dirty_bit_A= 1

readBuffer_B

readBuffer_A

Next = 0 - ANext = 1 - B

no

yesyes

no

Buffer readStart

Fetch mediachunk from

disk

Next bit

Dirty_bit_B= 0

Dirty_bit_A= 0

writeBuffer_B

writeBuffer_A

Next = 0 - ANext = 1 - B

no

yesyes

no

Buffer write

Page 39: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

39

RTP Server Throughput

0

10

20

30

40

50

60

70

Number of streams

Ban

dw

idth

Mb

ps

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

Page 40: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

40

Jitter

0

5000

10000

15000

20000

25000

10 30 50 70 90 110

Number of streams

jitt

er u

sec RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

Page 41: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

41

Throughput DB-RTP server – 63.85 Mbps RTP server – 59 Mbps.

Both servers exhibit steady jitter, but DB-RTP has relatively lower jitter compared to RTP server.

Summary: DB-RTP Server Performance Evaluation

Page 42: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

42

Contributions

Cache overhead analysis. Memory latency and bandwidth analysis Measurement-based performance evaluation Design, implementation, and evaluation of a prototype streaming server - Double Buffer RTP (DB-RTP) server.

Page 43: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

43

Conclusions

High throughput is possible with server design enhancement. Server throughput is significantly degraded by

excessive cache misses and page faults. Latency hiding with pre-fetching and buffering can

improve throughput and jitter performance

Page 44: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

44

Future Work

Server Development hybrid = multiplexing + multithreading

Special Architectures (Network processors & ASICs) resource scheduling investigation of the role I/O use of IRAM (intelligent RAM) architectures integrated network infrastructure server

Page 45: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

45

Thank you

Page 46: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

46 go back

Array restructuring

Loop nest transformation

Array PaddingArray Padding

float rgbFrames [64][64][64][8]

Original code

float rgbFrames [65][65][65][9]

Transformed code

float rgbFrames [64][64][64][8]

Original code

float rgbFrames [8][64][64][64]

Transformed code

float rgbFrames [8][64][64][64];float yuvFrames[8]64][64][64];int i, j, k, l;for (i=0; i<64; i++) for (j=0; j<64; j++) for (k =0; k<64; k++) for (l=0;l<8; l++) yuvFrame[l][i][j][k] = rgbFrames [l][i][j][k]

float rgbFrames [8][64][64][64];float yuvFrames[8]64][64][64];int i, j, k, l;for (l=0; l<8; l++) for (i=0; i<64; i++) for (j =0; j<64; j++) for (k=0;k<64; k++) yuvFrame[l][i][j][k] = rgbFrames [l][i][j][k]

Original code

Transformed code

Page 47: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

47

Testbeds

IDC Card

Card

Card

Card

Linux RouterServer

Router clients

NICs

go back

A B C D E F G HSELECTED

ON-LINE

Dual boot server(Windows 2000/Linux Server)

Triple-boot client machines(Windows 2000/Linux Server)

Catalyst 35501 Gbps switch

Streaming media/web servertestbed

Software routertestbed

Page 48: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

48

Communication Configurations

go back

host host

host

host

host

host

host

hosthosthosthost

host

hosthost

host

1-1 communication(1. simplex 2. duplex)

Double 1-1communication

(3. simplex 4. duplex)

1-4 communication(5. simplex 6. duplex)

Ring communication(7. simplex 8. duplex)

Page 49: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

49

Backup slides

Page 50: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

50

0

100

200

300

400

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

pag

e fa

ult

s /

sec

dss, unique

dss, multiple

wms, unique

wms, multiple

0

100

200

300

400

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

pag

e fa

ult

s /

sec

dss, unique

dss, multiple

wms, unique

wms, multiple

Page fault

56 kbps 300 kbps

Memory Performance

Page 51: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

51

0

10

20

30

40

50

60

70

80

90

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

cp

u u

tili

zati

on

(%

)

dss, unique

dss, multiple

wms, unique

wms, multiple

Streaming Server: CPU Utilization

Page 52: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

52

0

500

1000

1500

2000

2500

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns) dss, unique

dss, multiple

wms, unique

wms, multiple

L2 cache misses (56kbps)

Cache Performance cont.

Page 53: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

53

0

500

1000

1500

2000

2500

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns) dss, unique

dss, multiple

wms, unique

wms, multiple

L2 cache misses (300kbps)

Cache Performance cont.

Page 54: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

54

Web Servers

0100200300400500600700800900

1000

5B 500B 10KB 100KB 5MB

File size (Kilobytes)

L1

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

0

500

1000

1500

2000

2500

5B 500B 10KB 100KB 5MB

File size (Kilobytes)

L2

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Cache performanceL1 cache misses L2 cache misses

Transaction

0100020003000400050006000700080009000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B

File size (Kilobytes)

Tra

nsa

ctio

ns/

sec

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Page 55: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

55

0

50

100

150

200

250

300

350

5B 50B

500B

5KB

10KB

50KB

100K

B

500K

B5M

B50

MB

File size (Kilobytes)

late

ncy

(se

c) apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

0

20

40

60

80

100

120

5B 500B 10KB 100KB 5MB

File size (Kilobytes)

cpu

uti

liza

tio

n (

%)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Latency

CPU Utilization

Web Servers

Page 56: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

56

DB-RTP Server

0

5

10

15

20

25

30

35

40

Number of streams

nu

mb

er

of

cach

e m

isses

(mil

lio

ns)

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

0

1

2

3

4

5

6

7

8

9

10

20

30

40

50

60

70

80

90

100

110

120

Number of streams

nu

mb

er

of

cach

e m

isses

(mil

lio

ns)

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

0

0.5

1

1.5

2

2.5

3

3.5

4

10 30 50 70 90 110

Number of streams

CP

U u

tili

zati

on

%

RTP-unique

RTP-multiple

DB-RTP-unique

DB-RTP-multiple

L1 cache misses L2 cache misses

CPU Utilization

Page 57: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

57

Memory Performance Evaluation Methodologies

Analytical Requires just paper and pencil Accuracy?

Simulation Requires programming Time and cost?

Measurement Real system or a prototype required Using on-chip counters Benchmarking tools More accurate

Page 58: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

58

Server Performance Tuning

Memory performance tuning

Array padding Array restructuring Loop nest transformation

Latency hiding and multithreading

EPIC (IA-64) VIRAM Impulse

Multiprocessing and clustering

Task parallelization E.g. Panama cluster

router

Special Architectures Network processors ASICs and Data flow

architectures

Page 59: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

59

Temporal vs. spatial locality A PDU lacks temporal locality Observation: PDU processing exhibits excellent spatial

locality Suppose data cache line is 32 bytes (or 16 words) long Sequential accesses with stride = 1 Accessing one word, brings other 15 words as well Thus, effective MR = 1/16 = 6.2% better than even scientific

apps Thus, generally MR = W/L

W - Width of each memory access (in bytes) L - Length of each cache line (in bytes)

Validation of above observation: Similar special locality characteristics reported via measurements:

S. Sohoni et al., “A Study of Memory System Performance of Multimedia Applications,” in proc. of ACM SIGMETRICS 2001

MR for streaming media player better than SPEC benchmark apps!

Page 60: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

60

Memory-CPU Transfers

PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache)

Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP)

IC – Instruction count per transaction AR – Number of memory accesses/instruction (AR=1) MR – Ratio of cache misses to memory accesses MP – Miss penalty in terms of clock cycles

Cache miss rate Worst case: MR = 1 while typically MP = 10 Stall cycles = 10 x IC

Page 61: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

61

Determine cache overhead wrt execution time: (Execution time)no-cache = (IC)(CPI)(CC) (Execution time)with-cache = (IC)(CPI)(CC) {1 + (MR)(MP)} Cache overhead = 1 + (MR)(MP)

Cache overhead in various cases: Worst case: MR = 1 and MP = 10

Cache results in 11 times higher latency for each transaction!

Memory-CPU latency dependent on internal bus bandwidth Best case: MR = 0 trivial Average case: MR = 0.1 and MP = 10 and (MR)

(MP)1 Latency due to stalls = ideal execution time without stalls

Tm-c = S/32Bi usec where S is the PDU size and Bi is the internal bus BW in MB/s

Memory-CPU Transfers cont.

Page 62: MEMORY PERFORMANCE EVALUATION  OF  HIGH THOUGHPUT SERVERS

62

Open Questions

Role of specific-purpose architecture on performance of high throughput servers (e.g. network processor)

Role of memory compression

Role of scheduling

Open Questions