on the interaction between commercial workloads and memory systems in high-performance servers...

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers

Fredrik Dahlgren, Magnus Karlsson, and Jim Nilssonin collaboration with

Sun Microsystems and Ericsson Research

Per Stenström

Department of Computer Engineering, Chalmers,

Göteborg, Sweden

http://www.ce.chalmers.se/~pers

Motivation

• Database applications dominate (32%)

• Yet, major focus is on scientific/eng. apps (16%)

Dominating multiprocessor server applications 1995(Source: Dataquest)

16%

32%32%

5%

6%

9%

Scientific andengineering

Database

File server

Other

Media and e-mail

Print server

Project Objective

• Design principles for high-performance memory systems for emerging applications

• Systems considered:

– high-performance compute nodes

– SMP and DSM systems built out of them

• Applications considered:

– Decision support and on-line transaction processing

– Emerging applications• Computer graphics

• video/sound coding/decoding

• handwriting recognition

• ...

Outline

• Experimental platform• Memory system issues studied

– Working set size in DSS workloads– Prefetch approaches for pointer-intensive

workloads (such as in OLTP)– Coherence issues in OLTP workloads

• Concluding remarks

Experimental Platform

$ M

Single and multiprocessor system models

Platform enables • Analysis of comm. workloads• Analysis of OS effects• Tracking architectural events to OS or appl. level

Operating system (Linux)

Application

CPUSparc V8

Memory

Inte

rru

pt

TT

Y

SC

SI

Eth

ern

et

Devices

Outline





Decision-Support Systems (DSS)

Compile a list of matching entries in several database relations

Will moderately sized caches suffice for huge databases?

Join

Join

Join

Scan

Scan

Scan

Scan

Level i

i-1

2

1

Our Findings

• MWS: footprint of instructions and private data to access a single tuple– typically small (< 1 Mbyte) and not affected by database size

• DWS: footprint of database data (tuples) accessed across consecutive invocations of same scan node– typically small impact (~0.1%) on overall miss rate

Cache MissRate

Cache sizeMWSDWS 1 DWS 2 DWS i

Methodological Approach

Challenges:

• Not feasible to simulate huge databases

• Need source code: we used PostgreSQL and MySQL

Approach:

• Analytical model using– parameters that describe the query

– parameters measured on downscaled query executions

– system parameters

Footprints and Reuse Characteristics in DSS

• MWS: instructions, private, and metadata– can be measured on downscaled simulation

• DWS: all tuples accessed at lower levels– can be computed based on query composition and prob. of match

Join

Join

Join

Scan

Scan

Scan

Scan

Level i i-1

2 1

Footprints per tuple access

MWS and DWSi

MWS and DWSi-1

MWS and DWS2

MWS and DWS1

Analytical model-an overview

• Goal: Predicts miss rate versus cache size for fully-assoc. caches with a LRU replacement policy for single-proc. systems

Number of cold misses:

• size of footprint/block size– |MWS| is measured– |DWSi| computed based on parameters describing the query (size of relations probability of matching a search criterion, index versus sequential scan, etc)

Number of capacity misses for tuple access at level i:• CM0(1- C - C0) if C0 < Cache size < |MWS| |MWS| - C0

• size of tuple/block size if |MWS| <= Cache size < |MWS| + |DWSi|Number of accesses per tuple: measured Total number of misses and accesses: computed

Model Validation

Goal:

• Prediction accuracy for queries with different compositions

– Q3, Q6, and Q10 from TPC-D

• Prediction accuracy when scaling up the database

– parameters at 5Mbyte used to predict at 200 Mbytes databases

• Robustness across database engines

– Two engines: PostgreSQL and MySQL

012

34567

89

10

16 32 64 128

256

512

1024

2048

4096

Database, measured

Database, predicted

Total, measured

Total, predicted

Mis

s ra

tio

Cache size (Kbyte)

Q3 on PostgreSQL- 3 levels- 1 seq. scan- 2 index scan- 2 nest. loop joins

Model Predictions: Miss rates for Huge Databases

Miss ratio (%) versus cache size (Kbytes) for Q3 on a 10-Terabyte database

00,5

11,5

22,5

Cache size

Meta data

Database data

Private data

Instructions

• Miss rate by instr., priv. and meta data rapidly decay (128 Kbytes)

• Miss rate component for database data small

What’s in the tail?

Outline





Cache Issues for Linked Data Structures

• Pointer-chasing show up in many interesting applications:– 35% of the misses in OLTP (TPC-B)

– 32% of the misses in an expert system

– 21% of the misses in Raytrace

•Traversal of lists may exhibit poor temporal locality•Results in chains of data dependent loads, called pointer-chasing

SW Prefetch Techniques to Attack Pointer-Chasing

• Greedy Prefetching (G). - computation per node < latency

• Prefetch Arrays (P.(S/H))

Generalization of G and J that addresses above shortcomings.

- Trade memory space and bandwidth for more latency tolerance

• Jump Pointer Prefetching (J) - short list or traversal not known a priori

Results: Hash Tables and Lists in OldenNormalized execution time

14 16 15 18 17 11 13 28 20 20

82 8464 61

89 75 4439 39

86

0

20

40

60

80

100

120

Memory Stall Time

Busy Time

B G J P.S P.H B G J P.S P.H

MST HEALTH

Prefetch Arrays do better because:• MST has short lists and little computation per node

• They prefetch data for the first nodes in HEALTH unlike Jump prefetching

Results: Tree Traversals in OLTP and Olden

Hardware-based prefetch Arrays do better because:

• Traversal path not known in DB.tree (depth first search)

• Data for the first nodes prefetched in Tree.add

Normalized execution time

42 5079 62 50 30 40 35 50 38

35

83

3526 70 33 37

2622

58

0

50

100

150

200

Memory Stall Time

Busy Time

DB.tree Tree.add

B G J P.S P.H B G J P.S P.H

Other Results in Brief

• Impact of longer memory latencies: – Robust for lists

– For trees, prefetch arrays may cause severe cache pollution

• Impact of memory bandwidth– Performance improvements sustained for bandwidths of typical

high-end servers (2.4 Gbytes/s)

– Prefetch arrays may suffer for trees. Severe contention on low-bandwidth systems (640 Mbytes/s) were observed

• Node insertion and deletion for jump pointers and prefetch arrays– Results in instruction overhead (-). However,

– insertion/deletion is sped up by prefetching (+)

Outline





Coherence Issues in OLTP

• Favorite protocol: write-invalidate

• Ownership overhead: invalidations cause write stall and inval. traffic

DSM node

P

$

P

$

P

$

M M M

P

$

P

$

P

$

M M M

P

$

P

$

P

$

M M M

P

$

P

$

P

$

M M M

SMP node

Ownership Overhead in OLTPSimulation setup:

• CC-NUMA with 4 nodes

• MySQL, TPC-B, 600 MB database

Kernel DBMS Library Total

Load-Store

44% 31% 27% 40%

Stored-Between

32% 61% 66% 41%

Loaded-Between

24% 8% 7% 19%

• 40% of all ownership transactions stem from load/store sequences

Techniques to Attack Ownership Overhead

• Dynamic detection of migratory sharing– detects two load/store sequences by different

processors– only a sub-set of all load/store sequences (~40%

in OLTP)• Static detection of load/store sequences

– compiler algorithms that tags a load followed by a store and brings exclusive block in cache

– poses problems in TPC-B

New Protocol Extension

Criterion:

load miss from processor i followed by global store from i,

tag block as Load/Store Normalized execution time

0

20

40

60

80

100

120

Baseline Mig LS

Write stall

Read stall

Busy

Concluding Remarks

• Focus on DSS and OLTP has revealed challenges not exposed by traditional appl.– Pointer-chasing – Load/store optimizations

• Application scaling not fully understood– Our work on combining simulation with

analytical modeling shows some promise

on the interaction between commercial workloads and memory systems in high-performance servers...

Documents

cache size mws mws c

size of tupleblock size

database size dws

dss mws

set size

dwsi mws

oltp workloads

findings mws