on the interaction between commercial workloads and memory systems in high-performance servers...
Post on 19-Dec-2015
219 views
TRANSCRIPT
On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers
Fredrik Dahlgren, Magnus Karlsson, and Jim Nilssonin collaboration with
Sun Microsystems and Ericsson Research
Per Stenström
Department of Computer Engineering, Chalmers,
Göteborg, Sweden
http://www.ce.chalmers.se/~pers
Motivation
• Database applications dominate (32%)
• Yet, major focus is on scientific/eng. apps (16%)
Dominating multiprocessor server applications 1995(Source: Dataquest)
16%
32%32%
5%
6%
9%
Scientific andengineering
Database
File server
Other
Media and e-mail
Print server
Project Objective
• Design principles for high-performance memory systems for emerging applications
• Systems considered:
– high-performance compute nodes
– SMP and DSM systems built out of them
• Applications considered:
– Decision support and on-line transaction processing
– Emerging applications• Computer graphics
• video/sound coding/decoding
• handwriting recognition
• ...
Outline
• Experimental platform• Memory system issues studied
– Working set size in DSS workloads– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)– Coherence issues in OLTP workloads
• Concluding remarks
Experimental Platform
$ M
Single and multiprocessor system models
Platform enables • Analysis of comm. workloads• Analysis of OS effects• Tracking architectural events to OS or appl. level
Operating system (Linux)
Application
CPUSparc V8
Memory
Inte
rru
pt
TT
Y
SC
SI
Eth
ern
et
Devices
Outline
• Experimental platform• Memory system issues studied
– Working set size in DSS workloads– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)– Coherence issues in OLTP workloads
• Concluding remarks
Decision-Support Systems (DSS)
Compile a list of matching entries in several database relations
Will moderately sized caches suffice for huge databases?
Join
Join
Join
Scan
Scan
Scan
Scan
Level i
i-1
2
1
Our Findings
• MWS: footprint of instructions and private data to access a single tuple– typically small (< 1 Mbyte) and not affected by database size
• DWS: footprint of database data (tuples) accessed across consecutive invocations of same scan node– typically small impact (~0.1%) on overall miss rate
Cache MissRate
Cache sizeMWSDWS 1 DWS 2 DWS i
Methodological Approach
Challenges:
• Not feasible to simulate huge databases
• Need source code: we used PostgreSQL and MySQL
Approach:
• Analytical model using– parameters that describe the query
– parameters measured on downscaled query executions
– system parameters
Footprints and Reuse Characteristics in DSS
• MWS: instructions, private, and metadata– can be measured on downscaled simulation
• DWS: all tuples accessed at lower levels– can be computed based on query composition and prob. of match
Join
Join
Join
Scan
Scan
Scan
Scan
Level i i-1
2 1
Footprints per tuple access
MWS and DWSi
MWS and DWSi-1
MWS and DWS2
MWS and DWS1
Analytical model-an overview
• Goal: Predicts miss rate versus cache size for fully-assoc. caches with a LRU replacement policy for single-proc. systems
Number of cold misses:
• size of footprint/block size– |MWS| is measured– |DWSi| computed based on parameters describing the query (size of relations probability of matching a search criterion, index versus sequential scan, etc)
Number of capacity misses for tuple access at level i:• CM0(1- C - C0) if C0 < Cache size < |MWS| |MWS| - C0
• size of tuple/block size if |MWS| <= Cache size < |MWS| + |DWSi|Number of accesses per tuple: measured Total number of misses and accesses: computed
Model Validation
Goal:
• Prediction accuracy for queries with different compositions
– Q3, Q6, and Q10 from TPC-D
• Prediction accuracy when scaling up the database
– parameters at 5Mbyte used to predict at 200 Mbytes databases
• Robustness across database engines
– Two engines: PostgreSQL and MySQL
012
34567
89
10
16 32 64 128
256
512
1024
2048
4096
Database, measured
Database, predicted
Total, measured
Total, predicted
Mis
s ra
tio
Cache size (Kbyte)
Q3 on PostgreSQL- 3 levels- 1 seq. scan- 2 index scan- 2 nest. loop joins
Model Predictions: Miss rates for Huge Databases
Miss ratio (%) versus cache size (Kbytes) for Q3 on a 10-Terabyte database
00,5
11,5
22,5
Cache size
Meta data
Database data
Private data
Instructions
• Miss rate by instr., priv. and meta data rapidly decay (128 Kbytes)
• Miss rate component for database data small
What’s in the tail?
Outline
• Experimental platform• Memory system issues studied
– Working set size in DSS workloads– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)– Coherence issues in OLTP workloads
• Concluding remarks
Cache Issues for Linked Data Structures
• Pointer-chasing show up in many interesting applications:– 35% of the misses in OLTP (TPC-B)
– 32% of the misses in an expert system
– 21% of the misses in Raytrace
•Traversal of lists may exhibit poor temporal locality•Results in chains of data dependent loads, called pointer-chasing
SW Prefetch Techniques to Attack Pointer-Chasing
• Greedy Prefetching (G). - computation per node < latency
• Prefetch Arrays (P.(S/H))
Generalization of G and J that addresses above shortcomings.
- Trade memory space and bandwidth for more latency tolerance
• Jump Pointer Prefetching (J) - short list or traversal not known a priori
Results: Hash Tables and Lists in OldenNormalized execution time
14 16 15 18 17 11 13 28 20 20
82 8464 61
89 75 4439 39
86
0
20
40
60
80
100
120
Memory Stall Time
Busy Time
B G J P.S P.H B G J P.S P.H
MST HEALTH
Prefetch Arrays do better because:• MST has short lists and little computation per node
• They prefetch data for the first nodes in HEALTH unlike Jump prefetching
Results: Tree Traversals in OLTP and Olden
Hardware-based prefetch Arrays do better because:
• Traversal path not known in DB.tree (depth first search)
• Data for the first nodes prefetched in Tree.add
Normalized execution time
42 5079 62 50 30 40 35 50 38
35
83
3526 70 33 37
2622
58
0
50
100
150
200
Memory Stall Time
Busy Time
DB.tree Tree.add
B G J P.S P.H B G J P.S P.H
Other Results in Brief
• Impact of longer memory latencies: – Robust for lists
– For trees, prefetch arrays may cause severe cache pollution
• Impact of memory bandwidth– Performance improvements sustained for bandwidths of typical
high-end servers (2.4 Gbytes/s)
– Prefetch arrays may suffer for trees. Severe contention on low-bandwidth systems (640 Mbytes/s) were observed
• Node insertion and deletion for jump pointers and prefetch arrays– Results in instruction overhead (-). However,
– insertion/deletion is sped up by prefetching (+)
Outline
• Experimental platform• Memory system issues studied
– Working set size in DSS workloads– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)– Coherence issues in OLTP workloads
• Concluding remarks
Coherence Issues in OLTP
• Favorite protocol: write-invalidate
• Ownership overhead: invalidations cause write stall and inval. traffic
DSM node
P
$
P
$
P
$
M M M
P
$
P
$
P
$
M M M
P
$
P
$
P
$
M M M
P
$
P
$
P
$
M M M
SMP node
Ownership Overhead in OLTPSimulation setup:
• CC-NUMA with 4 nodes
• MySQL, TPC-B, 600 MB database
Kernel DBMS Library Total
Load-Store
44% 31% 27% 40%
Stored-Between
32% 61% 66% 41%
Loaded-Between
24% 8% 7% 19%
• 40% of all ownership transactions stem from load/store sequences
Techniques to Attack Ownership Overhead
• Dynamic detection of migratory sharing– detects two load/store sequences by different
processors– only a sub-set of all load/store sequences (~40%
in OLTP)• Static detection of load/store sequences
– compiler algorithms that tags a load followed by a store and brings exclusive block in cache
– poses problems in TPC-B
New Protocol Extension
Criterion:
load miss from processor i followed by global store from i,
tag block as Load/Store Normalized execution time
0
20
40
60
80
100
120
Baseline Mig LS
Write stall
Read stall
Busy
Concluding Remarks
• Focus on DSS and OLTP has revealed challenges not exposed by traditional appl.– Pointer-chasing – Load/store optimizations
• Application scaling not fully understood– Our work on combining simulation with
analytical modeling shows some promise