sting revisited: performance of commercial database

of 28 /28
1 R. Clapp, 1/21/01 STiNG Revisited: Performance of Commercial Database Benchmarks on a CC-NUMA Computer System Russell M. Clapp IBM NUMA-Q [email protected] January 21st, 2001

Author: others

Post on 20-Jun-2022

1 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

caecw01.PDFPerformance of Commercial Database Benchmarks on a CC-NUMA Computer System
Russell M. Clapp
Background • STiNG was the code name of an engineering development.
– Architecture introduced at ISCA in 1996.
• NUMA-Q was the name of the product family that resulted. – Features phased in over multiple releases.
– Not yet feature complete in mid-1997.
– Unable to validate simulation results at that time.
• Left Sequent to pursue other opportunities in 1997.
• Recently returned to what is now IBM NUMA-Q – Looking at data collected between late 1997 and 1999.
– Now trying to validate simulation results –sort of.
• This presentation compares apples and oranges. – At least they are both types of fruit!
• A lot of the details are different, but the overall conclusions are the same.
3 R. Clapp, 1/21/01
• Brief review of STiNG architecture. – Updates on changes made for second-generation NUMA-Q
• Examination of OLTP and DSS workloads. – Per instruction workload profiles.
– How they look on a CC-NUMA machine.
• Comparison of architectural simulation and measured results. – OLTP and DSS workloads.
– Report latencies and consumed bandwidths throughout the system.
• Conclusions
• 1 GByte/sec SCI interconnect.
• MESI and modified SCI coherence protocols.
• Up to 63 nodes in a coherence domain
Quad 0 Quad 1
18 b i t s 18 b i t s
DataPumpDataPump
S C I o u tS C I i n
X e o n T M B u s ( 1 4 7 b i t s )
Lynxbus
72 b
SCI Remote Cache Tags
its
9 7 0 1 1 6 . 1 p m e s s e r
LYNX2 Board SBB Board
40 b
ts
6 R. Clapp, 1/21/01
STiNG to NUMA-Q Product Evolution • STiNG assumed product features that were introduced in NUMA-Q over
time: – Multipath I/O.
– Intelligent I/O drivers.
– Improved chipset behavior.
– Dual protocol processing engine in SCLIC.
– Larger processor and remote cache sizes.
• These changes were all incorporated with the second generation of NUMA-Q, code named “Scorpion”. – This is the data that I will present today for OLTP workloads.
– I will also present some data from the first generation of NUMA-Q.
• “Centurion” increases the bus speed and ASICs to 100MHz as well as adding some other minor enhancements.
7 R. Clapp, 1/21/01
STiNG Simulation vs. NUMA-Q Data Collection • “Application profiles” were determined for several database benchmark
workloads. – We compare to measured profiles on NUMA-Q.
• The TPC-B benchmark and Query 6 of the TPC-D suite were chosen as representative of OLTP and DSS usage models.
– TPC-C and Query 5 Of TPC-D were measured on NUMA-Q
• A simulator with behavioral-level models of all system ASICs and data paths was built.
• The application profiles were used to drive the simulation model.
• The simulation model reports latencies, resource utilization rates, and instruction throughput.
– Data was collected using hardware performance counters.
– The counters provide data for over 1000 different metrics.
8 R. Clapp, 1/21/01
STiNG Simulation vs. NUMA-Q Data Collection
• STiNG simulations included the the “Dual SCLIC.” – TPC-C data is for the dual sequencer version of the SCLIC, the “DSCLIC”.
• Clearly, we are comparing apples and oranges. – But many of the results are similar!
2 per SCLIC
1 per SCLIC
2 per SCLIC
1 per SCLIC
Application Profiles
• Larger cache leads to lower MPI for TPC-C vs. TPC-B. • Query 5 has higher MPI than Query 6 despite larger cache. • Remote memory access rates are lower than what we assumed.
– Lots of OS work made this happen.
• Remote cache miss rates are higher than expected. – Fewer L2 capacity misses to remote memory that would hit in remote cache.
• TPC-C has lower I/O rate than TPC-B.
0.440.490.080.36bits per instructionI/O
323212832MB (all 4-way)Remote Cache Size
27%35%24%35%per L2 cache missRemote Memory Access Rate
0.00310.00180.00730.0223per instructionL2 Cache Miss
1M512K2M512Kper processorL2 Cache Size
TPC-B on STiNG
• Most invalidations completed locally as well.
11 R. Clapp, 1/21/01
Cache Miss Service Distribution for DSS 32 Processors
• Larger systems should have more remote references take 4 hops. • Query 5 does not show this behavior.
– This is probably due to the higher remote cache miss rate. – These are compulsory misses for Query 5.
Query 5 on NUMA-Q
0
20
40
60
80
100
120
Processors
STiNG/Query 6 NUMA-Q/Query 5
Average Cache Miss Penalty
• Higher locality leads to lower average latency. • Rate of increase is decreasing with quads –as predicted.
13 R. Clapp, 1/21/01
Remote Quad Access Latency
• OLTP higher than simulated in part due to speed difference between Scorpion quads and SCI ring.
Average Remote Latency
Processors
• Similar breakdown despite loading and chipset differences. • SCI a bigger percentage due to faster quad speed.
STiNG TPC-B with Dual SCLIC Engine
26%
62%
12%
Bus time for Average Lynx time for Average SCI time for Average
NUMA-Q TPC-C with Dual SCLIC Engine
16%
65%
19%
Remote Latency Breakdown –8 Quad DSS
• Much lower latency makes SCI responsible for a larger percentage. • Simulation had low SCLIC utilization, NUMA-Q has fewer “4 Hop” line fills.
STiNG Query 6
22%
Bus time for Average Lynx time for Average SCI time for Average
NUMA-Q Query 5
16 R. Clapp, 1/21/01
Quad Bus Bandwidth Consumption
• Bus utilization for TPC-C on NUMA-Q is close to STiNG despite much lower MPI. – Lower overall latency results in lower CPI which allows this to occur.
• Less I/O keeps Query 5 bus utilization close to Query 6 despite higher MPI.
Data Bandwidth Consumed vs. Maximum Available
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
0 4 8 12 16 20 24 28 32 Processors
STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C
Relative CPI Breakdown for OLTP TPC-B on STiNG, 4 Quads
21%
26%
11%
42%
internal
21%
6%
28% 45%
Despite different core speeds and ratios, the relative contribution from the sources of CPI is similar.
18 R. Clapp, 1/21/01
were reasonably close to measured results in most areas. – Architectural simulation provided an accurate high-level view of system
behavior. – The simulation results had a significant impact on implementation decisions.
• Dual protocol processing engines. • Hardware assist. • Lots of other areas not covered today.
– The 1 man-year investment in modeling was well worth it. – All this despite known deficiencies in the modeling technique.
• NUMA architectures are clearly capable of executing commercial workloads. – Average latency is similar to other large-scale UMA machines in the same
timeframe. – However, OS restructuring for NUMA was required to get the right mix of local
and remote transfers.
Final Remarks
• The industry is moving to NUMA. – New offerings from Sun, HP, and Compaq for example.
• Lower remote to local ratios than NUMA-Q will be required to reduce effort on OS restructuring. – This helps the “shrink wrap” OS segment as well.
• As NUMA becomes more pervasive, even “shrink wrap” OSes will make the necessary modifications. – Memory allocation.
– Affinity scheduling.
– Multipath I/O.
PIIX4
MDC PCI-A0
External Memory Bus
Point-to-Point Multipath
• With multipath I/O, the all DMA transfers are to/from local memory addresses.
23 R. Clapp, 1/21/01
Workload Characterization: OLTP vs. DSS • OLTP does more “bad” things than DSS.
– OLTP has higher processor cache miss rates than DSS.
– OLTP has higher I/O operations per second than DSS.
– OLTP spends more time and instructions in kernel mode than DSS.
– OLTP requires higher cache-to-cache transfer bandwidth than DSS.
• However, DSS requires higher I/O bandwidth than OLTP. – But large block I/O softens the blow.
• With sufficient I/O bandwidth available, both workloads are processor-bound with throughput limited by the latency of the interconnect and memory.
– This is true for all results I present today.
• DSS has lower CPI and scales better than OLTP.
• These conclusions are based on benchmarks – real OLTP applications behave better, but still not as good as DSS.
24 R. Clapp, 1/21/01
Quad Local Access Latency
• Model was too pessimistic at low utilizations and too optimistic at high utilizations.
Quad Local Latencies - 16 Processor OLTP and 32 Processor DSS
0
5
10
15
20
25
30
35
450 NX 90MHz
Sequencer Core Utilization
• NUMA-Q has much fewer remote invalidates and some “hardware assist.” • This doesn’t help Query 5 as higher MPI and remote cache miss rate add up to
higher bandwidth consumption.
SCLIC Sequencer Utilization
Processors
Quad-to-Quad Bandwidth Consumption
• Noops cause unanticipated bandwidth consumption for OLTP. • Rate of increase is higher due to speed of Scorpion quads relative to SCI. • Despite higher CPI, Query 5 uses more remote bandwidth than Query 6 due to
higher core speed, MPI and remote cache miss rate.
SCI Ring Utilization
Processors
27 R. Clapp, 1/21/01
Relative CPI Breakdown for DSS Query 6 on STiNG, 8 Quads
85%
4%
59%
18%
21%
2%
•Higher miss rate for Query 5 leads to higher external CPI.
28 R. Clapp, 1/21/01
• OLTP Benchmarks are worst case. Over 60% efficiency is acceptable.
Performance Relative to One Quad
0
1
2
3
4
5
6
7
8
Quads