sting revisited: performance of commercial database

1R. Clapp, 1/21/01

STiNG Revisited:

Performance of Commercial Database Benchmarks on a CC-NUMA Computer System

Russell M. Clapp

IBM NUMA-Q

[email protected]

January 21st, 2001

2R. Clapp, 1/21/01

Background• STiNG was the code name of an engineering development.

– Architecture introduced at ISCA in 1996.

• NUMA-Q was the name of the product family that resulted.– Features phased in over multiple releases.

– Not yet feature complete in mid-1997.

– Unable to validate simulation results at that time.

• Left Sequent to pursue other opportunities in 1997.

• Recently returned to what is now IBM NUMA-Q– Looking at data collected between late 1997 and 1999.

– Now trying to validate simulation results –sort of.

• This presentation compares apples and oranges.– At least they are both types of fruit!

• A lot of the details are different, but the overall conclusions are the same.

3R. Clapp, 1/21/01

Outline

• Brief review of STiNG architecture.– Updates on changes made for second-generation NUMA-Q

• Examination of OLTP and DSS workloads.– Per instruction workload profiles.

– How they look on a CC-NUMA machine.

• Comparison of architectural simulation and measured results.– OLTP and DSS workloads.

– Report latencies and consumed bandwidths throughout the system.

• Conclusions

4R. Clapp, 1/21/01

STiNG Block Diagram

• Homogenous nodes called quads.

• Quad local bus defined by Intel processor and chipset.

• 1 GByte/sec SCI interconnect.

• Shared global memory address space.

• Shared global I/O address space.

• MESI and modified SCI coherence protocols.

• Up to 63 nodes in a coherence domain

Quad 0 Quad 1

Quad 2

Quad 3Quad N-2

Quad N-1

Proc

Lynx Memory/PCIControl

Proc Proc Proc

5R. Clapp, 1/21/01

Lynx2 Block Diagram

DOBICDOBIC

SCI RemoteCache Tags

(even)(2Mx40)


(even)(2Mx40)

LocalDirectory

(even)(8Mx40)

LocalDirectory

(even)(8Mx40)

18 b i t s18 b i t s

DataPumpDataPump

72 b

its

35 b

its

33 b

its

72 b

its

S C I o u tS C I i n

X e o n T M B u s ( 1 4 7 b i t s )

Lynxbus

FastDirectory(2Mx72)

FastDirectory(2Mx72)

RemoteCacheTags

(512Kx72)

RemoteCacheTags

(512Kx72)

RemoteCache

(16Mx72)

RemoteCache

(16Mx72)

1 4 + 9

2 1 b i t s + 1 3

72 b

its

1 4 b i t s + 8


(odd)(2Mx40)


(odd)(2Mx40)

LocalDirectory

(odd)(8Mx40)

LocalDirectory

(odd)(8Mx40)

40 b

its

40 b

its

72 b

its

10 b

its

10 b

its

9 7 0 1 1 6 . 1 p m e s s e r

LYNX2 BoardSBB Board

DSCLICDSCLIC

72

bits

8 bi

ts

3 bi

ts

4 bi

ts

1 4 b i t s + 8

40 b

its

40 b

its

7 2 b i t s 3 bi

ts

6R. Clapp, 1/21/01

STiNG to NUMA-Q Product Evolution• STiNG assumed product features that were introduced in NUMA-Q over

time:– Multipath I/O.

– Intelligent I/O drivers.

– Improved chipset behavior.

– Dual protocol processing engine in SCLIC.

– Larger processor and remote cache sizes.

• These changes were all incorporated with the second generation of NUMA-Q, code named “Scorpion”.– This is the data that I will present today for OLTP workloads.

– I will also present some data from the first generation of NUMA-Q.

• “Centurion” increases the bus speed and ASICs to 100MHz as well as adding some other minor enhancements.

7R. Clapp, 1/21/01

STiNG Simulation vs. NUMA-Q Data Collection• “Application profiles” were determined for several database benchmark

workloads. – We compare to measured profiles on NUMA-Q.

• The TPC-B benchmark and Query 6 of the TPC-D suite were chosen as representative of OLTP and DSS usage models.

– TPC-C and Query 5 Of TPC-D were measured on NUMA-Q

• A simulator with behavioral-level models of all system ASICs and data paths was built.

• The application profiles were used to drive the simulation model.

• The simulation model reports latencies, resource utilization rates, and instruction throughput.

– Data was collected using hardware performance counters.

– The counters provide data for over 1000 different metrics.

8R. Clapp, 1/21/01

STiNG Simulation vs. NUMA-Q Data Collection

• STiNG simulations included the the “Dual SCLIC.”– TPC-C data is for the dual sequencer version of the SCLIC, the “DSCLIC”.

• Clearly, we are comparing apples and oranges.– But many of the results are similar!

2 per SCLIC

1 per SCLIC

2 per SCLIC

1 per SCLIC

Protocol Processors

STiNG500MHz32MB 4-way

66MHz512K 4-way

133MHzTPC-B

NUMA-Q “Scorpion”

NUMA-Q

STiNG

System Name

500MHz128MB 4-way

90MHz2M 4-way

495MHzTPC-C

500MHz32MB 4-way

60MHz1M 4-way

180MHzQuery 5

500MHz32MB 4-way

66MHz512K 4-way

133MHzQuery 6

SCI/Datapump

Remote Cache

Bus/Lynx Speed

L2 Cache

Core Speed

9R. Clapp, 1/21/01

Application Profiles

• Larger cache leads to lower MPI for TPC-C vs. TPC-B.• Query 5 has higher MPI than Query 6 despite larger cache.• Remote memory access rates are lower than what we assumed.

– Lots of OS work made this happen.

• Remote cache miss rates are higher than expected.– Fewer L2 capacity misses to remote memory that would hit in remote cache.

• TPC-C has lower I/O rate than TPC-B.

0.440.490.080.36bits per instructionI/O

0.00170.00190.00030.0014cache line per instI/O

43%15%24%11%per referenceRC Miss Rate

323212832MB (all 4-way)Remote Cache Size

27%35%24%35%per L2 cache missRemote Memory Access Rate

0.00310.00180.00730.0223per instructionL2 Cache Miss

1M512K2M512Kper processorL2 Cache Size

32321616Processor Count

TPC-D Q5TPC-D Q6TPC-CTPC-BRateEvent

10R. Clapp, 1/21/01

Cache Miss Service Distribution for OLTP 16 processors

TPC-B on STiNG

50.4%

27.2%

2.7%5.4%

5.4%

8.9%

Hit to Local Memory

Hit to Remote Cache

Local Cache-to-CacheTransfer

2 Hop Remote

4 Hop Remote

Local Hit/ RemoteInvalidate

TPC-C on NUMA-Q

68.8%

14.5%

9.1%

3.4%

3.3%

0.9%

• Locality was higher than expected, including cache-to-cache transfers.– “Buddy Locks” helped this quite a bit.

• Most invalidations completed locally as well.

11R. Clapp, 1/21/01

Cache Miss Service Distribution for DSS32 Processors

• Larger systems should have more remote references take 4 hops.• Query 5 does not show this behavior.

– This is probably due to the higher remote cache miss rate.– These are compulsory misses for Query 5.

Query 5 on NUMA-Q

69.4%

14.6%

3.1%

7.0%

4.4%

1.4%

Query 6 on STiNG

51.6%

27.8%

1.6%

3.6%

10.9%

4.5%

Hit to Local Memory

Hit to Remote Cache

Local Cache-to-CacheTransfer

2 Hop Remote

4 Hop Remote

Local Hit/ RemoteInvalidate

12R. Clapp, 1/21/01

Average Bus Latency on L2 Cache Miss

0

20

40

60

80

100

120

0 4 8 12 16 20 24 28 32

Processors

bus clocks

STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C

STiNG/Query 6 NUMA-Q/Query 5

Average Cache Miss Penalty

• Higher locality leads to lower average latency.• Rate of increase is decreasing with quads –as predicted.

13R. Clapp, 1/21/01

Remote Quad Access Latency

• OLTP higher than simulated in part due to speed difference between Scorpion quads and SCI ring.

Average Remote Latency

150

200

250

300

350

400

450

500

4 8 12 16 20 24 28 32

Processors

busclocks



14R. Clapp, 1/21/01

Remote Latency Breakdown - 4 Quad OLTP

• Similar breakdown despite loading and chipset differences.• SCI a bigger percentage due to faster quad speed.

STiNG TPC-B with Dual SCLIC Engine

26%

62%

12%

Bus time for Average Lynx time for Average SCI time for Average

NUMA-Q TPC-C with Dual SCLIC Engine

16%

65%

19%

15R. Clapp, 1/21/01

Remote Latency Breakdown –8 Quad DSS

• Much lower latency makes SCI responsible for a larger percentage.• Simulation had low SCLIC utilization, NUMA-Q has fewer “4 Hop” line fills.

STiNG Query 6

23%55%

22%

Bus time for Average Lynx time for Average SCI time for Average

NUMA-Q Query 5

18%

54%

28%

16R. Clapp, 1/21/01

Quad Bus Bandwidth Consumption

• Bus utilization for TPC-C on NUMA-Q is close to STiNG despite much lower MPI. – Lower overall latency results in lower CPI which allows this to occur.

• Less I/O keeps Query 5 bus utilization close to Query 6 despite higher MPI.

Data Bandwidth Consumed vs. Maximum Available

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

0 4 8 12 16 20 24 28 32Processors



17R. Clapp, 1/21/01

Relative CPI Breakdown for OLTPTPC-B on STiNG, 4 Quads

21%

26%

11%

42%

internal

local miss

remote miss

invalidate

TPC-C on NUMA-Q, 4 Quads

21%

6%

28%45%

Despite different core speeds and ratios, the relative contribution from the sources of CPI is similar.

18R. Clapp, 1/21/01

Conclusions• Despite many differences between STiNG and NUMA-Q, simulation results

were reasonably close to measured results in most areas.– Architectural simulation provided an accurate high-level view of system

behavior.– The simulation results had a significant impact on implementation decisions.

• Dual protocol processing engines.• Hardware assist.• Lots of other areas not covered today.

– The 1 man-year investment in modeling was well worth it.– All this despite known deficiencies in the modeling technique.

• NUMA architectures are clearly capable of executing commercial workloads.– Average latency is similar to other large-scale UMA machines in the same

timeframe.– However, OS restructuring for NUMA was required to get the right mix of local

and remote transfers.

19R. Clapp, 1/21/01

Final Remarks

• The industry is moving to NUMA.– New offerings from Sun, HP, and Compaq for example.

• Lower remote to local ratios than NUMA-Q will be required to reduce effort on OS restructuring.– This helps the “shrink wrap” OS segment as well.

• As NUMA becomes more pervasive, even “shrink wrap” OSes will make the necessary modifications.– Memory allocation.

– Affinity scheduling.

– Multipath I/O.

20R. Clapp, 1/21/01

Backup

21R. Clapp, 1/21/01

“Scorpion/Centurion” Quad Block Diagram

I/OAPIC

Processor External Bus @ 90-100 MHz

PIIX4

IDEUSB GPIO

Flash

Super I/O

KeyboardMouse

2 Serial1 Parallel

Ports

Floppy

MIOC82450NX

F16 Bus

LYNX2Board

Xeon XeonXeonXeon

I/OAPIC

PCI & ISAInterrupts

DOBIC

F16 Bus

PXB-A

64 Bit PC I Bus30-33 MHz

MDC PCI-A0

PCI-A1

PCI-A2

PCI-A3

PXB-B

PCI-B0

PCI-B1

PCI-B2

PCI-B3

64 Bit PC I Bus30-33 MHz

ExternalMemory Bus

OMMOMM

Lynx Bus

ISA/X bus

RTC,NVRAM

MDC ISA

USB

IDE

ProcActivityLights

SCIBus

SLMMSLMM

Quad Memory

(1 GB- 8 GB)

RCGs(2)

MUXs(4)

22R. Clapp, 1/21/01

Mulipath I/O

Quad 0

Quad 1

Disks A

Disks B

IQLink

Quad 0

Quad 1

Disks A

Disks B

IQLink

Fibre ChannelSwitch

Point-to-Point Multipath

• With multipath I/O, the all DMA transfers are to/from local memory addresses.

23R. Clapp, 1/21/01

Workload Characterization: OLTP vs. DSS• OLTP does more “bad” things than DSS.

– OLTP has higher processor cache miss rates than DSS.

– OLTP has higher I/O operations per second than DSS.

– OLTP spends more time and instructions in kernel mode than DSS.

– OLTP requires higher cache-to-cache transfer bandwidth than DSS.

• However, DSS requires higher I/O bandwidth than OLTP.– But large block I/O softens the blow.

• With sufficient I/O bandwidth available, both workloads are processor-bound with throughput limited by the latency of the interconnect and memory.

– This is true for all results I present today.

• DSS has lower CPI and scales better than OLTP.

• These conclusions are based on benchmarks – real OLTP applications behave better, but still not as good as DSS.

24R. Clapp, 1/21/01

Quad Local Access Latency

• Model was too pessimistic at low utilizations and too optimistic at high utilizations.

Quad Local Latencies - 16 Processor OLTP and 32 Processor DSS

0

5

10

15

20

25

30

35

busclocks

STiNG/DSCLIC/TPC-B NUMA-Q/DSCLIC/TPC-C STiNG/Query 6 NUMA-Q/Query 5

450 NX90MHz

450 GXModel66MHz

450 GXModel66MHz

450 GX60MHz

25R. Clapp, 1/21/01

Sequencer Core Utilization

• NUMA-Q has much fewer remote invalidates and some “hardware assist.”• This doesn’t help Query 5 as higher MPI and remote cache miss rate add up to

higher bandwidth consumption.

SCLIC Sequencer Utilization

0%

10%

20%

30%

40%

50%

60%

70%

4 8 12 16 20 24 28 32

Processors



26R. Clapp, 1/21/01

Quad-to-Quad Bandwidth Consumption

• Noops cause unanticipated bandwidth consumption for OLTP.• Rate of increase is higher due to speed of Scorpion quads relative to SCI.• Despite higher CPI, Query 5 uses more remote bandwidth than Query 6 due to

higher core speed, MPI and remote cache miss rate.

SCI Ring Utilization

0%

5%

10%

15%

20%

25%

30%

8 12 16 20 24 28 32

Processors

STiNG/DSCLIC/TPC-B TPC-D/Q6, BaselineNUMA-Q/DSCLIC/TPC-C NUMA-Q/DSCLIC/TPC-C/No NOOPsNUMA-Q/Query 5

27R. Clapp, 1/21/01

Relative CPI Breakdown for DSSQuery 6 on STiNG, 8 Quads

85%

4%

8%3%

internal

local miss

remote miss

invalidate

Query 5 on NUMA-Q, 8 Quads

59%

18%

21%

2%

•Higher miss rate for Query 5 leads to higher external CPI.

28R. Clapp, 1/21/01

Throughput Scaling

• OLTP Benchmarks are worst case. Over 60% efficiency is acceptable.

Performance Relative to One Quad

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Quads

Effective Quads

STiNG/Dual SCLIC/TPC-B NUMA-Q/Dual SCLIC/TPC-C TPC-D/Q6

sting revisited: performance of commercial database

Documents